Choose Random vs Butina

Use this guide to choose between the available train/test split strategies.

Random split

Use random when:

  • you want a simple baseline split

  • the dataset is large enough that a random partition is acceptable

  • you are doing quick iteration

Tradeoff:

  • a random split may not stress scaffold or chemotype generalisation

Butina split

Use butina when:

  • you want cluster-aware splitting

  • you care more about chemical dissimilarity between train and test

  • you want a more demanding validation setting

Tradeoff:

  • the split behavior depends on the distance cutoff

  • cluster-based splitting can produce a harder task

Practical recommendation

For exploratory work:

  • start with random

For more realistic evaluation:

  • compare random and butina

The NCATS-sol example script renders both so you can inspect the difference directly.