Choose Random vs Butina¶
Use this guide to choose between the available train/test split strategies.
Random split¶
Use random when:
you want a simple baseline split
the dataset is large enough that a random partition is acceptable
you are doing quick iteration
Tradeoff:
a random split may not stress scaffold or chemotype generalisation
Butina split¶
Use butina when:
you want cluster-aware splitting
you care more about chemical dissimilarity between train and test
you want a more demanding validation setting
Tradeoff:
the split behavior depends on the distance cutoff
cluster-based splitting can produce a harder task
Practical recommendation¶
For exploratory work:
start with
random
For more realistic evaluation:
compare
randomandbutina
The NCATS-sol example script renders both so you can inspect the difference directly.