Choose Random vs Butina¶

Use this guide to choose between the available train/test split strategies.

Random split¶

Use random when:

you want a simple baseline split
the dataset is large enough that a random partition is acceptable
you are doing quick iteration

Tradeoff:

a random split may not stress scaffold or chemotype generalisation

Butina split¶

Use butina when:

you want cluster-aware splitting
you care more about chemical dissimilarity between train and test
you want a more demanding validation setting

Tradeoff:

the split behavior depends on the distance cutoff
cluster-based splitting can produce a harder task

Practical recommendation¶

For exploratory work:

start with random

For more realistic evaluation:

compare random and butina

The NCATS-sol example script renders both so you can inspect the difference directly.