NNNS Chemistry blog
Prevous: Hexanitrogen
Next: A game of humans against algorithms

All blogs

An organic chemistry dataset

12 October 2025 - Research update 00005

So what does it take to create a high-quality reaction SMILES dataset for use in Large Language Model training? A previous blog looked at reaction mapping as a possible metric for the reliability of a proposed reaction (link) but was dismissed as returning too many false negatives. Can we instead look at a similarity index for a given reactant and product?
Similarity indexes have been used to search for similar reactions in a database and not just similar molecules, as early as 1990 as one publication suggests (doi) but I am unaware of any publication dealing with similarity indexes on product molecules and their precursors. And why should it. Reactants and reaction products are supposed to be dissimilar after all. But what if a plausible reaction has a reactant-product similarity that is just higher than that of an implausible reaction and could this be a way to filter implausible reactions and dismiss them from the database?

In this exercise, solvent, reagents and catalysts are thrown out and crucially all reactants are combined by a dot. The dots means that in an odd way the reactions components are considered as a salt even when obviously they do not constitute a salt. The Tanimoto similarity index can be calculated using RdKit and insanely fast (1.44 million reactant - product pairs in 30 minutes on a consumer laptop.) so that brings us to selecting the fingerprint. RdKit has several fingerprint options and a useful fingerprint will have a sufficiently low index value for an implausible reaction and a sufficiently high value for a plausible one. An analysis (not very scientifically sound) yielded a preference for the MACCS index fingerprint. This algorithm (repo) uses a reaction SMARTS set of 166 entries to scan molecule properties such as "does it have a ketone group" or "how rings does it have" or "is it a heterocycle"? and the more matches between two sets of 166 entries the more similar two molecules are.

Running the exercise quickly unearths an uncomfortable truth: the bottom 0 - 0.2 MACCS index has lots of dubious reactions! One subset are reactions missing a product, very embarrassing because these should have been removed in a database query ages ago. In another subset one of the reactants ended up on the product side of the equation. No reaction should be left behind (!) and fortunately it was possible to fix many of them manually. In many cases there was no obvious relationship between reactant and product. In yet another relevant subset reaction components were assigned the wrong role. Compounds classified as reactant were in fact a reagent or a compound classified as a solvent was in fact a reactant. Fortunately this type of role mismatch can be fixed in an automated way on a per-reagent basis. Again using RdKit all reactions a specific reagent is involved in, can be checked for the presence of that reagent in the product molecule. If it is present than the role is assigned a reactant and otherwise assigned a reagent role. For example the reagent methanesulfonyl chloride when incorporated leaves a S(=O)(=O)([C&H3]) fragment in the product and in a dataset of 14K entries, 5K entries were re-assigned a reactant role and 1.5K entries were reassigned a reagent role. Other examples: hydrazine hydrate (9.6K) with 3.6K reassigned to reactant or triphenylphosphine (20K) with 0.6K entries re-assigned to reactant. The project is getting somewhere!

The outcome of the final run of the 1.44M dataset is depicted in this table as total record volume against the calculated index in 0.05 point intervals. The front up to the 0.55 accounts for 5.6% of the total count. The 0.55 cutoff is based on this article in which the thresholds are calculated for random molecule pairs for many fingerprints. In the case of the MACCS index, 90% of random pairs happen to have an index of less than 0.55. Not that all these reactions are dodgy. For example a reaction classic from Organic Syntheses: the conversion of acetone to mesitylene (Link) is legitimate but with a terrible index. Therefore none of the reactions will be dismissed. And the real figure for dodgy reactions? According to an expert panel of organic chemists (me), out of a random 100 reactions in the 1,44M dataset on average 4 reactions have issues.