CRD year in review 2023

05 January 2024 - Taking stock

And what a year it was! How will it be remembered? We have everything: perpetual war, a climate disaster, democracy under threat, out-of-control machine-intelligence, all threatening our very existence. Can we still have civilized discussions about silly things like chemistry?

The Chemical Reaction Database is still growing, 137K reactions from USPTO 2023 are now included bringing the total to 843K. Remains a backlog of USPTO records 2017 - 2022 which is a work in progress. The production line is now fully automated, the only manual tasks remaining are importing and extracting the weekly zip files. The 2023 USPTO batch is now available on Figshare as a reaction SMILES text file at (10.6084/m9.figshare.24921555). The 2023 batch was partly processed by ChatGPT as an alternative to ChemicalTagger (summary here). That means that I have time to try out new software. It was relatively easy to get Decimer installed (MacOS, M2, Conda). This library enables optical recognition of molecules in pdf files. It would make it easier to extract reactions from the academic literature: if a systematic name is unavailable, all molecules are scanned as png files and Decimer returns a SMILES string for each of them. Ran a small trial with it, runs admirable but more complex molecules are a challenge. In a next developmental step it is also possible to apply optical recognition to reaction schemes as they appear in the academic literature. The toolkit is called ReactionDataExtractor and I intend to try it out.

Speaking of the academic literature, a new dataset came to my attention recently called CJHIF containing 3 million reaction SMILES from the academic literature. It comes with information on yield and contains the solvents and reagents and can be found here. All that is known about it's creation is that is is based on several high-impact journals and that it has been a Chinese effort with payed collaborators. It was made available by Shu Jiang who deployed the dataset (and others) for work in predicting the practicality of chemical reactions (DOI) It has already been picked up by Victor Gil who very recently used it in work on reaction prediction models (DOI) and another dataset (based on CJHIF) called CORISO is also available here.

The incorporation of the dataset in the chemical reaction database will at least require the conversion of SMILES to a systematic name. I was happy to find out about STOUT (SMILES-TO-IUPAC-name translator), a toolkit that does just that. As the accompanying paper explains (DOI) the tool is based on deep-learning and machine-translation. As an initial test of both the CJHIF dataset and STOUT, I have everything up and running and without any complaints STOUT has been spitting out IUPAC names or the past 4 days.

In all, a target for 2 million reactions at the end of 2024 is not unreasonable. See you there!