The prompt engineer and the AI assistant
30 April 2023 - AI
And the results are in! For the past 6 months the CRD project has been importing current 2022 - 2023 USPTO patents with each valid synthetic procedure parsed and included in the CRD database. To give you an idea here are the numbers.
A total of 105K patents were imported, totaling 18GB of data. Of course only a fractions deals with organic chemistry (CO7) and a weekly batch typically consists of 200 files (60-80 MB). The total number of potential procedures in each weekly batch? Difficult to assess but on average 3900. With isolating the individual procedures and running chemical tagger the success rate turned out to be 40 to 50% with 40 to 50 relevant patents each week (0.5% of the total). A total of 25K reactions could be added to the RCD database.
Three weekly USPTO batches were processed using chatGTP-3. My openAI API account was awarded 18 dollars in grant money so there was something to play with! Managed to spend half the amount because the grant was time-limited to 6 weeks. In any event, the chatbox is performing equally well as ChemicalTagger with around 46% turnover. It seems both tools have their quirks, chemicalTagger cannot handle a pinacol ester or xphos, chatGPT has problems with hydrogen. Be aware that the bot has been extensively trained on the entire USPTO corpus in the first place which may explain it's confident chemistry attitude.
ChatGPT can also just spontaneously forget it is supposed to return data in JSON reverting to freestyle text. The job of prompt engineer is the hottest new job on the planet and I can understand why. It takes some time to figure out how explain the AI assistant how it should answer your questions. You cannot simply ask to list all the solvents but to exclude the extraction solvents (the CRD database only accepts reaction solvents) but you can ask the bot the add the solvent role for each solvent. It will then dutifully explain a solvent is used as a medium or as an extraction or chromatography solvent.
ChatGPT is just one AI tool in an avalanche of new initiatives. The CRD project will attempt to keep up or at least it has made a solemn resolution to that effect. The chatGPT API does not understand Internet and it would be nice if an AI would be able to directly read and analyze a supplementary information document from a given academic article. A new AI assistant that can at least make sense of a PDF is the Chrome extension SciSpace Copilot. When it comes to summarizing a page my test run was a failure. Just for kicks I selected the very first procedure in volume 1 of Organic Syntheses from 1921 (glycerol reaction to allyl alcohol). Asked to list all chemical compounds mentioned in the article, first it ignored glycerol, when asked to specify the amount of glycerol used (2 Kg) it was confident the amount was not included in the text and then when prompted that the amount was there and that it was 2 Kg, the assistant apologized that, yes, the amount was indeed 2 Kg. It will not be the first time an AI assistant drives the prompt engineer mad.