The hamburger problem
23 November 2019 - Data mining
In a recent article submitted to chemrxiv Christopher Southan introduces the reader to the hamburger problem in big data science (link). There are two things that are keeping data scientists from mining the huge mountain of scientific data that has accumulated over the past 200 years. One of them is the pay-wall and the refusal of scientific publishers to make data available. Some time ago this blog featured the efforts of Peter Murray-Rust on this topic (link). But even with the pay wall eradicated, the information is still locked in millions of tiny data parcels called pdf files. Southan reminds us of the quote "We have spent millions putting data into the literature but now have to spend millions more getting it back out". Southan's primary focus is on medicinal chemistry and the assets contained in a typical article: the Document itself (a DOI), the Activity (Assay), the Result, the Compound and the Protein or DARCP for short. CAS-SciFinder is a large chemical database but limits handling to documents and compounds (D-C). Smaller DARCP specific databases exist but they are small and fragmented. Southan notes several shortcomings in the current publication system, the PubChem open chemistry database does currently not have overly active contributors (link), just one journal demands the publication of SMILES, do not expect the collection of identifiers (article, compound) to be complete. Main conclusion: researchers in the medicinal chemistry field are a long way from having their data corraled in a machine-readable database. In terms of the hamburger problem, DARCP is the meat and the pdf the hamburger.
As a random trial three open-access articles in the field of medicinal chemistry were examined, all about novel compounds inhibiting enzymes associated with Alzheimer disease. As a plus all three articles (from the European Journal of Medicinal Chemistry, the ( International Journal of Molecular Sciences , Journal of Enzyme Inhibition and Medicinal Chemistry publish the systematic IUPAC names with the compounds synthesized. For comparison, in synthetic organic chemistry the most common identifier for a chemical compound is still compound number 1,2,3 just as it was one hundred years ago. Unfortunately all three articles do not identify the starting materials, making inclusion in a reaction database again cumbersome.