The language of science might be mathematics, but scientific knowledge is largely captured in the written word. Automating the capture of that knowledge is the aim of research published in the journal Nature. The approach offers to overcome the inherent problems of text analysis in that it is difficult to process using conventional statistical analysis or even modern machine learning methods.

Would that scientific knowledge be intrinsically open to analysis. But, as Vahe Tshitoyan of the Lawrence Berkeley National Laboratory, Berkeley, California, USA and colleagues point out, the main source of machine-interpretable data for the materials research community has come from structured property databases. Unfortunately, these databases represent only a mere fraction of all the knowledge that might be otherwise accessible and analyzed. Scientific publications contain more than property values of materials though. They also have embedded within them the connections and relationships among the data as interpreted by the authors.

Natural language processing has been tried as a way to extract this kind of information but it needs enormous data sets with which the algorithms might be trained. The team needed to circumvent this obstacle. They have now demonstrated that materials science knowledge in the published scientific literature can be encoded in an efficient manner as what they refer to as "information-dense word embeddings". These are vector representations of words and the process is done without the need for the materials scientist to label or supervise.

"Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure-property relationships in materials," the team writes [Tshitoyan, V. et al. Nature (2019) 571, 95-98; DOI: 10.1038/s41586-019-1335-8]

The important point is that once this unsupervised generation of embeddings has been done, the system can then recommend specific materials for particular functional applications. The team demonstrated with historical information that such applications could have been foreseen with their method several years before their discovery. "This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications," the team adds.

Given that the team trained their system on some 3.3 million scientific abstracts published between 1922 and 2018 in more than one thousand journals relevant to materials science and generated a vocabulary of about half a million words, the implications are enormous. The approach could make mining this huge part of the scientific for knowledge and relationships so much more efficient and point to a general approach to finding ideas and inspiration.