In 2007, I wrote an article for Materials Today introducing the idea that Materials Informatics should be considered as a formal sub-discipline of materials science and engineering, just as for instance Bioinformatics is in the field of life sciences. One of the points I made in my original article, centered on the need to assess whether we had a “data feast or famine” in any given materials science problem. The idea that we could have a “feast” or an overabundance of data may seem at first to be a puzzling concept, but I argued that quantitatively defining what constitutes “enough” or “right” data in any given materials science problem is a critical aspect of Materials Informatics. With the increase in popularity of phrases such as “materials by design”, “integrated computational materials engineering” or “materials genome”, there is a sense that informatics simply equates to the need for more data. From that perspective, the focus of “informatics” activities invariably leans towards the need to collect and generate more data, addressing the software and information management challenges to organize and query that data in some digital form and finally, distribute that data by taking the advantage of advances in information technology. Hence topics such as combinatorial materials science and high throughput screening coupled with digital libraries have blossomed in a wide array of materials science oriented disciplines. Similarly there has been a similar surge of efforts in the computational materials science community, driven largely from the field of condensed matter physics, in taking advantage of computational tools and algorithms to generate massive arrays of results proposing new materials chemistries and properties.
While acknowledging that these efforts are all important and needed, it should pointed out that, they are exploring only one aspect of the field of Materials Informatics. The data driven paradigm for materials discovery needs to explore a much bigger dimensional feature space in data which includes issues such as uncertainty, skewness, sparseness and the diverse and numerous forms of data including numerical, textual, conceptual and imaging. The integration of all these different modalities and attributes of data, along with volume, is what Materials Informatics is really about.
If enhancing data volume is an important but not sufficient criterion for informatics, then where in fact should the field of Materials Informatics be headed? The answer to that lies in harnessing the paradigm of “Big Data”; where the word “big” refers to the size of the dimensionality of correlations to explore in analyzing data driven problems , of which volume of data is just one aspect. The broader information science community has appropriately defined Big Data as governed by four metrics: volume, velocity, variety and veracity. The “4 V’s” are the core to informatics and at present most Materials Informatics efforts are focused only on expanding the volume of data at the expense of largely ignoring the other V’s the data analytics component of informatics.
Data volume of course we readily understand. Data velocity refers to harnessing real time data acquisition (e.g., data from dynamics experiments). Data variety is concerned with the fact that data takes all forms in materials science, ranging from discrete numerical values to qualitative descriptions of materials behavior and imaging data. Veracity, the final V, acknowledges the practical reality in materials science that we have a lot of “missing” data and the data that we do have, has uncertainty associated with it. Quantifying that uncertainty, knowing how to fill in the data gaps with limited knowledge are challenging but yet doable goals when one judiciously links the tools of statistical learning, data mining, and statistics with materials physics, chemistry, and engineering principles. Even with limited data, by taking advantage of this approach and addressing the other Vs of Big Data we have had success in predicting new materials, identifying new physical parameters controlling structure-property relationships and developing accelerated means for generating reference data. This is the power of informatics.
Finally, we need to reiterate that our ultimate goal for Materials Informatics is to discover new knowledge. Increasing data volume alone does not necessarily increase knowledge, a fact well known to the computer /information science fields as well as domain applications such as genomics and biotechnology. Often knowledge trails behind data and simply increasing data without addressing the other V’s of data exacerbates the problem by making the gap between knowledge and data even larger. There can be a false sense of intellectual security being surrounded by lots of data. Informatics is the science of how to address the 4Vs of Big Data simultaneously and how to integrate the findings from these efforts. This is where the tools of machine learning coupled to statistics need to be judiciously linked to the foundations of materials science, namely theory, modeling and experiments, to make data bases a laboratory for generating new information and not just a repository for retrieving known or expected information. Informatics is the science that formalizes the use of those tools and holds the key for a promising and rich future for materials science.
This article was originally published in Materials Today (2012) 15(11), 470. To access past issues of Materials Today, and register for your free subscription to the magazine, just click here.