Extracting physics through deep data analysis

In recent decades humankind has become very apt at generating and recording enormous amounts of data, ranging from tweets and selfies on social networks, to financial transactions in banks and stores. The scientific community has not shunned this popular trend and now routinely produces hundreds of petabytes of data per year [1]. The reason for this is that materials and phenomena in the world around us exist in an interweaved, entangled form, which gives rise to the complexity of the Universe and determines the size and complexity of the data that describes it. Science and technology endeavor to unravel this convolution and extract pure components from the mixtures, be it in ore mining and metal smelting or separation of thermal conductivity into the electronic and phononic contributions. Decomposition of complex behavior is the key to understanding manifestations of Nature. However, tools to carry out this task are not readily available, and therefore, intricate systems often remain well-characterized experimentally, but still not well understood due to intricacy of the collected data. In materials science, understanding and ultimately designing new materials with complex properties will require the ability to integrate and analyze data from multiple instruments, including computational models, designed to probe complementary ranges of space, time, and energy.

This problem is particularly relevant in the field of imaging. Much of the progress in electron and scanning probe microscopies since 80s was enabled by computer-assisted methods for data acquisition and analysis. New developments in imaging technologies, since the beginning of XXI century, have opened the veritable floodgates of high-veracity information describing structure and functionality of materials. These data often come in the form of multidimensional data sets containing partial or full information on materials response to a range of external stimuli acquired over time. Typical examples of Big Data in the field are spectroscopic modes of the scanning transmission electron microscopy (STEM) and scanning probe microscopy (SPM). Possessing high variability and containing information on the nanoscale behavior of materials, they could be very useful for understanding and controlling material's functionality. However, the challenge is to convert the data contained in these data sets into useful information on materials structure and functionality.

The information hidden in the data can be explored using two complementary frameworks: statistical and physical. The former reveals the variability and abundance of different behaviors and internal correlations within the data set. The latter postulates the physical mechanisms underpinning the observed behaviors, or attempts to infer them based on comparison with similar systems. For instance, the electron energy loss spectroscopy (EELS) can map distribution of spectra across a sample. The number of different spectra (behaviors) and their relative abundance maps comprise the statistical information of an EELS dataset. However, typically the local spectra of individual elements appear mixed due to the complex local chemical composition, and thus the physical information – exact local chemistry, relaxation processes, etc. – lies deeper in the data, being more fundamental.

Multivariate statistical methods used in data mining allow for extraction of the statistical information. Algorithms such as principal component analysis (PCA), independent component analysis (ICA), linear discriminant analysis (LDA), k-means clustering, etc., have been widely used in hyperspectral imaging processing, STEM, electronic nose gas sensing, neuroscience, and in the general area of chemometrics, etc. In order to go deeper and also extract the physical information, more sophisticated mathematical tools are required, tools that are based on physical models and incorporate physical constraints to the outcome of the statistical analysis.

One such approach – Bayesian linear unmixing (BLU) – has been recently developed by Dobigeon et al. [2]. This algorithm treats experimental data as a linear combination of a specified number of behaviors (e.g., spectra) and endeavors to unmix them into pure components with respective abundance maps. BLU features the abundance nonnegativity and sum to one constraint, which ensure that the resultant maps and components are physically meaningful (e.g., intensity of the spectral signal is nonnegative at all energies), something that a number of multivariate methods explicitly lack. This tool has been successfully used [3] to unmix EELS datasets yielding the spectra of pure elements and background, as well as spatial distribution of their abundances. The same work has shown that neither PCA nor ICA can handle this job, producing spectra that are not physically meaningful or interpretable. A very recent work [4] demonstrates a more complicated case: BLU was applied to SPM voltage spectroscopy data collected from a two-materials nanocomposite (see figure). Analysis showed that both material components and their interface can be identified and that 4 different conductive behaviors corresponding to two conduction mechanisms (Fowler-Nordheim tunneling and Poole-Frenkel transport) are responsible for the diversity in the raw data. Unlike the interpretation of the EELS data, which is performed based on a priori known spectra of pure chemical elements, the individual conductive behaviors of the nanocomposite were not known, but found in the process of analysis. The fact that the chemical composition and physical mechanisms became identifiable in these two examples via BLU highlights its usefulness and destines its future applicability.

Material scientists need methods like BLU that combine the power of statistics and physics to extract physically-meaningful information about the behavior of materials in order to improve their functionalities. The synergy between imaging and data analytics will allow harnessing the power of multivariate statistical methods and modern computing power that enabled highlights of modern civilization such as Google to understand and explore multidimensional imaging and spectroscopy data sets. Rather than creating multiple samples, the structure-property relationships extracted from a single disordered sample could offer a statistical picture of materials functionality, providing the experimental counterpart to Materials Genome type programs where advances in theoretical methods and computational capacity have enabled large-scale simulations and high throughput screening of material properties (also leading to a plethora of data). Application of exploratory data analysis tools to multidimensional structural and spectroscopic datasets will allow the divestment of human visual perception as the benchmark of meaning, and a transition away from the ‘illustration’ mode of microscopy studies. This approach will reveal new local behaviors and previously unseen local structure/property correlations, as well as allowing us to finally describe and explore systems with nanoscale phase separation like ferroelectric relaxors, morphotropic systems, phase separated manganites and more, as well as more disordered mesoscopic systems ranging from non-crystalline soft materials to fossil fuels to batteries and fuel cells.

‘This research was conducted at the Center for Nanophase Materials Sciences, which is sponsored at Oak Ridge National Laboratory by the Scientific User Facilities Division, Office of Basic Energy Sciences, U.S. Department of Energy.’

Further reading

[1] A.A. White, MRS Bull., 38 (2013), pp. 594–595.
[2] N. Dobigeon, et al., IEEE Trans. Signal Process., 57 (2009), pp. 4355–4368.
[3] N. Dobigeon, N. Brun, Ultramicroscopy, 120 (2012), pp. 25–34.
[4] E. Strelcov, et al., ACS Nano, 8 (2014), pp. 6449–6457.

Read full text on ScienceDirect

DOI: 10.1016/j.mattod.2014.10.002