I am a data scientist, and this article is aimed at my fellow practitioners. It is mostly based on a 2017 paper called “ Theory-guided Data Science: A New Paradigm for Scientific Discovery from Data ” [TGDS]. [TGDS] exposes to natural scientists the contributions of data science to their work. Here, we will discuss [TGDS] and do the reverse mapping: where can our data scientists’ practices fit in the scientific endeavour? How can we advertise ourselves to scientists who are willing to incorporate data science into their work?
Data science is now commonplace in many industries. The increasing availability of data, combined with the progress in machine learning and computing in general, are a major factor of this ubiquity.
Machine learning has long been summoned for business purposes, fostered by industries such as e-commerce and marketing. But we see more and more applications in natural sciences, where algorithms are used to stimulate scientific discovery. Here is a quick, non comprehensive list of uses of machine learning in natural sciences (more details can be found in [TGDS] and [MLPHYS]):
However, there is more than just throwing the latest machine learning algorithm into the broth. Here we are dealing with physical phenomena for which we only have imperfect models in the first place. Data science can surely help improve those mathematical models. But in order to be of practical use, we must ensure that the machine learning outputs are physically consistent. This means that special care must be taken when we marry data and natural sciences together — in other words, incorporate knowledge into machine learning.
The literature that we have gathered was produced by physicists and engineers from different fields. Some of the authors have conducted actual experiments, compared physics-based with data-based approaches; others reviewed the results of the former in order to take lessons from those various experiments. According to them, here are the stakes that must guide us when putting data science to good use:
In other words, physicists and engineers are having a critical view on the use of data science in their field of expertise. They do not oppose the inductive (physics-based) and empirical (data-based) strategies but acknowledge their complementarity ( In a narrow context of the present study, the machine learning is defined as the capability to create effective surrogates for a massive amount of data from measurements and simulations [CLASSIF]).
There are many ways that data science and physics can be combined. The concept of Theory-Guided Data Science has emerged from the need to capitalize on existing practices; it is extensively described in the [TGDS] paper. We can see TGDS as a plea for mutual awareness between the two fields. It acknowledges both the increasing importance of data in natural science, and the need to build a catalog of good practices for fulfilling the stakes cited in the introduction of this article: “The overarching vision of TGDS is to introduce scientific consistency as an essential component for learning generalizable models.”
In [TGDS], the performance of a machine learning model is defined as:
Performance ∝ Accuracy + Simplicity + Consistency
Consistency is the driving principle behind the various practices detailed in [TGDS]. Those practices are organized in 5 TGDS research themes, namely:
(A) Theory-guided design of data science models
(B) Theory-guided training of data science models
(C) Theory-guided refinement of data science outputs
(D) Learning hybrid models of theory and data science
(E) Augmenting theory-based models using data science
In this theme, domain knowledge influences the choice of the machine learning algorithm, its architecture, and (when applicable) the link function between the inputs and the output.
As an example, the scientist may be guided towards an artificial neural network, with convolutional layers, because the problem under study involves spatial correlations.
Here, domain knowledge dictates various constraints imposed on the data science model (bounds for values or gradients, initial values of parameters, symmetries, interactions between variables, regularization, …). In effect, it belongs to what we call “model selection”.
For example, if the prediction target must take its values in a known range, then a penalization term can be added to the loss function of the machine learning model accordingly. If, in addition, the problem is invariant under some transformation of the input space, then additional training samples can be artificially created to force the model to feature the same invariance.
The output of the machine learning model is post-processed (refined) according to theoretical considerations, in order to make more sense of them or to make them physically consistent.
The simplest example of this, is to simply eliminate a candidate model that produces inconsistent results. Another algorithm or set of parameters can then be tested, with the hope that it meets the consistency requirements. If the inconsistencies are not too severe, the results may be “tweaked”, such as by replacing aberrant values with something more sensible.
Theoretical and/or numerical models coexist with machine learning models. The former provide inputs for the latter, and conversely the outputs of machine learning models can fuel traditional models. All those models then form a complex graph, with intermediate results flowing between them until the final output step.
One sample use is the concept of surrogate model . A surrogate model copes with the absence of a theoretical model in a part of the system. Instead of taking equations out of the blue, the model learns the relationships between the variables involved, and is plugged into a more general theoretical or numerical model. A surrogate model is also useful when the theoretical model is accessible but too costly to run numerically on every instance.
Here, we don’t necessarily build machine learning models. Rather, data science practices are called in to help the theoretician improve their models.
The two use cases described in [TGDS] for this theme, are data assimilation and parameter calibration. They rely on inference and uncertainty modelling to maximize the likelihood of the theoretical model or its parameters.
As data scientists, we usually don’t plan our work in this way. We are rather used to a data science workflow or pipeline, such as the following for supervised learning:
So let’s review the TGDS practices as ways to introduce domain knowledge at specific points of our data science workflow. For examples of practical knowledge that can be mobilized in such situations, please refer to the [TGDS] paper.
One of the practices of TGDS does not fit in our table, because it is not centered on data science. It is the exploitation of data science results in a physical or numerical model (mentioned in the “Learning hybrid models of theory and data science” research theme). This kind of hybridization will be a particular case of the data / physics coupling approaches, subject of a future article.
To complete the picture, let’s mention two techniques that are detailed in [INVARIANCE], inspired by physical considerations. Those techniques aim at producing machine learning models that are more accurate yet simpler (and faster) than “naive” ones. The first technique is to train the machine learning model not on raw data, but on input features inspired by the invariants of the mathematical model. The second one is well known in image recognition applications; it consists in augmenting the input data by applying transformations under which the mathematical model is invariant (e.g. by translating or rotating the inputs).
Projecting the TGDS practices onto the data science workflow was rather easy. It is a good signal in favor of the desired collaboration between data and natural scientists.
We also found that some practices are far from naive: they go well beyond the “mundane” data science tasks such as feature engineering and algorithm tuning. The most enlightening item is, arguably, “Regularization and penalization terms (B)”, at the training step. It is a technique whereby physical consistency is embedded inside the machine learning model instead of being used to discriminate a posteriori among possible models. Concretely, it encompasses several tricks:
Not all training algorithms or frameworks allow the data scientist to act freely on the loss function so this practice has also an impact on the “Choice of algorithm and its architecture” step of the workflow. For example, Scikit-learn Python ecosystem does not have a group lasso implementation. On the other hand, XGBoost lets you plug your own objective function and metric.
On the minus side, two important (supervised) machine learning tasks are not mentioned in [TGDS]: the choice of the target, and the choice of the cross-validation scheme and metric.
One might consider the choice of the target as obvious in a given scientific context: an observable quantity, a desired outcome such as a yield, a discrete parameter like a regime… are all natural ones. However, it turns out that in the presence of a mathematical model, many quantities are linked together, several of which could be observable, and we must choose one. They are not of equal interest if we consider that our mathematical model is imperfect, and that the observations can be noisy and all candidate targets may not exhibit the same measurement incertitude or data quality issues.
As regards cross-validation, in a data science context, we would expect this item to be addressed explicitly. Like the choice of the target above, we can guess that the question was deemed too obvious by the authors of [TGDS]. As a matter of fact, for concrete physical problems, the RMSE (Root-Mean Squared Error) is by far the most palatable choice for regressions. For one-class classification problems, the accuracy feels just as natural. But we must acknowledge the fact that cross-validation plays a role in the ability of the model to generalize (and thus in its consistency). More specifically, here are examples of knowledge that the cross-validation scheme and metrics are susceptible to encode:
Those two issues, definition of the target and cross-validation, are not to be overlooked. A data scientist and a physicist working together would really need to be on par to discuss them, because the two concepts have both physical and methodological implications. Here, this is not only the physicist asking the data scientist to implement their knowledge.
Our discussion so far has focused on approaches for marrying data science with physics. Among the stakes listed in the introduction, the first three ones — Efficiency, Generalizability, Interpretability — are really satisfied by those approaches. The last stake — Common language, tools & platforms — has only been considered as a call for interdisciplinary collaboration. Let’s be more specific now.
As was already mentioned, data science is pervasive in many scientific fields. It is also well established in the industry, where it makes a link between physical phenomena and processes. Put differently, data science is an ideal collaboration vehicle for all activities involved in the whole lifecycle of a product: research, design, process engineering, quality control, supply chain, logistics, customer service, …
Recalling the data science workflow sketched earlier, there are at least 3 steps that produce or consume artifacts that are valuable across the organization:
The above is valid inside an organization. To some extent, it also holds between organizations operating in the same industry, as long as they wish to cooperate. Data science, relying heavily on software tools, is penetrated by the culture of open source. Publishing one’s findings as reusable software packages is common nowadays; for an example in material sciences, see [PYMKS], which provides functions for calculations on microstructures, for visualization of the results, etc.
There is also a call for sharing data sets widely: data is the fuel for building better machine learning models. It can come in many flavours: measurements, experiments, numerical simulations, … or models themselves. Of course extensive sharing of data, though highly desirable, is difficult in practice. Intellectual property, the absence of an agreement on metadata, are obvious obstacles. Another one is the acknowledgement of the fact that a data repository should come with a computation grid, software and a metadata catalog — building a data platform with many stakeholders is indeed a complex project. [MATERIALS] and [PYKMS] are at least calling for such a platform.
This article is an attempt at summing up the various factors for a fruitful collaboration between data and natural sciences, from a data scientist’s eyes. As already pointed out by the authors we have reviewed, organizations and individuals from both worlds have a strong interest in working together. For the benefit of scientific discovery, they will build efficient and physically-consistent data science models that fit in a wider scientific workflow. To quote [MLPHYS], “Balancing the computational cost of data generation, ease of model training, and model evaluation time continues to be an important consideration when choosing the appropriate ML method for each application.”
The references below cover those aspects in a much greater detail. They also give a fair lot of concrete examples to understand the practicalities of the various approaches outlined. For the curious readers, [MLPHYS] also traces the influence of statistical physics on the theoretical foundation of machine learning algorithms — several conceptual tools from statistical physics were transposed to mathematics, giving new intuitions, and allowing to go way beyond the mere empirical effectiveness of machine learning.
We have only scratched the surface of how theory and data science can work together. Many aspects of the collaboration are yet to be explored in greater detail. For example, we would like to build a systematic methodology for coupling data science with mathematical models (PDEs, simulations). Another topic of high interest is the important role played by a special kind of algorithms in the literature we have reviewed: artificial neural networks. How to incorporate knowledge in their design?
I would like to thank Annabelle Blangero and Rémy Frenoy for their thoughtful review of this article.
[CLASSIF] C.-W. Chang and N. T. Dinh, “Classification of Machine Learning Frameworks for Data-Driven Thermal Fluid Models,” International Journal of Thermal Sciences, vol. 135, pp. 559–579, Jan. 2019.
[INVARIANCE] J. Ling, R. Jones, and J. Templeton, “Machine learning strategies for systems with invariance properties,” Journal of Computational Physics, vol. 318, pp. 22–35, Aug. 2016.
[MATERIALS] J. Hill, G. Mulholland, K. Persson, R. Seshadri, C. Wolverton, and B. Meredig, “Materials science with large-scale data and informatics: Unlocking new opportunities,” MRS Bulletin, vol. 41, no. 05, pp. 399–409, May 2016.
[MLPHYS] G. Carleo et al. , “Machine learning and the physical sciences,” arXiv:1903.10563 [astro-ph, physics:cond-mat, physics:hep-th, physics:physics, physics:quant-ph] , Mar. 2019.
[PYMKS] D. B. Brough, D. Wheeler, and S. R. Kalidindi, “Materials Knowledge Systems in Python — a Data Science Framework for Accelerated Development of Hierarchical Materials,” Integrating Materials and Manufacturing Innovation, vol. 6, no. 1, pp. 36–53, Mar. 2017.
[TGDS] A. Karpatne et al., “Theory-guided Data Science: A New Paradigm for Scientific Discovery from Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 10, pp. 2318–2331, Oct. 2017.