What is Scientific Machine Learning?
Scientific machine learning** (SciML) is an emerging discipline within the data science community. SciML seeks to address domain-specic data challenges and extract insights from scientific data sets through innovative methodological solutions. SciML draws on tools from both machine learning and scientific computing to develop new methods for scalable, domain-aware, robust, reliable, and interpretable learning and data analysis, and will be critical in driving the next wave of data-driven scientific discovery in the physical and engineering sciences.
Like scientific computing, SciML is multidisciplinary and leverages expertise from applied and computational mathematics, computer science, and the physical sciences.
** the term SciML is borrowed from this technical report prepared for the DOE ASCR program
Why Scientific Machine Learning?
New innovations in machine learning (ML) and “big data” are beginning to drive advances in scientific disciplines such as the Earth sciences (Bergen et. al., 2019), but the full potential of these techniques for data-driven discovery has yet to be fully realized. One barrier to data-driven discovery is that existing methods often do not meet the needs of scientific users. Application-agnostic algorithms, or those designed for more traditional ML applications such as image or natural language processing, can not typically be directly applied to scientic data sets and require non-trivial, task-specic modications. In other cases, the models or outputs do not provide the insights or guarantees required for scientic applications.
Consider the following:
-
In many applications only limited or low-quality labels are available, while massive unlabeled (often class imbalanced) data sets are common.
-
In discovery-oriented tasks, ground truth is unknown and benchmark data sets are unavailable.
-
Scientific data are often high-dimensional, noisy, heterogeneous, low-signal-to-noise, and multiscale.
-
Models should respect or incorporate physical laws, constraints, and other scientific domain knowledge.
-
Robust methods and an ability to quantify uncertainty are required for scientific rigor.
-
Extracting new scientific insights from data requires human-interpretable models or outputs.
Research to advance data-driven discovery in the Earth and physical sciences
A (non-exhaustive) list of research topics in scientific machine learning:
-
Big data & small labels. Methods for unsupervised learning, semi-supervised learning, positive-unlabeled learning, active learning, or weakly-supervised learning, that account for biases in labeling and make realistic assumptions about label-generating process.
-
Leveraging non-traditional / low-cost data sources, data fusion. Extracting insights from multiple sensors/ sources that produce larger quantities of lower quality (noisy, heterogeneous, unstructured, high uncertainty) data.
-
Robust and reliable learning. Uncertainty quantification, stability analysis, validation, performance metrics, and reproducibility, especially in high stakes or safety-critical applications.
-
Domain-aware and physics-informed learning. Hybrid models that include both data-driven and domain-aware components.
-
Enchancing modeling and simulation capabilities with machine learning.
-
Novelty detection in large data sets.
-
Interpretable/explainable machine learning.
-
Algorithms for streaming sensor data.
- Data compression.