Materials Science Has a Reproducibility Problem. Machine Learning Inherits It.
Mining the published literature to build datasets for machine learning has become an increasingly popular strategy for accelerating materials research. Yet a foundational issue remains unresolved: experimental details in many papers are incomplete, and the field continues to struggle with reproducibility. When results cannot be independently recreated, the datasets derived from them carry hidden and unquantified uncertainty. This post examines the structural data gaps underlying the reproducibility problem and analyzes how they constrain the reliability, transferability, and interpretability of literature-driven models.
The Reproducibility Problem in Materials Science
Materials scientists are very good at discovering new things, whether synthesizing a novel structure or pushing device performance to a new record. However, the field is not as effective at verifying if those results can be reproduced by someone else. This is the crux of the reproducibility problem: without the ability to recreate a synthesis or measurement, we cannot distinguish a genuine breakthrough from a statistical fluke or experimental artifact. If a result cannot be replicated, we cannot reliably build on it to create something functional or useful.
Repeatability and reproducibility form the bedrock of modern science, though they test different things. Repeatability refers to internal consistency: the ability of the same researcher to reach the same results using the same methods and equipment in the same lab. Reproducibility, by contrast, is a test of universal truth. It requires an independent research team to obtain the same results using the same methodology, but in a different environment with different equipment.
In practice, scientists are generally diligent about repeatability. Researchers frequently repeat experiments internally to ensure their findings are robust or to generate additional samples for further testing. However, these internal repeats and their associated statistics, such as means, standard deviations and confidence intervals, are rarely reported in the literature, remaining hidden in lab notebooks.
Reproducibility, on the other hand, remains largely unexplored territory. When an external group attempts to reproduce a published result, they may need to re-optimize the entire process to account for the subtle, local variables of their own laboratory environment and essentially create a new procedure. This re-optimization can take weeks or months and may fail entirely. When it fails, the original discovery is effectively lost to the community. Even when re-optimization succeeds, the modified procedures are typically not reported, leaving subsequent groups to encounter the same obstacles.
The Mechanism of the Crisis: Complexity in Practice
The challenge of reproducibility in experimental materials science is rarely due to poor intent or carelessness. Rather, it reflects the inherent complexity of the field. Materials synthesis and processing are sensitive to variables that standard reporting formats struggle to capture. In practice, information loss can occur through several channels:
System and device architecture (‘black box’ effects): For technologically relevant systems such as semiconductor processes, energy devices, or catalytic cells, geometry and scale are not background details but primary variables. This is because the detailed architecture of a deposition reactor or the specific positioning of electrodes in an electrochemical cell dictates the mass and energy transfer within the system. These factors control the local environment where the actual chemistry occurs and, ultimately, the final outcome of the experiment. When these physical dimensions are omitted from a report, the experiment essentially takes place inside a ‘black box’, leaving other researchers with no roadmap to follow.
Instrument calibration and insufficient data: Scientific accuracy is only as reliable as the instruments generating the data. Instruments may be improperly calibrated or experience unnoticed drift over time. A furnace set to 1050 °C may in reality operate at 1020 °C. A single temperature reading may also fail to capture spatial gradients across the system. While some processes tolerate minor deviations, others are highly sensitive, and small inaccuracies can compromise the resulting dataset.
Latent variables: In many cases, the variables that disrupt a process are unrecognized. These may include trace impurities in reagents, ambient humidity, or residue remaining on reactor walls from previous experiments. Because these factors are not monitored or reported, they introduce hidden conditions that can cause a process to succeed in one environment and fail in another.
Under-reported procedures and human variation: Experimental sections in academic papers are often brief. A report may state that a solution was stirred without specifying stirring speed, stir bar geometry, or flask volume. Temperature ramp rates, drying conditions, and ambient environmental conditions are frequently omitted. Negative or unsuccessful experiments are almost never reported, leaving the community blind to the conditions under which a process fails. In addition, operator-dependent factors, such as how a doctor-blading step is performed or how a sample is generally handled, introduce variation that is difficult to standardize but can strongly influence material quality.
Legacy instruments: Older literature often relies on legacy instruments with lower sensitivity or more manual measurement protocols. Replicating such data using modern standardized equipment can be difficult, particularly if methodological details were not fully documented.
Increasing material complexity: As materials science advances toward smaller scales and more complex architectures, sensitivity to process and environmental variables increases. Decades ago, a millimeter-scale material may have been insensitive to parts-per-million impurities. But now, a modern nanomaterial with high surface area may be strongly affected by the same levels of trace contaminants. As complexity increases, tolerance to known and latent variations decreases.
Field Specific Examples: Where Reproducibility Breaks Down
The reproducibility problem is not uniform across materials science. Each subfield faces distinct challenges.
Thin Film Deposition: Processes such as atomic layer deposition (ALD) and chemical vapor Deposition (CVD) are highly sensitive to reactor geometry and dimensions, details often omitted from publications. Moreover, reported parameters such as temperature, pressure, and flow rate may be affected by calibration errors or insufficient spatial measurements. Latent variables, including parts-per-million impurities on reactor walls or in precursors, further complicate replication.
(Further reading: Frustratingly complex: Why is growing electronic grade 2D materials so difficult? and How to Calculate Gas Flow Velocities in CVD/ALD and Why Pressure Accuracy Is Key)
Energy devices: A coin cell that performs well in one laboratory may not reproduce the same performance elsewhere or at industrial scales. Lithium-ion battery metrics can vary significantly between laboratories due to sensitivity to impurities, subtle differences in assembly, and scaling effects in cell architecture. Reporting may omit measurement details such as internal resistance compensation or electrode area calculations, which directly affect their performance metrics. In comparison, solid-state batteries add mechanical complexity, as performance is sensitive to applied pressure and interfacial contact during fabrication and operation. Small fabrication differences can produce substantial discrepancies in electrochemical behavior.
Catalytic systems: Reproducibility in electrocatalysis is complicated by numerous latent parameters. Performance can depend on humidity during electrode drying, stirring rates during electrode synthesis, or trace impurities such as iron contamination in the potassium hydroxide electrolyte. Metal leaching from counter electrodes can also introduce unintended activity. In some cases, the synthesized material functions as a pre-catalyst that undergoes structural evolution in situ to form the catalytically active form, making performance highly sensitive to cell configuration and operating history.
Metal-organic frameworks (MOFs): MOFs are highly porous materials constructed from metal ions or clusters bridged with organic linkers to form extended networks or cages. Often described as molecular sponges, they are widely studied and utilized for gas storage and separation. Despite extensive literature, the syntheses of a large fraction of widely cited MOFs lack reported repeat syntheses. A meta-analysis by Park et al. of CO2 adsorption data by different MOFs reported that a significant fraction of measurements were statistically inconsistent outliers. Without standardized activation and handling protocols, performance data for these materials can be unreliable.
From Reproducibility Failure to Dataset Uncertainty: Consequences for Machine Learning
The pivot to machine learning (ML) in materials science is difficult for the same reasons that reproducibility fails. If the published literature lacks the information necessary for a human to recreate an experiment, then pattern matching across those experiments becomes physically meaningless for an ML model. While ML is adept at finding trends in noisy data, it cannot fill the void left by missing causal variables. Without comprehensive documentation, we have no way to distinguish a genuine physical signal from an experimental artifact or a statistical outlier.1
The lack of statistical depth, specifically the absence of reported means and error bars, further complicates this transition. When comparing datasets for performance metrics or properties, we rarely have access to the detailed measurement protocols required to ensure true equivalence i.e. an ‘apples-to-apples’ comparison. We are effectively blind to potential calibration errors, instrument drifts, or the latent variables that may have influenced the original results.
Finally, there is the critical issue of missing information on negative outcomes. Because failed or unsuccessful experiments are rarely reported, ML models are trained on a skewed representation of the experimental space. Without failure data, the models cannot learn the boundaries of a process, making it nearly impossible to predict when or why a new experiment might fail in the real world.
Conclusion: The Implications for Materials AI
In experimental materials science, much of the data-generating process is only partially reported. Consequently, irreproducibility becomes a structural feature, and the resulting uncertainty propagates into the datasets themselves. Machine learning cannot eliminate that uncertainty; it inherits and trains on it.
The implications are clear: as we move toward a data-driven era, the utility of our models will be limited not by algorithmic complexity, but by the transparency of our reporting. To make materials AI truly reliable, the field must shift toward standardized metadata reporting, ensuring that calibration logs, reactor geometries, and negative results are no longer hidden in lab notebooks but are instead part of a robust, verifiable record.
TL;DR
Reproducibility and ML: If a human can’t reproduce an experiment due to missing information, an ML model cannot reliably extract transferable patterns from it.
Signal vs. noise: Without better documentation, we cannot determine whether a data point reflects a breakthrough, an outlier, or a laboratory artifact.
The metadata gap: Means, error bars, and calibration histories are often missing, preventing true ‘apples-to-apples’ comparisons between datasets.
Selection bias: The absence of negative outcomes skews datasets and limits a model’s ability to learn the true limits of a material or process.
The path forward: The utility of materials AI trained on literature datasets is currently capped not by algorithmic sophistication, but by the transparency of our reporting. Moving forward, the goal cannot just be more data; it must be more comprehensive, complete and reproducible data.
In a follow-up post The Missing Piece of the Reproducibility Puzzle: The Local Chemical Environment I examine what must actually be recreated for experiments to be reproducible, and why nominal settings alone are typically insufficient.
References
@Reproducibility in Science: A Metrology Perspective
@Responding to the growing issue of research reproducibility
Thin Films:
@Consistency and reproducibility in atomic layer deposition
@Research on scalable graphene faces a reproducibility gap
@An industry view on two-dimensional materials in electronics
Energy Devices:
@From small batteries to big claims
@Benchmarking the reproducibility of all-solid-state battery cell performance
@Error, reproducibility and uncertainty in experiments for electrochemical energy technologies
Catalysis:
@To Err is Human; To Reproduce Takes Time
@Reproducibility in Electrocatalysis
@Toward Benchmarking in Catalysis Science: Best Practices, Challenges, and Opportunities
MOFs:
@Does Chemical Engineering Research Have a Reproducibility Problem?
@How Reproducible Are Isotherm Measurements in Metal–Organic Frameworks?
While ML models can be robust to stochastic (random) noise, the ‘noise’ we are discussing here for these materials science experiments is what is systematically missing. When an important cause, such as reactor geometry or trace impurities, is omitted from the feature set, it becomes a latent variable. The model does not average out this gap; instead, it attributes the resulting variance to the features it can observe, leading to spurious correlations that fail to generalize across laboratory environments.


I feel the word 'material' might be removed completely. Medical science, for example, fits the same pattern.