What You'll Learn
By the end of this tutorial, you will be able to:
Data Mastery
Access materials databases (Materials Project, AFLOW), clean and prepare data, and create proper train/test splits for materials ML.
Featurization
Transform crystal structures and compositions into numerical descriptors using matminer's 70+ featurizers.
Model Building
Train, tune, and compare ML models from simple linear regression to powerful ensemble methods like XGBoost.
Evaluation
Properly evaluate model performance with cross-validation, learning curves, and appropriate metrics.
Explainability
Interpret your models using SHAP values to understand which features drive predictions.
Real Projects
Apply everything to predict band gaps, formation energies, and other material properties.
Prerequisites
What you should know before starting
New to Python?
Check out Tutorial 02: Python for Materials Science in this series first!
Course Modules
7 interactive notebooks from fundamentals to real projects
ML Fundamentals
BeginnerBuild a solid foundation in machine learning concepts. Understand supervised vs unsupervised learning, the bias-variance tradeoff, and why ML is transforming materials science.
Data Foundation
BeginnerLearn to access the Materials Project API, query material properties, clean messy data, and create proper train/validation/test splits that avoid data leakage.
Featurization Basics
IntermediateTransform chemical compositions and crystal structures into numerical feature vectors using matminer. Learn composition-based, structure-based, and site-based descriptors.
Classical ML Models
IntermediateTrain and compare various ML algorithms: Linear Regression, Ridge/Lasso, Decision Trees, Random Forest, Gradient Boosting, and XGBoost. Understand when to use each.
Model Evaluation
IntermediateMaster model evaluation with proper metrics (MAE, RMSE, R²), learning curves, hyperparameter tuning with Optuna, and nested cross-validation for unbiased estimates.
Explainable AI
AdvancedOpen the black box! Use SHAP values to understand feature importance, create interpretable visualizations, and gain physical insights from your ML models.
Project: Band Gap Prediction
AdvancedCapstone project! Build an end-to-end ML pipeline to predict semiconductor band gaps. Apply everything you've learned to a real-world materials discovery challenge.
Key Concepts
Essential ML concepts explained for materials scientists
Supervised vs Unsupervised Learning
Supervised Learning
Learn from labeled data to predict outcomes.
- Regression: Predict continuous values (band gap, formation energy)
- Classification: Predict categories (metal/insulator, stable/unstable)
# Example: Predict band gap from composition
X = features # composition descriptors
y = band_gaps # known values
model.fit(X, y)
prediction = model.predict(new_composition)
Unsupervised Learning
Find patterns in data without labels.
- Clustering: Group similar materials together
- Dimensionality Reduction: Visualize high-D materials space
# Example: Cluster materials by similarity
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(features)
The Overfitting Problem
Overfitting occurs when your model memorizes the training data instead of learning generalizable patterns. It performs great on training data but fails on new, unseen materials.
Warning Signs of Overfitting
Training R² = 0.99, Test R² = 0.50 → Your model memorized the data!
Solutions:
- More data: Larger datasets reduce overfitting risk
- Simpler models: Use regularization (Ridge, Lasso)
- Cross-validation: Properly evaluate on held-out data
- Feature selection: Remove irrelevant features
Materials Featurization
ML models need numerical inputs. Featurization converts materials (compositions, structures) into feature vectors that capture relevant physics and chemistry.
| Feature Type | Input | Examples | Use Case |
|---|---|---|---|
| Composition | Chemical formula | Avg. electronegativity, atomic radius | Quick screening |
| Structure | Crystal structure | Coordination numbers, bond lengths | Accurate predictions |
| Electronic | DOS, band structure | Band center, width, occupation | Electronic properties |
from matminer.featurizers.composition import ElementProperty
from pymatgen.core import Composition
# Create featurizer
ep = ElementProperty.from_preset("magpie")
# Featurize a composition
comp = Composition("Fe2O3")
features = ep.featurize(comp)
# Returns ~130 numerical features!
SHAP: Understanding Your Model
SHAP (SHapley Additive exPlanations) values explain how each feature contributes to a prediction. Based on game theory, SHAP fairly distributes the "credit" among features.
Why SHAP Matters for Materials Science
SHAP helps you gain physical insights: "This material has a high band gap because of its large electronegativity difference."
import shap
# Create explainer for your trained model
explainer = shap.Explainer(model)
shap_values = explainer(X_test)
# Summary plot: which features matter most?
shap.summary_plot(shap_values, X_test)
# Waterfall: explain a single prediction
shap.waterfall_plot(shap_values[0])
Tools & Libraries
Your ML for materials toolkit
Quick Start
Get up and running in minutes
# Clone the repository
git clone https://github.com/NabKh/ML-for-Materials-Science.git
cd ML-for-Materials-Science
# Create conda environment
conda env create -f environment.yml
conda activate ml-materials
# Verify installation
python -c "import matminer; import shap; print('Ready!')"
# Launch Jupyter
jupyter lab
Materials Project API Key
You'll need a free API key from materialsproject.org/api to access the database.
Code Preview
A taste of what you'll build
# Complete ML pipeline for band gap prediction
import pandas as pd
from mp_api.client import MPRester
from matminer.featurizers.composition import ElementProperty
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import shap
# 1. Get data from Materials Project
with MPRester("your_api_key") as mpr:
docs = mpr.materials.summary.search(
band_gap=(0.1, 5.0),
fields=["material_id", "formula_pretty", "band_gap"]
)
# 2. Create DataFrame and featurize
df = pd.DataFrame(docs)
ep = ElementProperty.from_preset("magpie")
df = ep.featurize_dataframe(df, "composition")
# 3. Train model
X = df.drop(["band_gap", "formula_pretty"], axis=1)
y = df["band_gap"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)
# 4. Evaluate
print(f"Test R²: {model.score(X_test, y_test):.3f}")
# 5. Explain with SHAP
explainer = shap.Explainer(model)
shap_values = explainer(X_test)
shap.summary_plot(shap_values, X_test)
Additional Resources
Dive deeper into ML for materials
ACS Best Practices Guide
Comprehensive guide to ML for materials scientists with best practices.
matminer Documentation
Full documentation for 70+ materials featurizers.
Materials Project Workshop
Official tutorials from the Materials Project team.
SHAP Documentation
Learn to interpret any ML model with SHAP values.
What's Next?
Continue your ML journey
Tutorial 08: Neural Network Potentials
Learn to use graph neural networks (M3GNet, CHGNet) for molecular dynamics and property prediction.
Coming SoonTutorial 09: Advanced Features
Master SOAP/MBTR descriptors, active learning, and Bayesian optimization for materials discovery.
Coming Soon