What You'll Learn

By the end of this tutorial, you will be able to:

Data Mastery

Access materials databases (Materials Project, AFLOW), clean and prepare data, and create proper train/test splits for materials ML.

Featurization

Transform crystal structures and compositions into numerical descriptors using matminer's 70+ featurizers.

Model Building

Train, tune, and compare ML models from simple linear regression to powerful ensemble methods like XGBoost.

Evaluation

Properly evaluate model performance with cross-validation, learning curves, and appropriate metrics.

Explainability

Interpret your models using SHAP values to understand which features drive predictions.

Real Projects

Apply everything to predict band gaps, formation energies, and other material properties.

Prerequisites

What you should know before starting

Python basics (variables, functions, loops)
NumPy arrays and operations
Pandas DataFrames
Basic plotting with Matplotlib
Chemistry: atoms, bonds, crystals
Basic statistics (mean, std, distributions)

New to Python?

Check out Tutorial 02: Python for Materials Science in this series first!

Course Modules

7 interactive notebooks from fundamentals to real projects

01

ML Fundamentals

Beginner

Build a solid foundation in machine learning concepts. Understand supervised vs unsupervised learning, the bias-variance tradeoff, and why ML is transforming materials science.

Supervised Learning Unsupervised Learning Overfitting Cross-Validation
02

Data Foundation

Beginner

Learn to access the Materials Project API, query material properties, clean messy data, and create proper train/validation/test splits that avoid data leakage.

Materials Project API Data Cleaning Train/Test Split Data Leakage
03

Featurization Basics

Intermediate

Transform chemical compositions and crystal structures into numerical feature vectors using matminer. Learn composition-based, structure-based, and site-based descriptors.

matminer Composition Features Structure Features Feature Selection
04

Classical ML Models

Intermediate

Train and compare various ML algorithms: Linear Regression, Ridge/Lasso, Decision Trees, Random Forest, Gradient Boosting, and XGBoost. Understand when to use each.

Linear Models Decision Trees Random Forest XGBoost
05

Model Evaluation

Intermediate

Master model evaluation with proper metrics (MAE, RMSE, R²), learning curves, hyperparameter tuning with Optuna, and nested cross-validation for unbiased estimates.

Metrics Learning Curves Hyperparameter Tuning Cross-Validation
06

Explainable AI

Advanced

Open the black box! Use SHAP values to understand feature importance, create interpretable visualizations, and gain physical insights from your ML models.

SHAP Values Feature Importance Partial Dependence Model Interpretation
07

Project: Band Gap Prediction

Advanced

Capstone project! Build an end-to-end ML pipeline to predict semiconductor band gaps. Apply everything you've learned to a real-world materials discovery challenge.

Full Pipeline Band Gap Model Comparison Physical Insights

Key Concepts

Essential ML concepts explained for materials scientists

Supervised vs Unsupervised Learning

Supervised Learning

Learn from labeled data to predict outcomes.

  • Regression: Predict continuous values (band gap, formation energy)
  • Classification: Predict categories (metal/insulator, stable/unstable)
python
# Example: Predict band gap from composition
X = features  # composition descriptors
y = band_gaps  # known values
model.fit(X, y)
prediction = model.predict(new_composition)

Unsupervised Learning

Find patterns in data without labels.

  • Clustering: Group similar materials together
  • Dimensionality Reduction: Visualize high-D materials space
python
# Example: Cluster materials by similarity
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(features)

The Overfitting Problem

Overfitting occurs when your model memorizes the training data instead of learning generalizable patterns. It performs great on training data but fails on new, unseen materials.

Warning Signs of Overfitting

Training R² = 0.99, Test R² = 0.50 → Your model memorized the data!

Solutions:

  • More data: Larger datasets reduce overfitting risk
  • Simpler models: Use regularization (Ridge, Lasso)
  • Cross-validation: Properly evaluate on held-out data
  • Feature selection: Remove irrelevant features

Materials Featurization

ML models need numerical inputs. Featurization converts materials (compositions, structures) into feature vectors that capture relevant physics and chemistry.

Feature Type Input Examples Use Case
Composition Chemical formula Avg. electronegativity, atomic radius Quick screening
Structure Crystal structure Coordination numbers, bond lengths Accurate predictions
Electronic DOS, band structure Band center, width, occupation Electronic properties
python
from matminer.featurizers.composition import ElementProperty
from pymatgen.core import Composition

# Create featurizer
ep = ElementProperty.from_preset("magpie")

# Featurize a composition
comp = Composition("Fe2O3")
features = ep.featurize(comp)
# Returns ~130 numerical features!

SHAP: Understanding Your Model

SHAP (SHapley Additive exPlanations) values explain how each feature contributes to a prediction. Based on game theory, SHAP fairly distributes the "credit" among features.

Why SHAP Matters for Materials Science

SHAP helps you gain physical insights: "This material has a high band gap because of its large electronegativity difference."

python
import shap

# Create explainer for your trained model
explainer = shap.Explainer(model)
shap_values = explainer(X_test)

# Summary plot: which features matter most?
shap.summary_plot(shap_values, X_test)

# Waterfall: explain a single prediction
shap.waterfall_plot(shap_values[0])

Tools & Libraries

Your ML for materials toolkit

scikit-learn
ML algorithms & preprocessing
matminer
70+ materials featurizers
pymatgen
Materials analysis & MP API
XGBoost
Gradient boosting champion
SHAP
Model explainability
Optuna
Hyperparameter optimization

Quick Start

Get up and running in minutes

bash
# Clone the repository
git clone https://github.com/NabKh/ML-for-Materials-Science.git
cd ML-for-Materials-Science

# Create conda environment
conda env create -f environment.yml
conda activate ml-materials

# Verify installation
python -c "import matminer; import shap; print('Ready!')"

# Launch Jupyter
jupyter lab

Materials Project API Key

You'll need a free API key from materialsproject.org/api to access the database.

Code Preview

A taste of what you'll build

python
# Complete ML pipeline for band gap prediction
import pandas as pd
from mp_api.client import MPRester
from matminer.featurizers.composition import ElementProperty
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import shap

# 1. Get data from Materials Project
with MPRester("your_api_key") as mpr:
    docs = mpr.materials.summary.search(
        band_gap=(0.1, 5.0),
        fields=["material_id", "formula_pretty", "band_gap"]
    )

# 2. Create DataFrame and featurize
df = pd.DataFrame(docs)
ep = ElementProperty.from_preset("magpie")
df = ep.featurize_dataframe(df, "composition")

# 3. Train model
X = df.drop(["band_gap", "formula_pretty"], axis=1)
y = df["band_gap"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)

# 4. Evaluate
print(f"Test R²: {model.score(X_test, y_test):.3f}")

# 5. Explain with SHAP
explainer = shap.Explainer(model)
shap_values = explainer(X_test)
shap.summary_plot(shap_values, X_test)

Additional Resources

Dive deeper into ML for materials

What's Next?

Continue your ML journey

Tutorial 08: Neural Network Potentials

Learn to use graph neural networks (M3GNet, CHGNet) for molecular dynamics and property prediction.

Coming Soon

Tutorial 09: Advanced Features

Master SOAP/MBTR descriptors, active learning, and Bayesian optimization for materials discovery.

Coming Soon