Methodology | MLIP Benchmark

Track methodologies

This page covers the potential-energy-surface (PES) benchmark that drives the home leaderboard. Each property track has its own methodology page:

Stability ordering — per-element polymorph ranking (static / hydrostatic / full relax)
Elastic constants — Cᵢⱼ via the MP stress–strain protocol
Vacancy formation — per-site monovacancy E_f, cross-potential reference
Surface energy — clean-surface γ on low-index facets vs MP/Crystalium PBE
GB Energies — elemental CSL grain-boundary energy / W_sep / decohesion, cross-model consensus (no DFT reference)
Pure GB Benchmark — pymatgen γ-surface-optimised MP grain-boundary energy vs MP DFT (MAE/RMSE/r, real reference)
Phonons & QHA — finite-displacement phonons + QHA (designed / upcoming)
Melting point — TI (calphy NETI) + two-phase coexistence Tm for 7 GRACE-1L potentials

The full code-cited provenance for the PES offset fit lives in the deep reference.

The DFT ground truth is the ASSYST campaign: ~7M structures stored in the dft_surrogate_mlip_assyst Postgres database. This count is not the raw candidate pool from the generator. Each entry is a DFT evaluation produced by the pipeline below — the final relaxed structures (recomputed at KSPACING 0.25) plus every RATTLE / TRIAX / SHEAR configuration derived from them.

Structure generation & validation filter

The training and evaluation structures are generated with the ASSYST methodology — Automated Small SYmmetric Structure Training (Poul, Huber & Neugebauer, 2024) — implemented in pyiron_workflow_assyst. ASSYST builds an unbiased, systematically extendable training set by sampling small random crystal prototypes across the space groups (varying cell shape, volume, and decoration) rather than relying on hand-picked known phases. This deliberately produces many configurations — including strongly compressed and near-overlapping ones — so the potential sees the full repulsive wall, but it also produces structures that are too unphysical to be worth a DFT calculation.

Generated candidate structures — both the random symmetric prototypes and the derived perturbations described below — are screened by the validation filter in structure_filter_utils.py. A structure is accepted (is_valid_structure) only if it clears both gates:

Absolute minimum distance. The smallest interatomic distance anywhere in the cell (self-distances excluded) must be ≥ min_dist, default 1.0 Å. This is a cheap global floor that catches near-coincident atoms regardless of species.
Species-resolved core-overlap test. For every element pair (i, j) the closest observed i–j distance must be ≥ (1 − tol) · (Rᵢ + Rⱼ), where Rᵢ are the VASP POTCAR core radii (RCORE, given in Bohr and converted to Å with the factor 0.5291773) and tol is the core_overlap_tolerance. The filter function defaults to tol = 0.2, but this campaign was run with tol = 0.3 (the run_ASSYST_on_structure default), i.e. atomic cores were allowed to interpenetrate by up to 30 % of their summed radii before a structure was rejected.

Per-pair minimum distances are obtained from the pymatgen neighbour list (get_all_neighbors, 5.0 Å cutoff), keeping the smallest distance seen for each sorted element pair. A candidate failing either gate is re-drawn (up to 100 attempts per slot) rather than computed, which keeps the dataset physically reasonable while the 30 % tolerance still retains the short-range, high-energy configurations the potentials need to learn.

The RCORE table covers all 69 elements from H through Bi; the resulting cutoffs range from ≈0.58 Å (H, He) up to ≈1.69 Å (Tl). The first few:

Element	POTCAR `RCORE` (Bohr)	Rᵢ (Å)
H	1.10	0.582
He	1.10	0.582
Li	2.05	1.085
Be	1.90	1.005
B	1.70	0.900
C	1.50	0.794

Show all 69 elements (H–Bi)

Element	POTCAR `RCORE` (Bohr)	R_i (Å)
H	1.10	0.582
He	1.10	0.582
Li	2.05	1.085
Be	1.90	1.005
B	1.70	0.900
C	1.50	0.794
N	1.50	0.794
O	1.52	0.804
F	1.52	0.804
Ne	1.70	0.900
Na	2.20	1.164
Mg	2.00	1.058
Al	1.90	1.005
Si	1.90	1.005
P	1.90	1.005
S	1.90	1.005
Cl	1.90	1.005
Ar	1.90	1.005
K	2.30	1.217
Ca	2.30	1.217
Sc	2.50	1.323
Ti	2.80	1.482
V	2.70	1.429
Cr	2.50	1.323
Mn	2.30	1.217
Fe	2.30	1.217
Co	2.30	1.217
Ni	2.30	1.217
Cu	2.30	1.217
Zn	2.30	1.217
Ga	2.60	1.376
Ge	2.30	1.217
As	2.10	1.111
Se	2.10	1.111
Br	2.10	1.111
Kr	2.30	1.217
Rb	2.50	1.323
Sr	2.50	1.323
Y	2.80	1.482
Zr	3.00	1.588
Nb	2.40	1.270
Mo	2.75	1.455
Tc	2.80	1.482
Ru	2.70	1.429
Rh	2.70	1.429
Pd	2.60	1.376
Ag	2.50	1.323
Cd	2.30	1.217
In	3.10	1.640
Sn	3.00	1.588
Sb	2.30	1.217
Te	2.30	1.217
I	2.30	1.217
Xe	2.50	1.323
Cs	2.50	1.323
Ba	2.80	1.482
La	2.80	1.482
Hf	3.00	1.588
Ta	2.90	1.535
W	2.75	1.455
Re	2.70	1.429
Os	2.70	1.429
Ir	2.60	1.376
Pt	2.60	1.376
Au	2.50	1.323
Hg	2.50	1.323
Tl	3.20	1.693
Pb	3.10	1.640
Bi	3.00	1.588

From prototype to reference structure

Each random prototype is run through run_ASSYST_on_structure, which relaxes it, harvests the relaxed cell, and spawns a family of perturbations — every point then re-evaluated at high accuracy. All DFT is VASP 5.4.4 / PBE, driven under Custodian error handling (VaspErrorHandler, NonConvergingErrorHandler, FrozenJobErrorHandler).

Symmetry break. Before relaxation each seed is displaced by structure.perturb(0.1) (0.1 Å random kick) so that high-symmetry prototypes do not sit on a saddle.
Three-stage relaxation at the default k-mesh (KSPACING ≈ 0.5), each stage feeding its relaxed cell to the next:
- ISIF 7 — relax cell volume (shape and ions fixed);
- ISIF 5 — relax cell shape;
- ISIF 2 — relax ions in the fixed cell.
IBRION = 2 (conjugate gradient), up to NSW = 300 ionic steps, EDIFF = 1e-4.
Harvest. Only the final relaxed image of the ISIF 2 trajectory is kept, and only if its SCF loop converged (non-converged frames are dropped).
High-accuracy recompute (static). The relaxed base structure is re-evaluated on a denser mesh — KSPACING = 0.25, tightened EDIFF = 1e-5, NSW = 0 — and this same accurate setting is used for every perturbation below. These static single points are what enter the reference set.
Derived perturbations. From the relaxed base, three families are generated (the production run used n_rattle = 5 and n_stretch = 5, where the "stretch" count drives both triaxial and shear — so 5 RATTLE + 5 TRIAX + 5 SHEAR = 15 per base structure):
- RATTLE — Gaussian per-atom displacements with σ = 0.1 Å, combined with a uniform diagonal cell strain per axis from U(−5 %, +5 %). Samples near-equilibrium forces and small volume/shape fluctuations.
- TRIAX — diagonal-only strain, each of the three cell axes independently scaled by 1 + U(−0.8, +0.8). Large normal (volumetric / uniaxial) deformations for the equation of state and diagonal stress.
- SHEAR — a full 3×3 strain matrix I + U(−0.8, +0.8) with its diagonal pinned to 1, so only the off-diagonal (shear) components act.
Each perturbation is re-checked against is_valid_structure (min-distance 1.0 Å + RCORE core-overlap at tol = 0.3); rejected draws are retried up to 100 times per slot. Frames are named {base}_{rattle|triax|shear}_{n}.

The ~7M reference structures are therefore the recomputed relaxed bases plus all of their RATTLE / TRIAX / SHEAR perturbations — every one a KSPACING = 0.25 static single point — not the raw prototypes, and not intermediate relaxation steps.

DFT settings

Key INCAR tags (host-element example Fe; the full INCAR is built per structure in the driver script):

Tag	Relaxation (ISIF 7→5→2)	Accurate recompute
`GGA` / functional	PBE (`Pe`)	PBE
`ENCUT`	400 eV	400 eV
`PREC`	Accurate	Accurate
`ALGO`	Normal	Normal
`ISMEAR` / `SIGMA`	1 / 0.2 (Methfessel–Paxton)	1 / 0.2
`EDIFF`	1e-4	1e-5
`EDIFFG`	1e-4	1e-4
`KSPACING`	~0.5 (VASP default)	0.25
`IBRION` / `NSW` / `NELM`	2 / 300 / 300	— / 0 (static) / —
`ISPIN`	2 (magnetic; `--nonmagnetic` → 1)	same
`MAGMOM`	set per-structure (not in the base INCAR)	same
mixing	`AMIX 0.01`, `AMIX_MAG 0.1`, `BMIX`/`BMIX_MAG 0.0001`	same
parallel / misc	`KPAR 4`, `NCORE 4`, `LORBIT 10`, `LREAL .FALSE.`	same

The default runs are spin-polarised with element-dependent initial moments (ferromagnetic host, lightly antiparallel solutes) and deliberately hard linear mixing to push difficult magnetic SCF cycles to convergence.

Reference-set filtering

The reference set itself is filtered only by SCF convergence (step 3) and the geometric validity filter (min-distance + RCORE core-overlap, above) at generation time — there is no separate absolute-energy or absolute-force cap on the DFT data. The single magnitude-based cut applied for the benchmark is the evaluation-time outlier filter below, which screens the MLIP-minus-DFT error and the predicted force, not the reference structures.

Offset fit (per composition-system × spin)

Each potential's raw MLIP energy can be off the DFT scale by an element-dependent constant. We fit per-element shifts Δμ_i minimising (E_MLIP − E_DFT) − Σᵢ Nᵢ · Δμᵢ, via the normal equations (AᵀA)μ = Aᵀb, with robust iterative MAD σ-clipping on the per-atom residual so a handful of broken structures cannot bias the fit. The fit is separated by composition-system and by spin rather than fitting one global shift per element:

system: for each element-set S (e.g. {Fe}, {Fe,C}, {Fe,Ni,C}), the shifts are fit over every structure whose composition ⊆ S — so {Fe,C} pools pure-Fe ∪ pure-C ∪ Fe–C — and each structure is corrected with its own home-system shifts. Reported in the _corrected_global columns (the headline E_rmse_corrected).
spin: the fit is run independently for spin-polarised (ISPIN=2) and non-spin-polarised (ISPIN=1) structures, since the DFT energy scale differs.
elemental: a second flavour where each mixed system's shifts come only from its constituent unary partitions (Fe–C corrected from pure-Fe and pure-C alone, never from Fe–C structures themselves) — a pure transfer test. Reported in the _corrected_elemental columns and surfaced by the leaderboard's Energy correction → Elemental transfer toggle. For unary structures elemental ≡ system by construction.

The fit is performed in the upstream per-potential evaluation; each structure carries energy_err_per_atom_corrected_global (and …_elemental), which E_rmse_corrected (and E_rmse_corrected_elemental) aggregate (below).

Outlier filter

Before any per-element/per-dataset RMSE/MAE is reported, each structure must clear a three-part cut, applied per structure in regen_summary_v2.py and documented in the summary_filter.json sidecar shipped beside the CSVs:

status == "done" — the MLIP evaluation completed;
|energy_err_per_atom_corrected_global| ≤ 5.0 eV/atom — the offset-corrected per-atom energy error (not the raw E_MLIP − E_DFT);
max_force ≤ 10.0 eV/Å — the largest predicted atomic-force component.

These drop catastrophically broken MLIP predictions, not real reference offsets — without the cut a single runaway structure can inflate a per-element RMSE by 10–100×. The raw wide parquets are unfiltered; only the summary CSVs apply this cut, so every headline number is post-filter. The per-row n_filtered_e column records how many structures each row dropped.

The same per-structure cut governs the force rows. The per-species force metrics (F_*) aggregate the per-atom errors of only those atoms belonging to structures that pass the cut, so forces, energies, and stresses are all reported over the same set of structures. (Earlier snapshots aggregated forces under a looser 50 eV/atom + 50 eV/Å rule, which left pathological structures in the force statistics; they were re-aggregated to this filter on 2026-06-01.)

Metric definitions

Metric	Definition (per summary-CSV row)
`E_rmse_corrected`	√mean((dE/atom − offset)²) over the kept structures in the row group; offset = per-(system, spin) pooled fit
`E_mae_corrected`	mean(\|dE/atom − offset\|) — same group
`E_rmse_corrected_elemental` / `E_mae_corrected_elemental`	same, but offset fit from constituent unary partitions only (elemental-transfer; Energy correction → Elemental transfer)
`E_rmse_raw` / `E_mae_raw`	same, on the uncorrected per-atom energy error
`F_rmse_comp`	√mean(\|F_MLIP − F_DFT\|²) over atoms, grouped by atomic species
`F_mae_comp`	mean(\|F_MLIP − F_DFT\|) — same
`F_radial_rmse`	√mean((\|F_MLIP\| − \|F_DFT\|)²) over atoms — magnitude-only error
`F_radial_mae`	mean(\|\|F_MLIP\| − \|F_DFT\|\|) — same
`F_ang_med_deg`	median angle(F_MLIP, F_DFT) per atomic species
`S_rmse_voigt`	RMS over structures of each structure's stress RMSE, where the per-structure error runs over the 6 ASE-Voigt components `[xx, yy, zz, yz, xz, xy]`; stored in kBar

Row granularity. Each summary-CSV row is one (partition, dataset) group, not a single structure-element. The unary partition is keyed per element (the Unary and Unary_magnetic datasets are merged); binary_Fe_X / binary_Ni_X / carbide / ternary are keyed per dataset. Energy and stress metrics are computed per group as above; the force metrics are carried per atomic species.

Stress: Voigt alignment & units. The shipped numbers come from the stress-patched table (wide_stresspatched/). Earlier exports differenced the DFT stress (a flat row-major 3×3) against the MLIP stress (a 6-vector in ASE Voigt order) without reordering, so only the xx component lined up and S_rmse_voigt was inflated ~2.7–9×; the patch reorders the DFT tensor to ASE Voigt (VOIGT_IDX = [0, 4, 8, 5, 2, 1]) before differencing. S_mae_voigt is left blank (not derivable from the scalar per-structure stress RMSE). The CSV stores stress in kBar; the leaderboard converts to GPa at display time (× 0.1).

Technical performance metrics

The technical-performance page reports each potential's runtime cost on a controlled benchmark — a single B200 GPU running molecular dynamics on an FCC-Cu supercell (NVE, 1 fs timestep), independent of the accuracy metrics. Two engines are measured: ASE (the universal Python path, run for every potential) and LAMMPS (where an official/community GPU pair_style build exists — typically several × faster). The data ships in engine-bench.csv (one row per potential × engine).

max atoms — the largest FCC-Cu cell that ran on one 183 GB B200, localised by bisection to ~one lattice step. This is the practical GPU-memory capacity ceiling (how big a simulation the model can hold), and it is the bubble size on the scatter.
mem-derived max — MB/atom (measured at the largest stable cell) extrapolated to the full card; a smooth cross-check on the bisected ceiling.
steps/day — MD throughput at a fixed ~4 000-atom reference cell (the apples-to-apples point across potentials); the scatter's X axis (log).
peak/day — throughput at the smallest cell that ran for that potential and engine (shown after the @ in the table). This "smallest cell" is the bottom rung of the FCC-Cu size ladder — as few as 32 atoms in ASE, but larger for models with long neighbour cutoffs (LAMMPS builds and e.g. PET bottom out around 256 atoms), so the reference size differs per row. It is therefore an upper bound on throughput, useful for ranking a potential against itself but not comparable across rows.

The table on the page shows the three headline columns — max atoms, steps/day, and peak/day — while engine-bench.csv additionally carries first_oom_atoms, mb_per_atom, mem_derived_max, and peak_at_atoms for anyone reproducing the sweep.

The scatter plots steps/day (log X — faster to the right) against the accuracy of a selectable benchmark track (Y); on the home page the axis is fixed to ASSYST PES E RMSE. Unlike the earlier A100 throughput proxy this is a controlled, single-card benchmark; absolute numbers still depend on the card and build, so read it as a capacity/throughput comparison between potentials rather than an absolute spec.

Leaderboard aggregation

The per-row metrics above are combined into the single number shown per potential through a per-row value, a choice of which partitions to include, and a choice of how to weight them. The roll-up is computed in the browser from the summary table (src/components/leaderboard-agg.js), so both choices are interactive.

Per-row metric — computed as in Metric definitions, one value per (partition, dataset) row (regen_summary_v2.py).
Partition subset — the leaderboard's Partitions checkboxes select which of the five partitions (unary, binary_Fe_X, binary_Ni_X, carbide, ternary) contribute; the score is computed over the selected subset, and ticking all five gives the global ranking. (The single Partition selector on the periodic-table heatmap and comparison cards is a separate control and does not affect the leaderboard.)
Weighting — the Weighting radio chooses how the rows in the selected subset are rolled up. For a metric m with weight column w (energy → n_structures, force → n_atoms, stress → n_structs_stress), over the contributing rows R:
- Equal weight on structures — Σ(w·m) / Σ(w), a structure/atom-count-weighted mean. Every structure counts equally, so high-population datasets dominate. This is the legacy roll-up — a weighted mean of per-row RMSEs, not a single RMSE pooled over every structure.
- Equal weight per dataset — the unweighted arithmetic mean of the per-row values (per element for unary, per dataset otherwise). Each dataset counts equally regardless of how many structures it holds.
- Equal weight per partition — the Equal weight on structures value is computed within each selected partition, then those partition values are averaged equally, so the ~93 unary elements do not swamp the 3 ternary datasets. This is the default, with all five partitions selected.
RMSE/stress columns under the two equal-weight schemes use the arithmetic mean of the per-row RMSEs (not a root-mean-square of RMSEs). Rows whose metric or weight is missing are dropped, and a partition with no finite value for a metric is dropped from the Equal weight per partition average (it is not counted as zero).

Magnetic split. The leaderboard can be restricted by magnetic state. Two levels are offered, both recomputing every metric over only the chosen subset (energy & stress from the per-structure wide table, the five force metrics from the full per-atom arrays):

Binary (INCAR ISPIN): Spin-polarised (ISPIN=2) vs Non-spin-polarised (ISPIN=1). These partition every structure, so spin-pol ∪ non-spin-pol = All.
DFT 5-class (magnetic_class), classified per structure from the converged DFT moments:
- NSP — ISPIN≠2; magnetism not assessed (identical set to non-spin-pol).
- NM — spin-polarised but relaxed to ≈0 moment (max |mᵢ| < 0.05 µB): DFT-confirmed non-magnetic.
- FM — ferromagnetic (significant local moments, one dominant sign).
- AFM — antiferromagnetic (antiparallel sublattices, |Σm|/N < 0.10 µB, compensated).
- FiM — ferrimagnetic (antiparallel but uncompensated).
So NSP ≡ non-spin-pol, and NM ∪ FM ∪ AFM ∪ FiM ≡ spin-pol. An element present in only one state appears only under that state. The classifier lives in scripts/build_wide_per_struct.py::classify_magnetic.

The leaderboard defaults to sorting ascending by E_rmse_corrected (lowest energy RMSE first). The per-element heatmap and comparison cards use an unweighted mean across rows — i.e. they line up with the Equal weight per dataset scheme but will differ from the structure-weighted one. F_ang_med_deg is itself a per-species median, so any leaderboard roll-up of it is a mean of per-species medians — an approximation, not a true global median.

Element attribution caveat

For binary/carbide datasets the row's element (which drives the periodic-table heatmap cell) is the X-partner parsed from the dataset name — Binary_Fe_Mo → Mo, Binary_Fe_B → B — and the host (Fe/Ni/C) is deliberately left unattributed so its cell stays grey. Ternary rows keep the wide-table alphabetical attribution. Force metrics use per-atom species (atomic-number-derived) grouping, which is consistent across datasets.

Code

The scientific core of the PES track is a single MLIP evaluation — energy, forces, stress — on each corpus structure (the ~7 M-row queue, DB and SLURM plumbing is omitted here):

from ase.io import read
calc = ...   # any ASE calculator for your MLIP (e.g. mace_mp(model="medium", device="cuda"))

atoms = read("structure.xyz")          # any structure from the corpus
atoms.calc = calc
energy = atoms.get_potential_energy()  # eV
forces = atoms.get_forces()            # eV/Å,  (N, 3)
stress = atoms.get_stress()            # eV/Å³, Voigt (6,)