Training set construction

When you have gathered several dozens of structures describing your system of interest, you can use them to build a training set. appa contains tools for structure selection and setting up DFT calculations with VASP.

If you do not want to do this, you can find some example training data in the examples folder of this repo. platinum-solvated-protons.xyz and platinum-water-dissociation.xyz should include some near-transition state structures.

Structure selection¶

The command appa select uses the Maximum Set Coverage algorithm from the quests package to select diverse structures from a dataset. To use this tool you need to have quests installed (available with pip).

Usage: appa select [OPTIONS]

  Select the most diverse configurations spanning the configuration space
  using the Maximum Set Coverage algorithm.

Options:
  -d, --data-dir PATH  Directory with .extxyz/.xyz/.traj files  [required]
  --size INTEGER       Number of structures to select  [default: 100]
  --bw FLOAT           Bandwidth for entropy estimation kernel  [default: 0.065]
  -o, --out TEXT       Output XYZ file  [default: selected.xyz]
  -s, --species TEXT   Allowed species (e.g. -s O -s H -s Pt)
  --help               Show this message and exit.

The default bandwidth seems okay for water-related atomic environments.

Example:

appa select -d data --size 1200 -s O -s H -s Pt

Running can take a few minutes so it’s best to run it on a small CPU node on the HPC.

Data labeling¶

When you have a set of structures, you can set up a batch DFT calculation with appa vasp input.

Usage: appa vasp input [OPTIONS]

  Generate VASP input files for a single structure (specified by xyz and
  index). Requires a config .yaml file with sections specified by config_type,
  and an ASE-readable XYZ extended file where all structures have a
  config_type in their info section.

Options:
  --xyz FILE              XYZ file with configurations to be labelled.
                          [required]
  --index INTEGER         Index of the configuration (e.g. SLURM array index).
                          [required]
  --output-dir DIRECTORY  Output directory for VASP input files.  [required]
  --params FILE           YAML file with VASP parameters.  [default: params.yaml]
  --ncore INTEGER         VASP NCORE.  [default: 8]
  --kpar INTEGER          VASP KPAR.  [default: 1]
  --help                  Show this message and exit.

An example of a params.yaml file is given below. The tags correspond to the ASE Vasp calculator input kwargs.

interface:
  encut: 450
  kspacing: 0.18
  xc: rpbe
  ivdw: 11
  ediff: 1.0e-6
  ismear: 0
  sigma: 0.1
  lreal: Auto
  nelm: 200
  algo: Fast
  ldipol: True
  idipol: 3
  lwave: False
  lcharg: False
  lasph: True

An example jobscript, which can be submitted by sbatch job --array=0-123 where 123 is the number of structures minus one in your to_label.xyz file.

#!/bin/bash
#SBATCH --job-name=label
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --time=1-00:00:00

source ~/.bashrc
conda activate grace  # whatever environment has appa installed
which python

module purge
module load 2025
module load VASP5/5.4.4.pl2-foss-2025a-VASPsol-VTST-CUDA-12.8.0

echo "$(date "+%Y-%m-%d %H:%M:%S"): Job started"
echo "$(date "+%Y-%m-%d %H:%M:%S"): Node $SLURMD_NODENAME"

TASK_DIR=$(printf "$PWD/results/%05d" "$SLURM_ARRAY_TASK_ID")
SCRATCH_BASE="/scratch-local/$USER"
SCRATCH_DIR="$SCRATCH_BASE/${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}"

appa vasp input --xyz="to_label.xyz" --index=$SLURM_ARRAY_TASK_ID --output-dir=$TASK_DIR --ncore=32 --kpar=2

mkdir -p $SCRATCH_DIR
rsync -a "$TASK_DIR/" "$SCRATCH_DIR/"
pushd $SCRATCH_DIR
echo "$(date "+%Y-%m-%d %H:%M:%S"): Running VASP"
srun vasp_std
popd
rsync -a "$SCRATCH_DIR/" "$TASK_DIR/"

# Clean up
rm -rf "$SCRATCH_DIR"
echo "$(date "+%Y-%m-%d %H:%M:%S"): Job finished"

To collect the outputs from the DFT calculations, you can use appa vasp collect, where DIRECTORY is the name of the directory where your VASP calculations are stored, according to the format DIRECTORY/*/vasprun.xml.

Usage: appa vasp collect [OPTIONS] DIRECTORY

  Collect VASP outputs, analyze energy/force distributions, filter outliers,
  and write to extxyz.

Options:
  -o, --output FILE  Output extxyz file.  [default: collected.xyz]
  --fmax FLOAT       Maximum allowed force magnitude in eV/Å  [default: 10]
  --emax FLOAT       Maximum deviation from mean energy as a fraction of mean
                     energy  [default: 0.1]
  --help             Show this message and exit.

This script also filters outliers, i.e. structures with large forces or large energy deviations, which can destabilize the ML model training.

This approach is an alternative to selecting structures based on committee disagreement; it does not require running slow molecular dynamics with a committee. It might do a few more DFT calculations than necessary (on outliers), though, but usually CPU hours are cheaper and more readily available than GPU hours.

References¶

Yu, B., Lordi, V., & Schwalbe-Koda, D. (2025). Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory. arXiv. 10.48550/ARXIV.2511.10561