Wrenformer #44

janosh · 2022-05-12T10:32:41Z

This PR adds a new variant of the Wren model called Wrenformer, a rewrite that drops the custom self-attention in favor of builtin PyTorch TransformerEncoder. It also prepares Matbench submissions for Roost and Wren (structure tasks only for Wren).

The Wren rewrite as a transformer encoder was necessary to run robustly across Matbench tasks as the original Wren would run into out-of-memory errors during training and inference on materials with large numbers of Wyckoff positions (< 16 was a safe cutoff). This happened even on A100 GPUs with 80 GB RAM.

Initial performance testing (training for 100 epochs and with only 3 transformer layers) suggests Wrenformer slightly beats Wren across all Matbench tasks I recorded for both.

More sets of hyperparameters

Speed difference between Wren and Wrenformer

According to @CompRhys, Wren could run 500 epochs in 5.5 h on a P100 training on 120k samples of MP data (similar to the matbench_mp_e_form dataset with 132k samples). Wrenformer only managed 207 epochs in 4h on the more powerful A100 training on matbench_mp_e_form. However, to avoid out-of-memory issues, Rhys constrained Wren to only run on systems with <= 16 Wyckoff positions. The code below shows that this lightens the workload by a factor of about 7.5, likely explaining the apparent slowdown in Wrenformer.

import pandas as pd
from aviary.wren.utils import count_wyks
from examples.mat_bench import DATA_PATHS

df = pd.read_json(DATA_PATHS["matbench_mp_e_form"])

df["n_wyckoff"] = df.wyckoff.map(count_wyks)


sum_wyckoffs_sqr = (df.n_wyckoff**2).sum()
sum_wyckoffs_lte_16_sqr = (df.query("n_wyckoff <= 16").n_wyckoff ** 2).sum()
print(f"{sum_wyckoffs_sqr=}")
print(f"{sum_wyckoffs_lte_16_sqr=}")
print(f"{sum_wyckoffs_sqr/sum_wyckoffs_lte_16_sqr=:.3}")
# prints 7.45, so Wrenformer has to do 7.45x more work, explaining the about 2x slow down
# on a more powerful GPU (Nvidia A100 vs Wren on a P100)

rename plot_scaled_errors() to scale_errors()

…imports

in collaboration with Rokas

…l matbench tasks modify examples/mat_bench/slurm_submit.py to run wrenformer

…/wren/data.py rename cry_ids -> material_ids

use longer but clearer variable names

…name

* add class InMemoryDataLoader in new module aviary/data.py refactor run_matbench_task() to work with it * remove device kwarg from BaseModelClass, load tensors onto GPU externally doing devicde IO inside epoch loop can lead to significant slow down and means doing the same work at every epoch instead of once also remove WyckoffData class from wrenformer/data.py and improve slurm submit script header formatting * rewrite print_walltime decorator to also work as context manager ensure model checkpoints and tensorboard logs are always saved relative to project root by prefixing paths with ROOT * refactor run_matbench_task() to do single fold so each fold can be a separate slurm job * mv examples/mat_bench/run_{wrenformer=>matbench}.py

…of just JSON simply write {dataset: {fold: preds}} dict as compressed JSON to disk

…ion tasks change typo in InMemoryDataLoader: default shuffle=True->False

…oader 1024->128 also fix key error in bench_dict[dataset_name][fold]

use it to run roostformer in run_matbench_task() if model name contains roost

more efficient thsn reloading all model preds and computinh afterwards

…ipython shell magic cmd

…merging results from separate slurm jobs run_matbench_task() drop arg benchmark_path: str replaced by timestamp: str

…s.py

…mat_bench/utils.py

…y dependency (was imported only for softmax)

d_model can now be specified separately depending on dataset size

https://docs.python.org/3/whatsnew/3.8.html#other-language-changes

for more information, see https://pre-commit.ci

record number of trainable params in wandb config

…lassification test also adds batch_size kwarg to run_wrenformer()

aviary/data.py

aviary/wren/data.py

aviary/wren/utils.py

aviary/wrenformer/run.py

also rename some poorly named variables: - element_weights -> wyckoff_site_multiplicities in aviary/wren/data.py - aflow -> aflow_label_with_chemsys in aviary/wren/utils.py

remove global numpy random seed in aviary/data.py

arises due to torch.tensor(float("nan")) defaulting to CPU

Wrenformer

janosh added 28 commits April 27, 2022 14:10

start work on roost + wren matbench submission

b95437a

run_matbench_task() fix reading out classification predictions

f9d4146

add examples/matbench/{plotting_functions,make_plots}.py

a048224

add scaled_error_heatmap() to matbench/plotting_functions.py

5d5ee7e

rename plot_scaled_errors() to scale_errors()

mv examples/{matbench,mat_bench} to avoid shadowing matbench package …

a3d28a5

…imports

initial working version of Wren as a transformer

f645ab1

in collaboration with Rokas

add examples/mat_bench/run_wrenformer.py for running wrenformer on al…

1c4ceed

…l matbench tasks modify examples/mat_bench/slurm_submit.py to run wrenformer

wrenformer fix node aggregation: exclude padded sequence values

349e2ef

fix pytorch error from model and data on different devices

4c89dd1

run_wrenformer.py only create benchmark_dir if not empty string

5496d4a

drop parse_aflow() from aviary/wrenformer/data.py, import from aviary…

6f81d75

…/wren/data.py rename cry_ids -> material_ids

rename new Wren variant to Wrenformer, fix pydocstyle doc string errors

34472f5

use longer but clearer variable names

fix run_matbench_task() saving models to wrong hard-coded checkpoint …

7a2d9a0

…name

run_matbench_task() ditch writing MatbenchBenchmark to disk in favor …

df65c76

…of just JSON simply write {dataset: {fold: preds}} dict as compressed JSON to disk

fix run_matbench_task() use correct number of outputs for classificat…

3b0aa17

…ion tasks change typo in InMemoryDataLoader: default shuffle=True->False

fix run_matbench_task() oom error from too large batch size in test_l…

280a029

…oader 1024->128 also fix key error in bench_dict[dataset_name][fold]

add get_composition_embedding() in aviary/wrenformer/data.py

c165b04

use it to run roostformer in run_matbench_task() if model name contains roost

run_matbench_task() add kwarg n_transformer_layers

131bbef

run_matbench_task() save model scores directly to JSON

b0b1958

more efficient thsn reloading all model preds and computinh afterwards

slurm_submit.py use subprocess.run() to submit slurm jobs instead of …

d276a7b

…ipython shell magic cmd

add examples/mat_bench/utils.py with open_json() context manager for …

d88a0f0

…merging results from separate slurm jobs run_matbench_task() drop arg benchmark_path: str replaced by timestamp: str

refactor data loading of model errors in examples/mat_bench/make_plot…

762e9b6

…s.py

move print_walltime context manager from aviary/utils.py to examples/…

abd5c23

…mat_bench/utils.py

add custom np softmax() and one_hot() implementations and remove scip…

fca4644

…y dependency (was imported only for softmax)

wrenformer add linear layer to project embedding dim to d_model

d7be55f

d_model can now be specified separately depending on dataset size

fix BaseModelClass initial epoch to 0 (was 1)

76625a7

fix tests after removal of device kwarg from BaseModelClass

36d7807

janosh added the enhancement New feature or request label May 12, 2022

fix py37 not supporting unenclosed iterable unpacking

1daf90e

https://docs.python.org/3/whatsnew/3.8.html#other-language-changes

janosh and others added 6 commits June 15, 2022 10:02

Merge remote-tracking branch 'origin/main' into wrenformer

8f83842

[pre-commit.ci] auto fixes from pre-commit.com hooks

3daee8c

for more information, see https://pre-commit.ci

torch.save checkpoint swa_model if using SWA

9daa82a

record number of trainable params in wandb config

add kwargs learning_rate and warmup_steps to run_wrenformer()

6fa5400

include test_df in run_wrenformer() return values

65ff974

make embedding_aggregations an explicit kwarg to wrenformer

f6b32d4

janosh force-pushed the wrenformer branch from 5b9a49b to f6b32d4 Compare June 16, 2022 11:47

janosh added 2 commits June 16, 2022 18:01

add module test_wrenformer.py with non-robust regression and robust c…

5a9a651

…lassification test also adds batch_size kwarg to run_wrenformer()

fix py37 CI by not using walrus operator

735beec