adding variable length attention to llama3 8b #2000

liangel-02 · 2025-11-07T01:12:31Z

Summary
This PR adds variable length attention (varlen) support to the Llama 3 8b model in torchtitan. We replace use_flex_attn with attn_type (either "sdpa", "varlen", "flex"). If attn_type = "varlen", the attention module calls a compiled varlen_attn defined here.

Testing
Ran loss and performance tests against flex attention. Loss is on par.

Varlen is slightly slower than Flex due to the cuda kernel speeds (varlen calls into flash_attention_forward/flash_attention_backward today).

	Varlen	Flex
Forward	774us 357ns	722us 317ns
Backward	1ms 955us 916ns	1ms 558us 747ns

torchtitan/hf_datasets/text_datasets.py

fegin

This implementation won't work with PP and too model intrusive. The pack logic should be hide inside the inner attention.

torchtitan/hf_datasets/text_datasets.py

torchtitan/models/llama3/train_configs/llama3_8b_varlen.toml

torchtitan/models/attention.py

torchtitan/models/llama3/__init__.py

fegin

LGTM, thanks for the update. Leave some other comments, after the comments are addressed, this PR should be ready.

torchtitan/models/attention.py

torchtitan/models/llama3/model/model.py

torchtitan/models/llama3/train_configs/llama3_8b.toml

tianyu-l

Thanks! Left some comments, please see if they make sense to you.

torchtitan/models/llama3/model/model.py

torchtitan/models/llama3/model/args.py

torchtitan/models/llama3/model/model.py

torchtitan/models/attention.py

drisspg · 2025-11-19T18:42:37Z

torchtitan/models/deepseek_v3/infra/parallelize.py


-    use_flex_attn = getattr(model.model_args, "use_flex_attn", False)
-    if job_config.parallelism.context_parallel_degree > 1 and use_flex_attn:
+    attn_type = getattr(model.model_args, "attn_type", "sdpa")


nit: in python 3.11+ strenum seems like a good fit for this

TorchTitan still sticks to 3.10 afaik.

torchtitan/models/llama3/model/model.py

torchtitan/models/attention.py

tianyu-l

Left some more comments. If you'd like to focus on Llama 3 in this PR, that's fine with me too.

torchtitan/distributed/activation_checkpoint.py

tianyu-l · 2025-11-20T06:26:12Z

torchtitan/experiments/forge/example_train.py

        extra_kwargs = {}

-        if getattr(self.model_args, "use_flex_attn", False):
+        if getattr(self.model_args, "attn_type", "sdpa") == "flex":


"varlen" should also work here?

iiuc this isn't limited to llama3, ill add varlen after more thorough testing for the other models

torchtitan/experiments/gpt_oss/infra/parallelize.py

torchtitan/experiments/simple_fsdp/deepseek_v3/parallelize.py

torchtitan/experiments/vlm/infra/parallelize.py

torchtitan/models/llama4/model/args.py

torchtitan/models/qwen3/infra/parallelize.py

torchtitan/models/llama4/model/model.py

tianyu-l · 2025-11-20T06:46:41Z

torchtitan/models/qwen3/model/model.py

+        match self.attn_type:
+            case "flex":
+                self.inner_attention = FlexAttentionWrapper()
+            case _:


How about varlen? also it seems get_attention_masks function in this file is not changed.
If the scope of this PR is to support Llama 3 only, that's fine too.

i think we can limit to llama3 in this pr and add support for other models later

torchtitan/models/attention.py

fegin

LGTM, we can leave other models to other PR(s).

fegin · 2025-11-20T17:38:01Z

torchtitan/models/deepseek_v3/infra/parallelize.py

-    if use_flex_attn:
-        attention_kernel_plan = prepare_module_input(
-            input_layouts=(Shard(1), Shard(1), Shard(1)),
-            desired_input_layouts=(Shard(1), Shard(1), Shard(1)),
-            use_local_output=True,
-        )
-    else:
-        attention_kernel_plan = prepare_module_input(
-            input_layouts=(Shard(1), Shard(1), Shard(1)),
-            desired_input_layouts=(Shard(1), Shard(1), Shard(1)),
-            use_local_output=True,
-        )


This is an existing duplicated code?

ya this was there before i removed it per #2000 (comment)

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 7, 2025

liangel-02 force-pushed the test_varlen branch 3 times, most recently from eeecb63 to cad97e5 Compare November 12, 2025 22:49

liangel-02 changed the title ~~Test varlen~~ adding variable length attention to llama 3 8b Nov 12, 2025

liangel-02 changed the title ~~adding variable length attention to llama 3 8b~~ adding variable length attention to llama3 8b Nov 12, 2025

liangel-02 requested a review from drisspg November 12, 2025 23:18

drisspg reviewed Nov 12, 2025

View reviewed changes

torchtitan/hf_datasets/text_datasets.py Outdated Show resolved Hide resolved

drisspg reviewed Nov 12, 2025

View reviewed changes

torchtitan/hf_datasets/text_datasets.py Outdated Show resolved Hide resolved

fegin requested changes Nov 13, 2025

View reviewed changes

torchtitan/hf_datasets/text_datasets.py Outdated Show resolved Hide resolved

torchtitan/hf_datasets/text_datasets.py Outdated Show resolved Hide resolved

liangel-02 force-pushed the test_varlen branch 4 times, most recently from 55352a5 to 066ca02 Compare November 14, 2025 18:11

liangel-02 requested a review from fegin November 14, 2025 18:11

liangel-02 marked this pull request as ready for review November 14, 2025 18:14

liangel-02 requested review from tianyu-l, wconstab and wwwjn as code owners November 14, 2025 18:14

wwwjn reviewed Nov 17, 2025

View reviewed changes

torchtitan/models/llama3/train_configs/llama3_8b_varlen.toml Outdated Show resolved Hide resolved

torchtitan/models/attention.py Outdated Show resolved Hide resolved

torchtitan/models/llama3/__init__.py Outdated Show resolved Hide resolved

liangel-02 force-pushed the test_varlen branch from 066ca02 to c9b6d5c Compare November 17, 2025 15:17

fegin requested changes Nov 17, 2025

View reviewed changes

torchtitan/models/attention.py Show resolved Hide resolved

torchtitan/models/llama3/model/model.py Outdated Show resolved Hide resolved

torchtitan/models/llama3/train_configs/llama3_8b.toml Outdated Show resolved Hide resolved

liangel-02 force-pushed the test_varlen branch 2 times, most recently from a902cbe to de416f9 Compare November 17, 2025 18:05

liangel-02 requested a review from fegin November 17, 2025 18:05

tianyu-l requested changes Nov 17, 2025

View reviewed changes

liangel-02 force-pushed the test_varlen branch 4 times, most recently from caafc81 to 4d36560 Compare November 18, 2025 21:49

liangel-02 added 4 commits November 19, 2025 10:07

collapse batch outside of dataloader

0012170

remove explicit mask def

4fef6eb

attention_type

0d32d5a

remove use_flex_attn

4d80f4e

liangel-02 force-pushed the test_varlen branch 2 times, most recently from ca0efc0 to 291daea Compare November 19, 2025 18:27