FSDP-Multi-GPU-Training

Overview

A production-ready template for training LLMs with two strategies:

FSDP for multi-GPU sharded training and sharded checkpoints
Unsloth for efficient 4-bit fine-tuning on a single/multi-GPU node

It is designed for AWS SageMaker Spot with preemption-safe saves (SIGTERM handling) and automatic resume from the latest checkpoint. Includes hardened dataloading (HF Hub and S3 parquet) and robust checkpoint sync to S3.

For non-developers (quick start): use the provided commands without changing code. For developers: code lives under src/ in a standard Python package layout.

Features

FSDP strategy with sharded initialization and sharded checkpoints
Unsloth strategy with 4-bit training and a fallback to Transformers
Preemption-safe: SIGTERM/SIGINT triggers emergency_stop() with a final checkpoint
Auto-resume: if --resume is not provided, uses the most recent checkpoint in checkpoint.output_dir
Hardened dataloader: HF datasets or S3 parquet, batched mapping with prompt templating
S3 checkpoint sync with retries and exponential backoff
Dockerfile with CUDA PyTorch, awscli, and s5cmd preinstalled

Repository Structure

src/fsdp_unsloth/
- core/ trainers, strategy selection, security checks
- common/ logging, memory, checkpoint utils, and config adapter/schema
scripts/
- train.py CLI entry (thin wrapper; you can also use the installed CLI)
- infer.py example inference script
- configs/ example configs (FSDP/Unsloth + smoke)
.github/workflows/ GitHub Actions for CI and pre-commit

Setup (dev-friendly, using uv)

python -m venv .venv
. .venv/bin/activate
pip install uv
uv pip install -e ".[dev]"
pre-commit install

Secure Job Submission

A template notebook is provided at notebooks/secure_submit.ipynb which demonstrates:
- Building SageMaker guardrails via scripts/core/security.py::build_sagemaker_guardrails()
- Redacting secrets before logging configs
- Merging guardrails into a job request (example boto3 call commented out)
Configure environment values using .env.example (copy to .env).
Optional: install python-dotenv (already in requirements) and load env in your scripts:
```
from dotenv import load_dotenv
load_dotenv()
```

Preflight Safety Checks

src/fsdp_unsloth/core/strategy_selector.py runs preflight checks (HF token format, S3/local path safety, W&B readiness) before trainer construction.
Enable strict mode to fail fast:
```
security:
  strict_preflight: true
```
Docker image (recommended for SageMaker)

docker build -t unsloth-fsdp-training:latest .

Configs

Base schema: src/fsdp_unsloth/common/configs/base_config.yaml
Examples:
- FSDP: scripts/configs/fsdp/llama-7b.yaml
- Unsloth: scripts/configs/unsloth/finance-alpaca.yaml
- Smoke tests: scripts/configs/{fsdp,unsloth}/smoke.yaml

Backend selection is explicit:

Set backend: fsdp or backend: unsloth at the top of the config.
CLI override available via --backend (alias of --strategy).

Key fields:

training.* (batch sizes, lr, steps)
checkpoint.save_interval, checkpoint.output_dir
logging.log_interval, logging.wandb_project
model.name, model.max_length, model.load_in_4bit, model.hf_token
fsdp.mixed_precision and other sharding params

Running Training

Using the installed CLI (recommended):

fsdp-train --config scripts/configs/fsdp/smoke.yaml --smoke
fsdp-train --config scripts/configs/unsloth/smoke.yaml --backend unsloth --smoke

Via provided script wrapper (equivalent):

python scripts/train.py --config scripts/configs/fsdp/llama-7b.yaml

Multi-GPU (torchrun):

make train-fsdp-mgpu NGPU=8
make train-unsloth-mgpu NGPU=8

Optional NCCL hints in Makefile (commented) for multi-node networking.

Checkpoints & Resume

FSDP saves sharded checkpoints into folders like checkpoint_<step>/ under checkpoint.output_dir.
Unsloth saves a single-file checkpoint checkpoint_<step>.bin.
Auto-resume (when --resume not provided): auto-detects the latest checkpoint in checkpoint.output_dir.
SageMaker: set CheckpointConfig (S3 URI). The trainer will sync to SM_CHECKPOINT_DIR automatically.

Data Loading

HF dataset: data.name = HF dataset ID, supports streaming.
S3 parquet: data.name = s3://bucket/path/file.parquet (parquet only). Uses s3fs.
Prompt templating: define data.prompt_template using {instruction}, {input}, {output}, {eos_token}.

SageMaker Spot Training

Spot preemption triggers SIGTERM; the trainer catches it and performs an emergency checkpoint save.
Recommended GPU instances:
- FSDP: p4d.24xlarge (A100, 8x GPU) or p5.48xlarge (H100) for larger models
- Unsloth: g5.12xlarge (A10G) or p4d.24xlarge depending on model size
Use CheckpointConfig for S3 checkpointing and enable Managed Spot Training. Ensure MaxWaitTimeInSeconds > MaxRuntimeInSeconds for queueing.

Smoke Tests

Minimal runs to validate wiring and error handling.

make train-unsloth-smoke
make train-fsdp-smoke  # requires GPU

Checkpoint Conversion Tool

Use scripts/tools/convert_checkpoint.py to convert between FSDP sharded directories and Unsloth single-file checkpoints.

Convert FSDP shards to a single Unsloth file:

python -m scripts.tools.convert_checkpoint \
  --source_path outputs/checkpoint_1000 \
  --target_path outputs/unsloth_1000.bin \
  --strategy fsdp --target_strategy unsloth

Convert Unsloth file to FSDP shards directory:

python -m scripts.tools.convert_checkpoint \
  --source_path outputs/unsloth_1000.bin \
  --target_path outputs/fsdp_1000 \
  --strategy unsloth --target_strategy fsdp

Inference with an optional checkpoint (file or shard dir):

python scripts/infer.py \
  --config scripts/configs/unsloth/smoke.yaml \
  --prompt "Hello" \
  --checkpoint outputs/unsloth_1000.bin

Contributing

Prereqs
- Python 3.10+, CUDA drivers for GPU runs
- HF credentials (HF_TOKEN) if using gated models/datasets
- AWS credentials for S3 (optional for S3 paths)
Workflow
- Branch from main, implement changes
- Run format/lint/tests:
```
pre-commit run --all-files
pytest -v
```
- Submit a PR with a concise description and test plan

License

This project is licensed under Apache License 2.0 (see LICENSE).

TODO (Continuous Improvement)

[docs] Add SageMaker job submission examples (Estimator config, Spot flags, CheckpointConfig)
[fsdp] Add richer sharding options in fsdp config (activation checkpointing policies, CPU offload)
[resume] Write a latest pointer file after each save to speed up auto-resume discovery
[inference] Validate and document scripts/infer.py for both strategies
[tests] Add CPU-only unit tests and a small CI workflow for lint + schema checks
[monitoring] Add optional CloudWatch/W&B guidance and Makefile targets for metrics sync
[datasets] Add JSONL and multi-file S3 dataset examples

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
docker		docker
infra/terraform		infra/terraform
notebooks		notebooks
scripts		scripts
src		src
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
example.env		example.env
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FSDP-Multi-GPU-Training

Features

Repository Structure

Setup (dev-friendly, using uv)

Secure Job Submission

Preflight Safety Checks

Configs

Running Training

Checkpoints & Resume

Data Loading

SageMaker Spot Training

Smoke Tests

Checkpoint Conversion Tool

Contributing

License

TODO (Continuous Improvement)

About

Uh oh!

Releases

Packages

Languages

License

codeamt/FSDP-Multi-GPU-Training

Folders and files

Latest commit

History

Repository files navigation

FSDP-Multi-GPU-Training

Features

Repository Structure

Setup (dev-friendly, using uv)

Secure Job Submission

Preflight Safety Checks

Configs

Running Training

Checkpoints & Resume

Data Loading

SageMaker Spot Training

Smoke Tests

Checkpoint Conversion Tool

Contributing

License

TODO (Continuous Improvement)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages