[RFC] Users, Training abstractions, and Repo design #54

pbontrager · 2023-12-07T16:07:26Z

Below is an RFC on how to design the UX and code layout for tune, along with a description of the users we want to aim for.

joecummings · 2023-12-07T20:09:40Z

DESIGN_RFC.md

+# Repo Design
+Finally a note on repo design. I think all of the components should be groped together so that they serve as a library within the library for the recipes to access and for user 3 to access directly with clean imports. The components should be grouped according to the getter functions.
+
+```


With this proposed repo structure, can you provide an example of the exact command User 1 would use?

Maybe something like torchrun finetune_llm.py --defaults finetune_llama2_on_alpaca.yaml --num_devices 8 --loss cross_entropy where the user launches the script first and then chooses a set of defaults that we can parse unless they're explicitly overwritten. The alternative would be that we provide our own launcher, something like tune finetune_llama2_on_alpaca --num_devices 8 --loss cross_entropy which would wrap torchrun and parse the yaml which could like to the recipe, but I'm afraid this approach would bias users towards building yaml files instead of scripts.

soumith · 2023-12-12T05:13:51Z

DESIGN_RFC.md

+- copy llama2_finetune.py to my_finetune.py
+- edit my_finetune.py directly with no need to change any library files
+
+To enable this user it's very important that our recipes our hackable, self contained, and easily readable. For this reason we encourage copy and pasting the training loop as we add more recipes.


and short. our recipes should be "nano" size ideally.

I agree with this. While as we're iterating the initial recipes will be longer, but we expect to abstract out the boiler plate so that the recipes are close in length to the example pseudo code provided. Do you consider the amount of code provided in the Trainer abstraction pseudocode nano sized?

rohan-varma

Love the overall design approach & categorization of users. Left a couple of comments for your consideration, thanks!

rohan-varma · 2023-12-12T09:13:14Z

DESIGN_RFC.md

+To be able to understand how we want to design the abstractions and the repo, it's important that we define how we want our users to interact with the library. One big complaint we heard from a number of users was around the black box nature of the fine tuning libraries they were using. On the other hand we also spoke to users who were very new to ml and were happy just to directly launch scripts without ever looking inside. To address this, I propose we model three levels of users and build our library to allow users to advance through these stages as they grow in requirements.
+
+### User 1:
+This user just wants to use a recipes on their own dataset. They may want to play around with a couple of parameter changes but this user would be happiest with cli level control and access to good default values for a particular recipe.


what do we mean by parameter changes concretely - are these flags that would be exposed by the trainer via CLI / config?

By parameter changes I mean common hyper-parameters such as learning rate. Essentially the trainer would have a set of parameters that we think are the most likely to need to be controlled by the user and we'd make those args available. We'd provide default values for some specific recipe examples but any of these args could be accessed via CLI.

rohan-varma · 2023-12-12T09:19:24Z

DESIGN_RFC.md

+
+model = get_model(...)  # get from our models, or our PEFT models
+model = tune.distributed(model, num_devices) # FSDP wrap
+dataset = get_dataset(...) # one of our datasets, which point to hf datasets


Thinking about these APIs and whether they're the right level of abstraction in general. I see how something like get_model is concise, but wonder if that hides away the model architecture and makes things harder to reason about and hack into.

For example, with this proposal a LoRA might be instantiated as follows:

model = get_model(llama2, peft=True) # or similar

versus

model = LoRATransformerDecoder(...) # user would specify num_layers, num_heads, etc similar to Llama params

similar w/ tune.distributed - on one hand I can see this valuable as we might not just FSDP wrap, but also set defaults like sharded checkpointing, etc. But I wonder how much this dissuades configurability - for example configuring auto wrapping w/FSDP is a common pain point in higher level trainers (granted that might be going away with per-parameter FSDP).

The idea for get_model() is just to make model implementations we already configured accessible from the CLI. So you wouldn't do model = get_model("llama2", peft=True) but instead do model = get_model("LoRaTransformerDecoder"). We could be clever and make the model itself configurable with get_model("LoRaTransformerDecoder", **kwargs) but I think at that point the user should be encouraged to move past the CLI and copy and paste the recipe. I think in general we should move users to level 2 the moment they want configurability and not try to lock ourselves into an over-engineered config system.

^ We should demonstrate clear examples for both. get_model caters to User 1 and the creation of a model like LoRaTransformerDecoder caters to User 2 or 3.

rohan-varma · 2023-12-12T09:19:51Z

DESIGN_RFC.md

+evaluator = get_eval(...) # eval harness to use
+
+opt = get_optimizer(...)
+opt = tune.wrapped_optimizer(opt, num_devices) # autocast + distributed logic


nit: what does the optimizer have to do w/autocast logic?

If you train with fp16 you have to scale the optimizer and you don't call optimizer.step() but scaler.step(optimizer). This wrapper could hide those inconsistencies from the training loop.

rohan-varma · 2023-12-12T09:20:59Z

DESIGN_RFC.md

+
+		data = data.to(device)
+
+		with tune.cast(dtype):  # choose the appropriate autocast if distributed


There might be some clarification needed here. I know this isn't a concrete proposal to add a tune.cast context manager, but autocast and the FSDP mixed precision APIs compose well together and using on or the other or composing both mean different things in a training context. happy to clarify more.

Yeah, I think those details are out of scope for this RFC, but the main idea is to just provide a unified mixed precision code that doesn't change if we're doing distributed training or not.

Hmm, I dont think I understand - why would I have mixed precision code if I'm not doing mixed precision training?

rohan-varma · 2023-12-12T09:22:31Z

DESIGN_RFC.md

+
+		logger(model, l, step, epoch)
+		checkpoint(model, opt, dataloader, step, epoch)
+		evaluator(model, step, epoch)


so this is mostly a training utilities based approach, where we offer a bunch of convenient utils and users roll their own training loop (and we provide a couple they can copy paste from or use as starter examples). At what point do we consider consolidating on TNT here and not rolling our own training utilities?

Is TNT a library we want to take a dependency on? This design is made under the assumption that we didn't want a dependency on a trainer. I do think the "open for loop" design with utils is more extendable than the trainer class approach that's very popular. But I don't want to reinvent the wheel for no reason.

rohan-varma · 2023-12-12T09:23:10Z

DESIGN_RFC.md

+recipes/
+	finetune_llm.py
+
+defaults/


nit: should we call this configs?

I like configs or examples

I'm open to it

kartikayk · 2024-01-03T22:39:25Z

DESIGN_RFC.md

+# Users
+To be able to understand how we want to design the abstractions and the repo, it's important that we define how we want our users to interact with the library. One big complaint we heard from a number of users was around the black box nature of the fine tuning libraries they were using. On the other hand we also spoke to users who were very new to ml and were happy just to directly launch scripts without ever looking inside. To address this, I propose we model three levels of users and build our library to allow users to advance through these stages as they grow in requirements.
+
+### User 1:


I think this is directionally right, but we should simplify these definitions:

User 1: Clone, modify and run configs

User 2: Trivial recipe changes (eg: support for new dataset, support for models with similar architectures etc) which usually includes cloning and modifying existing examples

User 3: Non-trivial code changes (eg: support for new model architectures, new training techniques etc) which includes writing new recipes using existing components

kartikayk · 2024-01-03T22:39:31Z

DESIGN_RFC.md

+I will include pseudocode here for how we can design our recipes to support the user profiles listed above.
+
+```
+parser = argparse.ArgumentParser(...) # support all getters, lr, batchsize, epoch, dtype, num_devices, seed


I'm guessing this is just an example? Ideally we would focus less on argsparse and more on yaml configs (though some support for argparse makes sense)

Currently using argparse to define the options and it can read in a yaml for all of the values. So a specific recipe does have a yaml that you can pass in with the "--config" flag or override specific values with command line arguments.

kartikayk · 2024-01-03T22:39:43Z

DESIGN_RFC.md

+
+```
+
+The above example, if we agree on it, should act as a guide for the direction of our recipe code. Initially none of the getter functions and utils will be built so there will be much more boilerplate but we can reduce this with time.


we can reduce this with time

I think we should do this as part of the initial launch. It can come at the cost of additional features. But whatever features we do provide, should have well defined utilities.

On that note, do we have a concrete set of features/utilities we are targeting for a very first version?

We don't have a concrete list. There has been a lot of discussion since this RFC was put up on the value of simple utilities vs transparency. We have been iteratively adding a few utilities so far but it could be worth making an explicit list with designs.

Unittests for param server

This PR adds the following: 1 - via reset parameters, a full layerwise init for the llama models under /llama. This uses the total model depth as part of the init via: self.weight_init_std = 0.02 / (2 * self.num_layers) ** 0.5 2 - The final output ffn (head) is init with sqrt of the dim of the model itself and a slightly wider cutoff factor of 3. 3 - tangential change - updates run_llama_train.sh with updated MODEL and MODEL_CONF params to allow for direct model control via the sh script. (there was a MODEL already but it was incorrectly using that in place of MODEL_CONF...though we should update this as it's not intuitive). 4 - made the debugmodel default to 2 layers as an improved debug check. 5 - added a 1B and 40B for additional testing configs. I can't currently run 70B on my H100 due to OOM, but can run 40B. Testing: Verified proper init and training with 7B, 13B and ~40B: <img width="1085" alt="Screenshot 2024-02-11 at 10 39 12 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/049037ed-63a4-4ab0-bebc-f297857aab72">

RFC design doc

3637867

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 7, 2023

pbontrager mentioned this pull request Dec 7, 2023

Users, Training abstractions, and repo design #53

Closed

joecummings reviewed Dec 7, 2023

View reviewed changes

pbontrager changed the title ~~Users, Training abstractions, and Repo design RFC~~ [RFC] Users, Training abstractions, and Repo design Dec 11, 2023

soumith reviewed Dec 12, 2023

View reviewed changes

rohan-varma reviewed Dec 12, 2023

View reviewed changes

NicolasHug mentioned this pull request Dec 15, 2023

Enabling User2 workflow #94

Closed

kartikayk reviewed Jan 3, 2024

View reviewed changes

pbontrager mentioned this pull request Jan 11, 2024

Utils Refactor #180

Merged

joecummings added the rfc Request for comments label Jan 23, 2024

RdoubleA closed this Feb 6, 2024

joecummings deleted the design_rfc branch March 18, 2024 20:50

joecummings pushed a commit that referenced this pull request May 1, 2025

Merge pull request #54 from joecummings/ps_unit_test

56b5b96

Unittests for param server


		data = data.to(device)

		with tune.cast(dtype): # choose the appropriate autocast if distributed


		```

		The above example, if we agree on it, should act as a guide for the direction of our recipe code. Initially none of the getter functions and utils will be built so there will be much more boilerplate but we can reduce this with time.

[RFC] Users, Training abstractions, and Repo design #54

[RFC] Users, Training abstractions, and Repo design #54

Uh oh!

Conversation

pbontrager commented Dec 7, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!