-
Notifications
You must be signed in to change notification settings - Fork 670
[RFC] Users, Training abstractions, and Repo design #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Repo Design | ||
Finally a note on repo design. I think all of the components should be groped together so that they serve as a library within the library for the recipes to access and for user 3 to access directly with clean imports. The components should be grouped according to the getter functions. | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this proposed repo structure, can you provide an example of the exact command User 1
would use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like torchrun finetune_llm.py --defaults finetune_llama2_on_alpaca.yaml --num_devices 8 --loss cross_entropy
where the user launches the script first and then chooses a set of defaults that we can parse unless they're explicitly overwritten. The alternative would be that we provide our own launcher, something like tune finetune_llama2_on_alpaca --num_devices 8 --loss cross_entropy
which would wrap torchrun and parse the yaml which could like to the recipe, but I'm afraid this approach would bias users towards building yaml files instead of scripts.
- copy llama2_finetune.py to my_finetune.py | ||
- edit my_finetune.py directly with no need to change any library files | ||
|
||
To enable this user it's very important that our recipes our hackable, self contained, and easily readable. For this reason we encourage copy and pasting the training loop as we add more recipes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and short. our recipes should be "nano" size ideally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. While as we're iterating the initial recipes will be longer, but we expect to abstract out the boiler plate so that the recipes are close in length to the example pseudo code provided. Do you consider the amount of code provided in the Trainer abstraction pseudocode nano sized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the overall design approach & categorization of users. Left a couple of comments for your consideration, thanks!
To be able to understand how we want to design the abstractions and the repo, it's important that we define how we want our users to interact with the library. One big complaint we heard from a number of users was around the black box nature of the fine tuning libraries they were using. On the other hand we also spoke to users who were very new to ml and were happy just to directly launch scripts without ever looking inside. To address this, I propose we model three levels of users and build our library to allow users to advance through these stages as they grow in requirements. | ||
|
||
### User 1: | ||
This user just wants to use a recipes on their own dataset. They may want to play around with a couple of parameter changes but this user would be happiest with cli level control and access to good default values for a particular recipe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do we mean by parameter changes concretely - are these flags that would be exposed by the trainer via CLI / config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By parameter changes I mean common hyper-parameters such as learning rate. Essentially the trainer would have a set of parameters that we think are the most likely to need to be controlled by the user and we'd make those args available. We'd provide default values for some specific recipe examples but any of these args could be accessed via CLI.
|
||
model = get_model(...) # get from our models, or our PEFT models | ||
model = tune.distributed(model, num_devices) # FSDP wrap | ||
dataset = get_dataset(...) # one of our datasets, which point to hf datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about these APIs and whether they're the right level of abstraction in general. I see how something like get_model
is concise, but wonder if that hides away the model architecture and makes things harder to reason about and hack into.
For example, with this proposal a LoRA might be instantiated as follows:
model = get_model(llama2, peft=True) # or similar
versus
model = LoRATransformerDecoder(...) # user would specify num_layers, num_heads, etc similar to Llama params
similar w/ tune.distributed
- on one hand I can see this valuable as we might not just FSDP wrap, but also set defaults like sharded checkpointing, etc. But I wonder how much this dissuades configurability - for example configuring auto wrapping w/FSDP is a common pain point in higher level trainers (granted that might be going away with per-parameter FSDP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea for get_model() is just to make model implementations we already configured accessible from the CLI. So you wouldn't do model = get_model("llama2", peft=True)
but instead do model = get_model("LoRaTransformerDecoder")
. We could be clever and make the model itself configurable with get_model("LoRaTransformerDecoder", **kwargs)
but I think at that point the user should be encouraged to move past the CLI and copy and paste the recipe. I think in general we should move users to level 2 the moment they want configurability and not try to lock ourselves into an over-engineered config system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ We should demonstrate clear examples for both. get_model
caters to User 1 and the creation of a model like LoRaTransformerDecoder
caters to User 2 or 3.
evaluator = get_eval(...) # eval harness to use | ||
|
||
opt = get_optimizer(...) | ||
opt = tune.wrapped_optimizer(opt, num_devices) # autocast + distributed logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what does the optimizer have to do w/autocast logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you train with fp16 you have to scale the optimizer and you don't call optimizer.step()
but scaler.step(optimizer)
. This wrapper could hide those inconsistencies from the training loop.
|
||
data = data.to(device) | ||
|
||
with tune.cast(dtype): # choose the appropriate autocast if distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some clarification needed here. I know this isn't a concrete proposal to add a tune.cast
context manager, but autocast
and the FSDP mixed precision APIs compose well together and using on or the other or composing both mean different things in a training context. happy to clarify more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think those details are out of scope for this RFC, but the main idea is to just provide a unified mixed precision code that doesn't change if we're doing distributed training or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I dont think I understand - why would I have mixed precision code if I'm not doing mixed precision training?
|
||
logger(model, l, step, epoch) | ||
checkpoint(model, opt, dataloader, step, epoch) | ||
evaluator(model, step, epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is mostly a training utilities based approach, where we offer a bunch of convenient utils and users roll their own training loop (and we provide a couple they can copy paste from or use as starter examples). At what point do we consider consolidating on TNT here and not rolling our own training utilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is TNT a library we want to take a dependency on? This design is made under the assumption that we didn't want a dependency on a trainer. I do think the "open for loop" design with utils is more extendable than the trainer class approach that's very popular. But I don't want to reinvent the wheel for no reason.
recipes/ | ||
finetune_llm.py | ||
|
||
defaults/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we call this configs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like configs or examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to it
# Users | ||
To be able to understand how we want to design the abstractions and the repo, it's important that we define how we want our users to interact with the library. One big complaint we heard from a number of users was around the black box nature of the fine tuning libraries they were using. On the other hand we also spoke to users who were very new to ml and were happy just to directly launch scripts without ever looking inside. To address this, I propose we model three levels of users and build our library to allow users to advance through these stages as they grow in requirements. | ||
|
||
### User 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is directionally right, but we should simplify these definitions:
- User 1: Clone, modify and run configs
- User 2: Trivial recipe changes (eg: support for new dataset, support for models with similar architectures etc) which usually includes cloning and modifying existing examples
- User 3: Non-trivial code changes (eg: support for new model architectures, new training techniques etc) which includes writing new recipes using existing components
I will include pseudocode here for how we can design our recipes to support the user profiles listed above. | ||
|
||
``` | ||
parser = argparse.ArgumentParser(...) # support all getters, lr, batchsize, epoch, dtype, num_devices, seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is just an example? Ideally we would focus less on argsparse and more on yaml configs (though some support for argparse makes sense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently using argparse to define the options and it can read in a yaml for all of the values. So a specific recipe does have a yaml that you can pass in with the "--config" flag or override specific values with command line arguments.
|
||
``` | ||
|
||
The above example, if we agree on it, should act as a guide for the direction of our recipe code. Initially none of the getter functions and utils will be built so there will be much more boilerplate but we can reduce this with time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can reduce this with time
I think we should do this as part of the initial launch. It can come at the cost of additional features. But whatever features we do provide, should have well defined utilities.
On that note, do we have a concrete set of features/utilities we are targeting for a very first version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a concrete list. There has been a lot of discussion since this RFC was put up on the value of simple utilities vs transparency. We have been iteratively adding a few utilities so far but it could be worth making an explicit list with designs.
Unittests for param server
This PR adds the following: 1 - via reset parameters, a full layerwise init for the llama models under /llama. This uses the total model depth as part of the init via: self.weight_init_std = 0.02 / (2 * self.num_layers) ** 0.5 2 - The final output ffn (head) is init with sqrt of the dim of the model itself and a slightly wider cutoff factor of 3. 3 - tangential change - updates run_llama_train.sh with updated MODEL and MODEL_CONF params to allow for direct model control via the sh script. (there was a MODEL already but it was incorrectly using that in place of MODEL_CONF...though we should update this as it's not intuitive). 4 - made the debugmodel default to 2 layers as an improved debug check. 5 - added a 1B and 40B for additional testing configs. I can't currently run 70B on my H100 due to OOM, but can run 40B. Testing: Verified proper init and training with 7B, 13B and ~40B: <img width="1085" alt="Screenshot 2024-02-11 at 10 39 12 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/049037ed-63a4-4ab0-bebc-f297857aab72">
Below is an RFC on how to design the UX and code layout for tune, along with a description of the users we want to aim for.