Unable to reproduce LLaMA-7B results when training from scratch

Hi,

I was trying to reproduce the LLaMA-7B with 1 gist token results from scratch following the training instruction in the README. I ran the script below on 4 A100-80GB GPUs:
```bash
TAG="train80g"

port=$(shuf -i25000-30000 -n1)

deepspeed --master_port $port --num_gpus=4 --no_local_rank \
    --module src.train \
    +model=llama-7b wandb.tag=$TAG \
    training.deepspeed=ds_configs/stage3.json \
    training.gist.condition=gist \
    training.gist.num_gist_tokens=1
``` 
However, the final results after 3 epochs are much lower than the reported ones in the paper. I got seen 51.24, unseen 42.01, human 19.00 for ROUGE-L. I tried training for longer epochs but it didn't help with unseen and human ROUGE-L results. I did not change anything in the training config other than the wandb account.

I also evaluated the 3 provided checkpoints (gist, pos_control, neg_control) and the results are consistent with the paper (< 0.1 difference in terms of ROUGE-L) for all of them, so the evaluation code should function normally. Could you help double check if the above training setup is correct, and do you have any suggestions on how to reproduce LLaMA results in the paper?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unable to reproduce LLaMA-7B results when training from scratch #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Unable to reproduce LLaMA-7B results when training from scratch #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions