-
Couldn't load subscription status.
- Fork 27
Description
Hi,
I was trying to reproduce the LLaMA-7B with 1 gist token results from scratch following the training instruction in the README. I ran the script below on 4 A100-80GB GPUs:
TAG="train80g"
port=$(shuf -i25000-30000 -n1)
deepspeed --master_port $port --num_gpus=4 --no_local_rank \
--module src.train \
+model=llama-7b wandb.tag=$TAG \
training.deepspeed=ds_configs/stage3.json \
training.gist.condition=gist \
training.gist.num_gist_tokens=1However, the final results after 3 epochs are much lower than the reported ones in the paper. I got seen 51.24, unseen 42.01, human 19.00 for ROUGE-L. I tried training for longer epochs but it didn't help with unseen and human ROUGE-L results. I did not change anything in the training config other than the wandb account.
I also evaluated the 3 provided checkpoints (gist, pos_control, neg_control) and the results are consistent with the paper (< 0.1 difference in terms of ROUGE-L) for all of them, so the evaluation code should function normally. Could you help double check if the above training setup is correct, and do you have any suggestions on how to reproduce LLaMA results in the paper?