-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Is there an existing issue for this bug?
- I have searched the existing issues
🐛 Describe the bug
当我启动训练的时候出现了下面的错误,怎么办?Ubuntu 22.04
llama) root@autodl-container-ad594b8360-ad0d4c6e:~/autodl-tmp/ColossalAI/applications/ColossalChat/examples/training_scripts# bash train_sft.sh
GPU Memory Usage:
0 1 MiB
1 1 MiB
Now CUDA_VISIBLE_DEVICES is set to:
CUDA_VISIBLE_DEVICES=0,1
/root/miniconda3/envs/llama/bin/colossalai
/root/miniconda3/envs/llama/bin/python
/bin/bash: line 1: export: `=/usr/bin/supervisord': not a valid identifier
Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=31312 train_sft.py --pretrain /root/autodl-tmp/model/qwen/Qwen2-1___5B-Instruct --tokenizer_dir /root/autodl-tmp/model/qwen/Qwen2-1___5B-Instruct --save_interval 5 --dataset /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00000 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00001 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00002 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00003 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00004 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00005 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00006 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00007 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00008 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00009 --plugin zero2 --batch_size 2 --max_epochs 1 --accumulation_steps 4 --lr 5e-5 --max_len 4096 --grad_checkpoint --save_path /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31 --config_file /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31.json --log_dir /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31 on localhost, is localhost: True, exception: Encountered a bad command exit code!
Command: 'cd /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/training_scripts && export ="/usr/bin/supervisord" SHELL="/bin/bash" NV_LIBCUBLAS_VERSION="12.1.0.26-1" NVIDIA_VISIBLE_DEVICES="GPU-f80c38c0-bbe5-0581-78f3-c584fea4b8c2,GPU-d2d56e77-1eaf-3254-c719-013dea3447cd" NV_NVML_DEV_VERSION="12.1.55-1" NV_CUDNN_PACKAGE_NAME="libcudnn8" NV_LIBNCCL_DEV_PACKAGE="libnccl-dev=2.17.1-1+cuda12.1" CONDA_EXE="/root/miniconda3/bin/conda" NV_LIBNCCL_DEV_PACKAGE_VERSION="2.17.1-1" HOSTNAME="autodl-container-ad594b8360-ad0d4c6e" NVIDIA_REQUIRE_CUDA="cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526" NV_LIBCUBLAS_DEV_PACKAGE="libcublas-dev-12-1=12.1.0.26-1" NV_NVTX_VERSION="12.1.66-1" NV_CUDA_CUDART_DEV_VERSION="12.1.55-1" NV_LIBCUSPARSE_VERSION="12.0.2.55-1" NV_LIBNPP_VERSION="12.0.2.50-1" NCCL_VERSION="2.17.1-1" PWD="/root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/training_scripts" AutoDLContainerUUID="ad594b8360-ad0d4c6e" NV_CUDNN_PACKAGE="libcudnn8=8.9.0.131-1+cuda12.1" CONDA_PREFIX="/root/miniconda3/envs/llama" NVIDIA_DRIVER_CAPABILITIES="compute,utility,graphics,video" JUPYTER_SERVER_URL="http://autodl-container-ad594b8360-ad0d4c6e:8888/jupyter/" NV_NVPROF_DEV_PACKAGE="cuda-nvprof-12-1=12.1.55-1" NV_LIBNPP_PACKAGE="libnpp-12-1=12.0.2.50-1" NV_LIBNCCL_DEV_PACKAGE_NAME="libnccl-dev" TZ="Asia/Shanghai" NV_LIBCUBLAS_DEV_VERSION="12.1.0.26-1" NVIDIA_PRODUCT_NAME="CUDA" NV_LIBCUBLAS_DEV_PACKAGE_NAME="libcublas-dev-12-1" LINES="57" NV_CUDA_CUDART_VERSION="12.1.55-1" AutoDLServiceURL="https://u258683-8360-ad0d4c6e.westc.gpuhub.com:8443" HOME="/root" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:" COLUMNS="221" NVIDIA_CUDA_END_OF_LIFE="1" AutoDLRegion="west-C" CUDA_VERSION="12.1.0" AgentHost="172.21.0.184" NV_LIBCUBLAS_PACKAGE="libcublas-12-1=12.1.0.26-1" NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE="cuda-nsight-compute-12-1=12.1.0-1" CONDA_PROMPT_MODIFIER="(llama) " PYDEVD_USE_FRAME_EVAL="NO" PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION="python" NV_LIBNPP_DEV_PACKAGE="libnpp-dev-12-1=12.0.2.50-1" NV_LIBCUBLAS_PACKAGE_NAME="libcublas-12-1" NV_LIBNPP_DEV_VERSION="12.0.2.50-1" CUDA_VISIBLE_DEVICES="0,1" JUPYTER_SERVER_ROOT="/root" TERM="xterm-256color" NV_LIBCUSPARSE_DEV_VERSION="12.0.2.55-1" LIBRARY_PATH="/usr/local/cuda/lib64/stubs" NV_CUDNN_VERSION="8.9.0.131" AutodlAutoPanelToken="jupyter-autodl-container-ad594b8360-ad0d4c6e-3db8691ef93a642afa031eb431dfa54b2de5d9a5f90f44fa69fbd6d8ab6c8ef2b" CONDA_SHLVL="2" SHLVL="3" PYXTERM_DIMENSIONS="80x25" NV_CUDA_LIB_VERSION="12.1.0-1" NVARCH="x86_64" NV_CUDNN_PACKAGE_DEV="libcudnn8-dev=8.9.0.131-1+cuda12.1" NV_CUDA_COMPAT_PACKAGE="cuda-compat-12-1" CONDA_PYTHON_EXE="/root/miniconda3/bin/python" NV_LIBNCCL_PACKAGE="libnccl2=2.17.1-1+cuda12.1" LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64" LC_CTYPE="C.UTF-8" CONDA_DEFAULT_ENV="llama" NV_CUDA_NSIGHT_COMPUTE_VERSION="12.1.0-1" REQUESTS_CA_BUNDLE="/etc/ssl/certs/ca-certificates.crt" OMP_NUM_THREADS="32" NV_NVPROF_VERSION="12.1.55-1" PATH="/root/miniconda3/envs/llama/bin:/root/miniconda3/condabin:/root/miniconda3/bin:/usr/local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" NV_LIBNCCL_PACKAGE_NAME="libnccl2" NV_LIBNCCL_PACKAGE_VERSION="2.17.1-1" MKL_NUM_THREADS="32" CONDA_PREFIX_1="/root/miniconda3" DEBIAN_FRONTEND="noninteractive" OLDPWD="/root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts" _="/root/miniconda3/envs/llama/bin/colossalai" && torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=31312 train_sft.py --pretrain /root/autodl-tmp/model/qwen/Qwen2-1___5B-Instruct --tokenizer_dir /root/autodl-tmp/model/qwen/Qwen2-1___5B-Instruct --save_interval 5 --dataset /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00000 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00001 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00002 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00003 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00004 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00005 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00006 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00007 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00008 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00009 --plugin zero2 --batch_size 2 --max_epochs 1 --accumulation_steps 4 --lr 5e-5 --max_len 4096 --grad_checkpoint --save_path /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31 --config_file /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31.json --log_dir /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31'
Exit code: 1
Stdout: already printed
Stderr: already printed
====== Training on All Nodes =====
localhost: failure
====== Stopping All Nodes =====
localhost: finish
Environment
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working