-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Closed as not planned
Closed as not planned
Copy link
Description
prompt len: 6495, max_tokens: 21000
running command :
python benchmark_serving.py --backend=vllm --host=localhost --port=8888 --dataset=/mnt/vllm/benchmarks/fake_data --tokenizer=/mnt/disk2/lama-tokenizer --num-prompts=1
python -m vllm.entrypoints.api_server --model=/mnt/disk2/llama-2-13b-chat-hf/ --tokenizer=/mnt/disk2/lama-tokenizer --tensor-parallel-size=2 --swap-space=64 --engine-use-ray --worker-use-ray --max-num-batched-tokens=60000
INFO 11-17 08:58:33 async_llm_engine.py:371] Received request 93296c1db0b24cfbb2ee20b7208ceced: prompt: ' U1XiBoEelEJeEDfIAGLrf27N9d1********dgbZq8fXYw215vKF2k77Cjb',
sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0,
frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1,
use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[],
ignore_eos=True, max_tokens=21000, logprobs=None, prompt_logprobs=None,
skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: None.
Error log:
(RayWorker pid=296668) [2023-11-17 08:38:23,099 E 296668 296668] logging.cc:97: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
(RayWorker pid=296668) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorker pid=296668) For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorker pid=296668) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorker pid=296668)
(RayWorker pid=296668) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
(RayWorker pid=296668) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5ab808e4d7 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayWorker pid=296668) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f5ab805836b in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayWorker pid=296668) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5ab073bb58 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
(RayWorker pid=296668) frame #3: <unknown function> + 0x1c36b (0x7f5ab070c36b in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
(RayWorker pid=296668) frame #4: <unknown function> + 0x2b930 (0x7f5ab071b930 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
(RayWorker pid=296668) frame #5: <unknown function> + 0x4d46c6 (0x7f5a50b766c6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
(RayWorker pid=296668) frame #6: <unknown function> + 0x3ee77 (0x7f5ab8073e77 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayWorker pid=296668) frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f5ab806c69e in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayWorker pid=296668) frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f5ab806c7b9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayWorker pid=296668) frame #9: <unknown function> + 0x759cc8 (0x7f5a50dfbcc8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
(RayWorker pid=296668) frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7f5a50dfc075 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
(RayWorker pid=296668) frame #11: ray::RayWorker.execute_method() [0x5ecd90]
(RayWorker pid=296668) frame #12: ray::RayWorker.execute_method() [0x5447b8]
(RayWorker pid=296668) frame #13: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #14: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #15: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #16: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #17: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #18: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #19: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #20: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #21: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #22: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #23: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #24: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #25: <unknown function> + 0x644015 (0x7f5abceb6015 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #26: std::_Function_handler<ray::Status (ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string const&, std::string const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, std::string*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool, bool, bool), ray::Status (*)(ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string, std::string, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, std::string*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool, bool, bool)>::_M_invoke(std::_Any_data const&, ray::rpc::Address const&, ray::rpc::TaskType&&, std::string&&, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string const&, std::string const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*&&, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*&&, std::string*&&, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string&&, bool&&, bool&&, bool&&) + 0x157 (0x7f5abcdf2547 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #27: ray::core::CoreWorker::ExecuteTask(ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > > const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*) + 0xc1e (0x7f5abcfdce5e in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #28: std::_Function_handler<ray::Status (ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > >, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*), std::_Bind<ray::Status (ray::core::CoreWorker::*(ray::core::CoreWorker*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>, std::_Placeholder<4>, std::_Placeholder<5>, std::_Placeholder<6>, std::_Placeholder<7>, std::_Placeholder<8>))(ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > > const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*)> >::_M_invoke(std::_Any_data const&, ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > >&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*&&, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*&&, bool*&&, std::string*&&) + 0x58 (0x7f5abcf117d8 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #29: <unknown function> + 0x793684 (0x7f5abd005684 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #30: <unknown function> + 0x79498a (0x7f5abd00698a in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #31: <unknown function> + 0x7ac04e (0x7f5abd01e04e in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #32: ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled(ray::TaskID, ray::core::InboundRequest&) + 0x10c (0x7f5abd01f35c in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #33: <unknown function> + 0x7b02cb (0x7f5abd0222cb in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #34: ray::core::ActorSchedulingQueue::Add(long, long, std::function<void (std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>)>, std::function<void (ray::Status const&, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>)>, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>, std::string const&, std::shared_ptr<ray::FunctionDescriptorInterface> const&, ray::TaskID, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&) + 0x400 (0x7f5abd023da0 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #35: ray::core::CoreWorkerDirectTaskReceiver::HandleTask(ray::rpc::PushTaskRequest const&, ray::rpc::PushTaskReply*, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>) + 0x1216 (0x7f5abd005016 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #36: <unknown function> + 0x735e25 (0x7f5abcfa7e25 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #37: <unknown function> + 0xa59886 (0x7f5abd2cb886 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #38: <unknown function> + 0xa4b55e (0x7f5abd2bd55e in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #39: <unknown function> + 0xa4bab6 (0x7f5abd2bdab6 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #40: <unknown function> + 0x102fdbb (0x7f5abd8a1dbb in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #41: <unknown function> + 0x1031d99 (0x7f5abd8a3d99 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #42: <unknown function> + 0x10324a2 (0x7f5abd8a44a2 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #43: ray::core::CoreWorker::RunTaskExecutionLoop() + 0x1c (0x7f5abcfa6a8c in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #44: ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop() + 0x8c (0x7f5abcfe825c in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #45: ray::core::CoreWorkerProcess::RunTaskExecutionLoop() + 0x1d (0x7f5abcfe840d in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #46: <unknown function> + 0x57b5d7 (0x7f5abcded5d7 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #47: ray::RayWorker.execute_method() [0x504b7b]
(RayWorker pid=296668) frame #48: _PyEval_EvalFrameDefault + 0x851 (0x56bbe1 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #49: _PyFunction_Vectorcall + 0x1b6 (0x5f5ee6 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #50: _PyEval_EvalFrameDefault + 0x851 (0x56bbe1 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #51: _PyEval_EvalCodeWithName + 0x26a (0x569d8a in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #52: PyEval_EvalCode + 0x27 (0x68e267 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #53: ray::RayWorker.execute_method() [0x67d9b1]
(RayWorker pid=296668) frame #54: ray::RayWorker.execute_method() [0x67da2f]
(RayWorker pid=296668) frame #55: ray::RayWorker.execute_method() [0x67dad1]
(RayWorker pid=296668) frame #56: PyRun_SimpleFileExFlags + 0x197 (0x67fbf7 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #57: Py_RunMain + 0x212 (0x6b8082 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #58: Py_BytesMain + 0x2d (0x6b840d in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #59: __libc_start_main + 0xf3 (0x7f5abe4b5083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorker pid=296668) frame #60: _start + 0x2e (0x5faa2e in ray::RayWorker.execute_method)
(RayWorker pid=296668)
(RayWorker pid=296668) [2023-11-17 08:38:23,158 E 296668 296668] logging.cc:104: Stack trace:
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xf2e81a) [0x7f5abd7a081a] ray::operator<<()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xf30fd8) [0x7f5abd7a2fd8] ray::TerminateHandler()
(RayWorker pid=296668) /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a) [0x7f5abc6865aa] _Unwind_Resume
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so(+0x759cc8) [0x7f5a50dfbcc8] THPVariable_clear()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so(_Z28THPVariable_subclass_deallocP7_object+0x325) [0x7f5a50dfc075] THPVariable_subclass_dealloc()
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x5ecd90]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x5447b8]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_3rpc7AddressENS2_8TaskTypeESsRKNS0_4core11RayFunctionERKSt13unordered_mapISsdSt4hashISsESt8equal_toISsESaISt4pairIKSsdEEERKSt6vectorISt10shared_ptrINS0_9RayObjectEESaISQ_EERKSN_INS2_15ObjectReferenceESaISV_EERSH_S10_PSN_ISG_INS0_8ObjectIDESQ_ESaIS12_EES15_PSN_ISG_IS11_bESaIS16_EERSO_INS0_17LocalMemoryBufferEEPbPSsRKSN_INS0_16ConcurrencyGroupESaIS1F_EESsbbbEPFS1_S5_S6_SsSA_SM_SU_SZ_SsSsS15_S15_S19_S1C_S1D_S1E_S1J_SsbbbEE9_M_invokeERKSt9_Any_dataS5_OS6_OSsSA_SM_SU_SZ_S10_S10_OS15_S1T_OS19_S1C_OS1D_OS1E_S1J_S1S_ObS1X_S1X_+0x157) [0x7f5abcdf2547] std::_Function_handler<>::_M_invoke()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationERKSt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDES5_INS_9RayObjectEEESaISQ_EEST_PS7_IS8_ISN_bESaISU_EEPN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSs+0xc1e) [0x7f5abcfdce5e] ray::core::CoreWorker::ExecuteTask()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_17TaskSpecificationESt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS0_8ObjectIDES5_INS0_9RayObjectEEESaISO_EESR_PS7_IS8_ISL_bESaISS_EEPN6google8protobuf16RepeatedPtrFieldINS0_3rpc20ObjectReferenceCountEEEPbPSsESt5_BindIFMNS0_4core10CoreWorkerEFS1_S4_RKSK_SR_SR_SV_S12_S13_S14_EPS18_St12_PlaceholderILi1EES1E_ILi2EES1E_ILi3EES1E_ILi4EES1E_ILi5EES1E_ILi6EES1E_ILi7EES1E_ILi8EEEEE9_M_invokeERKSt9_Any_dataS4_OSK_OSR_S1U_OSV_OS12_OS13_OS14_+0x58) [0x7f5abcf117d8] std::_Function_handler<>::_M_invoke()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x79498a) [0x7f5abd00698a] std::_Function_handler<>::_M_invoke()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x7ac04e) [0x7f5abd01e04e] ray::core::InboundRequest::Accept()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core20ActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE+0x10c) [0x7f5abd01f35c] ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x7b02cb) [0x7f5abd0222cb] ray::core::ActorSchedulingQueue::ScheduleRequests()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core20ActorSchedulingQueue3AddEllSt8functionIFvS2_IFvNS_6StatusES2_IFvvEES5_EEEES2_IFvRKS3_S7_EES7_RKSsRKSt10shared_ptrINS_27FunctionDescriptorInterfaceEENS_6TaskIDERKSt6vectorINS_3rpc15ObjectReferenceESaISO_EE+0x400) [0x7f5abd023da0] ray::core::ActorSchedulingQueue::Add()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core28CoreWorkerDirectTaskReceiver10HandleTaskERKNS_3rpc15PushTaskRequestEPNS2_13PushTaskReplyESt8functionIFvNS_6StatusES8_IFvvEESB_EE+0x1216) [0x7f5abd005016] ray::core::CoreWorkerDirectTaskReceiver::HandleTask()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x735e25) [0x7f5abcfa7e25] std::_Function_handler<>::_M_invoke()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xa59886) [0x7f5abd2cb886] EventTracker::RecordExecution()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xa4b55e) [0x7f5abd2bd55e] std::_Function_handler<>::_M_invoke()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xa4bab6) [0x7f5abd2bdab6] boost::asio::detail::completion_handler<>::do_complete()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x102fdbb) [0x7f5abd8a1dbb] boost::asio::detail::scheduler::do_run_one()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x1031d99) [0x7f5abd8a3d99] boost::asio::detail::scheduler::run()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x10324a2) [0x7f5abd8a44a2] boost::asio::io_context::run()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0x1c) [0x7f5abcfa6a8c] ray::core::CoreWorker::RunTaskExecutionLoop()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x8c) [0x7f5abcfe825c] ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv+0x1d) [0x7f5abcfe840d] ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x504b7b]
(RayWorker pid=296668) ray::RayWorker.execute_method(_PyEval_EvalFrameDefault+0x851) [0x56bbe1] _PyEval_EvalFrameDefault
(RayWorker pid=296668) ray::RayWorker.execute_method(_PyFunction_Vectorcall+0x1b6) [0x5f5ee6] _PyFunction_Vectorcall
(RayWorker pid=296668) ray::RayWorker.execute_method(_PyEval_EvalFrameDefault+0x851) [0x56bbe1] _PyEval_EvalFrameDefault
(RayWorker pid=296668) ray::RayWorker.execute_method(_PyEval_EvalCodeWithName+0x26a) [0x569d8a] _PyEval_EvalCodeWithName
(RayWorker pid=296668) ray::RayWorker.execute_method(PyEval_EvalCode+0x27) [0x68e267] PyEval_EvalCode
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x67d9b1]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x67da2f]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x67dad1]
(RayWorker pid=296668) ray::RayWorker.execute_method(PyRun_SimpleFileExFlags+0x197) [0x67fbf7] PyRun_SimpleFileExFlags
(RayWorker pid=296668) ray::RayWorker.execute_method(Py_RunMain+0x212) [0x6b8082] Py_RunMain
(RayWorker pid=296668) ray::RayWorker.execute_method(Py_BytesMain+0x2d) [0x6b840d] Py_BytesMain
(RayWorker pid=296668) /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f5abe4b5083] __libc_start_main
(RayWorker pid=296668) ray::RayWorker.execute_method(_start+0x2e) [0x5faa2e] _start
(RayWorker pid=296668) *** SIGABRT received at time=1700210303 on cpu 36 ***
(RayWorker pid=296668) PC: @ 0x7f5abe4d400b (unknown) raise
(RayWorker pid=296668) @ 0x7f5abe4d4090 (unknown) (unknown)
(RayWorker pid=296668) @ 0x7f5abc73a38c 1008 (unknown)
(RayWorker pid=296668) @ 0x7ffec2633b50 248 (unknown)
(RayWorker pid=296668) @ 0x1 (unknown) (unknown)
(RayWorker pid=296668) [2023-11-17 08:38:23,161 E 296668 296668] logging.cc:361: *** SIGABRT received at time=1700210303 on cpu 36 ***
(RayWorker pid=296668) [2023-11-17 08:38:23,161 E 296668 296668] logging.cc:361: PC: @ 0x7f5abe4d400b (unknown) raise
(RayWorker pid=296668) [2023-11-17 08:38:23,162 E 296668 296668] logging.cc:361: @ 0x7f5abe4d4090 (unknown) (unknown)
(RayWorker pid=296668) [2023-11-17 08:38:23,162 E 296668 296668] logging.cc:361: @ 0x7f5abc73a38c 1008 (unknown)
(RayWorker pid=296668) [2023-11-17 08:38:23,163 E 296668 296668] logging.cc:361: @ 0x7ffec2633b50 248 (unknown)
(RayWorker pid=296668) [2023-11-17 08:38:23,165 E 296668 296668] logging.cc:361: @ 0x1 (unknown) (unknown)
(RayWorker pid=296668) Fatal Python error: Aborted
(RayWorker pid=296668) Stack (most recent call first):
(RayWorker pid=296668) File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 782 in main_loop
(RayWorker pid=296668) File "/usr/local/lib/python3.8/dist-packages/ray/_private/workers/default_worker.py", line 278 in <module>
2023-11-17 08:39:33,162 WARNING worker.py:2058 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffda6fb1ed560b0a9302273c5d01000000 Worker ID: d334f3efb7d78478dec2949c7ed2b0ae2563c2266e188631380e6bab Node ID: 4ebfbe49244d6a6cb436e651244d2576a9246ebe2d63480b0e5f80c1 Worker IP address: 172.16.47.112 Worker port: 33295 Worker PID: 296669 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f0e1b5e95e0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f0cf13e5e20>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f0e1b5e95e0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f0cf13e5e20>)>
Traceback (most recent call last):
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
task.result()
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop
has_requests_in_progress = await self.engine_step()
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 328, in engine_step
request_outputs = await self.engine.step.remote()
ray.exceptions.RayTaskError: ray::_AsyncLLMEngine.step() (pid=296628, ip=172.16.47.112, actor_id=744b80b9032fa37fd1ee549001000000, repr=<vllm.engine.async_llm_engine._AsyncLLMEngine object at 0x7fa737994310>)
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/llm_engine.py", line 563, in step
output = self._run_workers(
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/llm_engine.py", line 711, in _run_workers
all_outputs = ray.get(all_outputs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: RayWorker
actor_id: da6fb1ed560b0a9302273c5d01000000
pid: 296669
namespace: 1e594810-681e-4d7c-878c-65854434ef82
ip: 172.16.47.112
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
raise exc
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 11-17 08:39:33 async_llm_engine.py:134] Aborted request 8fd068bbbc5c4760ac2bc86d3174b33f.
INFO: ::1:47850 - "POST /generate HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
task.result()
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop
has_requests_in_progress = await self.engine_step()
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 328, in engine_step
request_outputs = await self.engine.step.remote()
ray.exceptions.RayTaskError: ray::_AsyncLLMEngine.step() (pid=296628, ip=172.16.47.112, actor_id=744b80b9032fa37fd1ee549001000000, repr=<vllm.engine.async_llm_engine._AsyncLLMEngine object at 0x7fa737994310>)
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/llm_engine.py", line 563, in step
output = self._run_workers(
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/llm_engine.py", line 711, in _run_workers
all_outputs = ray.get(all_outputs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: RayWorker
actor_id: da6fb1ed560b0a9302273c5d01000000
pid: 296669
namespace: 1e594810-681e-4d7c-878c-65854434ef82
ip: 172.16.47.112
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 292, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 273, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 190, in run_endpoint_function
return await dependant.call(**values)
File "/mnt/disk2/test/vllm_latest/vllm/vllm/entrypoints/api_server.py", line 58, in generate
async for request_output in results_generator:
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 436, in generate
raise e
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 430, in generate
async for request_output in stream:
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 70, in __anext__
raise result
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
raise exc
File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
2023-11-17 08:39:33,977 WARNING worker.py:2058 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff2e42461d426b10b9fffc43ce01000000 Worker ID: 8689b72cd4f3d992c79a89e6fe9161d936dcc38e1ef9f5e8bb0f14b5 Node ID: 4ebfbe49244d6a6cb436e651244d2576a9246ebe2d63480b0e5f80c1 Worker IP address: 172.16.47.112 Worker port: 36633 Worker PID: 296668 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
tattrongvu
Metadata
Metadata
Assignees
Labels
No labels