-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Description
Describe the bug
I've been running into core dump errors using diffusers in flax/TPU
I got this issue when trying to run the text_to_image training script (followed these exact steps in the instruction https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-flaxjax) - I was able to fix it with sudo rm -f /tmp/libtpu_lockfile
suggested by @patrickvonplaten
However, I just got the same error when trying to run a different inference script (the example code in this this PR I'm working on #2727) - this time, I'm not able to make it go away using the same methods.
I'm not sure how to reproduce these errors. It seems to me that once the core dump was triggered, scripts that were previously working on the same machine will stop working and always give a core dump error. However, I'm still able to run the official flax examples, so this issue is likely with diffusers/transformers.
I would appreciate any advice on steps I can take to fix this issue
Reproduction
followed these exact steps in the instruction https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-flaxjax
Logs
F0317 02:46:21.118406 358837 tpu_hal_vxc_hardware_impl_registration.cc:10] Check failed: tpu::TpuHalFactory::Register( tpu::TpuPlatformType::kHardware, tpu::TpuVersion::ke0897fcce, std::make_unique<tpu::TpuHalVxcHardwareFactory>( tpu::TpuVersion::ke0897fcce)) is OK (FAILED_PRECONDITION: TPU platform already registered for platform hardware version e0897fcce
=== Source Location Trace: ===
learning/45eac/tpu/runtime/hal/internal/tpu_hal_factory.cc:39
)
*** Check failure stack trace: ***
@ 0x7f20029aca84 (unknown)
@ 0x7f20029ac50d (unknown)
@ 0x7f20029acdc9 (unknown)
@ 0x7f2002b7d980 (unknown)
@ 0x7f200296898d (unknown)
@ 0x7f20029687a0 (unknown)
@ 0x7f20029687a0 (unknown)
@ 0x7f2002967fc4 (unknown)
@ 0x7f2002960648 (unknown)
@ 0x7f1ffcd16f1a (unknown)
@ 0x7f21edee4c65 tensorflow::tpu::InitializeTpuLibrary()
@ 0x7f21edee4dc9 tensorflow::tpu::FindAndLoadTpuLibrary()
@ 0x7f21ebfb77b9 xla::GetTpuClient()
@ 0x7f21e973578b pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
@ 0x7f21e9707acf pybind11::cpp_function::dispatcher()
@ 0x5f5b39 PyCFunction_Call
*** SIGABRT received by PID 358837 (TID 358837) on cpu 60 from PID 358837; ***
F0317 02:46:21.118406 358837 tpu_hal_vxc_hardware_impl_registration.cc:10] Check failed: tpu::TpuHalFactory::Register( tpu::TpuPlatformType::kHardware, tpu::TpuVersion::ke0897fcce, std::make_unique<tpu::TpuHalVxcHardwareFactory>( tpu::TpuVersion::ke0897fcce)) is OK (FAILED_PRECONDITION: TPU platform already registered for platform hardware version e0897fcce
=== Source Location Trace: ===
learning/45eac/tpu/runtime/hal/internal/tpu_hal_factory.cc:39
)
E0317 02:46:21.148313 358837 process_state.cc:784] RAW: Raising signal 6 with default behavior
Aborted (core dumped)
System Info
diffusers
version: 0.15.0.dev0- Platform: Linux-5.13.0-1023-gcp-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyTorch version (GPU?): 2.0.0+cu117 (False)
- Huggingface_hub version: 0.13.2
- Transformers version: 4.27.1
- Accelerate version: 0.17.1
- xFormers version: not installed
- Using GPU in script?: NO
- Using distributed or parallel set-up in script?: TPU pmap