Skip to content

get core dump issue using diffusers in TPU #2745

@yiyixuxu

Description

@yiyixuxu

Describe the bug

I've been running into core dump errors using diffusers in flax/TPU

I got this issue when trying to run the text_to_image training script (followed these exact steps in the instruction https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-flaxjax) - I was able to fix it with sudo rm -f /tmp/libtpu_lockfile suggested by @patrickvonplaten

However, I just got the same error when trying to run a different inference script (the example code in this this PR I'm working on #2727) - this time, I'm not able to make it go away using the same methods.

I'm not sure how to reproduce these errors. It seems to me that once the core dump was triggered, scripts that were previously working on the same machine will stop working and always give a core dump error. However, I'm still able to run the official flax examples, so this issue is likely with diffusers/transformers.

I would appreciate any advice on steps I can take to fix this issue

Reproduction

followed these exact steps in the instruction https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-flaxjax

Logs

F0317 02:46:21.118406 358837 tpu_hal_vxc_hardware_impl_registration.cc:10] Check failed: tpu::TpuHalFactory::Register( tpu::TpuPlatformType::kHardware, tpu::TpuVersion::ke0897fcce, std::make_unique<tpu::TpuHalVxcHardwareFactory>( tpu::TpuVersion::ke0897fcce)) is OK (FAILED_PRECONDITION: TPU platform already registered for platform hardware version e0897fcce
=== Source Location Trace: ===
learning/45eac/tpu/runtime/hal/internal/tpu_hal_factory.cc:39
)
*** Check failure stack trace: ***
  @   0x7f20029aca84 (unknown)
  @   0x7f20029ac50d (unknown)
  @   0x7f20029acdc9 (unknown)
  @   0x7f2002b7d980 (unknown)
  @   0x7f200296898d (unknown)
  @   0x7f20029687a0 (unknown)
  @   0x7f20029687a0 (unknown)
  @   0x7f2002967fc4 (unknown)
  @   0x7f2002960648 (unknown)
  @   0x7f1ffcd16f1a (unknown)
  @   0x7f21edee4c65 tensorflow::tpu::InitializeTpuLibrary()
  @   0x7f21edee4dc9 tensorflow::tpu::FindAndLoadTpuLibrary()
  @   0x7f21ebfb77b9 xla::GetTpuClient()
  @   0x7f21e973578b pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
  @   0x7f21e9707acf pybind11::cpp_function::dispatcher()
  @      0x5f5b39 PyCFunction_Call
*** SIGABRT received by PID 358837 (TID 358837) on cpu 60 from PID 358837; ***
F0317 02:46:21.118406 358837 tpu_hal_vxc_hardware_impl_registration.cc:10] Check failed: tpu::TpuHalFactory::Register( tpu::TpuPlatformType::kHardware, tpu::TpuVersion::ke0897fcce, std::make_unique<tpu::TpuHalVxcHardwareFactory>( tpu::TpuVersion::ke0897fcce)) is OK (FAILED_PRECONDITION: TPU platform already registered for platform hardware version e0897fcce
=== Source Location Trace: ===
learning/45eac/tpu/runtime/hal/internal/tpu_hal_factory.cc:39
)
E0317 02:46:21.148313 358837 process_state.cc:784] RAW: Raising signal 6 with default behavior
Aborted (core dumped)

System Info

  • diffusers version: 0.15.0.dev0
  • Platform: Linux-5.13.0-1023-gcp-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • PyTorch version (GPU?): 2.0.0+cu117 (False)
  • Huggingface_hub version: 0.13.2
  • Transformers version: 4.27.1
  • Accelerate version: 0.17.1
  • xFormers version: not installed
  • Using GPU in script?: NO
  • Using distributed or parallel set-up in script?: TPU pmap

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions