Troubleshooting your Model

This section provides troubleshooting instructions that can be referred to for common issues when training PyTorch models. The following are common functional issues that may occur when running on HPU and not on CPU/GPU.

Runtime Errors

Please ensure that both model and inputs are moved to the device in the script before the training loop begins. The most common symptoms of these could manifest as runtime errors from the python stack which will result in a backtrace:

model_inputs = model_inputs.to("hpu")
model = model.to("hpu")

The following table outlines possible runtime errors:

Error

Notes

KeyError: ‘torch_dynamo_backends’

torch._dynamo.exc.BackendCompiler Failed: debug_wrapper raised AssertionError: Torch not compiled with CUDA enabled

  • Make sure that the model does not use torch.compile (will be supported in future release).

  • Make sure PT_HPU_LAZY_MODE is set to “1” .

RuntimeError: FATAL ERROR :: MODULE:BRIDGE syn compile encountered : Graph compile failed. 26 compile time 5076188974 ns

  • Make sure Eager/Lazy mode flags were set correctly

  • Or, run the following:

    • export LOG_LEVEL_ALL_PT=1

    • export ENABLE_CONSOLE=true

    • export LOG_LEVEL_ALL=4

File “/usr/local/lib/python 3.8/dist-packages/habana_framewo rks/torch/core/step_closure.py”, line 45, in mark_step

htcore._mark_step(device_str)

RuntimeError: FATAL ERROR :: MODULE:BRIDGE syn launch encountered : synLaunch failed. 26

Run the following:

  • export LOG_LEVEL_ALL_PT=1

  • export ENABLE_CONSOLE=true

  • export LOG_LEVEL_ALL=4

RuntimeError: Unaccounted output %t925__1 at index 21. Cached recipe execution might break

Run the following:

  • export LOG_LEVEL_ALL_PT=1

  • export ENABLE_CONSOLE=true

  • export LOG_LEVEL_ALL=4

RuntimeError: Sizes of tensors along one of the non-cat dimensions don’t match

Check cat operation. User may not be able to do this unless maybe written from scratch.

RuntimeError: tensor does not have a device

Run the following:

  • export LOG_LEVEL_ALL_PT=1

  • export ENABLE_CONSOLE=true

  • export LOG_LEVEL_ALL=4

RuntimeError: optimum-habana v1.x.x. has been validated for Intel Gaudi 1.15.1, but the displayed driver version is v1.y.y. This could lead to undefined behavior.

Make sure you use the supported versions according to the Support Matrix.

RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

Your model requires more than one Gaudi or other models are already running on the available Gaudis. Verify if other models are running:

  • Run hl-smi to verify if other cards are consuming memory or running PyTorch jobs.

  • Run docker stats on bare metal to verify if other Docker images are consuming resources.

  • From the Docker image, run ps aux to verify which Python jobs are running. You can run pkill to stop the jobs.

  • From the Jupyter Notebook, run exit() or restart the Jupyter Kernel to ensure that all the jobs are stopped and are not consuming resources.

Performance Issues

For details on how to get best performance on HPU, refer to Model Performance Optimization Guide for PyTorch