Troubleshooting your Model
On this Page
Troubleshooting your Model¶
This section provides troubleshooting instructions that can be referred to for common issues when training PyTorch models. The following are common functional issues that may occur when running on HPU and not on CPU/GPU.
Runtime Errors¶
Please ensure that both model and inputs are moved to the device in the script before the training loop begins. The most common symptoms of these could manifest as runtime errors from the python stack which will result in a backtrace:
model_inputs = model_inputs.to("hpu")
model = model.to("hpu")
The following table outlines possible runtime errors:
Error |
Notes |
---|---|
KeyError: ‘torch_dynamo_backends’ torch._dynamo.exc.BackendCompiler Failed: debug_wrapper raised AssertionError: Torch not compiled with CUDA enabled |
|
RuntimeError: FATAL ERROR :: MODULE:BRIDGE syn compile encountered : Graph compile failed. 26 compile time 5076188974 ns |
|
File “/usr/local/lib/python 3.8/dist-packages/habana_framewo rks/torch/core/step_closure.py”, line 45, in mark_step htcore._mark_step(device_str) RuntimeError: FATAL ERROR :: MODULE:BRIDGE syn launch encountered : synLaunch failed. 26 |
Run the following:
|
RuntimeError: Unaccounted output %t925__1 at index 21. Cached recipe execution might break |
Run the following:
|
RuntimeError: Sizes of tensors along one of the non-cat dimensions don’t match |
Check cat operation. User may not be able to do this unless maybe written from scratch. |
RuntimeError: tensor does not have a device |
Run the following:
|
RuntimeError: optimum-habana v1.6.0. has been validated for SynapseAI 1.11.0, but the displayed driver version is v1.10.0. This could lead to undefined behavior. |
Make sure you use the supported versions according to the Support Matrix. |
Performance Issues¶
For details on how to get best performance on HPU, refer to Model Performance Optimization Guide for PyTorch