Troubleshooting PyTorch Model
On this Page
Troubleshooting PyTorch Model¶
This section provides troubleshooting instructions that can be referred to for common issues that may occur when training PyTorch models on the Intel® Gaudi® AI accelerator.
Runtime Errors¶
Ensure that both model and inputs are moved to the device in the script before the training loop begins. The most common symptoms of these could manifest as runtime errors from the Python stack which result in a backtrace:
model_inputs = model_inputs.to("hpu")
model = model.to("hpu")
The following table outlines possible runtime errors and their workarounds:
Error |
Workaround |
---|---|
KeyError: ‘torch_dynamo_backends’ torch._dynamo.exc.BackendCompiler Failed: debug_wrapper raised AssertionError: Torch not compiled with CUDA enabled |
|
RuntimeError: FATAL ERROR :: MODULE:BRIDGE syn compile encountered : Graph compile failed. 26 compile time 5076188974 ns |
|
File |
Run the following:
|
RuntimeError: Unaccounted output %t925__1 at index 21. Cached recipe execution might break |
Run the following:
|
RuntimeError: Sizes of tensors along one of the non-cat dimensions don’t match |
Check cat operation. User may not be able to do this unless maybe written from scratch. |
RuntimeError: tensor does not have a device |
Run the following:
|
RuntimeError: optimum-habana v1.x.x. has been validated for Intel Gaudi software 1.18.0, but the displayed driver version is v1.y.y. This could lead to undefined behavior. |
Make sure you use the supported versions according to the Support Matrix. |
RuntimeError: synStatus=26 [Generic failure] Device acquire failed. |
Your model requires more than one Gaudi card or other models are already running on the available Gaudi cards. Verify if other models are running:
|
RuntimeError: Port 9: DOWN |
In a single-server setup, internal port 9 may get disconnected if external interfaces are not connected. As a result, a workload cannot be run on eight Gaudi cards.
|
Using torch.float8_e4m3fn
on Gaudi 2¶
When running torch.float8_e4m3fn
on Gaudi 2, the IEEE float standard is used instead of torch.float8_e4m3fn
format.
This is done to allow FP8 training and inference models to run without any change on Gaudi 2 since Gaudi 2 does not support torch.float8_e4m3fn
.
If torch.float8_e4m3f
is used in your model on Gaudi 2, the data type supported is torch.float8_e4m3fnuz
which leads to differences in max/Inf/NaN values as shown below:
The max supported value between Gaudi 2 FP8-143 (S.1110.111 = 240.0) and torch.e4m3fn FP8-143 (S.1111.110 = 448.0).
Encoding for INF (S.1111.000) required in Gaudi 2 which is not supported in torch.e4m3fn.
There is more encoding for NaN (S.1111.{001, 010, 011, 100, 101, 110, 111}) on Gaudi 2 while torch.e4m3fn supports only two (S.1111.111).
Since the max values differ, if you use torch.float8_e4m3fn
, then the range of the FP8 values will be different.
The corresponding changes for FP8 scaling is automatically provided by Intel Gaudi. However, if you explicitly depend on the max/Inf/NaN values of torch.float8_e4m3fn
on Gaudi 2, then these values have to be modified to use torch.float8_e4m3fnuz
values.
Performance Issues¶
For details on how to get best performance on HPU, refer to Model Performance Optimization Guide for PyTorch.