Troubleshooting PyTorch Model

This section provides troubleshooting instructions that can be referred to for common issues that may occur when training PyTorch models on the Intel® Gaudi® AI accelerator.

Runtime Errors

Ensure that both model and inputs are moved to the device in the script before the training loop begins. The most common symptoms of these could manifest as runtime errors from the Python stack which result in a backtrace:

model_inputs = model_inputs.to("hpu")
model = model.to("hpu")

The following table outlines possible runtime errors and their workarounds:

Error

Workaround

KeyError: ‘torch_dynamo_backends’ torch._dynamo.exc.BackendCompiler Failed: debug_wrapper raised AssertionError: Torch not compiled with CUDA enabled

  • Make sure that the model does not use torch.compile if it does not support it.

  • Make sure PT_HPU_LAZY_MODE is set to “1” .

RuntimeError: FATAL ERROR :: MODULE:BRIDGE syn compile encountered : Graph compile failed. 26 compile time 5076188974 ns

  • Make sure Eager/Lazy mode flags were set correctly:

    • Eager mode: PT_HPU_LAZY_MODE=0

    • Lazy mode: PT_HPU_LAZY_MODE=1

  • Run the following:

    • export LOG_LEVEL_ALL_PT=1

    • export ENABLE_CONSOLE=true

    • export LOG_LEVEL_ALL=4

  • Make sure not to use FP32 data type on HL-225D as it is not supported on this device.

File /usr/local/lib/python 3.8/dist-packages/habana_frameworks/torch/core/step_closure.py, line 45, in mark_step htcore._mark_step(device_str) RuntimeError: FATAL ERROR :: MODULE:BRIDGE syn launch encountered : synLaunch failed. 26

Run the following:

  • export LOG_LEVEL_ALL_PT=1

  • export ENABLE_CONSOLE=true

  • export LOG_LEVEL_ALL=4

RuntimeError: Unaccounted output %t925__1 at index 21. Cached recipe execution might break

Run the following:

  • export LOG_LEVEL_ALL_PT=1

  • export ENABLE_CONSOLE=true

  • export LOG_LEVEL_ALL=4

RuntimeError: Sizes of tensors along one of the non-cat dimensions don’t match

Check cat operation. User may not be able to do this unless maybe written from scratch.

RuntimeError: tensor does not have a device

Run the following:

  • export LOG_LEVEL_ALL_PT=1

  • export ENABLE_CONSOLE=true

  • export LOG_LEVEL_ALL=4

RuntimeError: optimum-habana v1.x.x. has been validated for Intel Gaudi software 1.19.1, but the displayed driver version is v1.y.y. This could lead to undefined behavior.

Make sure you use the supported versions according to the Support Matrix.

RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

Your model requires more than one Gaudi card or other models are already running on the available Gaudi cards. Verify if other models are running:

  • Run hl-smi to verify if other cards are consuming memory or running PyTorch jobs.

  • Run docker stats on bare metal to verify if other Docker images are consuming resources.

  • From the Docker image, run ps aux to verify which Python jobs are running. You can run pkill to stop the jobs.

  • From the Jupyter Notebook, run exit() or restart the Jupyter Kernel to ensure that all the jobs are stopped and are not consuming resources.

RuntimeError: Port 9: DOWN

In a single-server setup, internal port 9 may get disconnected if external interfaces are not connected. As a result, a workload cannot be run on eight Gaudi cards.

  1. Verify that port 9 is DOWN by running ./check_link_status1.sh -o link.

  2. If it is DOWN, install hl_qual by following the instructions in Custom Driver and Software Installation. Skip this step if it is already installed.

  3. Run the following:

    cd /opt/habanalabs/qual/gaudi2/bin/
    ./manage_network_ifs.sh --down
    ./manage_network_ifs.sh --up
    ./manage_network_ifs.sh --down
    ./manage_network_ifs.sh -up
    ./manage_network_ifs.sh --down
    
  1. Run ./check_link_status1.sh -o link. All ports should be UP.

Or, disable external ports by following the instructions in Disable/Enable Gaudi 2 External NICs.

Using torch.float8_e4m3fn on Gaudi 2

When running torch.float8_e4m3fn on Gaudi 2, the IEEE float standard is used instead of torch.float8_e4m3fn format. This is done to allow FP8 training and inference models to run without any change on Gaudi 2 since Gaudi 2 does not support torch.float8_e4m3fn.

If torch.float8_e4m3f is used in your model on Gaudi 2, the data type supported is torch.float8_e4m3fnuz which leads to differences in max/Inf/NaN values as shown below:

  • The max supported value between Gaudi 2 FP8-143 (S.1110.111 = 240.0) and torch.e4m3fn FP8-143 (S.1111.110 = 448.0).

  • Encoding for INF (S.1111.000) required in Gaudi 2 which is not supported in torch.e4m3fn.

  • There is more encoding for NaN (S.1111.{001, 010, 011, 100, 101, 110, 111}) on Gaudi 2 while torch.e4m3fn supports only two (S.1111.111).

Since the max values differ, if you use torch.float8_e4m3fn, then the range of the FP8 values will be different. The corresponding changes for FP8 scaling is automatically provided by Intel Gaudi. However, if you explicitly depend on the max/Inf/NaN values of torch.float8_e4m3fn on Gaudi 2, then these values have to be modified to use torch.float8_e4m3fnuz values.

Performance Issues

For details on how to get best performance on HPU, refer to Model Performance Optimization Guide for PyTorch.