Debugging Model Divergence
On this Page
Debugging Model Divergence¶
Experiment with Other Data Types
Check the Framework Version
Print Tensors¶
The values stored in tensors can give an indication of the training process behavior. In particular, it is recommended to look for unexpected 0.0 or NaN values.
Any standard Python debugger can be used to set breakpoints, pause execution, and step through the training or modeling script.
At any point in the script, you can use the Python print()
statement to print the contents of the tensors.
Note that if the tensor is on the Gaudi (i.e. hpu
) device, you need to bring it to CPU
before you can print it. For example, if ‘t’ is a tensor on hpu
device, you can print it as:
$ print(t.to('cpu'))
Tensors on hpu
containing a single value can be printed directly using the item()
method.
For example, if ‘t’ is a tensor, you can print it as:
$ print(t.item())
When running in legacy Lazy mode, inspecting tensor values requires forcing op-by-op execution rather than allowing ops to be fused into larger graphs. This behavior, similar to Eager mode, can be enabled by setting the:code:PT_HPU_MAX_COMPOUND_OP_SIZE=1 environment variable, which limits the cluster size to a single op.
Other Data Types¶
If a model converges on CPU but fails to converge on Gaudi, it is useful to experiment with other data types. For example, use the FP32 data type instead of BF16.
Once FP32-based model converges, you can experiment with different mixed precision configurations to achieve optimal performance/accuracy benefits on your model. Refer to the Mixed Precision Training with PyTorch Autocast section for more details on the configuration procedure and debugging.
Framework Version¶
If your model converges successfully on CPU or GPU but fails to converge on Gaudi, check for a mismatch between the version of the framework you are running on Gaudi and the version of the framework you are running on the CPU or GPU.
$ python3 -c "import torch; print(torch.__version__)"
If you run training using an existing checkpoint and your model fails to converge on Gaudi, check for a mismatch between the version of the framework you are running and the Gaudi framework version that generated the checkpoint.