Debugging Model Divergence
On this Page
Debugging Model Divergence¶
Experiment with Other Data Types
Check the Framework Version
Print Tensors¶
The values stored in tensors can give an indication of the training process behavior. In particular, it is recommended to look for unexpected 0.0 or NaN values.
PT_HPU_MAX_COMPOUND_OP_SIZE
environment variable with cluster sizes limited to 1
enables executing a PyTorch model op by op and examining tensor values at the input and/or output of each op.
PT_HPU_MAX_COMPOUND_OP_SIZE=1
emulates Eager mode functionality since Eager mode support, as a subset of Lazy mode, is deprecated.
You can use any standard Python debugger to set breakpoints, break into the training/modeling script, step through the script and so on.
At any point in the script, you can use the Python print()
statement to print the contents of the tensors.
Note that if the tensor is on the Gaudi (i.e. hpu
) device, you need to bring it to CPU
before you can print it. For example, if ‘t’ is a tensor on hpu
device, you can print it as:
$ print(t.to('cpu'))
Tensors on hpu
containing a single value can be printed directly using the item()
method.
For example, if ‘t’ is a tensor, you can print it as:
$ print(t.item())
Other Data Types¶
If a model converges on CPU but fails to converge on Gaudi, it is useful to experiment with other data types. For example, use the FP32 data type instead of BF16.
Once FP32-based model converges, you can experiment with different mixed precision configurations to achieve optimal performance/accuracy benefits on your model. Refer to the Mixed Precision Training with PyTorch Autocast section for more details on the configuration procedure and debugging.
Framework Version¶
If your model converges successfully on CPU or GPU but fails to converge on Gaudi, check for a mismatch between the version of the framework you are running on Gaudi and the version of the framework you are running on the CPU or GPU.
$ python3 -c "import torch; print(torch.__version__)"
If you run training using an existing checkpoint and your model fails to converge on Gaudi, check for a mismatch between the version of the framework you are running and the Gaudi framework version that generated the checkpoint.