Debugging Model Divergence¶
The values stored in tensors can give an indication of the behavior of the training process. In particular, look for unexpected 0.0 or NaN values.
Habana’s integration of PyTorch supports eagermode execution. This mode allows for executing a PyTorch model op by op and examining tensor values at the input and/or output of each op. You can use any standard python debugger to set breakpoints, break into the training/modeling script, step through the script and so on. At any point in the script, you can use the Python print() statement to print the contents of the tensors. Note that if the tensor is on the Gaudi (i.e. ‘hpu’) device, you need to bring it to CPU before you can print it. For example, if ‘t’ is a tensor on ‘hpu’ device, you can print it as:
Tensors on ‘hpu’ containing a single value can be printed directly using the item() method. For example, if ‘t’ is a tensor, you can print it as:
Other Data Types¶
If a model converges on CPU but fails to converge on Gaudi, it is useful to experiment with other data types. For example, use the FP32 data type instead of BF16.
Once FP32 based model converges, you may want to experiment with different mixed precision configurations to arrive at a model with optimal performance/accuracy benefits. Please refer to the PyTorch Mixed Precision Training on Gaudi section for more details on the configuration procedure and debugging.
If your model converges successfully on CPU or GPU but fails to converge on Gaudi, check for a mismatch between the version of the framework you are running on Gaudi and the version of the framework you are running on the CPU or GPU.
For PyTorch, use:
$ python3 -c "import torch; print(torch.__version__)"
If you train using an existing checkpoint and your model fails to converge on Gaudi, check for a mismatch between the version of the framework you are running and the Gaudi framework version that generated the checkpoint.