Debugging Model Divergence

Other Data Types

If a model converges on CPU but fails to converge on Gaudi, it is useful to experiment with other data types. For example, use the FP32 data type instead of BF16.

Once FP32 based model converges, you may want to experiment with different mixed precision configurations to arrive at a model with optimal performance/accuracy benefits. Please refer to the Mixed Precision Training with PyTorch Autocast section for more details on the configuration procedure and debugging.

Framework Version

If your model converges successfully on CPU or GPU but fails to converge on Gaudi, check for a mismatch between the version of the framework you are running on Gaudi and the version of the framework you are running on the CPU or GPU.

For PyTorch, use:

$ python3 -c "import torch; print(torch.__version__)"

If you train using an existing checkpoint and your model fails to converge on Gaudi, check for a mismatch between the version of the framework you are running and the Gaudi framework version that generated the checkpoint.