Debugging Model Divergence

Other Data Types

If a model converges on CPU but fails to converge on Gaudi, it is useful to experiment with other data types. For example, use the FP32 data type instead of BF16.

Once FP32-based model converges, you can experiment with different mixed precision configurations to achieve optimal performance/accuracy benefits on your model. Refer to the Mixed Precision Training with PyTorch Autocast section for more details on the configuration procedure and debugging.

Framework Version

If your model converges successfully on CPU or GPU but fails to converge on Gaudi, check for a mismatch between the version of the framework you are running on Gaudi and the version of the framework you are running on the CPU or GPU.

$ python3 -c "import torch; print(torch.__version__)"

If you run training using an existing checkpoint and your model fails to converge on Gaudi, check for a mismatch between the version of the framework you are running and the Gaudi framework version that generated the checkpoint.