Debugging Model Divergence

This section provides suggested courses of action to take if your TensorFlow model diverges.

Other Data Types

If a model converges on CPU but fails to converge on Gaudi, it is useful to experiment with other data types. For example, use the FP32 data type instead of BF16.

Framework Version

If your model converges successfully on CPU or GPU but fails to converge on Gaudi, check for a mismatch between the version of the framework you are running on Gaudi and the version of the framework you are running on the CPU or GPU.

import tensorflow as tf
print(tf.__version__)

or

import tensorflow as tf
print(tf.version.VERSION)

If you train using an existing checkpoint and your model fails to converge on Gaudi, check for a mismatch between the version of the framework you are running and the Gaudi framework version that generated the checkpoint.