Debugging Model Divergence
On this Page
Debugging Model Divergence¶
This section provides suggested courses of action to take if your TensorFlow model diverges.
The values stored in tensors can give an indication of the behavior of the training process. In particular, look for unexpected 0.0 or NaN values.
The tf.print operator prints information to a desired output stream or logging level. Gaudi integration with TensorFlow supports many, but not all, tensor types for tf.print.
tf.debugging.check_numerics is not supported by the integration of Habana with TensorFlow.
Other Data Types¶
If a model converges on CPU but fails to converge on Gaudi, it is useful to experiment with other data types. For example, use the FP32 data type instead of BF16.
If your model converges successfully on CPU or GPU but fails to converge on Gaudi, check for a mismatch between the version of the framework you are running on Gaudi and the version of the framework you are running on the CPU or GPU.
import tensorflow as tf print(tf.__version__)
import tensorflow as tf print(tf.version.VERSION)
If you train using an existing checkpoint and your model fails to converge on Gaudi, check for a mismatch between the version of the framework you are running and the Gaudi framework version that generated the checkpoint.