Porting a Simple TensorFlow Model to Gaudi

To set up the TensorFlow environment, refer to the Installation Guide. The supported TensorFlow versions are listed in the Support Matrix.


Using APIs from different TensorFlow versions can cause compatibility issues. Please refer to the TensorFlow Known Issues and Limitations section for a list of current limitations.

Creating a TensorFlow Example

In order to run the following example, run the Docker image in interactive mode on the Intel® Gaudi® AI accelerator according to the instructions detailed in the Installation Guide.

After entering a Docker shell, create a “example.py” TensorFlow example with the following code snippet available in the TensorFlow Hello World Example.

import tensorflow as tf
from habana_frameworks.tensorflow import load_habana_module


(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
	tf.keras.layers.Flatten(input_shape=(28, 28)),
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01) 

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5, batch_size=128)
model.evaluate(x_test, y_test)

The example.py presents a basic TensorFlow code example. The following further explains the Gaudi-specific lines:

  • Line 2- Import function to enable a single Gaudi device.


Ensure that you have a proper PYTHONPATH set by checking if it consists of /root, or more specifically, export PYTHONPATH=/root/Model-References:/usr/lib/habanalabs:$PYTHONPATH

  • Line 4 - The function imported earlier is called to enable Gaudi (registered as ‘HPU’ device in TensorFlow), Intel Gaudi optimization passes, ops and so on.

Additional Migration Examples

In addition to the above, two migration examples are provided:

For example_tf_func.py, the migration instructions are similar to the Hello_world.py example detailed above. For example_tf_session.py, you must disable eager mode by adding tf.compat.v1.disable_eager_execution() to enable graph mode.

The table below summarizes the conditions that recommend tf.compat.v1.disable_eager_execution() to be added in the model scripts to enable graph mode:

TF version and API

Recommendations for disable_eager_execution

Code Examples in GitHub

TF1 scripts running in TF2 compatible mode

tf.compat.v1.disable_eager_execution() is required to enable graph mode.

TF1 model running in TF2 compatible mode: example_tf_session.py

TF2 scripts running with keras model (graph mode by default)

tf.compat.v1.disable_eager_execution() is NOT required to enable graph mode.

TF2 model running with Keras on single Gaudi: example.py

TF2 model running with Keras on Horovod based Multi Gaudis: example_hvd.py

TF2 scripts runs with tf.function (graph mode)

tf.compat.v1.disable_eager_execution() is NOT required to enable graph mode.

TF2 model running with tf.function on single Gaudi: example_tf_func.py

TF2 model running with tf.function on Horovod based Multi Gaudis: example_tf_func_hvd.py

Executing the Example

After creating the example.py, execute the example by running:

$PYTHON example.py

You can also run the above example with BF16 support enabled by adding the TF_BF16_CONVERSION=1 flag. For a full list of available runtime environment variables, see Runtime Environment Variables.


The following lines should appear as part of output:

Epoch 1/5
469/469 [==============================] - 1s 3ms/step - loss: 1.2647 - accuracy: 0.7208
Epoch 2/5
469/469 [==============================] - 1s 2ms/step - loss: 0.7113 - accuracy: 0.8433
Epoch 3/5
469/469 [==============================] - 1s 2ms/step - loss: 0.5845 - accuracy: 0.8606
Epoch 4/5
469/469 [==============================] - 1s 2ms/step - loss: 0.5237 - accuracy: 0.8688
Epoch 5/5
469/469 [==============================] - 1s 2ms/step - loss: 0.4865 - accuracy: 0.8749
313/313 [==============================] - 1s 2ms/step - loss: 0.4482 - accuracy: 0.8869

Since the first iteration includes graph compilation time, you can see the first iteration takes longer to run than later iterations. The software stack compiles the graph and saves the recipe to cache. Unless the graph changes or a new graph comes in, no recompilation is needed during the training. Typical graph compilation happens at the beginning of the training and at the beginning of the evaluation.

Viewing Loss and Accuracy in TensorFlow

You can find loss and accuracy in the demo scripts output. Loss and accuracy metrics can be visualized using different Profiler tools. For further details about the Profiler tools you can use, see Analysis section.

Loading the Intel Gaudi Module

To load habana_module for TensorFlow, you need to call load_habana_module() located under library_loader.py. This function loads the Intel Gaudi libraries needed in order to use the device at the TensorFlow level. Once loaded, Gaudi HPU is registered in TensorFlow and prioritized over CPU. This means, when a given Op is available for both CPU and the Gaudi HPU, the Op is assigned to the Gaudi HPU.

Intel Gaudi op support and custom TensorFlow ops are defined in the habana_ops object, also available in habana-tensorflow. It can be imported as such: from habana_frameworks.tensorflow import habana_ops, but should only be used after load_habana_module() is called. The custom ops are used for pattern matching vanilla TensorFlow ops.

load_habana_module() accepts an optional parameter allow_op_override where load_habana_module(allow_op_override=True) is the default. It allows replacement of a default TensorFlow op implementation with a custom one to improve performance. Only tf.keras.layers.LayerNormalization is currently supported. A known issue in TensorFlow may require you to disable allow_op_override by setting it to load_habana_module(allow_op_override=False).

Enabling a Single Gaudi Device

To enable a single Gaudi device, add the below code to the main function:

from habana_frameworks.tensorflow import load_habana_module
# or
import habana_frameworks.tensorflow as htf

To enable Horovod for multi-Gaudi runs, add distributed functions to the main function. To enable multi-worker training with tf.distribute, use HPUStrategy class. For more details on porting multi-node models, see Distributed Training with TensorFlow.