5. Migration Guide

5.1. Introduction

The purpose of this document is to guide users porting their own TensorFlow or PyTorch models to the Habana(R) Gaudi(R) HPU. The instructions provided in this document help ensure the models are functional and ready for further optimization. In addition to this document, refer to the TensorFlow User Guide or PyTorch User Guide.

5.2. Porting a Simple TensorFlow Model to Gaudi

Porting TensorFlow models to Gaudi requires the following:

  • Habana driver

  • Habana TensorFlow Docker image

To install the above, refer to the Setup and Install GitHub page. As TensorFlow releases newer versions, Habana will continue to support newer versions of TensorFlow and drop support for older versions. See the Support Matrix section of the Release Notes for more details of what is covered in this release.

Caution

Using APIs from different TensorFlow versions can cause compatibility issues. Please refer to the TensorFlow section of the Release Notes for a list of current constraints.

5.2.1. Loading the Habana Module

To load the Habana Module for TensorFlow, you need to call load_habana_module() located under library_loader.py. This function loads the Habana libraries needed in order to use Gaudi HPU at the TensorFlow level. Once loaded, Gaudi HPU is registered in TensorFlow and prioritized over CPU. This means, when a given Op is available for both CPU and the Gaudi HPU, the Op is assigned to the Gaudi HPU.

Habana op support and custom TensorFlow ops are defined in the habana_ops object, also available in habana-tensorflow. It can be imported as such: from habana_frameworks.tensorflow import habana_ops, but should only be used after load_habana_module() is called. The custom ops are used for pattern matching vanilla TensorFlow ops.

5.2.2. Enabling a Single Gaudi Device

To enable a single Gaudi device, add the below code to the main function:

from habana_frameworks.tensorflow import load_habana_module
load_habana_module()
# or
import habana_frameworks.tensorflow as htf
htf.load_habana_module()

To enable Horovod for multi-Gaudi runs, add distributed functions to the main function. To enable multi-worker training with tf.distribute, use HPUStrategy class. For more details on porting multi-node models, see Distributed Training with TensorFlow.

5.2.3. Creating a TensorFlow Example

In order to run the following example, run the Docker image in interactive mode on the Gaudi machine according to the instructions detailed in the Setup and Install GitHub page.

After entering a Docker shell, create a “example.py” TensorFlow example with the following code snippet available in the TensorFlow Hello World Example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import tensorflow as tf
from habana_frameworks.tensorflow import load_habana_module

load_habana_module()

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
	tf.keras.layers.Flatten(input_shape=(28, 28)),
	tf.keras.layers.Dense(10),
])
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01) 

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5, batch_size=128)
model.evaluate(x_test, y_test)

The example.py presents a basic TensorFlow code example. The following further explains the Habana-specific lines:

  • Line 2 - Import function to enable Gaudi.

Note

Ensure that you have a proper PYTHONPATH set by checking if it consists of /root, or more specifically, export PYTHONPATH=/root/Model-References:$PYTHONPATH

  • Line 4 - The function imported earlier is called to enable Gaudi (registered as ‘HPU’ device in TensorFlow), Habana optimization passes, Habana ops and so on.

In addition to the above, two migration examples are provided:

For example_tf_func.py, the migration instructions are similar to the Hello_world.py example detailed above. For example_tf_session.py, you must disable eager mode by adding tf.compat.v1.disable_eager_execution() to enable graph mode.

In summary, the minimal required change is the addition of the following lines. Please refer to the TensorFlow section of the Release Notes for a list of model constraints.

from habana_frameworks.tensorflow import load_habana_module

load_habana_module()

The table below summarizes the conditions that recommend tf.compat.v1.disable_eager_execution() to be added in the model scripts to enable graph mode:

TF version and API

Recommendations for disable_eager_execution

Code Examples in GitHub

TF1 scripts running in TF2 compatible mode

tf.compat.v1.disable_eager_execution() is required to enable graph mode.

TF1 model running in TF2 compatible mode: example_tf_session.py

TF2 scripts running with keras model (graph mode by default)

tf.compat.v1.disable_eager_execution() is NOT required to enable graph mode.

TF2 model running with Keras on single Gaudi: example.py

TF2 model running with Keras on Horovod based Multi Gaudis: example_hvd.py

TF2 scripts runs with tf.function (graph mode)

tf.compat.v1.disable_eager_execution() is NOT required to enable graph mode.

TF2 model running with tf.function on single Gaudi: example_tf_func.py

TF2 model running with tf.function on Horovod based Multi Gaudis: example_tf_func_hvd.py

5.2.4. Executing the Example

After creating the example.py, execute the example by running:

$PYTHON example.py

You can also run the above example with BF16 support enabled by adding the TF_ENABLE_BF16_CONVERSION=1 flag. For a full list of available runtime environment variables, see Runtime Environment Variables.

TF_ENABLE_BF16_CONVERSION=1 $PYTHON example.py

The following lines should appear as part of output:

Epoch 1/5
469/469 [==============================] - 1s 3ms/step - loss: 1.2647 - accuracy: 0.7208
Epoch 2/5
469/469 [==============================] - 1s 2ms/step - loss: 0.7113 - accuracy: 0.8433
Epoch 3/5
469/469 [==============================] - 1s 2ms/step - loss: 0.5845 - accuracy: 0.8606
Epoch 4/5
469/469 [==============================] - 1s 2ms/step - loss: 0.5237 - accuracy: 0.8688
Epoch 5/5
469/469 [==============================] - 1s 2ms/step - loss: 0.4865 - accuracy: 0.8749
313/313 [==============================] - 1s 2ms/step - loss: 0.4482 - accuracy: 0.8869

Since the first iteration includes graph compilation time, you can see the first iteration takes longer to run than later iterations. The software stack compiles the graph and saves the recipe to cache. Unless the graph changes or a new graph comes in, no recompilation is needed during the training. Typical graph compilation happens at the beginning of the training and at the beginning of the evaluation.

5.2.5. Viewing Loss and Accuracy in TensorFlow

You can find loss and accuracy in the demo scripts output. Loss and accuracy metrics can be visualized using different Profiler tools. For further details about the Profiler tools you can use, see Viewing Instructions.

5.3. Porting a Simple PyTorch Model to Gaudi

At the time of this release, support for PyTorch was under active development. Refer to the PyTorch model examples in the Model References GitHub for examples. The steps below are for reference and provide a baseline for preparing a PyTorch model to run on Gaudi. Models following these steps may still require additional modification to achieve optimized performance.

Porting PyTorch models to Gaudi requires the following:

  • Habana driver

  • Habana PyTorch Docker image

To install the above, refer to the Setup and Install GitHub page.

The following set of code additions need to be made to run a model on Habana. The following steps cover Eager and Lazy modes of execution.

  1. Load the Habana PyTorch Plugin Library, libhabana_pytorch_plugin.so.

from habana_frameworks.torch.utils.library_loader import load_habana_module
load_habana_module()
  1. Target the Gaudi HPU device:

device = torch.device("hpu")
  1. Move the model to the device:

model.to(device)

Note

Step 3 may already be implemented in your existing model script.

  1. Import the Habana Torch Library:

import habana_frameworks.torch.core as htcore

5. Enable Lazy execution mode if you want to run your model in this mode by setting the environment variable shown below. Do not set the below if you want to execute your code in Eager mode, for more information on Lazy and Eager mode, you can refer to the PyTorch User Guide:

os.environ["PT_HPU_LAZY_MODE"] = "1"

6. In Lazy mode, execution is triggered wherever data is read back to the host from the Habana device. For example, execution is triggered if you are running a topology and getting loss value into the host from the device with loss.item(). Adding a mark_step() in the code is another mechanism to trigger execution. The placement of mark_step() is required at the following points in a training script:

  • Right after optimizer.step() to cleanly demarcate training iterations,

  • Between loss.backward and optimizer.step() if the optimizer being used is a Habana custom optimizer.

htcore.mark_step()

Note

Placing mark_step() at any arbitrary point in the code is not currently supported. We will support insertion of mark_step() at arbitrary positions in future releases.

7. Load the checkpoint. Vision models with convolutions require Habana PyTorch specific steps. Refer to Convolution Weight Ordering in PyTorch Habana Vision Topologies for additional steps on weight order handling.

8. Save the checkpoint. Bring trainable parameters of the model and optimizer tensors to CPU using .to('cpu') on the tensors and save. Refer to Convolution Weight Ordering in PyTorch Habana Vision Topologies for additional steps on weight order handling below for vision models with convolutions.

5.3.1. Distributed Communication Using PyTorch

PyTorch supports distributed communication using torch.distributed and torch.nn.parallel.DistributedDataParallel APIs for both data and model parallelism. PyTorch supports a few communication backends like MPI, Gloo and NCCL natively. Habana support for distributed communication can be enabled using HCCL backend. For PyTorch distributed to work correctly, you need to export environment variable ID. ID is mapped to the local rank which is used to acquire the Gaudi card for a particular process in case of multi-node.

os.environ["ID"] = local_rank

5.3.2. Distributed Backend Initialization

5.3.3. HCCL

Support for HCCL communication backend is loaded and process group communication backend is initialized as “hccl” using the following script changes:

import habana_frameworks.torch.core.hccl
torch.distributed.init_process_group(backend='hccl', rank=rank, world_size=world_size)

5.4. Torch Multiprocessing for DataLoaders (Vision Models only)

If training scripts use multiprocessing for dataloader, change the start method to spawn or forkserver using the PyTorch API multiprocessing.set_start_method(...). For example:

torch.multiprocessing.set_start_method('spawn')

5.5. Custom Habana OPs for PyTorch

Habana provides its own implementation of some complex PyTorch OPs customized for Habana devices. In a given model, replacing these complex OPs with custom Habana versions will give better performance.

5.5.1. Custom Optimizers

The following is a list of custom optimizers currently supported on Habana devices:

The below shows an example code snippet demonstrating the usage of a custom optimizer:

try:
   from habana_frameworks.torch.hpex.optimizers import FusedLamb
except ImportError:
   raise ImportError("Please install habana_torch package")
   optimizer = FusedLamb(model.parameters(), lr=args.learning_rate)

Note

For models using Lazy mode execution, add a mark_step() right before the optimizer.step() call when using custom optimizer.

5.5.1.1. Other Custom OPs

The following is a list of other custom OPs currently supported on Habana devices:

The below is an example code snippet demonstrating the usage of FusedClipNorm:

try:
   from habana_frameworks.torch.hpex.normalization import FusedClipNorm
except ImportError:
   raise ImportError("Please install habana_torch package")
   FusedNorm = FusedClipNorm(model.parameters(), args.max_grad_norm)

FusedNorm.clip_norm(model.parameters())