4. Migration Guide

4.1. Introduction

The purpose of this document is to guide users porting their own TensorFlow or PyTorch models to the Habana(R) Gaudi(R) HPU. The instructions provided in this document help ensure the models are functional and ready for further optimization. In addition to this document, refer to the TensorFlow User Guide or PyTorch User Guide.

Note

Please make sure that the version of the SynapseAI software stack installation matches the version of the Docker images you are using. Our documentation on docs.habana.ai is also versioned, so select the appropriate version. The Setup and Install GitHub repository as well as the Model-References GitHub repository have branches for each release version. Make sure you are selecting the branch that matches the version of your SynapseAI software installation. For example, if SynapseAI software version 0.15.4 is installed, then you would clone the Model-References repository like this: % git clone -b 0.15.4 https://github.com/HabanaAI/Model-References. To confirm the SynapseAI Software version on your build, run the hl-smi tool and look at the “Driver Version”. (see the figure below)

../_images/hl-smi-check.jpg

Figure 4.1 SynapseAI Version Check

4.2. Porting a Simple TensorFlow Model to Gaudi

Porting TensorFlow models to Gaudi requires the following:

  • Habana driver

  • Habana TensorFlow Docker image

To install the above, refer to the TensorFlow Installation section or the Setup and Install GitHub page. As TensorFlow releases newer versions, Habana will continue to support newer versions of TensorFlow and drop support for older versions. See the TensorFlow section of the Release Notes for more details of what is covered in this release.

Caution

Using APIs from different TensorFlow versions can cause compatibility issues. Please refer to the TensorFlow section of the Release Notes for a list of current constraints.

4.2.1. Loading the Habana Module

To load the Habana Module for TensorFlow, you need to call load_habana_module() located under library_loader.py. This function loads the Habana libraries needed in order to use Gaudi HPU at the TensorFlow level. Once loaded, Gaudi HPU is registered in TensorFlow and prioritized over CPU. This means, when a given Op is available for both CPU and the Gaudi HPU, the Op is assigned to the Gaudi HPU.

Habana op support and custom TensorFlow ops are defined in the habana_ops object, also available in habana-tensorflow. It can be imported as such: form habana_frameworks.tensorflow import habana_ops, but should only be used after load_habana_module() is called. The custom ops are used for pattern matching vanilla TensorFlow ops.

4.2.2. Enabling a Single Gaudi Device

To enable a single Gaudi device, add the below code to the main function:

from habana_frameworks.tensorflow import load_habana_module
load_habana_module()
# or
import habana_frameworks.tensorflow as htf
htf.load_habana_module()

To enable Horovod for multi-Gaudi runs, add distributed functions to the main function. To enable multi-worker training with tf.distribute, use HPUStrategy class. For more details on porting multi-node models, see Distributed Training with TensorFlow.

4.2.3. Creating a TensorFlow Example

In order to run the following example, run the Docker image in interactive mode on the Gaudi machine according to the instructions detailed in the Setup and Install GitHub page.

After entering a Docker shell, create a “example.py” TensorFlow example with the following code snippet available in the TensorFlow Hello World Example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import tensorflow as tf
from habana_frameworks.tensorflow import load_habana_module

load_habana_module()

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
	tf.keras.layers.Flatten(input_shape=(28, 28)),
	tf.keras.layers.Dense(10),
])
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01) 

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5, batch_size=128)
model.evaluate(x_test, y_test)

The example.py presents a basic TensorFlow code example. The following further explains the Habana-specific lines:

  • Line 2 - Import function to enable Gaudi.

Note

Ensure that you have a proper PYTHONPATH set by checking if it consists of /root, or more specifically, export PYTHONPATH=/root/Model-References:$PYTHONPATH

  • Line 4 - The function imported earlier is called to enable Gaudi (registered as ‘HPU’ device in TensorFlow), Habana optimization passes, Habana ops and so on.

Note

Disabling eager execution tf.compat.v1.disable_eager_execution() is recommended for scripts using tf.Session. For scripts not using tf.Session, disabling eager execution is not required. However, it is recommended to use tf.function. TensorFlow is capable of running Habana optimization passes in eager mode, provided the code is enclosed in tf.function.

In addition to the above, two migration examples are provided:

For example_tf_func.py, the migration instructions are similar to the Hello_world.py example detailed above. For example_tf_session.py, you must disable eager mode by adding tf.compat.v1.disable_eager_execution() to the code.

In summary, the minimal required change is the addition of the following lines. Please refer to the TensorFlow section of the Release Notes for a list of model constraints.

from habana_frameworks.tensorflow import load_habana_module

load_habana_module()

The table below summarizes the conditions that recommend tf.compat.v1.disable_eager_execution() to be added in the model scripts to enable graph mode:

TF version and API

Single Gaudi

Horovod based Multi Gaudis

Code Examples in GitHub

TF1 scripts running in TF2 compatible mode

tf.compat.v1.disable_eager_execution() is required to enable graph mode.

tf.compat.v1.disable_eager_execution() is required to enable graph mode.

TF1 model running in TF2 compatible mode: example_tf_session.py

TF2 scripts running with keras model (graph mode by default)

tf.compat.v1.disable_eager_execution() is NOT required to enable graph mode.

tf.compat.v1.disable_eager_execution() is required to enable graph mode. This requirement will be removed when Habana Horovod is updated to a newer version. (see the note below)

TF2 model running with Keras on single Gaudi: example.py

TF2 model running with Keras on Horovod based Multi Gaudis: example_hvd.py

TF2 scripts runs with tf.function (graph mode)

tf.compat.v1.disable_eager_execution() is NOT required to enable graph mode.

tf.compat.v1.disable_eager_execution() is NOT required to enable graph mode.

TF2 model running with tf.function on single Gaudi: example_tf_func.py

TF2 model running with tf.function on Horovod based Multi Gaudis: example_tf_func_hvd.py

Note

Horovod-based Multi Gaudis requires adding disable_eager_execution due to Horovod-related bugs. This requirement will be removed in a future release.

4.2.4. Executing the Example

After creating the example.py, execute the example by running:

$PYTHON example.py

You can also run the above example with BF16 support enabled by adding the TF_ENABLE_BF16_CONVERSION=1 flag. For a full list of available runtime environment variables, see Runtime Environment Variables.

TF_ENABLE_BF16_CONVERSION=1 $PYTHON example.py

The following lines should appear as part of output:

Epoch 1/5
469/469 [==============================] - 1s 3ms/step - loss: 1.2647 - accuracy: 0.7208
Epoch 2/5
469/469 [==============================] - 1s 2ms/step - loss: 0.7113 - accuracy: 0.8433
Epoch 3/5
469/469 [==============================] - 1s 2ms/step - loss: 0.5845 - accuracy: 0.8606
Epoch 4/5
469/469 [==============================] - 1s 2ms/step - loss: 0.5237 - accuracy: 0.8688
Epoch 5/5
469/469 [==============================] - 1s 2ms/step - loss: 0.4865 - accuracy: 0.8749
313/313 [==============================] - 1s 2ms/step - loss: 0.4482 - accuracy: 0.8869

Since the first iteration includes graph compilation time, you can see the first iteration takes longer to run than later iterations. The software stack compiles the graph and saves the recipe to cache. Unless the graph changes or a new graph comes in, no recompilation is needed during the training. Typical graph compilation happens at the beginning of the training and at the beginning of the evaluation.

4.2.5. Viewing Loss and Accuracy in TensorFlow

You can find loss and accuracy in the demo scripts output. Loss and accuracy metrics can be visualized using different Profiler tools. For further details about the Profiler tools you can use, see Viewing Instructions.

4.3. Porting a Simple PyTorch Model to Gaudi

At the time of this release, support for PyTorch was under active development. Refer to the PyTorch model examples in the Model References GitHub for examples. The steps below are for reference and provide a baseline for preparing a PyTorch model to run on Gaudi. Models following these steps may still require additional modification to achieve optimized performance.

Porting PyTorch models to Gaudi requires the following:

  • Habana driver

  • Habana PyTorch Docker image

To install the above, refer to the Setup and Install GitHub page.

The following set of code additions need to be made to run a model on Habana. The following steps cover Eager and Lazy modes of execution.

  1. Load the Habana PyTorch Plugin Library, libhabana_pytorch_plugin.so.

from habana_frameworks.torch.utils.library_loader import load_habana_module
load_habana_module()
  1. Target the Gaudi HPU device:

device = torch.device("hpu")
  1. Move the model to the device:

model.to(device)

Note

Step 3 may already be implemented in your existing model script.

  1. Import the Habana Torch Library:

import habana_frameworks.torch.core as htcore

5. Enable Lazy execution mode if you want to run your model in this mode by setting the environment variable shown below. Do not set the below if you want to execute your code in Eager mode, for more information on Lazy and Eager mode, you can refer to the PyTorch User Guide:

os.environ["PT_HPU_LAZY_MODE"] = "1"

6. In Lazy mode, execution is triggered wherever data is read back to the host from the Habana device. For example, execution is triggered if you are running a topology and getting loss value into the host from the device with loss.item(). Adding a mark_step() in the code is another mechanism to trigger execution. The placement of mark_step() is required at the following points in a training script:

  • Right after optimizer.step() to cleanly demarcate training iterations,

  • Between loss.backward and optimizer.step() if the optimizer being used is a Habana custom optimizer.

htcore.mark_step()

Note

Placing mark_step() at any arbitrary point in the code is not currently supported. We will support insertion of mark_step() at arbitrary positions in future releases.

7. Load the checkpoint. Vision models with convolutions require Habana PyTorch specific steps. Refer to Convolution Weight Ordering in PyTorch Habana Vision Topologies for additional steps on weight order handling.

8. Save the checkpoint. Bring trainable parameters of the model and optimizer tensors to CPU using .to('cpu') on the tensors and save. Refer to Convolution Weight Ordering in PyTorch Habana Vision Topologies for additional steps on weight order handling below for vision models with convolutions.

4.3.1. Distributed Communication Using PyTorch

PyTorch supports distributed communication using torch.distributed and torch.nn.parallel.DistributedDataParallel APIs for both data and model parallelism. PyTorch supports a few communication backends like MPI, Gloo and NCCL natively. Support for HCL communication backend is loaded and process group communication backend is initialized as “hcl” using the following script changes:

import habana_torch_hcl
torch.distributed.init_process_group(backend='hcl', rank=rank, world_size=world_size)

PyTorch initializes the HCL communicator with default HCL configuration parameters. You can overwrite the default parameters and use your own HCL configuration JSON file by setting the HCL_CONFIG_PATH environment variable. Please refer to the Habana Communication Library (HCL) API Reference for more details on configuring HCL parameters. For PyTorch distributed to work correctly, you need to export environment variable ID. ID is mapped to the local rank which is used to acquire the Gaudi card for a particular process in case of multi-node.

os.environ["ID"] = local_rank

Additional recommendations to follow when running PyTorch models on HPU are listed below:

  1. For best performance in vision models, images should be moved from host to device using a non-blocking copy call.

  2. During DataLoader initialization, set pin_memory=True for best performance.

  3. In order to avoid potential Out of Memory (OOM) issues for large models, modify your script to drop last batch in an epoch if it is a partial batch.

  4. MPI barrier can be used to exclude data load time variations across processes to measure the training time accurately, in case of models which require significant amount of compute for data preprocessing.

4.4. Torch Multiprocessing for DataLoaders (Vision Models only)

If training scripts use multiprocessing for dataloader, change the start method to spawn or forkserver using the PyTorch API multiprocessing.set_start_method(...). For example:

torch.multiprocessing.set_start_method('spawn')

4.5. Convolution Weight Ordering in PyTorch Habana Vision Topologies

Convolution operations are central to vision topologies like ResNet. Gaudi HW performs convolution operations with filter (weights) in filters last format - RSCK format where:

  • R = height of the filter

  • S = width of the filter

  • C = number of channels per filter

  • K = number of filters

The default PyTorch convolution weight ordering is ‘filters first’ (KCRS). Therefore a re-ordering/permutation of all the convolution weights from KCRS to RSCK format is required before convolution operations. Such a permutation of weights is done once at the beginning of training in the PyTorch Habana vision topologies. However, since due to this permutation the weights are in RSCK format during training, a conversion back to KCRS format is necessary when saving intermediate checkpoints or saving the final trained weights. This helps bring the weights back to the default PyTorch format (KCRS), say, for use across DL training platforms.

Due to the permutation of weights to RSCK format, the gradients of these weights will also be in the same format on the HPU automatically. Any other tensors that are calculated as a function of the convolution weights (or gradients thereof) on HPU will also be in RSCK format. An example of such is the ‘momentum’ tensors corresponding to convolution weights in a ResNet model trained with Stochastic Gradient Descent with Momentum optimizer. Appropriate permutations to be in alignment with default destination format should be done if these tensors (convolution weights, gradients, momentum etc) are to be transferred across CPU and HPU (for example, CPU (KCRS) <–> (RSCK) HPU)

The following sections list the various scenarios in which such permutations need to be handled and provides recommendations on how to handle them. The instructions refer to permutations done in the ResNet training script located in the PyTorch Model Reference GitHub page.

4.5.1. Scenario 1: Initializing Training from the Beginning

  1. Initialize the weight on the CPU for the entire model.

  2. Move the model to ‘hpu’ device, for example, model.to("hpu").

  3. Permute the convolution weights and any other dependent tensors like ‘momentum’ to RSCK format. For example:

permute_params(model, True...)  # permute conv weights; True -> KCRS to RSCK format
permute_momentum(optimizer, True ...) # permute momentum; True -> KCRS to RSCK format
  1. Start the training.

4.5.2. Scenario 2: Initializing Training from a Checkpoint

1. If checkpoint loading is followed by weight permutation mentioned in Scenario 1, first permute the weights and dependent tensors back to default PyTorch format (if not, go to step 2). For example:

permute_params(model, False...)  # permute conv weights; False ->  RSCK to KCRS format
permute_momentum(optimizer, False ...)  # permute momentum; False ->  RSCK to KCRS format
  1. Load the checkpoint and optimizer state dictionary.

  2. Move the model to ‘hpu’ device (if not already done).

  3. Permute the weights and dependent tensors to RSCK format. For example:

permute_params(model, True...)  # permute conv weights; True -> KCRS to RSCK format
permute_momentum(optimizer, True ...) # permute momentum; True -> KCRS to RSCK format
  1. Start the training.

4.5.3. Scenario 3: Saving a Checkpoint

The convolution weights and dependent tensors on the ‘hpu’ device are in RSCK format.

  1. Permute the weights and dependent tensors to KCRS format. For example:

permute_params(model, False...)  # permute conv weights; False ->  RSCK to KCRS format
permute_momentum(optimizer, False ...) # permute momentum; False ->  RSCK to KCRS format
  1. Bring the trainable parameters of the model and optimizer tensors to the CPU and save.

  2. Move the trainable params and optimizer tensors to ‘hpu’.

  3. Permute the conv weight tensors and dependent tensors to RSCK format. For example:

permute_params(model, True...)  # permute conv weights; True -> KCRS to RSCK format
permute_momentum(optimizer, True ...) # permute momentum; True -> KCRS to RSCK format

The requirement for explicit addition of permutes with permute_params and permute_momentum in the model script will be removed in future releases.

4.6. Custom Habana OPs for PyTorch

Habana provides its own implementation of some complex PyTorch OPs customized for Habana devices. In a given model, replacing these complex OPs with custom Habana versions will give better performance.

4.6.1. Custom Optimizers

The following is a list of custom optimizers currently supported on Habana devices:

The below shows an example code snippet demonstrating the usage of a custom optimizer:

try:
   from habana_frameworks.torch.hpex.optimizers import FusedLamb
except ImportError:
   raise ImportError("Please install habana_torch package")
   optimizer = FusedLamb(model.parameters(), lr=args.learning_rate)

Note

For models using Lazy mode execution, add a mark_step() right before the optimizer.step() call when using custom optimizer.

4.6.1.1. Other Custom OPs

The following is a list of other custom OPs currently supported on Habana devices:

The below is an example code snippet demonstrating the usage of FusedClipNorm:

try:
   from habana_frameworks.torch.hpex.normalization import FusedClipNorm
except ImportError:
   raise ImportError("Please install habana_torch package")
   FusedNorm = FusedClipNorm(model.parameters(), args.max_grad_norm)

FusedNorm.clip_norm(model.parameters())