4. Migration Guide

4.1. Introduction

The purpose of this document is to guide users porting their own TensorFlow or PyTorch models to the Gaudi HPU. The instructions provided in this document help ensure the models are functional and ready for further optimization. In addition to this document, refer to the TensorFlow User Guide or PyTorch User Guide.

4.2. Porting a Simple TensorFlow Model to Gaudi

Porting TensorFlow models to Gaudi requires the following:

  • Habana driver

  • Habana TensorFlow Docker image

To install the above, refer to the TensorFlow Installation section or the Setup and Install GitHub page.

Note

The integration is valid for all supported TensorFlow versions. See the TensorFlow section of the Release Notes for more details.

Caution

Using APIs from different TensorFlow versions can cause compatibility issues. Please refer to the TensorFlow section of the Release Notes for a list of current constraints.

4.2.1. Loading the Habana Module

To load the Habana Module for TensorFlow, you need to call load_habana_module() located under library_loader.py. This function loads the Habana libraries needed in order to use the HPU device at the TensorFlow level. Once loaded, the HPU device is registered in TensorFlow and prioritized over CPU. This means, when a given Op is available for both CPU and HPU, the Op is assigned to the HPU.

Habana op support and custom TensorFlow ops are defined in the habana_ops object. It can be also imported from library_loader.py, but should only be used after load_habana_module() is called. The custom ops are used for pattern matching vanilla TensorFlow ops.

4.2.2. Enabling a Single Gaudi Device

To enable a single Gaudi device, add the below code to the main function:

from library_loader import load_habana_module

log_info_devices = load_habana_module()

To enable Horovod for multi-Gaudi runs, add distributed functions to the main function. To enable multi-worker training with tf.distribute, use HPUStrategy class. For more details on porting multi-node models, see Distributed Training with TensorFlow.

4.2.3. Creating Hello_world.py TensorFlow Example

In order to run the following example, run the Docker image in interactive mode on the Gaudi machine according to the instructions detailed in the Setup and Install GitHub page. After entering a Docker shell, create a “Hello world” TensorFlow example with the following code snippet available in the TensorFlow Hello World Example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import tensorflow as tf
from TensorFlow.common.library_loader import load_habana_module

load_habana_module()

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
	tf.keras.layers.Flatten(input_shape=(28, 28)),
	tf.keras.layers.Dense(10),
])
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01) 

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5, batch_size=128)
model.evaluate(x_test, y_test)

The Hello_world.py presents a basic TensorFlow code example. The following further explains the Habana-specific lines:

  • Line 2 - Import function to enable Gaudi. The imported load_habana_module can be found in /root/<model directory>/TensorFlow/common/library_loader.py.

Note

Ensure that you have a proper PYTHONPATH set by checking if it consists of /root.

  • Line 4 - The function imported earlier is called to enable Gaudi (registered as ‘HPU’ device in TensorFlow), Habana optimization passes, Habana ops and so on.

Note

Disabling eager execution tf.compat.v1.disable_eager_execution() is recommended for scripts using tf.Session. For scripts not using tf.Session, disabling eager execution is not required. However, it is recommended to use tf.function. TensorFlow is capable of running Habana optimization passes in eager mode, provided the code is enclosed in tf.function.

In addition to the above, two migration examples are provided:

For example_tf_func.py, the migration instructions are similar to the Hello_world.py example detailed above. For example_tf_session.py, you must disable eager mode by adding tf.compat.v1.disable_eager_execution() to the code.

In summary, the minimal required change is the addition of the following lines. Please refer to the TensorFlow section of the Release Notes for a list of model constraints.

from TensorFlow.common.library_loader import load_habana_module

load_habana_module()

#tf.compat.v1.disable_eager_execution() (not needed if there is no :code:`tf.Session`)

Note for tf.compat.v1.disable_eager_execution(): Enable TensorFlow v1 graph. To add this code to your model scripts, see the following requirements first:

  • For TensorFlow v1 models, tf.compat.v1.disable_eager_execution() should be added.

  • For TensorFlow v2 models, tf.compat.v1.disable_eager_execution() should be added except for the following conditions:

    • If the model is wrapped using tf.function, adding tf.compat.v1.disable_eager_execution() is not required.

    • If the model is implemented using tf.keras.models and executed using model.fit, tf.compat.v1.disable_eager_execution() is not needed since Keras runs graph mode by default. However, if run_eagerly is set to True, adding tf.compat.v1.disable_eager_execution() is required.

    • For distributed models implemented using Horovod, adding tf.compat.v1.disable_eager_execution() is required. Otherwise, the distributed model will not function.

Note

Distributed models without tf.compat.v1.disable_eager_execution() will not function properly. This known issue is not Habana-related: https://github.com/horovod/horovod/issues/1365.

4.2.4. Executing the Example

After creating the Hello_world.py, execute the example by running:

python3 example.py

You can also run the above example with BF16 support enabled by adding the TF_ENABLE_BF16_CONVERSION=1 flag. For a full list of available runtime flags, see Runtime Flags.

TF_ENABLE_BF16_CONVERSION=1 python3 example.py

The following lines should appear as part of output:

Epoch 1/5

469/469 [==============================] - 1s 3ms/step - loss: 1.2647 - accuracy: 0.7208

Epoch 2/5

469/469 [==============================] - 1s 2ms/step - loss: 0.7113 - accuracy: 0.8433

Epoch 3/5

469/469 [==============================] - 1s 2ms/step - loss: 0.5845 - accuracy: 0.8606

Epoch 4/5

469/469 [==============================] - 1s 2ms/step - loss: 0.5237 - accuracy: 0.8688

Epoch 5/5

469/469 [==============================] - 1s 2ms/step - loss: 0.4865 - accuracy: 0.8749

313/313 [==============================] - 1s 2ms/step - loss: 0.4482 - accuracy: 0.8869

Since the first iteration includes graph compilation time, you can see the first iteration takes longer to run than later iterations. The software stack compiles the graph and saves the recipe to cache. Unless the graph changes or a new graph comes in, no recompilation is needed during the training. Typical graph compilation happens at the beginning of the training and at the beginning of the evaluation.

4.2.5. Viewing Loss and Accuracy in TensorFlow

You can find loss and accuracy in the demo scripts output. Loss and accuracy metrics can be visualized using different Profiler tools. For further details about the Profiler tools you can use, see Viewing Instructions.

4.3. Porting a Simple PyTorch Model to Gaudi

Currently, support for PyTorch is under active development and available only in preview mode. Refer to the PyTorch model examples in the Model References GitHub for experimentation. The steps below are for reference and provide a baseline for preparing a PyTorch model to run on Gaudi. Models following these steps may still require additional modification to achieve optimized performance.

Porting PyTorch models to Gaudi requires the following:

  • Habana driver

  • Habana PyTorch Docker image

To install the above, refer to the Setup and Install GitHub page.

The following set of code additions need to be made to run a model on Habana. The following steps cover Eager and Lazy modes of execution.

1. Load the Habana PyTorch Plugin Library, libhabana_pytorch_plugin.so. The PATH-TO-LIBRARY-LOADER is the path used for your local clone of the /model-references/PyTorch/common located under PyTorch Model References on GitHub.

library_loader_path = < PATH-TO-LIBRARY-LOADER >
sys.path.append(os.path.realpath(os.path.join(
    os.path.dirname(os.path.realpath(__file__)), library_loader_path)))
from library_loader import load_habana_module
load_habana_module()

Alternatively, you may use the following code to load the library. The load_habana_module() utility function is a wrapper for this code, which is available in the utility function, library_loader.py, provided in the PyTorch Models common directory on GitHub.

habana_modules_directory = "/usr/lib/habanalabs"
sys.path.insert(0, habana_modules_directory)
torch.ops.load_library(os.path.abspath(os.path.join(
    habana_modules_directory, "libhabana_pytorch_plugin.so")))
  1. Target the Habana device:

device = torch.device("hpu")
  1. Move the model to the device:

model.to(device)

Note

Step 3 may already be implemented in your existing model script.

  1. Import the Habana Torch Library:

import habana_frameworks.torch.core as htcore

5. Enable Lazy execution mode if you want to run your model in this mode by setting the environment variable shown below. Do not set the below if you want to execute your code in Eager mode:

os.environ["PT_HPU_LAZY_MODE"] = "1"

6. In Lazy mode, execution is triggered wherever data is read back to the host from the Habana device. For example, execution is triggered if you are running a topology and getting loss value into the host from the device with loss.item(). Adding a mark_step() in the code is another mechanism to trigger execution. The placement of mark_step() is required at the following points in a training script:

  • Right after optimizer.step() to cleanly demarcate training iterations,

  • Between loss.backward and optimizer.step() if the optimizer being used is a Habana custom optimizer.

htcore.mark_step()

Note

Placing mark_step() at any arbitrary point in the code is not currently supported. We will support insertion of mark_step() at arbitrary positions in future releases.

7. Load the checkpoint. Vision models with convolutions require Habana PyTorch specific steps. Refer to Convolution Weight Ordering in PyTorch Habana Vision Topologies for additional steps on weight order handling.

8. Save the checkpoint. Bring trainable parameters of the model and optimizer tensors to CPU using .to('cpu') on the tensors and save. Refer to Convolution Weight Ordering in PyTorch Habana Vision Topologies for additional steps on weight order handling below for vision models with convolutions.

4.4. Torch Multiprocessing

If training scripts use multiprocessing, for example dataloader, change the start method to spawn or forkserver using the PyTorch API multiprocessing.set_start_method(...). For example:

torch.multiprocessing.set_start_method('spawn')

4.5. Convolution Weight Ordering in PyTorch Habana Vision Topologies

Convolution operations are central to vision topologies like ResNet. Gaudi HW performs convolution operations with filter (weights) in filters last format - RSCK format where:

  • R = height of the filter

  • S = width of the filter

  • C = number of channels per filter

  • K = number of filters

The default PyTorch convolution weight ordering is ‘filters first’ (KCRS). Therefore a re-ordering/permutation of all the convolution weights from KCRS to RSCK format is required before convolution operations. Such a permutation of weights is done once at the beginning of training in the PyTorch Habana vision topologies. However, since due to this permutation the weights are in RSCK format during training, a conversion back to KCRS format is necessary when saving intermediate checkpoints or saving the final trained weights. This helps bring the weights back to the default PyTorch format (KCRS), say, for use across DL training platforms.

Due to the permutation of weights to RSCK format, the gradients of these weights will also be in the same format on the HPU automatically. Any other tensors that are calculated as a function of the convolution weights (or gradients thereof) on HPU will also be in RSCK format. An example of such is the ‘momentum’ tensors corresponding to convolution weights in a ResNet model trained with Stochastic Gradient Descent with Momentum optimizer. Appropriate permutations to be in alignment with default destination format should be done if these tensors (convolution weights, gradients, momentum etc) are to be transferred across CPU and HPU (for example, CPU (KCRS) <–> (RSCK) HPU)

The following sections list the various scenarios in which such permutations need to be handled and provides recommendations on how to handle them. The instructions refer to permutations done in the ResNet training script located in the PyTorch Model Reference GitHub page.

4.5.1. Scenario 1: Initializing Training from the Beginning

  1. Initialize the weight on the CPU for the entire model.

  2. Move the model to ‘hpu’ device, for example, model.to("hpu").

  3. Permute the convolution weights and any other dependent tensors like ‘momentum’ to RSCK format. For example:

permute_params(model, True...)  # permute conv weights; True -> KCRS to RSCK format
permute_momentum(optimizer, True ...) permute momentum; True -> KCRS to RSCK format
  1. Start the training.

4.5.2. Scenario 2: Initializing Training from a Checkpoint

1. If checkpoint loading is followed by weight permutation mentioned in Scenario 1, first permute the weights and dependent tensors back to default PyTorch format (if not, go to step 2). For example:

permute_params(model, False...)  # permute conv weights; False ->  RSCK to KCRS format
permute_momentum(optimizer, False ...) permute momentum; False ->  RSCK to KCRS format
  1. Load the checkpoint and optimizer state dictionary.

  2. Move the model to ‘hpu’ device (if not already done).

  3. Permute the weights and dependent tensors to RSCK format. For example:

permute_params(model, True...)  # permute conv weights; True -> KCRS to RSCK format
permute_momentum(optimizer, True ...) permute momentum; True -> KCRS to RSCK format
  1. Start the training.

4.5.3. Scenario 3: Saving a Checkpoint

The convolution weights and dependent tensors on the ‘hpu’ device are in RSCK format.

  1. Permute the weights and dependent tensors to KCRS format. For example:

permute_params(model, False...)  # permute conv weights; False ->  RSCK to KCRS format
permute_momentum(optimizer, False ...) permute momentum; False ->  RSCK to KCRS format
  1. Bring the trainable parameters of the model and optimizer tensors to the CPU and save.

  2. Move the trainable params and optimizer tensors to ‘hpu’.

  3. Permute the conv weight tensors and dependent tensors to RSCK format. For example:

permute_params(model, True...)  # permute conv weights; True -> KCRS to RSCK format
permute_momentum(optimizer, True ...) permute momentum; True -> KCRS to RSCK format

The requirement for explicit addition of permutes with permute_params and permute_momentum in the model script will be removed in future releases.

4.6. Custom Habana OPs for PyTorch

For some complex Pytorch OPs, Habana provides its own implementation of these OPs customized for Habana devices. In a given model, replacing these complex OPs with custom Habana versions will give better performance.

4.6.1. Custom Optimizers

The following is a list of custom optimizers currently supported on Habana devices:

The below shows an example code snippet demonstrating the usage of a custom optimizer:

try:
   from habana_frameworks.torch.hpex.optimizers import FusedLamb
except ImportError:
   raise ImportError("Please install habana_torch package")
   optimizer = FusedLamb(model.parameters(), lr=args.learning_rate)

Note

For models using Lazy mode execution, add a mark_step() right before the optimizer.step() call when using custom optimizer.

4.6.2. Other Custom OPs

The following is a list of other custom OPs currently supported on Habana devices:

The below is an example code snippet demonstrating the usage of FusedClipNorm:

try:
   from habana_frameworks.torch.hpex.normalization import FusedClipNorm
except ImportError:
   raise ImportError("Please install habana_torch package")
   FusedNorm = FusedClipNorm(model.parameters(), args.max_grad_norm)

FusedNorm.clip_norm(model.parameters())