5. Migration Guide¶
The purpose of this document is to guide users porting their own TensorFlow or PyTorch models to the Habana(R) Gaudi(R) HPU. The instructions provided in this document help ensure the models are functional and ready for further optimization. In addition to this document, refer to the TensorFlow User Guide or PyTorch User Guide.
5.2. Porting a Simple TensorFlow Model to Gaudi¶
Porting TensorFlow models to Gaudi requires the following:
Habana TensorFlow Docker image
To install the above, refer to the Setup and Install GitHub page. As TensorFlow releases newer versions, Habana will continue to support newer versions of TensorFlow and drop support for older versions. See the Support Matrix section of the Release Notes for more details of what is covered in this release.
Using APIs from different TensorFlow versions can cause compatibility issues. Please refer to the TensorFlow section of the Release Notes for a list of current constraints.
5.2.1. Loading the Habana Module¶
To load the Habana Module for TensorFlow, you need to call
load_habana_module() located under
This function loads the Habana libraries needed in order to use Gaudi HPU at the TensorFlow level.
Once loaded, Gaudi HPU is registered in TensorFlow and prioritized over CPU.
This means, when a given Op is available for both CPU and the Gaudi HPU, the Op is assigned to the Gaudi HPU.
Habana op support and custom TensorFlow ops are defined in the
habana_ops object, also available in habana-tensorflow.
It can be imported as such:
from habana_frameworks.tensorflow import habana_ops, but should only be used after
load_habana_module() is called.
The custom ops are used for pattern matching vanilla TensorFlow ops.
5.2.2. Enabling a Single Gaudi Device¶
To enable a single Gaudi device, add the below code to the main function:
from habana_frameworks.tensorflow import load_habana_module load_habana_module() # or import habana_frameworks.tensorflow as htf htf.load_habana_module()
To enable Horovod for multi-Gaudi runs, add distributed functions to the main function.
To enable multi-worker training with
For more details on porting multi-node models, see Distributed Training with TensorFlow.
5.2.3. Creating a TensorFlow Example¶
In order to run the following example, run the Docker image in interactive mode on the Gaudi machine according to the instructions detailed in the Setup and Install GitHub page.
After entering a Docker shell, create a “example.py” TensorFlow example with the following code snippet available in the TensorFlow Hello World Example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
import tensorflow as tf from habana_frameworks.tensorflow import load_habana_module load_habana_module() (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(10), ]) loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) optimizer = tf.keras.optimizers.SGD(learning_rate=0.01) model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy']) model.fit(x_train, y_train, epochs=5, batch_size=128) model.evaluate(x_test, y_test)
example.py presents a basic TensorFlow code example. The following further explains the Habana-specific lines:
Line 2 - Import function to enable Gaudi.
Ensure that you have a proper PYTHONPATH set by checking if it consists of
/root, or more specifically,
Line 4 - The function imported earlier is called to enable Gaudi (registered as ‘HPU’ device in TensorFlow), Habana optimization passes, Habana ops and so on.
In addition to the above, two migration examples are provided:
example_tf_session.pywith tf.Session (originally from TF1.15 found in TensorFlow Hello World Example).
example_tf_func.py, the migration instructions are similar to the
Hello_world.py example detailed above.
example_tf_session.py, you must disable eager mode by adding
tf.compat.v1.disable_eager_execution() to enable graph mode.
In summary, the minimal required change is the addition of the following lines. Please refer to the TensorFlow section of the Release Notes for a list of model constraints.
from habana_frameworks.tensorflow import load_habana_module load_habana_module()
The table below summarizes the conditions that recommend
tf.compat.v1.disable_eager_execution() to be added in the model scripts to enable graph mode:
TF version and API
Recommendations for disable_eager_execution
Code Examples in GitHub
TF1 scripts running in TF2 compatible mode
tf.compat.v1.disable_eager_execution() is required to enable graph mode.
TF1 model running in TF2 compatible mode: example_tf_session.py
TF2 scripts running with keras model (graph mode by default)
tf.compat.v1.disable_eager_execution() is NOT required to enable graph mode.
TF2 model running with Keras on single Gaudi: example.py
TF2 model running with Keras on Horovod based Multi Gaudis: example_hvd.py
TF2 scripts runs with tf.function (graph mode)
tf.compat.v1.disable_eager_execution() is NOT required to enable graph mode.
TF2 model running with tf.function on single Gaudi: example_tf_func.py
TF2 model running with tf.function on Horovod based Multi Gaudis: example_tf_func_hvd.py
5.2.4. Executing the Example¶
After creating the
example.py, execute the example by running:
You can also run the above example with BF16 support
enabled by adding the
TF_ENABLE_BF16_CONVERSION=1 flag. For a full list of available runtime environment variables, see Runtime Environment Variables.
TF_ENABLE_BF16_CONVERSION=1 $PYTHON example.py
The following lines should appear as part of output:
Epoch 1/5 469/469 [==============================] - 1s 3ms/step - loss: 1.2647 - accuracy: 0.7208 Epoch 2/5 469/469 [==============================] - 1s 2ms/step - loss: 0.7113 - accuracy: 0.8433 Epoch 3/5 469/469 [==============================] - 1s 2ms/step - loss: 0.5845 - accuracy: 0.8606 Epoch 4/5 469/469 [==============================] - 1s 2ms/step - loss: 0.5237 - accuracy: 0.8688 Epoch 5/5 469/469 [==============================] - 1s 2ms/step - loss: 0.4865 - accuracy: 0.8749 313/313 [==============================] - 1s 2ms/step - loss: 0.4482 - accuracy: 0.8869
Since the first iteration includes graph compilation time, you can see the first iteration takes longer to run than later iterations. The software stack compiles the graph and saves the recipe to cache. Unless the graph changes or a new graph comes in, no recompilation is needed during the training. Typical graph compilation happens at the beginning of the training and at the beginning of the evaluation.
5.3. Porting a Simple PyTorch Model to Gaudi¶
At the time of this release, support for PyTorch was under active development. Refer to the PyTorch model examples in the Model References GitHub for examples. The steps below are for reference and provide a baseline for preparing a PyTorch model to run on Gaudi. Models following these steps may still require additional modification to achieve optimized performance.
Porting PyTorch models to Gaudi requires the following:
Habana PyTorch Docker image
To install the above, refer to the Setup and Install GitHub page.
The following set of code additions need to be made to run a model on Habana. The following steps cover Eager and Lazy modes of execution.
Load the Habana PyTorch Plugin Library,
from habana_frameworks.torch.utils.library_loader import load_habana_module load_habana_module()
Target the Gaudi HPU device:
device = torch.device("hpu")
Move the model to the device:
Step 3 may already be implemented in your existing model script.
Import the Habana Torch Library:
import habana_frameworks.torch.core as htcore
5. Enable Lazy execution mode if you want to run your model in this mode by setting the environment variable shown below. Do not set the below if you want to execute your code in Eager mode, for more information on Lazy and Eager mode, you can refer to the PyTorch User Guide:
os.environ["PT_HPU_LAZY_MODE"] = "1"
6. In Lazy mode, execution is triggered wherever data is read back to the host from the Habana device. For example,
execution is triggered if you are running a topology and getting loss value into the host from the device with
mark_step() in the code is another mechanism to trigger execution. The placement of
mark_step() is required
at the following points in a training script:
optimizer.step()to cleanly demarcate training iterations,
optimizer.step()if the optimizer being used is a Habana custom optimizer.
Placing mark_step() at any arbitrary point in the code is not currently supported. We will support insertion of mark_step() at arbitrary positions in future releases.
7. Load the checkpoint. Vision models with convolutions require Habana PyTorch specific steps. Refer to Convolution Weight Ordering in PyTorch Habana Vision Topologies for additional steps on weight order handling.
8. Save the checkpoint. Bring trainable parameters of the model and optimizer tensors to CPU using
.to('cpu') on the tensors
and save. Refer to Convolution Weight Ordering in PyTorch Habana Vision Topologies for additional steps on weight order handling below for vision models with convolutions.
5.3.1. Distributed Communication Using PyTorch¶
PyTorch supports distributed communication using
APIs for both data and model parallelism.
PyTorch supports a few communication backends like MPI, Gloo and NCCL natively.
Habana support for distributed communication can be enabled using HCCL backend.
For PyTorch distributed to work correctly, you need to export environment variable
ID is mapped to the local
rank which is used to acquire the Gaudi card for a particular process in case of multi-node.
os.environ["ID"] = local_rank
5.3.2. Distributed Backend Initialization¶
Support for HCCL communication backend is loaded and process group communication backend is initialized as “hccl” using the following script changes:
import habana_frameworks.torch.core.hccl torch.distributed.init_process_group(backend='hccl', rank=rank, world_size=world_size)
5.4. Torch Multiprocessing for DataLoaders (Vision Models only)¶
If training scripts use multiprocessing for dataloader, change the start method to
forkserver using the
multiprocessing.set_start_method(...). For example:
5.5. Custom Habana OPs for PyTorch¶
Habana provides its own implementation of some complex PyTorch OPs customized for Habana devices. In a given model, replacing these complex OPs with custom Habana versions will give better performance.
5.5.1. Custom Optimizers¶
The following is a list of custom optimizers currently supported on Habana devices:
FusedAdagrad - refer to torch.optim.Adagrad
FusedAdamW - refer to torch.optim.AdamW
FusedLamb - refer to LAMB optimizer paper
FusedSGD - refer to torch.optim.SGD
The below shows an example code snippet demonstrating the usage of a custom optimizer:
try: from habana_frameworks.torch.hpex.optimizers import FusedLamb except ImportError: raise ImportError("Please install habana_torch package") optimizer = FusedLamb(model.parameters(), lr=args.learning_rate)
For models using Lazy mode execution, add a
mark_step() right before the
optimizer.step() call when using custom optimizer.
126.96.36.199. Other Custom OPs¶
The following is a list of other custom OPs currently supported on Habana devices:
FusedClipNorm - refer to toch.nn.utils.clip_grad_norm_
The below is an example code snippet demonstrating the usage of FusedClipNorm:
try: from habana_frameworks.torch.hpex.normalization import FusedClipNorm except ImportError: raise ImportError("Please install habana_torch package") FusedNorm = FusedClipNorm(model.parameters(), args.max_grad_norm) FusedNorm.clip_norm(model.parameters())