GPU Migration Toolkit

The GPU Migration Toolkit simplifies migrating PyTorch models that run on GPU-based architecture to run on Gaudi. Rather than manually replacing Python API calls that have dependencies on GPU libraries with Habana-specific API calls, the toolkit automates this process so you can run your model with fewer modifications.

The GPU Migration toolkit maps specific API calls from the Python libraries and modules listed below to the appropriate equivalents in the SynapseAI:

  • torch.cuda

  • Torch API with GPU related parameters. For example, torch.randn(device=”cuda”).

  • Apex. If a specific model requires Apex, see Limitations section for further instructions.

  • pynvml

The toolkit does not optimize the performance of the model, so further modifications may be required. For more details, refer to Model Performance Optimization Guide for PyTorch. This document describes a subset of supported Python APIs and provides instructions on how to automatically migrate PyTorch models using these calls.

GPU Migration Toolkit is currently experimental.

Using GPU Migration Toolkit

GPU Migration toolkit is pre-installed as a Python package in SynapseAI. To migrate your model from GPU to HPU, perform the following steps:

  1. Prepare the environment for initial setup by following the steps in the Installation Guide.

    Note: It is recommended to use the Habana Pytorch Docker images and ensure that the existing packages in the models’ requirements.txt do not override the Habana PyTorch module torch, and PyTorch Lightning as these packages contain Gaudi specific enhancements. Additionally, torchaudio and torchvision are validated on Gaudi and included in the docker image. Other PyTorch libraries have not been formally validated.

  2. Import the GPU Migration Package and Habana Torch Library at the beginning of the primary python script (main.py, train.py, etc.):

import habana_frameworks.torch.gpu_migration
import habana_frameworks.torch.core as htcore
  1. Add mark_step(). In Lazy mode, mark_step() must be added in all training scripts right after loss.backward() and optimizer.step().

htcore.mark_step()
  1. Make sure that any device selection argument passed to the script is configured as if the script is running on a GPU. For example, add --cuda or --device gpu in the runtime command of your model. This will guarantee that the GPU Migration tool accurately detects and migrates instructions.

You are now prepared to begin your model training on HPU.

It is highly recommended to review the GPU Migration examples in th Model References GitHub repository. These show the GPU Migration toolkit working on publicly available models, including the Log files and additional information needed for performance tuning:

Additional Model Considerations

For other libraries and model packages, please consider the following:

  • For DeepSpeed models, be sure to continue to use the Habana DeepSpeed forked version of the DeepSpeed library as well as setting the deepspeed.init_distributed(), dist_backend to HCCL: deepspeed.init_distributed(dist_backend='hccl', init_method = <init_method>)

  • For Hugging Face models, it is recommended to simply use the existing Optimum-Habana interface for training and inference. You can refer to the Gaudi-specific examples or use additional models from the Hugging Face library. In some cases, the GPU Migration toolkit may help in identifying structures that may need to be modified.

  • For PyTorch Lightning, users should follow the existing methods used to to migrate models to work with PyTorch Lighting.

  • For Fairseq models, users should start with Habana’s Fairseq fork from the Habana GitHub Repository here.

Enabling GPU Migration Logging

You can enable the Logging feature, included in the GPU Migration Toolkit, by setting the GPU_MIGRATION_LOG_LEVEL environment variable as described in the table below. This generates log files that provide insight into the automation enabled by the GPU Migration Toolkit while running the model.

Level

Description

1

Logs all modules and prints to the console.

2

Logs all modules.

3

Logs all modules excluding torch.

Using the MNIST Example, you can add the Logging feature as follows:

GPU_MIGRATION_LOG_LEVEL=3 $PYTHON main.py

The log files are stored under $HABANA_LOGS/gpu_migration_logs/. Sample log files for the example models listed above can be found in their corresponding directory.

The Logging feature allows you to identify GPU calls that mismatch with HPU and may have not been implemented. If you encounter such a scenario, implement the necessary changes to the model script manually, based on the information provided in the log file:

  • If you modify only a few unimplemented calls, you can then directly rerun the training process.

  • If you modify all the unimplemented calls, remove the import habana_frameworks.torch.gpu_migration line of code and restart the training process without GPU Migration toolkit.

GPU Migration APIs Support

The Python APIs supported by the GPU Migration toolkit are classified as hpu_match, hpu_modified and hpu_mismatch according to HPU implementation. To access the full list of these calls and their compatibility with Gaudi, refer to Habana GPU Migration APIs guide.

Support Matrix

The below matrix contains information on versions of libraries verified with the GPU Migration toolkit:

Library

Verified version

Additional notes

Apex

7b2e71b

See Limitations section for details on installation.

pynvml

8.0.4

torch

2.0.1

Limitations

  • All the libraries and modules are preinstalled in Habana containers except for the Apex library. If a specific model requires Apex, run the following command to install it:

git clone https://github.com/NVIDIA/apex.git && cd apex
# Required for installation without GPU dependencies and build isolation
git fetch origin pull/1610/head:bug_fix &&  git cherry-pick b559175
git fetch origin pull/1680/head:dependency_fix && git cherry-pick --allow-empty --no-commit 2944255 --strategy-option=theirs
pip install -v --disable-pip-version-check --no-cache-dir ./
  • GPU Migration does not migrate calls to third party programs such as nvcc or nvidia-smi . Those calls need to be migrated manually.