GPU Migration Toolkit¶

The GPU Migration toolkit simplifies migrating PyTorch models that run on GPU-based architecture to run on Intel® Gaudi® AI accelerator. Rather than manually replacing Python API calls that have dependencies on GPU libraries with Gaudi-specific API calls, the toolkit automates this process so you can run your model with fewer modifications.

The GPU Migration toolkit maps specific API calls from the Python libraries and modules listed below to the appropriate equivalents in the Intel Gaudi software:

torch.cuda
Torch API with GPU related parameters. For example, torch.randn(device=”cuda”).
Apex. If a specific model requires Apex, see Limitations section for further instructions.
pynvml

The toolkit does not optimize the performance of the model, so further modifications may be required. For more details, refer to Model Performance Optimization Guide.

Enabling the GPU Migration Toolkit¶

The GPU Migration toolkit is preinstalled as a Python package in the Intel Gaudi software. To migrate your model from GPU to Intel Gaudi, perform the following steps:

Prepare the environment for initial setup by following the steps in the Installation Guide.

Note: It is recommended to use the Intel Gaudi PyTorch Docker images and ensure that the existing packages in the models’ requirements.txt do not override the Intel Gaudi PyTorch module, torch, and PyTorch Lightning as these packages contain Gaudi-specific enhancements. Additionally, torchaudio and torchvision are validated on Gaudi and included in the Docker image. Other PyTorch libraries have not been formally validated.
Import habana_frameworks.torch.core at the beginning of the primary Python script (main.py, train.py, etc.):
import habana_frameworks.torch.core as htcore
Add mark_step(). In Lazy mode, mark_step() must be added in all training scripts right after loss.backward() and optimizer.step(). For further details on mark_step, refer to mark_step section.
htcore.mark_step()
Set PT_HPU_GPU_MIGRATION=1 environment variable when running the primary Python script (main.py, train.py, etc.). Make sure that any device selection argument passed to the script is configured as if the script is running on a GPU. For example, add --cuda or --device gpu in the runtime command of your model. This will guarantee that the GPU Migration toolkit accurately detects and migrates instructions:
PT_HPU_GPU_MIGRATION=1 $PYTHON main.py

You are now prepared to begin your model training on Intel Gaudi.

It is highly recommended to review the GPU Migration examples in the Intel Gaudi Model References GitHub repository. These show the GPU Migration toolkit working on publicly available models, including the log files and additional information needed for performance tuning: MNIST, BERT, ResNet50, and Stable Diffusion.

Additional Model Considerations¶

For other libraries and model packages, please consider the following:

For DeepSpeed models, be sure to continue to use the Intel Gaudi DeepSpeed forked version of the DeepSpeed library as well as setting the distribution backend to HCCL: deepspeed.init_distributed(dist_backend='hccl', init_method = <init_method>).
For Hugging Face models, it is recommended to simply use the existing Optimum for Intel Gaudi interface for training and inference. You can refer to the Gaudi-specific examples or use additional models from the Hugging Face library. In some cases, the GPU Migration toolkit may help in identifying structures that may need to be modified.
For PyTorch Lightning, follow the existing methods used to to migrate models to work with PyTorch Lighting.
For Fairseq models, start with Intel Gaudi’s Fairseq fork.

Enabling GPU Migration Logging¶

You can enable the logging feature, included in the GPU Migration toolkit, by setting the GPU_MIGRATION_LOG_LEVEL environment variable as described in the table below. This generates log files that provide insight into the automation enabled by the GPU Migration toolkit while running the model.

Level	Description
1	Logs all modules and prints to the console.
2	Logs all modules.
3	Logs all modules excluding torch.

Using this MNIST Example, you can add the logging environment variable to the run command as follows:

GPU_MIGRATION_LOG_LEVEL=3 PT_HPU_GPU_MIGRATION=1 $PYTHON main.py

The log files are stored under $HABANA_LOGS/gpu_migration_logs/. Sample log files for ResNet50, and Stable Diffusion can be found in their corresponding directory.

The logging feature allows you to identify GPU calls that mismatch with Intel Gaudi and may have not been implemented. If you encounter such a scenario, implement the necessary changes to the model script manually, based on the information provided in the log file:

If you modify a few unimplemented calls, rerun the training process.
If you modify all unimplemented calls, unset PT_HPU_GPU_MIGRATION=0 environment variable and restart the training process without the GPU Migration toolkit.

GPU Migration APIs Support¶

The Python APIs supported by the GPU Migration toolkit are categorized as identical (directly mapped), modified and unsupported based on their compatibility with Gaudi. These categories correspond to the logging labels: hpu_match, hpu_modified and hpu_mismatch. To access the full list of these calls and their compatibility with Gaudi, refer to the Intel Gaudi GPU Migration Toolkit APIs guide.

Support Matrix¶

The below support matrix lists the versions of libraries verified with the GPU Migration toolkit:

Library	Verified version	Additional notes
Apex	e13873d	See Limitations section for details on installation.
pynvml	12.0.0
torch	2.6.0

Limitations¶

All the libraries and modules are preinstalled in Gaudi containers except for the Apex library. If a specific model requires Apex, run the following command to install it:

git clone https://github.com/NVIDIA/apex.git && cd apex
# Required for installation without GPU dependencies and build isolation
git checkout e13873d
pip install -r requirements.txt
git fetch origin pull/1610/head:bug_fix &&  git cherry-pick b559175 --allow-empty --no-commit
pip install -v --no-build-isolation --disable-pip-version-check --no-cache-dir ./

The GPU Migration toolkit does not migrate calls to third party programs such as nvcc or nvidia-smi . Those calls need to be migrated manually.
Intel Gaudi prefers usage of bfloat16 over float16 data type for models training/inference. To enable automatic conversion from float16 to bfloat16 data type, use PT_HPU_CONVERT_FP16_TO_BF16_FOR_MIGRATION=1 flag as shown below. By default, PT_HPU_CONVERT_FP16_TO_BF16_FOR_MIGRATION=0 uses the declared data type:
```
PT_HPU_CONVERT_FP16_TO_BF16_FOR_MIGRATION=1 PT_HPU_GPU_MIGRATION=1 $PYTHON main.py
```

Gaudi Documentation 1.21.1 documentation

GPU Migration Toolkit

On this Page