GPU Migration Toolkit
On this Page
GPU Migration Toolkit¶
The GPU Migration Toolkit simplifies migrating PyTorch models that run on GPU-based architecture to run on Gaudi. Rather than manually replacing Python API calls that have dependencies on GPU libraries with Habana-specific API calls, the toolkit automates this process so you can run your model with fewer modifications.
The GPU Migration toolkit maps specific API calls from the Python libraries and modules listed below to the appropriate equivalents in the SynapseAI:
torch.cuda
Torch API with GPU related parameters. For example,
torch.randn(device=”cuda”)
.Apex. If a specific model requires Apex, see Limitations section for further instructions.
pynvml
The toolkit does not optimize the performance of the model, so further modifications may be required. For more details, refer to Model Performance Optimization Guide for PyTorch. This document describes a subset of supported Python APIs and provides instructions on how to automatically migrate PyTorch models using these calls.
GPU Migration Toolkit is currently experimental.
Using GPU Migration Toolkit¶
GPU Migration toolkit is pre-installed as a Python package in SynapseAI. To migrate your model from GPU to HPU, perform the following steps:
Prepare the environment for initial setup by following the steps in the Installation Guide.
Note: It is recommended to use the Habana Pytorch Docker images and ensure that the existing packages in the models’ requirements.txt do not override the Habana PyTorch module
torch
, and PyTorch Lightning as these packages contain Gaudi specific enhancements. Additionally,torchaudio
andtorchvision
are validated on Gaudi and included in the docker image. Other PyTorch libraries have not been formally validated.Import the GPU Migration Package and Habana Torch Library at the beginning of the primary python script (main.py, train.py, etc.):
import habana_frameworks.torch.gpu_migration
import habana_frameworks.torch.core as htcore
Add
mark_step()
. In Lazy mode,mark_step()
must be added in all training scripts right afterloss.backward()
andoptimizer.step()
.
htcore.mark_step()
Make sure that any device selection argument passed to the script is configured as if the script is running on a GPU. For example, add
--cuda
or--device gpu
in the runtime command of your model. This will guarantee that the GPU Migration tool accurately detects and migrates instructions.
You are now prepared to begin your model training on HPU.
It is highly recommended to review the GPU Migration examples in th Model References GitHub repository. These show the GPU Migration toolkit working on publicly available models, including the Log files and additional information needed for performance tuning:
MNIST, BERT, ResNet50, and Stable Diffusion
Additional Model Considerations¶
For other libraries and model packages, please consider the following:
For DeepSpeed models, be sure to continue to use the Habana DeepSpeed forked version of the DeepSpeed library as well as setting the
deepspeed.init_distributed()
, dist_backend to HCCL:deepspeed.init_distributed(dist_backend='hccl', init_method = <init_method>)
For Hugging Face models, it is recommended to simply use the existing Optimum-Habana interface for training and inference. You can refer to the Gaudi-specific examples or use additional models from the Hugging Face library. In some cases, the GPU Migration toolkit may help in identifying structures that may need to be modified.
For PyTorch Lightning, users should follow the existing methods used to to migrate models to work with PyTorch Lighting.
For Fairseq models, users should start with Habana’s Fairseq fork from the Habana GitHub Repository here.
Enabling GPU Migration Logging¶
You can enable the Logging feature, included in the GPU Migration Toolkit, by setting the GPU_MIGRATION_LOG_LEVEL
environment variable as described in the table below.
This generates log files that provide insight into the automation enabled by the GPU Migration Toolkit while running the model.
Level |
Description |
---|---|
1 |
Logs all modules and prints to the console. |
2 |
Logs all modules. |
3 |
Logs all modules excluding torch. |
Using the MNIST Example, you can add the Logging feature as follows:
GPU_MIGRATION_LOG_LEVEL=3 $PYTHON main.py
The log files are stored under $HABANA_LOGS/gpu_migration_logs/
.
Sample log files for the example models listed above can be found in their corresponding directory.
The Logging feature allows you to identify GPU calls that mismatch with HPU and may have not been implemented. If you encounter such a scenario, implement the necessary changes to the model script manually, based on the information provided in the log file:
If you modify only a few unimplemented calls, you can then directly rerun the training process.
If you modify all the unimplemented calls, remove the
import habana_frameworks.torch.gpu_migration
line of code and restart the training process without GPU Migration toolkit.
GPU Migration APIs Support¶
The Python APIs supported by the GPU Migration toolkit are classified as hpu_match, hpu_modified and hpu_mismatch according to HPU implementation. To access the full list of these calls and their compatibility with Gaudi, refer to Habana GPU Migration APIs guide.
Support Matrix¶
The below matrix contains information on versions of libraries verified with the GPU Migration toolkit:
Library |
Verified version |
Additional notes |
---|---|---|
Apex |
7b2e71b |
See Limitations section for details on installation. |
pynvml |
8.0.4 |
|
torch |
2.0.1 |
Limitations¶
All the libraries and modules are preinstalled in Habana containers except for the Apex library. If a specific model requires Apex, run the following command to install it:
git clone https://github.com/NVIDIA/apex.git && cd apex
# Required for installation without GPU dependencies and build isolation
git fetch origin pull/1610/head:bug_fix && git cherry-pick b559175
git fetch origin pull/1680/head:dependency_fix && git cherry-pick --allow-empty --no-commit 2944255 --strategy-option=theirs
pip install -v --disable-pip-version-check --no-cache-dir ./
GPU Migration does not migrate calls to third party programs such as
nvcc
ornvidia-smi
. Those calls need to be migrated manually.