Running Workloads on Docker¶

Before you start, make sure to follow the instructions in the Installation Guide and On-Premise System Update.

Start Training a PyTorch Model on Gaudi¶

Run the Intel Gaudi Docker image:

DOCKER_OPTS="-e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host"
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all $DOCKER_OPTS vault.habana.ai/gaudi-docker/1.22.1/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest

Clone the Model References repository inside the container that you have just started:
```
git clone https://github.com/HabanaAI/Model-References.git
```
Move to the subdirectory containing the hello_world example which presents a basic PyTorch code example:
```
cd Model-References/PyTorch/examples/computer_vision/hello_world/
```
Update PYTHONPATH to include Model References repository and set PYTHON to Python executable:
```
export GC_KERNEL_PATH=/usr/lib/habanalabs/libtpc_kernels.so
export PYTHONPATH=$PYTHONPATH:Model-References
export PYTHON=/usr/bin/python3.10
```
Note

The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.

Training Examples¶

Training on a Single Gaudi Card

Run training on a single HPU in BF16 with autocast enabled. This is a simple linear regression model. Copy this run command into your terminal window:

PT_HPU_LAZY_MODE=0 $PYTHON mnist.py --batch-size=64 --epochs=1 --lr=1.0 --gamma=0.7 --hpu --autocast --use-torch-compile

The following shows the expected output:

============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
 PT_HPU_ENABLE_LAZY_COLLECTIVES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 224
CPU RAM       : 1056253844 KB
------------------------------------------------------------------------------
Train Epoch: 1 [0/60000.0 (0%)] Loss: 2.265625
Train Epoch: 1 [640/60000.0 (1%)]       Loss: 1.445312
***
Train Epoch: 1 [58880/60000.0 (98%)]    Loss: 0.035156
Train Epoch: 1 [59520/60000.0 (99%)]    Loss: 0.001045

Total test set: 10000, number of workers: 1
* Average Acc 92.850 Average loss 0.053

Distributed Training on 8 Gaudis

Run training on the same model using all eight HPUs. Copy this run command into your terminal window:

mpirun -n 8 --bind-to core --map-by socket:PE=6 \
      --rank-by core --report-bindings \
      --allow-run-as-root \
      -x PT_HPU_LAZY_MODE=0 $PYTHON mnist.py \
      --batch-size=64 --epochs=1 \
      --lr=1.0 --gamma=0.7 \
      --hpu --autocast --use-torch-compile

The following shows part of the expected output:

| distributed init (rank 0): env://
| distributed init (rank 3): env://
| distributed init (rank 5): env://
| distributed init (rank 6): env://
| distributed init (rank 4): env://
| distributed init (rank 7): env://
| distributed init (rank 1): env://
| distributed init (rank 2): env://
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 0
PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
PT_HPU_EAGER_PIPELINE_ENABLE = 1
PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
PT_HPU_ENABLE_LAZY_COLLECTIVES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056387128 KB
------------------------------------------------------------------------------
Train Epoch: 1 [0/7500.0 (0%)]  Loss: 2.296875
Train Epoch: 1 [640/7500.0 (9%)]        Loss: 0.972656
Train Epoch: 1 [1280/7500.0 (17%)]      Loss: 0.539062
***
Train Epoch: 1 [5760/7500.0 (77%)]      Loss: 0.057861
Train Epoch: 1 [6400/7500.0 (85%)]      Loss: 0.069824
Train Epoch: 1 [7040/7500.0 (94%)]      Loss: 0.044434

Total test set: 10000, number of workers: 8
* Average Acc 97.882 Average loss 0.063

Fine-tuning with Hugging Face Optimum for Intel Gaudi Library

The Optimum for Intel Gaudi library is the interface between the Hugging Face Transformers and Diffusers libraries and the Gaudi card. It provides a set of tools enabling easy model loading, training and inference on single and multi-card settings for different downstream tasks. The following example uses the text-classification task to fine-tune a BERT-Large model with the MRPC (Microsoft Research Paraphrase Corpus) dataset and also run inference.

Follow the below steps to install the stable release from the Optimum for Intel Gaudi examples and library:

Clone the Optimum for Intel Gaudi project to access the examples that are optimized for Gaudi:

cd ~
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana
git checkout v1.19.0

Install the Optimum for Intel Gaudi library. This will install the latest stable release:
```
pip install optimum-habana==v1.19.0
```
In order to use the DeepSpeed library on Gaudi, install the Intel Gaudi DeepSpeed fork:
```
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.22.1
```

The following example is based on the Optimum for Intel Gaudi Text Classification task example. Change to the text-classification directory and install the additional SW requirements for this specific example:

cd ~
cd optimum-habana/examples/text-classification/
pip install -r requirements.txt

Now your system is ready to execute fine-tuning on a BERT-Large model.

Single-Card Training

In the ~/optimum-habana/examples/text-classification/ folder, copy and paste the following commands to your terminal window to fine-tune the BERT-Large model on one Gaudi card:

python run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking  \
--task_name mrpc   \
--do_train   \
--do_eval   \
--per_device_train_batch_size 32 \
--learning_rate 3e-5  \
--num_train_epochs 3   \
--max_seq_length 128   \
--output_dir ./output/mrpc/  \
--use_habana  \
--use_lazy_mode   \
--bf16   \
--use_hpu_graphs_for_inference \
--throughput_warmup_steps 3

The results will show both training and evaluation:

{'train_runtime': 54.8875, 'train_samples_per_second': 266.059, 'train_steps_per_second': 8.342, 'train_loss': 0.3403122169384058, 'epoch': 3.0, 'memory_allocated (GB)': 7.47, 'max_memory_allocated (GB)': 9.97, 'total_memory_available (GB)': 94.61}
100%|██████████████ 345/345 [00:54<00:00,  6.29it/s]

***** train metrics *****
epoch                       =        3.0
max_memory_allocated (GB)   =       9.97
memory_allocated (GB)       =       7.47
total_memory_available (GB) =      94.61
train_loss                  =     0.3403
train_runtime               = 0:00:54.88
train_samples               =       3668
train_samples_per_second    =    266.059
train_steps_per_second      =      8.342

   ***** eval metrics *****
epoch                       =        3.0
eval_accuracy               =     0.8775
eval_combined_score         =     0.8959
eval_f1                     =     0.9144
eval_loss                   =     0.4336
eval_runtime                = 0:00:01.73
eval_samples                =        408
eval_samples_per_second     =    234.571
eval_steps_per_second       =     29.321
max_memory_allocated (GB)   =       9.97
memory_allocated (GB)       =       7.47
total_memory_available (GB) =      94.61

Multi-Card Training

In this example, you will be doing the same fine-tuning task on eight Gaudi cards. Copy and paste the following into the terminal window:

python ../gaudi_spawn.py  --world_size 8 --use_mpi run_glue.py  \
--model_name_or_path bert-large-uncased-whole-word-masking  \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking  \
--task_name mrpc  \
--do_train  \
--do_eval  \
--per_device_train_batch_size 32  \
--per_device_eval_batch_size 8  \
--learning_rate 3e-5  \
--num_train_epochs 3   \
--max_seq_length 128  \
--output_dir ./output/mrpc/  \
--use_habana   \
--use_lazy_mode   \
--bf16    \
--use_hpu_graphs_for_inference  \
--throughput_warmup_steps 3

You will see the training samples per second results are significantly faster when using all eight Gaudi cards:

{'train_runtime': 41.8426, 'train_samples_per_second': 1663.393, 'train_steps_per_second': 6.825, 'train_loss': 0.5247347513834636, 'epoch': 3.0, 'memory_allocated (GB)': 8.6, 'max_memory_allocated (GB)': 34.84, 'total_memory_available (GB)': 94.61}
100%|██████████| 45/45 [00:41<00:00,  1.07it/s]
***** train metrics *****
epoch                       =        3.0
max_memory_allocated (GB)   =      34.84
memory_allocated (GB)       =        8.6
total_memory_available (GB) =      94.61
train_loss                  =     0.5247
train_runtime               = 0:00:41.84
train_samples               =       3668
train_samples_per_second    =   1663.393
train_steps_per_second      =      6.825

***** eval metrics *****
epoch                       =        3.0
eval_accuracy               =     0.7623
eval_combined_score         =     0.7999
eval_f1                     =     0.8375
eval_loss                   =     0.4668
eval_runtime                = 0:00:02.06
eval_samples                =        408
eval_samples_per_second     =    198.062
eval_steps_per_second       =      3.398
max_memory_allocated (GB)   =      34.84
memory_allocated (GB)       =        8.6
total_memory_available (GB) =      94.61

Training with DeepSpeed

With the DeepSpeed package already installed, run multi-card training with DeepSpeed. Create and point to a ds_config.json file to set up the parameters of the DeepSpeed run. See the Hugging Face GitHub page and copy the configuration file example. Once the ds_config.json file is created, copy and paste these instructions into your terminal:

python ../gaudi_spawn.py \
--world_size 8 --use_deepspeed run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--task_name mrpc \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 3 \
--max_seq_length 128 \
--output_dir /tmp/mrpc_output_deepspeed/ \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--throughput_warmup_steps 3 \
--deepspeed ds_config.json

Note

To learn more about DeepSpeed, refer to the DeepSpeed User Guide for Training.
The --output_dir option specifies the directory where the model predictions and checkpoints will be saved. If the specified output directory already exists, a new directory must be provided. In the example above, the output directory is redirected to /tmp/mrpc_output_deepspeed/ to avoid conflict with the existing directory ./output/mrpc/ used in the previous examples. Without this redirection, the example might fail. Alternatively, you can use the --overwrite_output_dir option to overwrite the contents of the existing output directory.

Inference Example Run

Using inference will run the same evaluation metrics (accuracy, F1 score) as shown above. This will display how well the model has performed:

python run_glue.py --model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--task_name mrpc \
--do_eval \
--max_seq_length 128 \
--output_dir ./output/mrpc/  \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference

You have now run training and inference on Gaudi.

Next Steps¶

For next steps you can refer to the following:

To explore more models from the Model References, start here.
To run more examples using Hugging Face go here.
To migrate other models to Gaudi, refer to PyTorch Model Porting.

Gaudi Documentation 1.22.1 documentation

Running Workloads on Docker

On this Page

Running Workloads on Docker¶

Start Training a PyTorch Model on Gaudi¶

Training Examples¶

Next Steps¶