Using Gaudi Trained Checkpoints on Xeon

It is common to run training of Deep Learning models on GPUs and Intel® Xeon® Scalable processors, and also to run pre-trained or fine tuned models for inferencing on either GPUs or CPUs. Intel AI Engines provide a performance boost for the entire AI-pipeline with Intel AMX. This document provides guidance on using Intel® Gaudi® AI accelerator platform architecture trained checkpoints on Xeon with PyTorch. The use of OpenVINO for inference with HPU trained models is also covered. A previous article covered the writing of training scripts that can run on either Gaudi, GPU, or CPU.

This section describes the use of HPU trained models on Xeon using the following examples:

  • Simple example of MNIST

  • Bert Large from the Intel Gaudi Model References GitHub repository

  • Hugging Face sourced Image Classification

  • OpenVINO example

Using Gaudi Trained Checkpoints on CPU

Gaudi trained checkpoints have information in them which allow for HPU device optimizations. In general, these checkpoints will not load on CPU systems on which the Intel Gaudi software stack is not installed and will generate this error:

ModuleNotFoundError: No module named 'habana_frameworks'

To load HPU trained checkpoints on CPU, the checkpoint can be reloaded with PyTorch and then saved with device set to CPU as follows on a system which has the Intel Gaudi software stack installed:

import torch
device = torch.device('cpu')
state_dict = torch.load("path_to_hpu_checkpoint", device), "path_to_cpu_checkpoint")

The saved checkpoint will then load on Xeon and other CPU systems which do not have HPUs or the Intel Gaudi software stack installed.

MNIST Example Using HPU Checkpoints

The MNIST example is available here. This example demonstrates how to save the model on Gaudi (HPU) and then load the model on Xeon (CPU):

  1. Save a checkpoint of the model running on one Gaudi 2 HPU. This generates a file:

    $ PT_HPU_LAZY_MODE=1 python --hpu --save-model --epochs 1
  2. Load the model:

    checkpoint = torch.load(args.checkpoint, map_location=torch.device("cpu"))

There are two cases of loading this model and running on Xeon:

  • Load the saved model file on a Xeon CPU system which has the Intel Gaudi software stack installed. This will load and run without errors as the Intel Gaudi software stack is able to interpret the HPU device specifiers embedded in the model.

  • Load this saved model file on a Xeon CPU system which does not have the Intel Gaudi software stack installed using Some key changes applied to to generate are as follows:

    • Remove lines that import habana_frameworks and call mark_step.

    • Remove lines that refer to hpu and to lazy execution.

  1. Save the model:{"model": model.state_dict(), 'optimizer': optimizer.state_dict(
        ), 'epoch': args.epochs}, "")
  2. Run the model:

    $ python --checkpoint --epochs 1

BERT Large Model from Model References

The BERT Large pre-trained model is generated using the instructions described in BERT for PyTorch. The checkpoint file name depends on the number of steps and the --max_steps setting used in phase 1 and phase 2 of the training.

Running Inference on Xeon CPU

To run inference with the generated model, remove the --use_habana option from the command line. Other options of the call to run inference using the SQUAD dataset can be left as is. For example, on a system that has the Intel Gaudi software stack installed, set variables path_to_checkpoint, path_to_vocab, and path_to_eval_script and run:

python \
    --bert_model=bert-large-uncased \
    --autocast  \
    --config_file=./bert_config.json \
    --do_lower_case \
    --output_dir=/output/checkpoints/bert/inference \
    --json-summary=/tmp/log_directory/dllogger.json  \     --predict_batch_size=24  \
    --init_checkpoint=$path_to_checkpoint \
    --vocab_file=$path_to_vocab  \
    --do_predict  \
    --predict_file=/output/bert/v1.1/dev-v1.1.json \
    --do_eval \

To run this model on a CPU system, follow the guidelines in the previous section to save the checkpoint without the hpu device specifiers i.e., with torch.device("cpu").

Hugging Face Sourced Image Classification Model

To run training on Gaudi and a system configured with the Intel Gaudi software stack, use the below command:

python \
    --model_name_or_path google/vit-base-patch16-224-in21k \
    --dataset_name cifar10 \
    --output_dir /tmp/outputs/ \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --learning_rate 3e-5 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --load_best_model_at_end True \
    --save_total_limit 3 \
    --seed 1337 \
    --use_habana \
    --use_lazy_mode \
    --use_hpu_graphs_for_inference \
    --gaudi_config_name Habana/vit \
    --throughput_warmup_steps 3 \
    --dataloader_num_workers 1

This creates the checkpoint file under /tmp/outputs/.

To run inference on Xeon, use the following command:

python3  \
    --model_name_or_path /tmp/outputs/   \
    --dataset_name cifar10  \
    --output_dir /tmp/outputs/  \
    --remove_unused_columns False  \
    --do_eval \
    --per_device_eval_batch_size 64 \
    --dataloader_num_workers 1

Use with OpenVINO

To run inference on a Gaudi PyTorch model using OpenVINO runtime, refer to the OpenVINO tutorial document - Convert a PyTorch Model to OpenVINO Intermediate Representation (IR).

  1. Load the Gaudi PyTorch model:

    1. Create an instance of a model class.

    2. Load checkpoint state dictionary, which contains pre-trained model weights.

    3. Turn the model to evaluation for switching some operations to inference mode.

  2. Convert the PyTorch model to OpenVINO intermediate representation (IR):

    1. Calls openvino.convert_model to convert a PyTorch model object to openvino.Model instance.

    2. Calls openvino.Core.compile_model to load on a device.

    3. Calls openvino.save_model to save on disk for next usage.