Using Gaudi Trained Checkpoints on Xeon

It is common to train deep learning models on GPUs and Intel® Xeon® Scalable processors, as well as to run pre-trained or fine-tuned models for inference on either GPUs or CPUs. Intel AI Engines provide a performance boost for the entire AI-pipeline with Intel AMX. This document provides guidance on using Intel® Gaudi® AI accelerator platform architecture trained checkpoints on Xeon with PyTorch using the following examples:

  • Simple example of MNIST

  • BERT Large from the Intel Gaudi Model References GitHub repository

  • Hugging Face sourced Image Classification

  • OpenVINO example


A previous article addresses the writing of training scripts capable of running on Gaudi, GPU, or CPU.

Using Gaudi Trained Checkpoints on CPU

HPU-trained checkpoints contain information enabling optimizations specific to the HPU device. In general, these checkpoints will not load on CPU systems on which the Intel Gaudi software stack is not installed and will generate the following error:

ModuleNotFoundError: No module named 'habana_frameworks'

To load HPU-trained checkpoints on a CPU, you can reload the checkpoint using PyTorch and then save it with the device set to CPU. This process can be executed on a system with the Intel Gaudi software stack installed:

import torch
device = torch.device('cpu')
state_dict = torch.load("path_to_hpu_checkpoint", device), "path_to_cpu_checkpoint")

The saved checkpoint will then load on Xeon and other CPU systems which do not have HPUs or the Intel Gaudi software stack installed.

MNIST Example Using HPU Checkpoints

The MNIST example is available here. This example demonstrates how to save the model on Gaudi (HPU) and then load the model on Xeon (CPU):

  1. Save a checkpoint of the model running on one Gaudi 2 HPU. This generates a file:

    $ PT_HPU_LAZY_MODE=1 python --hpu --save-model --epochs 1
  2. Load the model:

    checkpoint = torch.load(args.checkpoint, map_location=torch.device("cpu"))

There are two cases of loading this model and running on Xeon:

  • Load the saved model file on a Xeon CPU system which has the Intel Gaudi software stack installed. This will load and run without errors as the Intel Gaudi software stack is able to interpret the HPU device specifiers embedded in the model.

  • Load this saved model file on a Xeon CPU system which does not have the Intel Gaudi software stack installed using Some key changes applied to to generate are as follows:

    • Remove lines that import habana_frameworks and call mark_step.

    • Remove lines that refer to hpu and to lazy execution.

  1. Save the model:{"model": model.state_dict(), 'optimizer': optimizer.state_dict(
        ), 'epoch': args.epochs}, "")
  2. Run the model:

    $ python --checkpoint --epochs 1

BERT Large Model from Model References

The BERT Large pre-trained model is generated using the instructions described in BERT for PyTorch. The checkpoint file name depends on the number of steps and the --max_steps setting used in phase 1 and phase 2 of the training.

Running Inference on Xeon CPU

To run inference with the generated model, remove the --use_habana option from the command line. Other options of the call to run inference using the SQUAD dataset can be left as is. For example, on a system that has the Intel Gaudi software stack installed, set variables path_to_checkpoint, path_to_vocab, and path_to_eval_script and run:

python \
    --bert_model=bert-large-uncased \
    --autocast  \
    --config_file=./bert_config.json \
    --do_lower_case \
    --output_dir=/output/checkpoints/bert/inference \
    --json-summary=/tmp/log_directory/dllogger.json  \     --predict_batch_size=24  \
    --init_checkpoint=$path_to_checkpoint \
    --vocab_file=$path_to_vocab  \
    --do_predict  \
    --predict_file=/output/bert/v1.1/dev-v1.1.json \
    --do_eval \

To run this model on a CPU system, refer to the guidelines provided in the previous section to save the checkpoint without the hpu device specifiers, using torch.device("cpu") instead.

Hugging Face Sourced Image Classification Model

To run training on Gaudi and a system configured with the Intel Gaudi software stack, use the below command. This creates the checkpoint file under /tmp/outputs/.

python \
    --model_name_or_path google/vit-base-patch16-224-in21k \
    --dataset_name cifar10 \
    --output_dir /tmp/outputs/ \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --learning_rate 3e-5 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --load_best_model_at_end True \
    --save_total_limit 3 \
    --seed 1337 \
    --use_habana \
    --use_lazy_mode \
    --use_hpu_graphs_for_inference \
    --gaudi_config_name Habana/vit \
    --throughput_warmup_steps 3 \
    --dataloader_num_workers 1

To run inference on Xeon, use the following command:

python3  \
    --model_name_or_path /tmp/outputs/   \
    --dataset_name cifar10  \
    --output_dir /tmp/outputs/  \
    --remove_unused_columns False  \
    --do_eval \
    --per_device_eval_batch_size 64 \
    --dataloader_num_workers 1

Use with OpenVINO

To run inference on a Gaudi PyTorch model using OpenVINO runtime, refer to the OpenVINO tutorial document - Convert a PyTorch Model to OpenVINO Intermediate Representation (IR).

  1. Load the Gaudi PyTorch model:

    1. Create an instance of a model class.

    2. Load checkpoint state dictionary, which contains pre-trained model weights.

    3. Turn the model to evaluation for switching some operations to inference mode.

  2. Convert the PyTorch model to OpenVINO intermediate representation (IR):

    1. Calls openvino.convert_model to convert a PyTorch model object to openvino.Model instance.

    2. Calls openvino.Core.compile_model to load on a device.

    3. Calls openvino.save_model to save on disk for next usage.