Model Optimization Checklist

This page is a checklist for optimizing your models on the Intel® Gaudi® AI accelerator. Applying the below steps will ensure that you are enabling the main areas of model optimization.

Optimizing models for Gaudi are separated into three steps:

  1. Initial Model porting - Ensures the model is functional on Gaudi.

  2. Model Optimizations - Includes the general enhancements for performance and applies to most models.

  3. Profiling - Allows you to identify bottlenecks on the host CPU or Gaudi and is the final step to optimize a model.

Initial Model Porting

The below are the first steps for functionality and initial assessment of performance.

Task

Description

Details

Running GPU Model Migration

Model is functional and running on Gaudi; manual migration steps may also be used.

GPU Migration Toolkit

Placement of mark_step()

Added after backward training pass and optimizer; This reduces memory consumption.

Usage of mark_step

Perform CPU Fallback analysis

Ensures that all model ops are running on Gaudi and not host CPU.

Placement of Ops on HPU

Model Optimizations

The below optimizations should be added to your model as the baseline for Training or Inference.

Task

Description

Details

Set Global Batch Size

Experiment to find largest Batch Size before reaching Out-of-Memory.

Batch Size

Use HPU_Graph

HPU Graphs capture operations using HPU stream and replay them.

Using HPU Graphs for Training

Use Autocast

Set BF16 or FP8 for better performance.

Mixed Precision Training with PyTorch Autocast

Set Static Shapes and Static OPS

Remove dynamic shapes to eliminate re-compilations.

Handling Dynamic Shapes

Set Fused Optimizers

Use custom Gaudi versions of Optimizer step.

Usage of Fused Operators

Use Fused Scaled Dot Product Attention

Use FusedSDPA for Transformer based models.

Using Fused Scaled Dot Product Attention (FusedSDPA)

Correct optimization for DeepSpeed (optional)

Select best ZeRO option for performance and memory usage.

DeepSpeed Validated Configurations

General optimizations

Gradient Bucket Size and View, Pinning data for Dataloader, and more.

Additional General Optimizations

Profiling Analysis

These profiling steps can be done to find performance bottlenecks on Gaudi or the host CPU.

Task

Description

Details

PyTorch Profiling with TensorBoard

Obtains Gaudi-specific recommendations for performance using TensorBoard.

Profiling with PyTorch

Review the PT_HPU_METRICS_FILE

Looks for excessive re-compilations during runtime.

Runtime Environment Variables

Profiling Trace Viewer

Uses Trace viewer or Perfetto to view traces.

Getting Started with Intel Gaudi Profiler

Model Logging

Sets ENABLE_CONSOLE to set Logging for debug and analysis.

Runtime Environment Variables