Model Optimization Checklist

This page is a checklist for optimizing your models on the Intel® Gaudi® AI accelerator. By following the steps in this guide, you enable the main optimization areas:

  • Initial model porting - Ensures the model is functional on Gaudi.

  • Model optimizations - Includes the general enhancements for performance and applies to most models.

  • Profiling - Allows you to identify bottlenecks on the host CPU or Gaudi.

Initial Model Porting

The below steps are to assess performance and functionality.

Task

Description

Details

Run GPU Migration toolkit

Model is functional and running on Gaudi; manual migration steps may also be used.

GPU Migration Toolkit

Place mark_step()

Reduces memory consumption when added after backward training pass and optimizer.

Usage of mark_step

Perform CPU fallback analysis

Ensures that all model ops are running on Gaudi and not host CPU.

Placement of Ops on HPU

Model Optimizations

The below optimizations should be added to your model as the baseline for training or inference.

Task

Description

Details

Set global batch size

Experiment to find largest batch size before reaching Out of Memory.

Batch Size

Use HPU Graphs

HPU Graphs capture operations using HPU stream and replay them.

HPU Graphs for Training

Use autocast

Set BF16 or FP8 for better performance.

Mixed Precision Training with PyTorch Autocast

Set static shapes and static ops

Remove dynamic shapes to eliminate re-compilations.

Handling Dynamic Shapes

Set fused optimizers and custom ops

Use fused Gaudi versions of optimizers and custom ops.

Fused Optimizers and Custom Ops for Intel Gaudi

Use FusedSDPA

Use FusedSDPA for Transformer-based models.

Using Fused Scaled Dot Product Attention (FusedSDPA)

Use DeepSpeed optimizations (optional)

Select the best ZeRO configuration for performance and memory usage.

DeepSpeed Validated Configurations

Choose an optimal execution mode for your model

Use Lazy mode or Eager mode with torch.compile.

Recommended Usage

Use model-specific optimizations

Gradient bucket size and view, pinning data for dataloader, and more.

Model-specific Optimizations

Profiling Analysis

These profiling steps can be done to find performance bottlenecks on Gaudi or host CPU.

Task

Description

Details

Profile with TensorBoard

Obtains Gaudi-specific recommendations for performance using TensorBoard.

Profiling with PyTorch

Review PT_HPU_METRICS_FILE

Looks for excessive re-compilations during runtime.

Runtime Environment Variables

Profile with Intel Gaudi Profiler

Uses Trace viewer or Perfetto to view traces.

Getting Started with Intel Gaudi Profiler

Set ENABLE_CONSOLE for model logging

Sets logging for debug and analysis.

Runtime Environment Variables