Theme by the Executable Book Project

On this Page

Model Optimization Checklist

On this Page

Model Optimization Checklist¶

This page is a checklist for optimizing your models on the Intel® Gaudi® AI accelerator. By following the steps in this guide, you enable the main optimization areas:

Initial model porting - Ensures the model is functional on Gaudi.
Model optimizations - Includes the general enhancements for performance and applies to most models.
Profiling - Allows you to identify bottlenecks on the host CPU or Gaudi.

Initial Model Porting¶

The below steps are to assess performance and functionality.

Task	Description	Details
Run GPU Migration toolkit	Model is functional and running on Gaudi; manual migration steps may also be used.	GPU Migration Toolkit
Place `mark_step()`	Reduces memory consumption when added after backward training pass and optimizer.	Usage of mark_step
Perform CPU fallback analysis	Ensures that all model ops are running on Gaudi and not host CPU.	Placement of Ops on HPU

Model Optimizations¶

The below optimizations should be added to your model as the baseline for training or inference.

Task	Description	Details
Set global batch size	Experiment to find largest batch size before reaching Out of Memory.	Batch Size
Use HPU Graphs	HPU Graphs capture operations using HPU stream and replay them.	HPU Graphs for Training
Use autocast	Set BF16 or FP8 for better performance.	Mixed Precision Training with PyTorch Autocast
Set static shapes and static ops	Remove dynamic shapes to eliminate re-compilations.	Handling Dynamic Shapes
Set fused optimizers and custom ops	Use fused Gaudi versions of optimizers and custom ops.	Fused Optimizers and Custom Ops for Intel Gaudi
Use FusedSDPA	Use FusedSDPA for Transformer-based models.	Using Fused Scaled Dot Product Attention (FusedSDPA)
Use DeepSpeed optimizations (optional)	Select the best ZeRO configuration for performance and memory usage.	DeepSpeed Validated Configurations
Choose an optimal execution mode for your model	Use Eager mode with `torch.compile` or Lazy mode	Recommended Usage
Use model-specific optimizations	Gradient bucket size and view, pinning data for dataloader, and more.	Model-specific Optimizations

Profiling Analysis¶

These profiling steps can be done to find performance bottlenecks on Gaudi or host CPU.

Task	Description	Details
Profile with TensorBoard	Obtains Gaudi-specific recommendations for performance using TensorBoard.	Profiling with PyTorch
Review `PT_HPU_METRICS_FILE`	Looks for excessive re-compilations during runtime.	Runtime Environment Variables
Profile with Intel Gaudi Profiler	Uses Trace viewer or Perfetto to view traces.	Getting Started with Intel Gaudi Profiler
Set `ENABLE_CONSOLE` for model logging	Sets logging for debug and analysis.	Runtime Environment Variables