Model Optimization Checklist
On this Page
Model Optimization Checklist¶
This page is a checklist for optimizing your models on the Intel® Gaudi® AI accelerator. By following the steps in this guide, you enable the main optimization areas:
- Initial model porting - Ensures the model is functional on Gaudi. 
- Model optimizations - Includes the general enhancements for performance and applies to most models. 
- Profiling - Allows you to identify bottlenecks on the host CPU or Gaudi. 
Initial Model Porting¶
The below steps are to assess performance and functionality.
| Task | Description | Details | 
|---|---|---|
| Run GPU Migration toolkit | Model is functional and running on Gaudi; manual migration steps may also be used. | |
| Place  | Reduces memory consumption when added after backward training pass and optimizer. | |
| Perform CPU fallback analysis | Ensures that all model ops are running on Gaudi and not host CPU. | 
Model Optimizations¶
The below optimizations should be added to your model as the baseline for training or inference.
| Task | Description | Details | 
|---|---|---|
| Set global batch size | Experiment to find largest batch size before reaching Out of Memory. | |
| Use HPU Graphs | HPU Graphs capture operations using HPU stream and replay them. | |
| Use autocast | Set BF16 or FP8 for better performance. | |
| Set static shapes and static ops | Remove dynamic shapes to eliminate re-compilations. | |
| Set fused optimizers and custom ops | Use fused Gaudi versions of optimizers and custom ops. | |
| Use FusedSDPA | Use FusedSDPA for Transformer-based models. | |
| Use DeepSpeed optimizations (optional) | Select the best ZeRO configuration for performance and memory usage. | |
| Choose an optimal execution mode for your model | Use Eager mode with  | |
| Use model-specific optimizations | Gradient bucket size and view, pinning data for dataloader, and more. | 
Profiling Analysis¶
These profiling steps can be done to find performance bottlenecks on Gaudi or host CPU.
| Task | Description | Details | 
|---|---|---|
| Profile with TensorBoard | Obtains Gaudi-specific recommendations for performance using TensorBoard. | |
| Review  | Looks for excessive re-compilations during runtime. | |
| Profile with Intel Gaudi Profiler | Uses Trace viewer or Perfetto to view traces. | |
| Set  | Sets logging for debug and analysis. | 
