Model Optimization Checklist
On this Page
Model Optimization Checklist¶
This page is a checklist for optimizing your models on the Intel® Gaudi® AI accelerator. Applying the below steps will ensure that you are enabling the main areas of model optimization.
Optimizing models for Gaudi are separated into three steps:
Initial Model porting - Ensures the model is functional on Gaudi.
Model Optimizations - Includes the general enhancements for performance and applies to most models.
Profiling - Allows you to identify bottlenecks on the host CPU or Gaudi and is the final step to optimize a model.
Initial Model Porting¶
The below are the first steps for functionality and initial assessment of performance.
Task |
Description |
Details |
---|---|---|
Running GPU Model Migration |
Model is functional and running on Gaudi; manual migration steps may also be used. |
|
Placement of mark_step() |
Added after backward training pass and optimizer; This reduces memory consumption. |
|
Perform CPU Fallback analysis |
Ensures that all model ops are running on Gaudi and not host CPU. |
Model Optimizations¶
The below optimizations should be added to your model as the baseline for Training or Inference.
Task |
Description |
Details |
---|---|---|
Set Global Batch Size |
Experiment to find largest Batch Size before reaching Out-of-Memory. |
|
Use HPU_Graph |
HPU Graphs capture operations using HPU stream and replay them. |
|
Use Autocast |
Set BF16 or FP8 for better performance. |
|
Set Static Shapes and Static OPS |
Remove dynamic shapes to eliminate re-compilations. |
|
Set Fused Optimizers |
Use custom Gaudi versions of Optimizer step. |
|
Use Fused Scaled Dot Product Attention |
Use FusedSDPA for Transformer based models. |
|
Correct optimization for DeepSpeed (optional) |
Select best ZeRO option for performance and memory usage. |
|
General optimizations |
Gradient Bucket Size and View, Pinning data for Dataloader, and more. |
Profiling Analysis¶
These profiling steps can be done to find performance bottlenecks on Gaudi or the host CPU.
Task |
Description |
Details |
---|---|---|
PyTorch Profiling with TensorBoard |
Obtains Gaudi-specific recommendations for performance using TensorBoard. |
|
Review the |
Looks for excessive re-compilations during runtime. |
|
Profiling Trace Viewer |
Uses Trace viewer or Perfetto to view traces. |
|
Model Logging |
Sets |