Model Optimization Checklist
On this Page
Model Optimization Checklist¶
This page is a checklist for optimizing your models on the Intel® Gaudi® AI accelerator. By following the steps in this guide, you enable the main optimization areas:
Initial model porting - Ensures the model is functional on Gaudi.
Model optimizations - Includes the general enhancements for performance and applies to most models.
Profiling - Allows you to identify bottlenecks on the host CPU or Gaudi.
Initial Model Porting¶
The below steps are to assess performance and functionality.
Task |
Description |
Details |
---|---|---|
Run GPU Migration toolkit |
Model is functional and running on Gaudi; manual migration steps may also be used. |
|
Place |
Reduces memory consumption when added after backward training pass and optimizer. |
|
Perform CPU fallback analysis |
Ensures that all model ops are running on Gaudi and not host CPU. |
Model Optimizations¶
The below optimizations should be added to your model as the baseline for training or inference.
Task |
Description |
Details |
---|---|---|
Set global batch size |
Experiment to find largest batch size before reaching Out of Memory. |
|
Use HPU Graphs |
HPU Graphs capture operations using HPU stream and replay them. |
|
Use autocast |
Set BF16 or FP8 for better performance. |
|
Set static shapes and static ops |
Remove dynamic shapes to eliminate re-compilations. |
|
Set fused optimizers and custom ops |
Use fused Gaudi versions of optimizers and custom ops. |
|
Use FusedSDPA |
Use FusedSDPA for Transformer-based models. |
|
Use DeepSpeed optimizations (optional) |
Select the best ZeRO configuration for performance and memory usage. |
|
Choose an optimal execution mode for your model |
Use Lazy mode or Eager mode with |
|
Use model-specific optimizations |
Gradient bucket size and view, pinning data for dataloader, and more. |
Profiling Analysis¶
These profiling steps can be done to find performance bottlenecks on Gaudi or host CPU.
Task |
Description |
Details |
---|---|---|
Profile with TensorBoard |
Obtains Gaudi-specific recommendations for performance using TensorBoard. |
|
Review |
Looks for excessive re-compilations during runtime. |
|
Profile with Intel Gaudi Profiler |
Uses Trace viewer or Perfetto to view traces. |
|
Set |
Sets logging for debug and analysis. |