Optimizing Large Language Models
Optimizing Large Language Models¶
As device memory is limited when training large language models, two methods for reducing the per HPU memory usage can be implemented: DeepSpeed ZeROs or Megatron 3D Parallelism. It is highly recommended to use Deepspeed ZeROs as Megatron 3D Parallelism requires you to reconfigure your model. The below table provides examples of optimization methods in different scenarios using DeepSpeed ZeROs and Megatron 3D Parallelism configurations.
Condition |
Strategy |
Recommendations |
---|---|---|
Model fits to one HPU |
Use PyTorch Distributed Data Parallel (DDP) |
Using one of the ZeRO configurations is not recommended due to communication overheard. |
Model does not fit to one HPU |
Use ZeRO-1 |
|
Model does not fit with ZeRO-1 across ranks |
Use ZeRO-2 |
|
Model does not fit with ZeRO-1 and ZeRO-2 across ranks |
Use ZeRO-3 |
|
Model does not fit to HPU with ZeRO-1 and ZeRO-2 |
Use Megatron 3D Parallelism |
|