Optimizing Large Language Models¶

As device memory is limited when training large language models, two methods for reducing the per HPU memory usage can be implemented: DeepSpeed ZeROs or Megatron 3D Parallelism. It is highly recommended to use Deepspeed ZeROs as Megatron 3D Parallelism requires you to reconfigure your model. The below table provides examples of optimization methods in different scenarios using DeepSpeed ZeROs and Megatron 3D Parallelism configurations.

Condition	Strategy	Recommendations
Model fits to one HPU	Use PyTorch Distributed Data Parallel (DDP)	Using one of the ZeRO configurations is not recommended due to communication overheard.
Model does not fit to one HPU	Use ZeRO-1	ZeRO-1 allows partitioning optimizer across ranks to save per device memory usage. Enabling bfloat16 data type can further save memory with bfloat16 optimizer and communication data type, etc. Increase micro batch size if memory allows.
Model does not fit with ZeRO-1 across ranks	Use ZeRO-2	ZeRO-2 allows partitioning gradients in addition to partitioning optimizer across ranks to save per device memory usage. If the model still does not fit, add Activation Checkpoint. If the model still does not fit, set `PT_HPU_MAX_COMPOUND_OP_SIZE` to reduce graph size for memory reduction. If the model does not fit after implementing all of the above, try ZeRO-Offload. Increase micro batch size if memory allows.
Model does not fit with ZeRO-1 and ZeRO-2 across ranks	Use ZeRO-3	ZeRO-3 partitions the bfloat16 model parameters across ranks. If the model does not fit, add Activation Checkpoint. If the model still does not fit, set `PT_HPU_MAX_COMPOUND_OP_SIZE` and/or `PT_HPU_MAX_COMPOUND_OP_SYNC` If the model still does not fit, set `DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRE` for memory reduction. If the model does not fit after implementing all of the above, add ZeRO-Infinity. Increase micro batch size if memory allows.
Model does not fit to HPU with ZeRO-1 and ZeRO-2	Use Megatron 3D Parallelism	Increase Tensor Parallel (TP) or increase Tensor Parallel with Pipeline Parallel (PP) to fit the model. Increase DataParallel (DP) for better performance and memory optimization. Increase micro batch size if memory allows.

Gaudi Documentation 1.22.1 documentation

Optimizing Large Language Models

Optimizing Large Language Models¶