Optimizing Large Language Models

As device memory is limited when training large language models, two methods for reducing the per HPU memory usage can be implemented: DeepSpeed ZeROs or Megatron 3D Parallelism. It is highly recommended to use Deepspeed ZeROs as Megatron 3D Parallelism requires you to reconfigure your model. The below table provides examples of optimization methods in different scenarios using DeepSpeed ZeROs and Megatron 3D Parallelism configurations.

Condition

Strategy

Recommendations

Model fits to one HPU

Use PyTorch Distributed Data Parallel (DDP)

Using one of the ZeRO configurations is not recommended due to communication overheard.

Model does not fit to one HPU

Use ZeRO-1

  • ZeRO-1 allows partitioning optimizer across ranks to save per device memory usage. Enabling bfloat16 data type can further save memory with bfloat16 optimizer and communication data type, etc.

  • Increase micro batch size if memory allows.

Model does not fit with ZeRO-1 across ranks

Use ZeRO-2

  • ZeRO-2 allows partitioning gradients in addition to partitioning optimizer across ranks to save per device memory usage.

  • If the model still does not fit, add Activation Checkpoint.

  • If the model still does not fit, set PT_HPU_MAX_COMPOUND_OP_SIZE to reduce graph size for memory reduction.

  • If the model does not fit after implementing all of the above, try ZeRO-Offload.

  • Increase micro batch size if memory allows.

Model does not fit with ZeRO-1 and ZeRO-2 across ranks

Use ZeRO-3

  • ZeRO-3 partitions the bfloat16 model parameters across ranks.

  • If the model does not fit, add Activation Checkpoint.

  • If the model still does not fit, set PT_HPU_MAX_COMPOUND_OP_SIZE and/or PT_HPU_MAX_COMPOUND_OP_SYNC

  • If the model still does not fit, set DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRE for memory reduction.

  • If the model does not fit after implementing all of the above, add ZeRO-Infinity.

  • Increase micro batch size if memory allows.

Model does not fit to HPU with ZeRO-1 and ZeRO-2

Use Megatron 3D Parallelism

  • Increase Tensor Parallel (TP) or increase Tensor Parallel with Pipeline Parallel (PP) to fit the model.

  • Increase DataParallel (DP) for better performance and memory optimization.

  • Increase micro batch size if memory allows.