Profiling Real-World Examples

The below provides some case studies on how profiling helps in identifying problems and further optimizations of various models.

StableDiffusion XL Finetune Example

In this example, a pytorch profile(.json) of a StableDiffusion XL finetune process is examined in the perfetto UI.

The trace image below shows large periods of time with no device utilization (upper portion of the image) in between successive profiling steps, resulting in inefficient device usage. On the host side (bottom section of the image), the dataloader iterator function call on the host appears to take up most of the total execution time for each profiling step. This initial analysis suggests that the process is host-bound which can also be observed by running the Habana Perf Tool:

../_images/use_case_1_0.png

The identified issue in the dataloader can be mitigated using increment --dataloader-num-processes parameter in the run command until results improve as seen below:

../_images/use_case_1_1.png

Large periods of no device utilization is still observed between the two profiling steps. The aten::item function call is taking up major execution time. This call occurs on the host when obtaining tensor values (e.g. loss prints) from the device. This can be reduced by increasing the --logging_steps parameter. After this enhancement, device utilization improves (has more continuity and no large gaps):

../_images/use_case_1_2.png

There is still a delay between host process and device execution at the beginning. This can be alleviated by using HPU graphs as seen below:

../_images/use_case_1_3.png

Optimum for Intel Gaudi Example

This example uses a text-generation command based on Deepspeed and Optimum for Intel Gaudi:

deepspeed --num_gpus 8 run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf --max_new_tokens 256 --bf16 --use_hpu_graphs --use_kv_cache --batch_size 180 --attn_softmax_bf16 --limit_hpu_graphs --reuse_cache --trim_logits

The below is how the low-level trace(.hltv) for this process looks when loaded in the perfetto UI. A big gap on the right end of the trace is observed. The Python process has finished but device DMA, TPC and Synapse logger threads are still busy. Ideally, these processes should be tightly synced and not cause such big lags (in tens of milliseconds):

../_images/use_case_2_0.png

There appears to be some overhead in data transfer during data pre-processing phase which may be due to inefficient memory access and resource contention on the host. This problem is remedied by using --bind_cores_to_rank parameter which instructs DeepSpeed to split the CPU cores and distribute them to each rank evenly for better memory management and improved data parallelism:

../_images/use_case_2_1.png

Hugging Face GPT Example

In this Hugging Face GPT based model’s text generation example, the first token generation is taking far too long (nearly 30 minutes). The latency is being caused due to waiting for some process on device and thus analysis is focused on the device side.

In the trace below, observe the compileGraph function call repeating contiguously at each profile step. This is unexpected as normally the graph is compiled once, cached and reused in the next step.

../_images/use_case_3_0.png

This is an indication of dynamicity in the model which could be due to dynamic shapes, dynamic ops and/or loops in the control flow of the graph itself. Follow the Handling Dynamicity guide to mitigate these issues.