Parallel Compilation

Note

This feature is enabled only with PT_HPU_LAZY_MODE=0.

Parallel Compilation is designed to improve performance in Eager mode and reduce time-to-train (TTT) in Compile mode. To execute any workload represented as a graph on the Intel® Gaudi® AI accelerator, the application must first compile it into a recipe. Compilation can be expensive, especially for large graphs in Compile mode. The goal is to reduce overall compilation latency by enabling concurrent compilation of multiple graphs.

The PT_HPU_COMPILE_THREAD_POOL_SIZE=<int> is set to use 8 compilation threads by default, but can be configured to use a different number as needed.

For a usage example, see vLLM Warmup Code. This example illustrates a common scenario where parallel compilation improves performance, as multiple graph compilations are initiated at the start of the workload, enabling parallel execution.

Compile Mode

In Compile mode, there are two levels of recipe caching - one in memory and one on disk - for storing compiled graphs. However, any changes to the model, such as architecture modifications or hyperparameter tuning, can trigger recompilation. In vLLM workloads, it is common to precompile a series of graphs for various input sizes during a warm-up phase. If the model changes, the application cannot reuse the cache and must recompile the graphs each time. To accelerate the warm-up phase, Parallel Compilation enables multiple graph compilations to run concurrently, producing recipes for different input sizes more efficiently.

Permutations Limitation

Compile mode is impacted by permutation handling — an optimization particularly important for vision-based models that aims to deliver out-of-the-box performance. The application cannot dynamically detect whether permutation handling is needed (for example, models that process only text typically do not require permutations). To correctly propagate permutation information throughout the application, it must synchronize on the main thread and wait for the first graph compilation of a given graph type to complete before proceeding - i.e. same graphs with different input shapes - dynamic shapes case. This synchronization requirement prevents further graph compilations from running in parallel, but once first compilation is finished, any subsequent ones are done in parallel.

Note

Parallel compilation is supported with both Vision and non-Vision based models.

Eager Mode

By default, Eager mode does not use a recipe cache, so graph compilations occur at runtime and not only during the warm-up phase. As a result, Parallel Compilation improves the runtime performance of Eager mode. Unlike Compile mode, Eager mode does not face the previously mentioned permutation-handling limitation. This is because it uses an additional layer of shapeless cache, which stores permutation information. Once the shapeless cache is populated - typically after a few iterations - graph compilations are pipelined and executed in parallel.