Parallel Compilation
On this Page
Parallel Compilation¶
Note
This feature is enabled only with PT_HPU_LAZY_MODE=0
.
Parallel Compilation is designed to improve performance in Eager mode and reduce time-to-train (TTT) in Compile mode. To execute any workload represented as a graph on the Intel® Gaudi® AI accelerator, the application must first compile it into a recipe. Compilation can be expensive, especially for large graphs in Compile mode. The goal is to reduce overall compilation latency by enabling concurrent compilation of multiple graphs.
The PT_HPU_COMPILE_THREAD_POOL_SIZE=<int>
is set to use 8 compilation threads by default, but can be configured to use a different number as needed.
For a usage example, see vLLM Warmup Code. This example illustrates a common scenario where parallel compilation improves performance, as multiple graph compilations are initiated at the start of the workload, enabling parallel execution.
Compile Mode¶
In Compile mode, there are two levels of recipe caching - one in memory and one on disk - for storing compiled graphs. However, any changes to the model, such as architecture modifications or hyperparameter tuning, can trigger recompilation. In vLLM workloads, it is common to precompile a series of graphs for various input sizes during a warm-up phase. If the model changes, the application cannot reuse the cache and must recompile the graphs each time. To accelerate the warm-up phase, Parallel Compilation enables multiple graph compilations to run concurrently, producing recipes for different input sizes more efficiently.
Permutations Limitation¶
Compile mode is currently limited by permutation handling—an optimization particularly important for vision-based models that aim to deliver out-of-the-box performance.
To correctly propagate permutation information throughout the application, it must synchronize on the main thread and wait for all graph compilations to complete before proceeding.
This synchronization requirement prevents any graph compilations from running in parallel.
Moreover, the application cannot dynamically detect whether permutation handling is needed (for example, models that process only text typically do not require permutations).
As a temporary workaround, make sure to disable permutation handling by setting PT_HPU_ENABLE_SYNAPSE_OUTPUT_PERMUTE=false
.
This will be fixed in a future release.
Note
Parallel compilation is currently supported only with non-Vision based models.
Eager Mode¶
By default, Eager mode does not use a recipe cache, so graph compilations occur at runtime and not only during the warm-up phase. As a result, Parallel Compilation improves the runtime performance of Eager mode. Unlike Compile mode, Eager mode does not face the previously mentioned permutation-handling limitation. This is because it uses an additional layer of shapeless cache, which stores permutation information. Once the shapeless cache is populated - typically after a few iterations - graph compilations are pipelined and executed in parallel.