Runtime Environment Variables

The following table describes runtime flags that are set in the environment to change the behavior as well as enable or disable some features.

Table 5 Runtime Environment Variables

Flag

Default

Description

Consumer

PT_HPU_LAZY_MODE

0

Controls execution mode:

  • 0x0 - torch.compile and Eager mode

  • 0x1 - Lazy mode

Intel Gaudi PyTorch bridge

GRAPH_VISUALIZATION

False

Creates of graph visualization files. The output dump graphs are in ./.graph_dumps folder

Intel Gaudi software

PT_HPU_RECIPE_CACHE_CONFIG

Unset

Holds the configuration of recipe cache. Configuration is encoded as a comma separated list in the following format: <RECIPE_CACHE_PATH>,<RECIPE_CACHE_DELETE>,<RECIPE_CACHE_SIZE_MB>.

  • <RECIPE_CACHE_PATH> - Path (directory), where compiled graph recipes are stored to accelerate a scale-up scenario. Only one process compiles the recipe, and other processes read it from disk. If unset (default), compiled graph recipes are not stored on disk (recipe disk caching disabled).

  • <RECIPE_CACHE_DELETE> - Bool flag (true/false). If set to True, the directory provided as <RECIPE_CACHE_PATH> will be cleared when the workload starts.

  • <RECIPE_CACHE_SIZE_MB> - Max size in MB of recipe cache directory. If size limit is reached then the oldest recipes (by creation time on file system) are removed by PyTorch bridge.

Note: If a recipe cache is shared among a few processes (scale-up), it must be stored on a local physical disk. Avoid using remote drives (such as NFS) where file locks are not supported, as it it may lead to instability and unpredictable behavior.

Intel Gaudi PyTorch bridge

PT_HPU_MAX_COMPOUND_OP_SIZE

INT64_MAX

Limits internal graph size to specified number of ops. Reduces the Lazy mode memory overhead. This will be improved in future releases.

Note: This may affect performance.

Intel Gaudi PyTorch bridge

PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES

False

The dynamic shapes feature is disabled by default. If a model experiences excessive recompilations due to dynamic data or ops, this variable can be set to enable the Intel Gaudi PyTorch bridge and graph compiler to automatically manage dynamic shapes in model scripts. The graphs will be automatically bucketed and padded into ranges to achieve a common size, reducing recompilations and and improving performance when working with dynamic workloads. To run with dynamic shapes handling enabled, set PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES=1 in the script or console.

Intel Gaudi PyTorch bridge

HABANA_PGM_LRU_MAX

30000

If cache evictions cause performance degradation, increasing the cache size will increase performance. The default value is 30000.

Note: Boost in performance may cause an increase in host memory consumption.

Intel Gaudi PyTorch bridge

PT_HPU_LAZY_ACC_PAR_MODE

1

This flag turns on host time optimization of lazy ops accumulation. It offloads ops accumulation to a separate thread, thus reducing computation time of the main thread.

Intel Gaudi PyTorch bridge

PT_HPU_METRICS_FILE

Unset

Path (file), where the collected metrics are stored. Metrics are stored in a file only when PT_HPU_METRICS_FILE flag is set.

Intel Gaudi PyTorch bridge

PT_HPU_METRICS_DUMP_TRIGGERS

process_exit

Once PT_HPU_METRICS_FILE flag is set to automatically dump the metrics, it triggers Intel Gaudi PyTorch bridge to specify precisely when to store the metrics into the file.

Supported values:

  • process_exit - stores metrics in a file during exit process.

  • mark_step - stores metrics in a file during mark_step.

  • metric_change - stores the metrics modified during the execution of the training script.

Multiple triggers can be enabled together by separating them with a comma, for example: PT_HPU_METRICS_DUMP_TRIGGERS=process_exit,metric_change

Intel Gaudi PyTorch bridge

PT_HPU_METRICS_FILE_FORMAT

json

Metrics file format. Both JSON and TEXT formats are supported:

  • json

  • text

Intel Gaudi PyTorch bridge

PT_HPU_ENABLE_GENERIC_STREAM

True

Enables generic stream[1] which allows a user to submit different types operations in same user stream[2]. For a usage example of this flag, see Wav2Vec2 inference script.
[1] Generic stream: A user stream where all operations can be pushed to a stream irrespective of the type of operation (copy, compute, or collective operations).
[2] User stream: A queue of device work. The host (user) places the work in this queue and continues immediately. The device schedules the work in this queue when the resources are free.

Intel Gaudi PyTorch bridge

PT_HPU_EAGER_PIPELINE_ENABLE

True

Accelerate Eager mode by enabling multithreaded pipeline in operations processing:

  • False - standard single threaded processing

  • True - multiple threads processing pipeline

Intel Gaudi PyTorch bridge

PT_COMPILE_ONLY_MODE

False

Option to disable launching any computations on the hardware. It can be useful in a warmup phase of a given workload, when user wants to compile all the graphs and have a well-defined separation before starting actual execution on the hardware.

Intel Gaudi PyTorch bridge

PT_HPU_GPU_MIGRATION

False

Enables GPU Migration Toolkit which simplifies migrating PyTorch models that run on GPU-based architecture to run on Intel Gaudi AI accelerator:

  • False - GPU Migration disabled

  • True - GPU Migration enabled

Intel Gaudi PyTorch bridge

PT_HPU_ENABLE_LAZY_COLLECTIVES

False

Captures the collectives in Lazy IR. This is useful when multi-card support for inference is used with HPU Graphs.

  • False - Does not capture the collectives in lazy IR.

  • True - Captures the collectives in Lazy IR.

Intel Gaudi PyTorch bridge

PT_HPU_LAZY_COLLECTIVES_HOLD_TENSORS

False

Applicable only when using PT_HPU_ENABLE_LAZY_COLLECTIVES=1.

  • False - Tensor references are released immediately after computation/collective is completed. Models requiring high memory should use this value.

  • True - Tensor references are held for a longer time until the respective computation and collective are completed. This provides better performance but is memory-intensive. This may cause an OOM error for some models.

Intel Gaudi PyTorch bridge

PT_HPU_COMPILE_THREAD_POOL_SIZE

8

Set the number of threads for Compile thread pool which is used in Eager and Compile mode. If set to 0 or more than available CPUs, the number of threads will be equal to the number of available CPUs.

Intel Gaudi PyTorch bridge

PT_HPU_ENABLE_SYNAPSE_OUTPUT_PERMUTE

True

Option to disable internal permutation handling for output tensors. Temporary WA needed to enable Parallel Compilation in Compile mode - see Compile Mode.

Intel Gaudi PyTorch bridge

PT_TE_CUSTOM_OP

False

Controls enabling of an experimental custom_op-based solution for efficient operation of Intel Transformer Engine modules in torch.compile mode.

Intel Transformer Engine

The following table describes runtime flags that are set in the environment to obtain Intel Gaudi software and Intel Gaudi PyTorch bridge level logs.

Note

Table 6 Runtime Environment Variables for Logging Mechanism

Flag

Default

Description

Consumer

HABANA_LOGS

~/.habana_logs/

Sets log files location.

Intel Gaudi software and Intel Gaudi PyTorch bridge

ENABLE_CONSOLE

False

If set to True, enables printing Intel Gaudi software and Intel Gaudi PyTorch bridge logs to the console. If unset, logs are output in the directory specified by HABANA_LOGS.

Intel Gaudi software and Intel Gaudi PyTorch bridge

LOG_LEVEL_ALL

5

Logging level for Intel Gaudi software components and Intel Gaudi PyTorch bridge.

Intel Gaudi software and Intel Gaudi PyTorch bridge

LOG_LEVEL_ALL_PT

5

Logging level for Intel Gaudi PyTorch bridge. If unset, LOG_LEVEL_ALL will be used.

Intel Gaudi PyTorch bridge

LOG_LEVEL_<COMPONENT>

5

Logging level for <COMPONENT> in Intel Gaudi PyTorch bridge. If unset, LOG_LEVEL_ALL_PT will be used.

There are many different components logging in PyTorch bridge. Notable mentions are:

  • PT_EAGER - C++ layer logs of Eager and Compile modes

  • PT_PYTHON - Python layer logs of Compile mode (hpu_backend specific logs)

  • PT_LAZY - C++ layer logs of Lazy mode

  • LOG_LEVEL_PT_FALLBACK - C++ layer logs of OP fallbacks to CPU (performance penalty)

There are more components, but they should not be critical for end user to work with in most cases.

Intel Gaudi PyTorch bridge

GPU_MIGRATION_LOG_LEVEL

Logging level for PT_HPU_GPU_MIGRATION component of Intel Gaudi PyTorch bridge:

  • 1 - Logs all modules and prints to the console.

  • 2 - Logs all modules.

  • 3 - Logs all modules excluding torch.

Intel Gaudi PyTorch bridge