Runtime Environment Variables

The following table describes runtime environment variables that are set to change the behavior as well as enable or disable some features. Among the below flags, TF_NUM_INTEROP_THREADS, TF_CPP_MIN_LOG_LEVEL, and TF_CPP_MIN_VLOG_LEVEL are native TensorFlow flags. All other flags are Intel® Gaudi® software specific.






Accepts a comma-separated list of op types to be placed on the CPU by PlaceUnsupportedOpsOnCpu pass. If set to “all_nodes”, all nodes in the graph are placed on the CPU.



Controls dumping of TensorFlow graphs after different graph transformation phases.

  • 1 (default) - dumps only from POST_REWRITE_FOR_EXEC

  • 0 - disable dumping

  • Value above 1 - enables dumping from all phases



Sets the path that TensorFlow dumps are saved to.

If unset, graphs will not be dumped. A warning message is shown for built-in TF graph dumping.



If set to ‘true’, enables printing Intel Gaudi software logs console.



Logging level from Intel Gaudi software and perf_lib.

  • 6 is no logs

  • 0 is verbose

By default, logs are placed either in the console (if ENABLE_CONSOLE=true) or under ~/.habana_logs/.



Enables FP32 to BF16 conversion pass for mixed precision training. Currently supported settings:

  • ‘0’, ‘false’ or unset - conversion is disabled

  • ‘1’, ‘true’, ‘full’ - contains all convertible ops

  • ‘basic’ - only general matrix multiplications and convolutions are allowed

  • /path/to/model/mixed_precision_config.json



Allows dumping current BF16 config to JSON file. If set to an absolute path output file, data is dumped.



If set to ‘false’ or ‘0’ support for compiling clusters with dynamic shape inputs is disabled.

Dynamic shapes are understood as inputs to the clusters, that are changing between iterations, forcing recompilations of graph recipes. Support of dynamic shapes reduces required number of recompilations by introducing graph recipes with input shape ranges (min,max) based on patterns of input shapes to a given cluster.

Disabling dynamic shapes support forces all clusters to be compiled statically, which can increase overall number of compilations.



If set to ‘0’, Pattern Matcher optimization pass is disabled.



Allows setting initial allocated memory size for workspace buffer in MB. That option is mainly for cases in which dynamic workspace allocation does not work properly.



Default allocation strategy which allocates host memory with the below minimum values:

  • 64G (for machines with more than that)

  • 80% of available memory size

  • Available memory size - 16G

If this flag is set to any value, it instructs the Intel Gaudi CPU allocator to override the default configuration of the CPU memory pool size with the given size in Gigabytes.



If set to a non-zero value, this flag enforces the thread count for TensorFlow op execution. Otherwise, TensorFlow selects the count based on the available cores and MKL/OpenMP configurations.



Logging level from native TensorFlow. Lower value means more logs. Valid values range is [0-4].



Another logging level from native TensorFlow. Higher value means more logs. Valid value range is [0-10].



If set to ‘True’, disables legacy Variables registration on HPU and allows them to be executed on CPU. Otherwise, legacy variables registration on HPU will prevent them from being executed at all.



Path (directory), where compiled graph recipes are stored between different runs of the same model (accelerates time of first iteration).

If unset, compiled graph recipes are not stored on disk (recipe disk caching disabled).

In a scale up scenario, different processes on one platform may share the same directory for recipe cache. Only one process compiles the recipe, and other processes read it from disk.

Note: Recipe cache dir is not cleared automatically and can increase in size over time.

Note: If a recipe cache is shared among a few processes (scale up), it must be stored on a local physical disk. Avoid using remote drives (such as NFS) where file locks are not supported, as it it may lead to instability and unpredictable behavior.



If set to ‘True’ in a multi-worker training, an instantiation of HPUStrategy class alters the behavior of tensorflow.python.ops.collective_ops.all_reduce_v2 TensorFlow’s function, so a barrier precedes the all-reduce operation.

The barrier effectively aligns multiple worker processes prior to every CollectiveReduceV2 operation, using an additional, host NIC-based CollectiveAllgather operation (placed on /device:CPU:0 device).

This specific mode of operation addresses scenarios where the iteration time of every worker varies greatly. In such cases, calls to the Habana Communication Library (specifically: hcclAllreduce()) may crash the process which waits too much for its counterpart in other workers. This timeout is related to the command queue of the Intel Gaudi Linux Kernel Driver and may be manually set during LKD insmod, e.g.: build_and_insmod_habanalabs -p "timeout_locked=90". Enabling the variable TF_HABANA_COLLECTIVE_REDUCE_SYNC is effectively an alternative to loading the LKD with increased timeout parameter.

This feature does not affect Horovod’s HorovodAllreduce operation.



Limit of nodes number inserted into a single graph recipe. If unset, graphs do not have any upper limit of nodes and are sliced only in algorithmic synchronization points (similar to sending tensors between devices). If set, big graphs will be sliced to smaller ones with maximal number of nodes determined by the variable value.

Note: The TF_MAX_CLUSTER_SIZE can be utilized only in case of long graph compilation time or exceeding the available memory in certain models. The value of TF_MAX_CLUSTER_SIZE should be set with caution. Otherwise, it may impact performance or increase memory footprint by breaking the graph compiler optimization mechanisms.