Optimize Inference on PyTorch¶

Inference workloads require optimization as they are more prone to host overhead. Since an inference step consumes less computation than a training step and is usually executed with smaller batch sizes, host overhead is more likely to increase throughput and reduce latency.

This document describes how to apply the below optimization methods to minimize host overhead and improve inference performance.

Set CPU Setting to Performance¶

BIOS Configuration Requirement for CPU Frequency Scaling

To enable the operating system to manage CPU performance states dynamically, the system BIOS must expose and allow control over CPU frequency scaling features. Specifically, the BIOS must support ACPI CPU frequency scaling interfaces and not restrict access to performance states (P-states) or governors. This ensures that the OS can read and modify the CPU frequency governor settings via cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor. Without proper BIOS support, this interface may be unavailable or return errors, preventing the OS from applying desired power or performance policies.

The below is an example of setting the CPU to performance for Ubuntu:

#Get setting:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

#Set setting:
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Note

The CPU settings must be updated on bare metal before starting the container.

Update CPU Settings¶

This section describes how to update CPU settings on Gaudi 3 using Sapphire and Granite Rapids to optimize performance.

Identify Host Overhead¶

The following example shows the host overhead observed in a typical inference workload on PyTorch:

for query in inference_queries:
# Optional pre-processing on the host, such as tokenization or normalization of images
# Copy the result to device memory
    query_hpu = query.to(hpu)
# Perform processing
    output_hpu = model(query_hpu)
# Copy the result to host memory
    output = output_hpu.to(cpu)
# Optional host post-processing, such as decoding

The model’s forward() function typically involves a series of computations. When executed without optimization, each line of Python code in the forward() call is evaluated by the Python interpreter, passed through the PyTorch Python front-end, then sent to the Intel Gaudi PyTorch bridge. Processing on the device only occurs when mark_step is invoked or when the copy to the CPU is requested. See the illustration below.

The diagram indicates that the HPU will have extended periods of inactivity due to computation steps being dependent on each other.

To identify similar cases, use the Intel Gaudi integration with the PyTorch Profiler, see Profiling with PyTorch section. If there are gaps between device invocations, host overhead is impeding throughput. When host overhead is minimal, the device will function continuously following a brief ramp-up period.

The following are three techniques to lower host overhead and enhance your inference performance.

Use HPU Graphs¶

As described in Run Inference Using HPU Graphs, wrapping the forward() call in htorch.hpu.wrap_in_hpu_graph minimizes the time in which each line of Python code in the forward() call is evaluated by the Python interpreter. See the example below.

Consequently, minimizing this time allows the HPU to start copying the output and running the computation faster, improving throughput.

Note

Using HPU Graphs for optimizing inference on Gaudi is highly recommended.

Use Asynchronous Copies¶

By default, the host thread that submits the computation to the device will wait for the copy operation to complete. However, by specifying the argument non_blocking=True during the copy operation, the Python thread can continue to execute other tasks while the copy occurs in the background.

To use Asynchronous Copies, replace query_hpu = query.to(hpu) with query_hpu = query.to(hpu, non_blocking=True). See our Wav2vec inference example.

The following is an example of the timing diagram:

Note

Asynchronous Copies are currently supported from the host to the device only.

Use Software Pipelining¶

This section describes how to achieve software pipelining using threads. This method can be applied to models in which an inference request goes through several stages.

For each inference iteration, waiting for result and post-processing of results is dispatched on different threads. This allows each inference iteration to run as fast as possible without being blocked by result computation and post-processing. Since the different threads run in parallel, throughput is optimized depending on the latency of the slowest pipeline stage. An example of this method can be found in the Wav2Vec Model Reference on GitHub. Applying this to the Wav2Vec model utilized the device fully.

The following should be added to the inference model:

pool = multiprocessing.pool.ThreadPool(4)- Initializes a thread pool. It is common to align the number of threads with the number of pipeline stages. For example, Wav2Vec model uses two pipeline stages: tokenize + compute and then decode. However, in the Wav2Vec model twice that amount is used to have some leeway in scheduling and overcome random hiccups on the host side.
logits_pinned.copy_(logits, non_blocking=True) - Utilizes a copy into pinned memory instead of regular memory. This allows the DMA on the device side to access the buffer.
pool.apply_async(sync, args=(stream_obj, processor, logits_pinned, args, decodes, e2e, predicted, ground_truth, ds, r, i, tokens, perfs, perf_start)) - In the original loop where software pipelining is not applied, the host thread blocks upon completion of the compute to copy back the results. Using software pipelining with the threads method defers this process to the thread pool. A thread is dispatched to perform the waiting, data copying, and decoding. Once the sample is done, the thread is returned to the pool.

The following is an example of the timing diagram in the steady state:

Gaudi Documentation 1.21.1 documentation

Optimize Inference on PyTorch

On this Page

Optimize Inference on PyTorch¶

Set CPU Setting to Performance¶

Update CPU Settings¶

Identify Host Overhead¶

Use HPU Graphs¶

Use Asynchronous Copies¶

Use Software Pipelining¶