Optimize Inference on PyTorch

Inference workloads require optimization as they are more prone to host overhead. Since an inference step consumes less computation than a training step and is usually executed with smaller batch sizes, host overhead is more likely to increase throughput and reduce latency.

This document describes how to apply the below optimization methods to minimize host overhead and improve inference performance.

Identifying Host Overhead

The following example shows the host overhead observed in a typical inference workload on PyTorch:

for query in inference_queries:
# Optional pre-processing on the host, such as tokenization or normalization of images
# Copy the result to device memory
    query_hpu = query.to(hpu)
# Perform processing
    output_hpu = model(query_hpu)
# Copy the result to host memory
    output = output_hpu.to(cpu)
# Optional host post-processing, such as decoding

The model’s forward() function typically involves a series of computations. When executed without optimization, each line of Python code in the forward() call is evaluated by the Python interpreter, passed through the PyTorch Python front-end, then sent to the Intel Gaudi PyTorch bridge. Processing on the device only occurs when mark_step is invoked or when the copy to the CPU is requested. See the illustration below.


The diagram indicates that the HPU will have extended periods of inactivity due to computation steps being dependent on each other.

To identify similar cases, use the Intel Gaudi integration with the PyTorch Profiler, see Profiling with PyTorch section. If there are gaps between device invocations, host overhead is impeding throughput. When host overhead is minimal, the device will function continuously following a brief ramp-up period.

The following are three techniques to lower host overhead and enhance your inference performance.

Using HPU Graphs

As described in Run Inference Using HPU Graphs, wrapping the forward() call in htorch.hpu.wrap_in_hpu_graph minimizes the time in which each line of Python code in the forward() call is evaluated by the Python interpreter. See the example below.


Consequently, minimizing this time allows the HPU to start copying the output and running the computation faster, improving throughput.


Using HPU Graphs for optimizing inference on Gaudi is highly recommended.

Using Asynchronous Copies

By default, the host thread that submits the computation to the device will wait for the copy operation to complete. However, by specifying the argument non_blocking=True during the copy operation, the Python thread can continue to execute other tasks while the copy occurs in the background.

To use Asynchronous Copies, replace query_hpu = query.to(hpu) with query_hpu = query.to(hpu, non_blocking=True). See our Wav2vec inference example.

The following is an example of the timing diagram:



Asynchronous Copies are currently supported from the host to the device only.

Using Software Pipelining

This section describes how to achieve software pipelining using threads. This method can be applied to models in which an inference request goes through several stages.

For each inference iteration, waiting for result and post-processing of results is dispatched on different threads. This allows each inference iteration to run as fast as possible without being blocked by result computation and post-processing. Since the different threads run in parallel, throughput is optimized depending on the latency of the slowest pipeline stage. An example of this method can be seen in the Wav2Vec Model Reference on GitHub. Applying this to the Wav2Vec model utilized the device fully.

The following should be added to the inference model:

  • pool = multiprocessing.pool.ThreadPool(4)- Initializes a thread pool. It is common to align the number of threads with the number of pipeline stages. For example, Wav2Vec model uses two pipeline stages: tokenize + compute and then decode. However, in Wav2Vec model twice that amount is used to have some leeway in scheduling and overcome random hiccups on the host side.

  • logits_pinned.copy_(logits, non_blocking=True) - Utilizes a copy into pinned memory instead of regular memory. This allows the DMA on the device side to access the buffer.

  • pool.apply_async(sync, args=(stream_obj, processor, logits_pinned, args, decodes, e2e, predicted, ground_truth, ds, r, i, tokens, perfs, perf_start)) - In the original loop where software pipelining is not applied, the host thread blocks on completion of the compute in order to copy back the results. Using software pipelining with threads method defers this process to the thread pool. A thread is dispatched to perform the waiting, the copying of data, and decoding. Once the sample is done, the thread is returned to the pool.

The following is an example of the timing diagram in the steady state: