Optimize Inference on PyTorch
On this Page
Optimize Inference on PyTorch¶
Inference workloads require optimization as they are more prone to host overhead. Since an inference step consumes less computation than a training step and is usually executed with smaller batch sizes, host overhead is more likely to increase throughput and reduce latency.
This document describes how to apply the below optimization methods to minimize host overhead and improve inference performance.
Identifying Host Overhead¶
The following example shows the host overhead observed in a typical inference workload on PyTorch:
for query in inference_queries:
# Optional pre-processing on the host, such as tokenization or normalization of images
# Copy the result to device memory
query_hpu = query.to(hpu)
# Perform processing
output_hpu = model(query_hpu)
# Copy the result to host memory
output = output_hpu.to(cpu)
# Optional host post-processing, such as decoding
The model’s forward()
function typically involves a series of computations. When executed without optimization,
each line of Python code in the forward()
call is evaluated by the Python interpreter, passed through
the PyTorch Python front-end, then sent to the Intel Gaudi PyTorch bridge. Processing on the device only occurs when
mark_step
is invoked or when the copy to the CPU is requested. See the illustration below.
The diagram indicates that the HPU will have extended periods of inactivity due to computation steps being dependent on each other.
To identify similar cases, use the Intel Gaudi integration with the PyTorch Profiler, see Profiling with PyTorch section. If there are gaps between device invocations, host overhead is impeding throughput. When host overhead is minimal, the device will function continuously following a brief ramp-up period.
The following are three techniques to lower host overhead and enhance your inference performance.
Using HPU Graphs¶
As described in Run Inference Using HPU Graphs, wrapping the forward()
call in htorch.hpu.wrap_in_hpu_graph
minimizes the time
in which each line of Python code in the forward()
call is evaluated by the Python interpreter. See the example below.
Consequently, minimizing this time allows the HPU to start copying the output and running the computation faster, improving throughput.
Note
Using HPU Graphs for optimizing inference on Gaudi is highly recommended.
Using Asynchronous Copies¶
By default, the host thread that submits the computation to the device will wait for the copy operation to complete.
However, by specifying the argument non_blocking=True
during the copy operation, the Python thread can
continue to execute other tasks while the copy occurs in the background.
To use Asynchronous Copies, replace query_hpu = query.to(hpu)
with query_hpu = query.to(hpu, non_blocking=True)
.
See our Wav2vec inference example.
The following is an example of the timing diagram:
Note
Asynchronous Copies are currently supported from the host to the device only.
Using Software Pipelining¶
This section describes how to achieve software pipelining using threads. This method can be applied to models in which an inference request goes through several stages.
For each inference iteration, waiting for result and post-processing of results is dispatched on different threads. This allows each inference iteration to run as fast as possible without being blocked by result computation and post-processing. Since the different threads run in parallel, throughput is optimized depending on the latency of the slowest pipeline stage. An example of this method can be found in the Wav2Vec Model Reference on GitHub. Applying this to the Wav2Vec model utilized the device fully.
The following should be added to the inference model:
pool = multiprocessing.pool.ThreadPool(4)
- Initializes a thread pool. It is common to align the number of threads with the number of pipeline stages. For example, Wav2Vec model uses two pipeline stages: tokenize + compute and then decode. However, in the Wav2Vec model twice that amount is used to have some leeway in scheduling and overcome random hiccups on the host side.logits_pinned.copy_(logits, non_blocking=True)
- Utilizes a copy into pinned memory instead of regular memory. This allows the DMA on the device side to access the buffer.pool.apply_async(sync, args=(stream_obj, processor, logits_pinned, args, decodes, e2e, predicted, ground_truth, ds, r, i, tokens, perfs, perf_start))
- In the original loop where software pipelining is not applied, the host thread blocks upon completion of the compute to copy back the results. Using software pipelining with the threads method defers this process to the thread pool. A thread is dispatched to perform the waiting, data copying, and decoding. Once the sample is done, the thread is returned to the pool.
The following is an example of the timing diagram in the steady state: