Triton Inference Server with Gaudi

This document provides instructions on deploying models using Triton on Gaudi. The process involves:

  • Creating a model repository and docker image

  • Launching a Triton Inference Server

  • Sending an Inference Request

The document is based on the Triton Inference Server Quick Start Guide.

Create a Model Repository and Docker Image

The Triton Inference Server is launched inside a docker container. The first step is to create a model repository which will be used by Triton to load your models. You can find a comprehensive guide in the following GitHub Repository.

Create a Model Repository

Create a model repository according to the structure detailed in Setting up the model repository and Model repository.

In this document, is used as the <model-definition-file> which is defined when creating the model repository. Since python is used as the backend for Triton, you need to create a which defines a TritonPythonModel class. An example can be found in Usage.

In order to enable your model on HPU, some modifications are required as detailed in PyTorch Model Porting and Run Inference Using Native PyTorch.


  • Currently, only Ubuntu20.04 is supported.

  • The number of models you deploy should be equal to or smaller than the number of cards you have in your server or container.

A config.pbtxt model configuration file is also created with the model repository. In this file, make sure KIND_CPU is used for instance_group. See Model Configuration GitHub page for more details.

For example:

name: "bert_large_256"
backend: "python"
input [
    name: "INPUT0"
    data_type: TYPE_INT64
    dims: [ -1, 256 ]
output [
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ -1, 1024 ]
instance_group [{ kind: KIND_CPU }]

Create a Docker Image for HPU

Since a Triton server is launched within a docker container, a docker image tailored for HPU is needed. Based on the guidelines detailed in the Setup_and_Install GitHub repository, you can build a Triton docker image for HPU using the following command:

./ triton ubuntu20.04

Run the Backend Container

With the server image and the model repository in place, the next step is to launch the Habana docker container for Triton server using the below command, where the ${image_name} is the docker image you built in the previous step:

docker run -it --runtime=habana --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host --name triton_backend ${image_name}

Launch the Triton Server

As the container you built is not for any specific model, you need to install the necessary prerequisites for your model. After this, you can launch the Triton server to load the model and start the service.

  1. Install the required libraries based on your specific model requirements.

  2. Start the server inside the container according to local file system by running the following command:

    $ tritonserver --model-repository=/path/to/model/repository ...

    For example:

    root@ip-172-31-8-104:/workspace/app tritonserver --model-repository backend/
    # Launches the triton server and downloads the transformer models, optimizes and torchscripts the model with IPEX.
    I0426 23:05:22.486991 130]
    | Model          | Version | Status |
    | Model1         | 1       | READY  |
  3. After setting up the server, check the service status and port by running the following command from Verify Triton Is Running Correctly:

    curl -v localhost:8000/v2/health/ready

Run the Client Container

After the server is up and running, you can use a client inside the container to make requests.

To use Triton on the client side with HPU, no specific changes are required. You can refer to Building a client application section and other details in the client documentation to customize your script. Finally, run the script inside a Docker container to make HTTP requests.

Create a Client Script

You can create a script using the tritonclient library, which is pre-installed in the docker image. In the script, you can define the request data, such as the type of data (text, image, etc.) and the IP address of the server for your model. Refer to the following example script:


The names of inputs and outputs in the should be consistent with the names defined in config.pbtxt.

Launch and Run the Client

  1. After the is ready, launch a client container by running the below command (on the same system), where the ${image_name} is the image you built in the previous step:

    docker run -it  --runtime=habana --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host --name triton_client ${image_name}
  2. Install the model prerequisites with the required libraries based on your specific model requirements.

  3. Run your script inside the container.

See Other client examples.

Check Results

You can view outputs from your requests inside the client container while the request status and log can be found in the server container.