Triton Inference Server with Gaudi

This document provides instructions on deploying models using Triton with Intel® Gaudi® AI accelerator. The process involves:

  • Creating a model repository and docker image

  • Launching a Triton Inference Server

  • Sending an Inference Request

The document is based on the Triton Inference Server Quick Start Guide.

Note

A set of example files for setting up a Triton server is available in the Intel Gaudi Vault. Download these files first and then follow the instructions below.

Create a Model Repository and Docker Image

The Triton Inference Server is launched inside a docker container. The first step is to create a model repository which will be used by Triton to load your models. You can find a comprehensive guide in the following GitHub Repository.

Create a Model Repository

Create a model repository according to the structure detailed in Setting up the model repository and Model repository.

Follow the steps below. This example uses the files download from the Intel Gaudi vault:

  • Create a new folder: mkdir -p /$HOME/models/llama2/1.

  • Copy the utils.py and model.py from the vault into the /models/llama2/1 folder.

  • Copy the config.pbtxt from the vault into the /models/llama2/ folder.

  • Install the Hugging Face Optimum Habana library: pip install optimum[habana].

In this document, model.py is used as the <model-definition-file> which is defined when creating the model repository. Since python is used as the backend for Triton, you can use the provided model.py which defines a habana_args class. A generic example can be found in Usage.

To enable a generic model on Gaudi, some modifications are required as detailed in PyTorch Model Porting and Getting Started with Inference on Intel Gaudi.

Note

The number of models you deploy should be equal to or smaller than the number of cards you have in your server or container.

The config.pbtxt model configuration file is also created with the model repository. In this file, make sure KIND_CPU is used for instance_group. See Model Configuration GitHub page for more details.

For example:

name: "llama2"
backend: "python"

input [
  {
    name: "INPUT0"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

output [
  {
    name: "OUTPUT0"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]
instance_group [{ kind: KIND_CPU }]

Create a Docker Image for Gaudi

Since a Triton server is launched within a docker container, a docker image tailored for HPU is needed. Based on the guidelines detailed in the Setup_and_Install GitHub repository, you can build a Triton docker image for HPU using the following command:

cd triton
make build BUILD_OS=ubuntu22.04

Run the Backend Container

With the server image and the model repository in place, the next step is to launch the Intel Gaudi docker container for Triton server using the below command, where the ${image_name} is the docker image you built in the previous step. Note that this is using a shared directory of the /root/models/llama2/1 folder that was created earlier.

docker run -it --runtime=habana --name triton_backend --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host -v /$HOME/models/llama2/1:/root/models/llama2/1 ${image_name}

Launch the Triton Server

As the container you built is not for any specific model, you need to install the necessary prerequisites for your model. After this, you can launch the Triton server to load the model and start the service.

  1. Install the required libraries based on your specific model requirements.

  2. Start the server inside the container according to local file system by running the following command:

    $ tritonserver --model-repository models/
    

    For example, you can see the process of starting the tritonserver. This table indicates your model is ready:

    root@ip-172-31-8-104:/workspace/app tritonserver --model-repository backend/
    
    
    I0103 00:04:58.435488 237 server.cc:626]
    +--------+---------+--------+
    | Model  | Version | Status |
    +--------+---------+--------+
    | llama2 | 1       | READY  |
    +--------+---------+--------+
    
  3. After setting up the server, check the service status and port by running the following command from Verify Triton Is Running Correctly:

    curl -v localhost:8000/v2/health/ready
    

Run the Client Container

After the server is up and running, you can use a client inside the container to make requests.

To use Triton on the client side with Gaudi, no specific changes are required. You can refer to Building a client application section and other details in the client documentation to customize your script. Finally, run the script inside a Docker container to make HTTP requests.

Create a Client Script

Use the client.py from the Intel Gaudi Vault to run the actual inference using the Triton server.

This file is based on the tritonclient library, which is pre-installed in the docker image. In the script, you can define the request data, such as the type of data (text, image, etc.) and the IP address of the server for your model. Refer to the following example script for more information: simple_http_infer_client.py.

Note

The names of inputs and outputs in the client.py should be consistent with the names defined in config.pbtxt.

Launch and Run the Client

  1. After the client.py is ready, launch a client container by running the below command (on the same system), where the ${image_name} is the image you built in the previous step. Make sure the client.py is copied into the docker image by passing it in the docker run command or copying it into the docker afterwards.

    docker run -it --runtime=habana --name triton_client --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host ${image_name}
    
  2. Install the model prerequisites with the required libraries based on your specific model requirements.

  3. Run your script client.py inside the container.

See Other client examples.

Check Results

You can view outputs from your requests inside the client container while the request status and log can be found in the server container.