Triton Inference Server with Gaudi
On this Page
Triton Inference Server with Gaudi¶
This document provides instructions on deploying models using Triton with Intel® Gaudi® AI accelerator. The process involves:
Creating a model repository and Docker image
Launching a Triton Inference Server
Sending an Inference Request
The document is based on the Triton Inference Server Quick Start Guide.
Note
A set of example files for setting up a Triton server is available in the Intel Gaudi Vault. Download these files first and then follow the instructions below.
Create a Model Repository and Docker Image¶
The Triton Inference Server is launched inside a Docker container. The first step is to create a model repository which will be used by Triton to load your models. You can find a comprehensive guide in the following GitHub Repository.
Create a Model Repository¶
Create a model repository according to the structure detailed in Setting up the model repository and Model repository.
Follow the steps below. This example uses the files download from the Intel Gaudi vault:
Create a new folder:
mkdir -p /$HOME/models/llama2/1
.Copy the
utils.py
andmodel.py
from the vault into the/models/llama2/1
folder.Copy the
config.pbtxt
from the vault into the/models/llama2/
folder.Install the Hugging Face Optimum for Intel Gaudi library:
pip install optimum[habana]
.
In this document, model.py
is used as the <model-definition-file> which is defined when creating the model repository.
Since Python is used as the backend for Triton, you can use the provided model.py
which defines a
habana_args
class. A generic example can be found in Usage.
To enable a generic model on Gaudi, some modifications are required as detailed in PyTorch Model Porting and Getting Started with Inference on Intel Gaudi.
Note
The number of models you deploy should be equal to or smaller than the number of cards you have in your server or container.
The config.pbtxt
model configuration file is also created with the model repository.
In this file, make sure KIND_CPU
is used for instance_group
. See Model Configuration GitHub page for more details.
For example:
name: "llama2"
backend: "python"
input [
{
name: "INPUT0"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
output [
{
name: "OUTPUT0"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
instance_group [{ kind: KIND_CPU }]
Create a Docker Image for Gaudi¶
Since a Triton server is launched within a Docker container, a Docker image tailored for HPU is needed. Based on the guidelines detailed in the Setup_and_Install GitHub repository, you can build a Triton Docker image for HPU using the following command:
cd triton
make build BUILD_OS=ubuntu22.04
Run the Backend Container¶
With the server image and the model repository in place, the next step is to launch the
Intel Gaudi Docker container for Triton server using the below command, where the ${image_name}
is the Docker image you built in the previous step. Note that this is using a shared directory of
the /root/models/llama2/1
folder that was created earlier.
docker run -it --runtime=habana --name triton_backend --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host -v /$HOME/models/llama2/1:/root/models/llama2/1 ${image_name}
Launch the Triton Server¶
Since the container you built is not tailored to any specific model, you need to install the necessary prerequisites for your particular model. Once this is done, you can proceed to launch the Triton server to load the model and start the service.
Install the required libraries based on your specific model requirements.
Start the server inside the container according to local file system by running the following command:
$ tritonserver --model-repository models/
For example, you can see the process of starting the tritonserver. This table indicates your model is ready:
root@ip-172-31-8-104:/workspace/app tritonserver --model-repository backend/ I0103 00:04:58.435488 237 server.cc:626] +--------+---------+--------+ | Model | Version | Status | +--------+---------+--------+ | llama2 | 1 | READY | +--------+---------+--------+
After setting up the server, check the service status and port by running the following command from Verify Triton Is Running Correctly:
curl -v localhost:8000/v2/health/ready
Run the Client Container¶
After the server is up and running, you can use a client inside the container to make requests.
To use Triton on the client side with Gaudi, no specific changes are required. You can refer to Building a client application section and other details in the client documentation to customize your script. Finally, run the script inside a Docker container to make HTTP requests.
Create a Client Script¶
Use the client.py
from the Intel Gaudi Vault to run the actual inference using the Triton server.
This file is based on the tritonclient
library, which is preinstalled in the Docker image.
In the script, you can define the request data, such as the type of data (text, image, etc.) and the IP address of the server for your model.
Refer to the following example script for more information: simple_http_infer_client.py.
Note
The names of inputs and outputs in the client.py
should be consistent with the names defined in config.pbtxt
.
Launch and Run the Client¶
After the
client.py
is ready, launch a client container by running the below command (on the same system), where the${image_name}
is the image you built in the previous step. Make sure theclient.py
is copied into the Docker image by passing it in the Docker run command or copying it into the Docker afterwards.docker run -it --runtime=habana --name triton_client --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host ${image_name}
Install the model prerequisites with the required libraries based on your specific model requirements.
Run your script
client.py
inside the container.
Check Results¶
You can view outputs from your requests inside the client container while the request status and log can be found in the server container.