Triton Inference Server with Gaudi
On this Page
Triton Inference Server with Gaudi¶
This document provides instructions on deploying models using Triton on Gaudi. The process involves:
Creating a model repository and docker image
Launching a Triton Inference Server
Sending an Inference Request
The document is based on the Triton Inference Server Quick Start Guide.
Create a Model Repository and Docker Image¶
The Triton Inference Server is launched inside a docker container. The first step is to create a model repository which will be used by Triton to load your models. You can find a comprehensive guide in the following GitHub Repository.
Create a Model Repository¶
Create a model repository according to the structure detailed in Setting up the model repository and Model repository.
In this document, model.py
is used as the <model-definition-file> which is defined when creating the model repository.
Since python is used as the backend for Triton, you need to create a model.py
which defines a
TritonPythonModel
class. An example can be found in Usage.
In order to enable your model on HPU, some modifications are required as detailed in PyTorch Model Porting and Run Inference Using Native PyTorch.
Note
Currently, only Ubuntu20.04 is supported.
The number of models you deploy should be equal to or smaller than the number of cards you have in your server or container.
A config.pbtxt
model configuration file is also created with the model repository.
In this file, make sure KIND_CPU
is used for instance_group
. See Model Configuration GitHub page for more details.
For example:
name: "bert_large_256"
backend: "python"
input [
{
name: "INPUT0"
data_type: TYPE_INT64
dims: [ -1, 256 ]
}
]
output [
{
name: "OUTPUT0"
data_type: TYPE_FP32
dims: [ -1, 1024 ]
}
]
instance_group [{ kind: KIND_CPU }]
Create a Docker Image for HPU¶
Since a Triton server is launched within a docker container, a docker image tailored for HPU is needed. Based on the guidelines detailed in the Setup_and_Install GitHub repository, you can build a Triton docker image for HPU using the following command:
./docker_build.sh triton ubuntu20.04
Run the Backend Container¶
With the server image and the model repository in place, the next step is to launch the
Habana docker container for Triton server using the below command, where the ${image_name}
is the docker image you built in the previous step:
docker run -it --runtime=habana --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host --name triton_backend ${image_name}
Launch the Triton Server¶
As the container you built is not for any specific model, you need to install the necessary prerequisites for your model. After this, you can launch the Triton server to load the model and start the service.
Install the required libraries based on your specific model requirements.
Start the server inside the container according to local file system by running the following command:
$ tritonserver --model-repository=/path/to/model/repository ...
For example:
root@ip-172-31-8-104:/workspace/app tritonserver --model-repository backend/ # Launches the triton server and downloads the transformer models, optimizes and torchscripts the model with IPEX. I0426 23:05:22.486991 130 server.cc:594] +----------------+---------+--------+ | Model | Version | Status | +----------------+---------+--------+ | Model1 | 1 | READY | +----------------+---------+--------+
After setting up the server, check the service status and port by running the following command from Verify Triton Is Running Correctly:
curl -v localhost:8000/v2/health/ready
Run the Client Container¶
After the server is up and running, you can use a client inside the container to make requests.
To use Triton on the client side with HPU, no specific changes are required. You can refer to Building a client application section and other details in the client documentation to customize your script. Finally, run the script inside a Docker container to make HTTP requests.
Create a Client Script¶
You can create a script client.py
using the tritonclient
library, which is pre-installed in the docker image.
In the script, you can define the request data, such as the type of data (text, image, etc.) and the IP address of the server for your model.
Refer to the following example script: simple_http_infer_client.py.
Note
The names of inputs and outputs in the client.py
should be consistent with the names defined in config.pbtxt
.
Launch and Run the Client¶
After the
client.py
is ready, launch a client container by running the below command (on the same system), where the${image_name}
is the image you built in the previous step:docker run -it --runtime=habana --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host --name triton_client ${image_name}
Install the model prerequisites with the required libraries based on your specific model requirements.
Run your script
client.py
inside the container.
Check Results¶
You can view outputs from your requests inside the client container while the request status and log can be found in the server container.