AI Model Serving with Intel Gaudi

AI model serving involves deploying and managing machine learning models in a production environment, making them accessible for inference through an API or other interfaces. Each tool has its strengths and is tailored to specific needs, so the choice depends on the specific requirements of your AI application, the type of models you are deploying, and the performance characteristics you need. The following table describes the AI model serving tools with Intel® Gaudi® AI accelerators support:

Model Serving Tool

Description

Features

Use Cases

TorchServe

TorchServe is an open-source model serving framework for PyTorch. It allows you to deploy PyTorch models at scale.

  • Easy deployment of PyTorch models.

  • Multi-model serving and model versioning.

  • Built-in metrics for monitoring.

  • RESTful endpoints for inference.

When using PyTorch models and looking for a native serving solution. Suitable for batch and real-time inference. See TorchServe Inference Server with Gaudi.

Text Generation Inference (TGI)

Hugging Face’s Text Generation Inference (TGI) is an optimized serving solution for large language models, particularly those used in text generation.

  • Optimized for high throughput and low latency.

  • Supports large language models like GPT-3.

  • Integration with Hugging Face Transformers.

When serving large language models specifically for text generation tasks. Applications requiring high-performance text generation. See TGI-Gaudi fork on GitHub.

Virtual Large Language Model (vLLM)

vLLM is designed to efficiently serve large language models by virtualizing them, which helps in reducing the memory footprint and improving inference efficiency.

  • Efficient memory management for large models.

  • High throughput and low latency.

  • Supports model parallelism.

When deploying extremely large language models that exceed typical memory constraints. Scenarios requiring efficient resource utilization and high-performance serving. See Intel Gaudi vLLM fork on GitHub.

Triton (Inference Server)

Triton (Inference Server) is an open-source solution for fast and scalable model deployment that is used to serve models providing accelerated inference performance on HPUs.

  • Allows the building of model repositories that will be ready and optimized for inference requests.

  • Launched inside a Docker container, and includes models and metadata (configuration files, versions, etc.) needed to serve the models.

  • Handles user inference requests by managing model repos, HPU resources, and backends.

When there is a need to efficiently deploy multiple models with different requirements and configurations, while independently managing the serving configurations. See Triton Inference Server with Gaudi.