AI Model Serving with Intel Gaudi
AI Model Serving with Intel Gaudi¶
AI model serving involves deploying and managing machine learning models in a production environment, making them accessible for inference through an API or other interfaces. Each tool has its strengths and is tailored to specific needs, so the choice depends on the specific requirements of your AI application, the type of models you are deploying, and the performance characteristics you need. The following table describes the AI model serving tools with Intel® Gaudi® AI accelerators support:
Model Serving Tool |
Description |
Features |
Use Cases |
---|---|---|---|
TorchServe |
TorchServe is an open-source model serving framework for PyTorch. It allows you to deploy PyTorch models at scale. |
|
When using PyTorch models and looking for a native serving solution. Suitable for batch and real-time inference. See TorchServe Inference Server with Gaudi. |
Text Generation Inference (TGI) |
Hugging Face’s Text Generation Inference (TGI) is an optimized serving solution for large language models, particularly those used in text generation. |
|
When serving large language models specifically for text generation tasks. Applications requiring high-performance text generation. See TGI-Gaudi fork on GitHub. |
Virtual Large Language Model (vLLM) |
vLLM is designed to efficiently serve large language models by virtualizing them, which helps in reducing the memory footprint and improving inference efficiency. |
|
When deploying extremely large language models that exceed typical memory constraints. Scenarios requiring efficient resource utilization and high-performance serving. See Intel Gaudi vLLM fork on GitHub. |
Triton (Inference Server) |
Triton (Inference Server) is an open-source solution for fast and scalable model deployment that is used to serve models providing accelerated inference performance on HPUs. |
|
When there is a need to efficiently deploy multiple models with different requirements and configurations, while independently managing the serving configurations. See Triton Inference Server with Gaudi. |