Toggle navigation sidebar
Toggle in-page Table of Contents
Welcome to Habana® Gaudi® v1.7 Documentation
Getting Started
Gaudi Architecture and Software Overview
Gaudi Architecture
SynapseAI® Software Suite
Best Practices for Model Training with Gaudi
Support Matrix
Release Notes
Installation
Habana Deep Learning Base AMI Installation
AWS Deep Learning AMI (DLAMI) Installation
AWS Base OS AMI Installation
Bare Metal Fresh OS Installation
AWS DL1 Quick Start Guide
Guides
PyTorch
Getting Started with PyTorch and Gaudi
PyTorch User Guide
Porting PyTorch Models to Gaudi
Placement of Ops on HPU
Weight Sharing
Runtime Environment Variables
Habana PyTorch Python API (habana_frameworks.torch)
PyTorch Gaudi Integration Architecture
PyTorch Mixed Precision Training on Gaudi
Distributed Training with PyTorch
Scale-Out Topology
Distributed Backend Initialization
DDP-based Scaling of Gaudi on PyTorch
Distributed Data Parallel Architecture
Habana Media Loader
DeepSpeed Training with Gaudi
Getting Started with DeepSpeed
DeepSpeed User Guide
Inference on Gaudi
Run Inference Using Native PyTorch
Run Inference Using HPU Graphs
Model Performance Optimization
Optimizing PyTorch Models
Handling Dynamic Shapes
Handling Custom Habana Ops for PyTorch
Optimizing Training Using PyTorch Lightning
Optimizing Training Platform
Debugging and Troubleshooting Guide
Debugging Possible Model Errors
Debugging Model Divergence
Debugging Slow Convergence
Troubleshooting your Model
PyTorch Operators
PyTorch CustomOp API
Hugging Face Optimum-Habana
PyTorch Lightning
TensorFlow
Migration Guide
TensorFlow User Guide
TensorFlow Gaudi Integration Architecture
Host and Device Ops Placement
TensorFlow Keras
Runtime Environment Variables
Habana TensorFlow Python API (habana_frameworks.tensorflow)
TensorFlow Mixed Precision Training on Gaudi
Distributed Training with TensorFlow
Overview
Scale-out Topology
Gaudi-to-process Assignment
Horovod-based Scaling of Gaudi on TensorFlow
TensorFlow Distributed based Scaling of Gaudi
Habana Media Loader
Model Performance Optimization
Optimization in TensorFlow Models
Optimizing Training Platform
Debugging Guide
Debugging Possible Model Errors
Debugging Model Divergence
Debugging Slow Convergence
Troubleshooting your Model
TensorFlow Operators
TensorFlow CustomOp API
Profiling
Profiling with TensorFlow
Profiling with Pytorch
Profiling with SynapseAI
Configuration
Runtime
Analysis
Profiling Architecture
Tips and Tricks to Accelerate the Training
Management and Monitoring
Qualification Library Guide (hl_qual Tool)
hl_qual Common Plugin Switches and Parameters
hl_qual Report Structure
hl_qual Expected Output and Failure Debug
Memory Stress Test Plugins Design, Switches and Parameters
Power Stress and EDP Tests Plugins Design, Switches and Parameters
Connectivity Serdes Test Plugins Design, Switches and Parameters
Functional Test Plugins Design, Switches and Parameters
Bandwidth Test Plugins Design, Switches and Parameters
hl_qual Monitor Textual UI
Package Content
hl_qual Design
System Management Interface Tool User Guide (hl-smi Tool)
Habana Labs Management Library (HLML) C API Reference
C API
Common APIs
Per device APIs
Linkage HLML
Habana Labs Management Library (PYHLML) Python API Reference
Python APIs
Common APIs
Per device APIs
Orchestration
Kubernetes User Guide
Habana Device Plugin for Kubernetes
MPI Operator for Kubernetes
Prometheus Metric Exporter for Kubernetes
OpenShift (OCP) User Guide
Preparation For Running Docker Image on OCP-based Host
Build & Run Docker Container
Load habanalabs Driver Inside Running Docker Container
Habana Device Plugin for Kubernetes
Usage Examples
VMware Tanzu Guide
Enabling Multiple Tenants
Multiple Workloads on a Single Docker
Multiple Dockers Each with a Single Workload
Running Multiple Workloads on a Single Node K8s Cluster
Amazon ECS with Habana User Guide
Setting Up EFA Enabled Security Group
Creating an Multi-Node Parallel (MNP) Compatible Docker Image
Create AWS Batch Compute Environment
Create and Submit AWS Batch Job
Advanced Model Training Batch Example: ResNet50 Keras
Amazon EKS with Habana User Guide
Creating Cluster and Node Group
Enabling Plugins
Running a Job on the Cluster
mnist Model Training Example: Run MPIJob on Multi-node Cluster
Advanced Model Training Example: Run ResNet Keras Multi-node Cluster
AWS User Guides
Create Elastic Container Registry (ECR) and Upload Images
Distributed Training across Multiple AWS DL1 Instances User Guide
Amazon ECS with Habana User Guide
Setting Up EFA Enabled Security Group
Creating an Multi-Node Parallel (MNP) Compatible Docker Image
Create AWS Batch Compute Environment
Create and Submit AWS Batch Job
Advanced Model Training Batch Example: ResNet50 Keras
Amazon EKS with Habana User Guide
Creating Cluster and Node Group
Enabling Plugins
Running a Job on the Cluster
mnist Model Training Example: Run MPIJob on Multi-node Cluster
Advanced Model Training Example: Run ResNet Keras Multi-node Cluster
APIs
Habana Collective Communications Library (HCCL) API Reference
Overview
Using HCCL
Scale-Out via Host-NIC
C API
Testing and Benchmarking
Habana Labs Management Library (HLML) C API Reference
C API
Common APIs
Per device APIs
Linkage HLML
Habana Labs Management Library (PYHLML) Python API Reference
Python APIs
Common APIs
Per device APIs
Habana TensorFlow Python API
Habana PyTorch Python API
TPC Programming
TPC Getting Started Guide
TPC Tools Installation Guide
TPC User Guide
TPC Programming Language
Processor Architectural Overview
TPC Programming Model
TPC-C Language
Built-in Functions
Implementing and Integrating New lib
TPC Coherency
Multiple Kernel Libraries
Abbreviations
TPC Tools Debugger
Installation
Starting a Debug Session
TPC-C Source or Disassembly Level Debugging
Debug Session Views and Operations
TPC-C Language Specification
Supported Data Types
Conversions and Type Casting
Operators
Vector Operations
Address Space Qualifiers
Storage-Class Specifiers
Exceptions to C99 standard
Exceptions to C++ 11 Standard
Preprocessor Directives and Macros
Functions
Built-in Special Functions
TPC Intrinsics Guide
Arithmetic
Bitwise
Cache
Convert
IRF
LUT
Load
Logical
Move
Pack/Unpack
Select
Store
Miscellaneous
Support
Support and Legal Notice
APIs
APIs
¶
Learn how to use the APIs provided by various SynapseAI libraries.
Habana Collective Communications Library (HCCL) API Reference
Habana Labs Management Library (HLML) C API Reference
Habana Labs Management Library (PYHLML) Python API Reference
Habana TensorFlow Python API
Habana PyTorch Python API