• Welcome to Habana® Gaudi® v1.10 Documentation

Getting Started

  • Gaudi Architecture and Software Overview
    • Gaudi Architecture
    • SynapseAI® Software Suite
  • Support Matrix
  • Release Notes
  • Installation
    • Habana Deep Learning Base AMI Installation
    • AWS Deep Learning AMI (DLAMI) Installation
    • AWS Base OS AMI Installation
    • Bare Metal Fresh OS Installation
  • AWS DL1 Quick Start

Frameworks

  • PyTorch
    • Getting Started with PyTorch and Gaudi
    • PyTorch Model Porting
      • Importing Habana Torch Library
      • Enabling Mixed Precision
      • Setting Up Distributed Training
      • GPU Migration Toolkit
    • PyTorch Mixed Precision Training on Gaudi
      • Native PyTorch Autocast
      • Habana Mixed Precision
    • Distributed Training with PyTorch
      • Scale-Out Topology
      • Distributed Backend Initialization
      • Gaudi-to-process Assignment
      • DDP-based Scaling of Gaudi on PyTorch
      • Theory of Distributed Training
    • Habana Media Loader
    • Large Models on PyTorch Using DeepSpeed
      • Getting Started with DeepSpeed
      • DeepSpeed Training
      • DeepSpeed Inference
    • Inference on PyTorch
      • Run Inference Using Native PyTorch
      • Run Inference Using HPU Graphs
      • Optimize Inference on PyTorch
      • Run Inference Using DeepSpeed
      • Triton Inference Server with Gaudi
    • Enabling Multiple Tenants on PyTorch
      • Multiple Workloads on a Single Docker
      • Multiple Dockers Each with a Single Workload
    • Model Performance Optimization
      • Optimizing PyTorch Models
      • Handling Dynamic Shapes
      • Handling Custom Habana Ops for PyTorch
      • Using HPU Graphs for Training
      • Optimizing Training Using PyTorch Lightning
      • Optimizing Training Platform
    • Debugging and Troubleshooting
      • Debugging Possible Model Errors
      • Debugging Model Divergence
      • Debugging Slow Convergence
      • Troubleshooting your Model
    • Runtime Environment Variables
    • Habana PyTorch Python API (habana_frameworks.torch)
    • PyTorch Operators
    • PyTorch CustomOp API
  • Hugging Face Optimum-Habana
  • PyTorch Lightning
  • TensorFlow
    • Migration Guide
    • TensorFlow User Guide
      • TensorFlow Gaudi Integration Architecture
      • Host and Device Ops Placement
      • TensorFlow Keras
      • Runtime Environment Variables
      • Habana TensorFlow Python API (habana_frameworks.tensorflow)
    • TensorFlow Mixed Precision Training on Gaudi
    • Distributed Training with TensorFlow
      • Overview
      • Scale-out Topology
      • Gaudi-to-process Assignment
      • Horovod-based Scaling of Gaudi on TensorFlow
      • TensorFlow Distributed based Scaling of Gaudi
    • Habana Media Loader
    • Enabling Multiple Tenants on TensorFlow
      • Multiple Workloads on a Single Docker
      • Multiple Dockers Each with a Single Workload
      • Running Multiple Workloads on a Single Node K8s Cluster
    • Model Performance Optimization
      • Optimization in TensorFlow Models
      • Optimizing Training Platform
    • Debugging Guide
      • Debugging Possible Model Errors
      • Debugging Model Divergence
      • Debugging Slow Convergence
      • Troubleshooting your Model
    • TensorFlow Operators
    • TensorFlow CustomOp API

Guides

  • Media Pipeline
    • Creating and Executing Media Pipeline
    • Media Pipe for PyTorch ResNet
    • Media Pipe for TensorFlow ResNet
    • Operators
      • fn.Add
      • fn.BasicCrop
      • fn.BitwiseAnd
      • fn.BitwiseOr
      • fn.BitwiseXor
      • fn.Brightness
      • fn.Cast
      • fn.Clamp
      • fn.CocoReader
      • fn.CoinFlip
      • fn.ColorSpaceConversion
      • fn.Concat
      • fn.Constant
      • fn.Contrast
      • fn.Crop
      • fn.CropMirrorNorm
      • fn.ExtCpuOp
      • fn.ExtHpuOp
      • fn.Flip
      • fn.GatherND
      • fn.GaussianBlur
      • fn.Hue
      • fn.ImageDecoder
      • fn.MediaConst
      • fn.MediaExtReaderOp
      • fn.MediaFunc
      • fn.MemCpy
      • fn.Mult
      • fn.Neg
      • fn.Normalize
      • fn.Pad
      • fn.RandomBiasedCrop
      • fn.RandomFlip
      • fn.RandomNormal
      • fn.RandomUniform
      • fn.ReadImageDatasetFromDir
      • fn.ReadNumpyDatasetFromDir
      • fn.ReduceMax
      • fn.ReduceMin
      • fn.Reshape
      • fn.Resize
      • fn.Saturation
      • fn.Slice
      • fn.Split
      • fn.SSDMetadata
      • fn.Sub
      • fn.Transpose
      • fn.Where
      • fn.Zoom
  • Profiling
    • Profiling with PyTorch
    • Profiling with SynapseAI
      • Getting Started with SynapseAI Profiler
      • Configuration
      • Analysis
    • Profiling with TensorFlow
    • Profiling Architecture
    • Tips and Tricks to Accelerate the Training
  • Management and Monitoring
    • Qualification Library Guide (hl_qual Tool)
      • hl_qual Common Plugin Switches and Parameters
      • hl_qual Report Structure
      • hl_qual Expected Output and Failure Debug
      • Memory Stress Test Plugins Design, Switches and Parameters
      • Power Stress and EDP Tests Plugins Design, Switches and Parameters
      • Connectivity Serdes Test Plugins Design, Switches and Parameters
      • Functional Test Plugins Design, Switches and Parameters
      • Bandwidth Test Plugins Design, Switches and Parameters
      • hl_qual Monitor Textual UI
      • Package Content
      • hl_qual Design
    • System Management Interface Tool User Guide (hl-smi Tool)
    • Habana Labs Management Library (HLML) C API Reference
      • C API
      • Common APIs
      • Per device APIs
      • Linkage HLML
    • Habana Labs Management Library (PYHLML) Python API Reference
      • Python APIs
      • Common APIs
      • Per device APIs
  • Orchestration
    • Kubernetes User Guide
      • Habana Device Plugin for Kubernetes
      • MPI Operator for Kubernetes
      • Prometheus Metric Exporter for Kubernetes
    • HabanaAI Operator for OpenShift
      • Setting up OpenShift Environment
      • Deploying HabanaAI Operator
    • VMware Tanzu Guide
    • Enabling Multiple Tenants
      • Enabling Multiple Tenants on PyTorch
        • Multiple Workloads on a Single Docker
        • Multiple Dockers Each with a Single Workload
      • Enabling Multiple Tenants on TensorFlow
        • Multiple Workloads on a Single Docker
        • Multiple Dockers Each with a Single Workload
        • Running Multiple Workloads on a Single Node K8s Cluster
    • Amazon ECS with Habana User Guide
      • Setting Up EFA Enabled Security Group
      • Creating an Multi-Node Parallel (MNP) Compatible Docker Image
      • Create AWS Batch Compute Environment
      • Create and Submit AWS Batch Job
      • Advanced Model Training Batch Example: ResNet50 Keras
    • Amazon EKS with Habana User Guide
      • Creating Cluster and Node Group
      • Enabling Plugins
      • Running a Job on the Cluster
      • mnist Model Training Example: Run MPIJob on Multi-node Cluster
      • Advanced Model Training Example: Run ResNet Keras Multi-node Cluster
  • Virtualization
  • AWS User Guides
    • Create Elastic Container Registry (ECR) and Upload Images
    • Distributed Training across Multiple AWS DL1 Instances User Guide
    • Amazon ECS with Habana User Guide
      • Setting Up EFA Enabled Security Group
      • Creating an Multi-Node Parallel (MNP) Compatible Docker Image
      • Create AWS Batch Compute Environment
      • Create and Submit AWS Batch Job
      • Advanced Model Training Batch Example: ResNet50 Keras
    • Amazon EKS with Habana User Guide
      • Creating Cluster and Node Group
      • Enabling Plugins
      • Running a Job on the Cluster
      • mnist Model Training Example: Run MPIJob on Multi-node Cluster
      • Advanced Model Training Example: Run ResNet Keras Multi-node Cluster
  • APIs
    • Habana Collective Communications Library (HCCL) API Reference
      • Overview
      • Using HCCL
      • Scale-Out via Host-NIC
      • C API
      • Testing and Benchmarking
    • Habana Labs Management Library (HLML) C API Reference
      • C API
      • Common APIs
      • Per device APIs
      • Linkage HLML
    • Habana Labs Management Library (PYHLML) Python API Reference
      • Python APIs
      • Common APIs
      • Per device APIs
    • Habana TensorFlow Python API
    • Habana PyTorch Python API
  • TPC Programming
    • TPC Getting Started Guide
    • TPC Tools Installation Guide
    • TPC User Guide
      • TPC Programming Language
      • Processor Architectural Overview
      • TPC Programming Model
      • TPC-C Language
      • Built-in Functions
      • Implementing and Integrating New lib
      • TPC Coherency
      • Multiple Kernel Libraries
      • Abbreviations
    • TPC Tools Debugger
      • Installation
      • Starting a Debug Session
      • TPC-C Source or Disassembly Level Debugging
      • Debug Session Views and Operations
    • TPC-C Language Specification
      • Supported Data Types
      • Conversions and Type Casting
      • Operators
      • Vector Operations
      • Address Space Qualifiers
      • Storage-Class Specifiers
      • Exceptions to C99 standard
      • Exceptions to C++ 11 Standard
      • Preprocessor Directives and Macros
      • Functions
      • Built-in Special Functions
    • TPC Intrinsics Guide
      • Arithmetic
      • Bitwise
      • Cache
      • Convert
      • IRF
      • LUT
      • Load
      • Logical
      • Move
      • Pack/Unpack
      • Select
      • Store
      • Miscellaneous

Support

  • Support and Legal Notice
Theme by the Executable Book Project

Management and Monitoring

Management and Monitoring¶

Discover different tools and solutions that will help you manage and monitor your Gaudi server.

  • Qualification Library Guide (hl_qual Tool)
  • System Management Interface Tool User Guide (hl-smi Tool)
  • Habana Labs Management Library (HLML) C API Reference
  • Habana Labs Management Library (PYHLML) Python API Reference

previous

Profiling Tips and Tricks

next

Qualification Library Guide (hl_qual Tool)

By Habana Labs
© Copyright 2023, Habana Labs.