Intel Tiber AI Cloud Quick Start Guide

This document provides instructions on setting up the Intel® Gaudi® 2 AI accelerator instance on the Intel® Tiber™ AI Cloud and running models from the Intel Gaudi Model References repository and the Hugging Face Optimum for Intel Gaudi library.

Please follow along with the video on our Developer Page to walk through the steps below. To set up a multi-server instance with two or more Gaudi nodes, refer to Setting up a Multi-Server Environment.

Creating an Account and Getting an Instance

Follow the below steps to get access to the Intel Tiber AI Cloud and launch a Gaudi 2 card instance.

  1. Go to https://console.cloud.intel.com and select Get Started to create an account and get SSH access.

../_images/console-home-00.png
  1. Go to “Console Home” and select “Catalog > Hardware”:

../_images/console-catalog-00.png
  1. Select the “Gaudi 2 Deep Learning Server”:

  2. In the instance Configuration window, enter an instance name and select the SSH key that you created in the Getting Started section. Click “Launch”. You will see that the node is being provisioned:

../_images/console-instance-00.png
  1. Once the State has changed from “provisioning” to “ready”, click on the instance name. Then select the “How to Connect” box:

../_images/console.cloud-01.png
  1. You will then see all the options to SSH into the Intel Tiber AI Cloud instance. Copy the SSH command and paste it into your terminal window:

../_images/console.cloud-03.png

Note

If you do not have access to the Gaudi 2 instance, you will need to request access to be added to the wait list (click on the “Preview Catalog” link at the top section of the screen and, in the Preview Catalog page, select the specific hardware to request an instance).

Start Training a PyTorch Model on Gaudi 2

Note

For partners not using the Intel Tiber AI Cloud, follow the instructions starting from this section to run models using Gaudi.

Now that the instance has been created, start with some simple model examples from the Intel Gaudi Model References GitHub repository.

  1. Run the hl-smi tool to confirm the Intel Gaudi software version used on your Intel Tiber AI Cloud instance. You will need to use the correct software version in the docker run and git clone commands. Use the HL-SMI Version at the top. In this case, the version is 1.18.0:

       HL-SMI Version:       hl-1.18.0-XXXXXXX
       Driver Version:       1.18.0-XXXXXX
    
  2. Run the Intel Gaudi Docker image:

       docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
    

    Note

    You may see this error message after running the above docker command:

    docker: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied.

    In such a case, you need to add your user to the docker group with this command:

    sudo usermod -a -G docker $USER

    Exit the terminal and re-enter. Your original docker command should work without any issues.

  3. Clone the Model References repository inside the container that you have just started:

       cd ~
       git clone -b 1.18.0 https://github.com/HabanaAI/Model-References.git
    
  4. Move to the subdirectory containing the hello_world example:

    cd Model-References/PyTorch/examples/computer_vision/hello_world/
    
  5. Update the environment variables to point to where the Model References repository is located and set PYTHON to Python executable:

    export PYTHONPATH=$PYTHONPATH:/root/Model-References
    export PYTHON=/usr/bin/python3.10
    

    Note

    The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.

Training Examples

Next Steps

For next steps you can refer to the following:

  • To explore more models from the Model References, start here.

  • To run more examples using Hugging Face go here.

  • To migrate other models to Gaudi, refer to PyTorch Model Porting.

Setting up a Multi-Server Environment

Follow these steps to manually set up a multi-server environment. This example shows how to set up two Gaudi nodes on the Intel Tiber AI Cloud.

Initial Setup

These generic instructions should be set for any platform:

  1. Make sure CPUs on the node are set to performant mode:

    cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    
  2. Set the Hugepages on all hosts to enable models:

    sudo sysctl -w vm.nr_hugepages=150000
    cat /proc/meminfo | grep HugePages_Total | awk '{ print $2 }'
    
  3. Make sure all external ports are set to ON for each host:

    /opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh  --status
    /opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh  --up
    
  4. (Optional) Generate an /etc/gaudinet.json file for L3 base scaling. This is not required for L2 based scaling. For this two-node example, this file is not needed.

Get the Host IP Address

You will need the IP Addresses of all the nodes in the cluster. From the “Instances” window, you can see each Host IP address for each node. In the example below, the IP Address is 100.83.89.152. You will need to collect the IP Addresses of all your configured nodes:

../_images/console.cloud.intel.com_compute_region-us-region-3.png

Start the Intel Gaudi Docker and Share SSH Keys

  1. Start the Intel Gaudi Docker on each of the host machines:

       docker run -it --name pytorch_gaudi2 --runtime=habana --privileged -v /sys/fs/cgroup:/sys/fs/cgroup -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host  vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
    
  2. Share the local .ssh key pair across the nodes to allow connectivity between the nodes. Run the below to copy your SSH keys to each node:

    docker exec -it pytorch_gaudi2 mkdir -p /root/.ssh
    docker cp id_rsa  pytorch_gaudi2:/root/.ssh/
    docker cp id_rsa.pub  pytorch_gaudi2:/root/.ssh/
    docker cp authorized_keys pytorch_gaudi2:/root/.ssh/
    
  3. Set the proper permissions and generate SSH host keys in each Docker. Make sure to replace the {IP_ADDRESS_NODE0} and {IP_ADDRESS_NODE1} in the commands below with the actual IP addresses of your nodes:

    chmod 700 /root/.ssh
    chmod 644 /root/.ssh/id_rsa.pub
    chmod 600 /root/.ssh/id_rsa
    chmod 644 /root/.ssh/authorized_keys
    sed -i 's/#Port 22/Port 3022/g' /etc/ssh/sshd_config
    sed -i 's/#   Port 22/    Port 3022/g' /etc/ssh/ssh_config
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
    /usr/bin/ssh-keygen -A
    service ssh restart
    ssh-keyscan -p 3022 -H {IP_ADDRESS_NODE0}>> ~/.ssh/known_hosts
    ssh-keyscan -p 3022 -H {IP_ADDRESS_NODE1}>> ~/.ssh/known_hosts
    

Test Using hccl_demo

Running hccl_demo tests basic node to node communications. The first run tests each node individually, and the second uses both nodes together.

  1. Set up the test:

    cd /root
    git clone https://github.com/HabanaAI/hccl_demo.git
    cd hccl_demo/
    make clean
    
  2. Run the first test - a single run using the below command on both nodes individually:

    HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 256m --test all_reduce --loop 1000 --ranks_per_node 8
    
  3. Run the second test - run two node hccl_demo. Make sure to replace the {IP_ADDRESS_NODE0} and {IP_ADDRESS_NODE1} in the commands below with the actual IP addresses of your nodes:

    python3 run_hccl_demo.py --test all_reduce --loop 1000 --size 32m -mpi --host {IP_ADDRESS_NODE0}:8,{IP_ADDRESS_NODE1}:8 --mca btl_tcp_if_include ens7f1
    

Test Using DeepSpeed

The next test uses the LLaMA 13B model from Megatron-DeepSpeed GitHub repository.

  1. Prepare a shared directory accessible to all the nodes and clone the Model-References repository to the shared directory from one node:

       cd /shared_dir
       git clone https://github.com/HabanaAI/Megatron-DeepSpeed.git
    
  2. Set up the environment variables and install necessary modules on all the nodes:

       cd /shared_dir/Megatron-DeepSpeed
       export MEGATRON_DEEPSPEED_ROOT=/path/to/Megatron-DeepSpeed
       export PYTHONPATH=$MEGATRON_DEEPSPEED_ROOT/:$PYTHONPATH
       pip install -r megatron/core/requirements.txt
       pip install git+https://github.com/HabanaAI/[email protected]
       apt update
       apt install pdsh -y
    
  3. Update the two IP addresses with the ones from your nodes in the scripts/hostsfile:

    10.10.100.101 slots=8
    10.10.100.102 slots=8
    
  4. Follow the Dataset Preparation instructions to acquire the full (500GB+) oscar-en dataset. Or, apply the following steps from one node to create a small (0.5GB) and customized RedPajama dataset to check the connectivity between nodes:

    mkdir -p /shared_dir/redpajama
    cd /shared_dir/redpajama
    # download the redpajama dataset list file and merely pick the first jsonl, which is arxiv
    wget 'https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt'
    head -n 1 urls.txt > first_jsonl.txt
    
    while read line; do
      dload_loc=${line#https://data.together.xyz/redpajama-data-1T/v1.0.0/}
      mkdir -p $(dirname $dload_loc)
      wget "$line" -O "$dload_loc"
    done < first_jsonl.txt
    
    # download the tokenizer file correspondent to the target model, e.g. LLama13b
    wget -O tokenizer.model "https://huggingface.co/huggyllama/llama-13b/resolve/main/tokenizer.model"
    
    # install necessary modules for data preparation
    pip install nltk sentencepiece
    mkdir -p arxiv_tokenized
    python $MEGATRON_DEEPSPEED_ROOT/tools/preprocess_data.py --input arxiv/*.jsonl \
          --output-prefix arxiv_tokenized/meg-gpt2 --tokenizer-model ./tokenizer.model \
          --append-eod --tokenizer-type GPTSentencePieceTokenizer --workers 64
    # use the tokenized files from above step to train
    
  5. Choose the oscar-en dataset or the customized RedPajama to run a multi-server test. For example, run the below command from one of the two nodes:

    cd $MEGATRON_DEEPSPEED_ROOT
    mkdir -p out_llama
    # DATA_DIR_ROOT=/data/pytorch/megatron-gpt/oscar-en
    DATA_DIR_ROOT=/shared_dir/redpajama
    HL_DATA_FILE_PREFIX=meg-gpt2_text_document HL_RESULTS_DIR=out_llama \
          HL_DATA_DIR_ROOT=${DATA_DIR_ROOT}/arxiv_tokenized HL_HOSTSFILE=scripts/hostsfile \
          HL_TOKENIZER_MODEL=${DATA_DIR_ROOT}/tokenizer.model HL_NUM_NODES=2 HL_PP=2 HL_TP=2 HL_DP=4 scripts/run_llama.sh