Collectives Performance

The below tests are performed using HCCL demos.

HCCL_Allreduce

  • Running without MPI:

    cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
    HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean
    --test all_reduce --nranks 8 --loop 1000 --node_id 0 --size 256m
    --ranks_per_node 8
    [BENCHMARK] hcclAllReduce(src!=dst, InputSizeInBytes=268435456,
    count=67108864, dtype=float, iterations=1000)
    
    [BENCHMARK]     NW Bandwidth   : 512.957298 GB/s
    [BENCHMARK]     Algo Bandwidth : 293.118456 GB/s
    [BENCHMARK]     Time           : 0.915792 seconds
    
  • Running with MPI:

    • Running HCCL on 1 server (8 Gaudi devices):

    python3 run_hccl_demo.py --size 32m --test all_reduce --loop 1000 -mpi -np 8
    
    • Running HCCL on 2 servers (16 Gaudi devices):

    python3 run_hccl_demo.py --test all_reduce --loop 1000 --size 32m -mpi --hostfile hostfile.txt
    

HCCL_Allgather

  • Running without MPI:

    cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
    HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean
    --test all_gather --nranks 8 --loop 1000 --node_id 0 --size 16m
    --ranks_per_node 8
    [BENCHMARK] hcclAllGather(src!=dst, InputSizeInBytes=16777216,
    count=4194304, dtype=float, iterations=1000)
    
    [BENCHMARK]     NW Bandwidth   : 509.189283 GB/s
    [BENCHMARK]     Algo Bandwidth : 72.741326 GB/s
    [BENCHMARK]     Time           : 0.230642 seconds
    
  • Running with MPI:

    • Running HCCL on 1 server (8 Gaudi devices):

    python3 run_hccl_demo.py --size 32m --test all_gather --loop 1000 -mpi -np 8
    
    • Running HCCL on 2 servers (16 Gaudi devices):

    python3 run_hccl_demo.py --test all_gather --loop 1000 --size 32m -mpi --hostfile hostfile.txt
    

HCCL_ReduceScatter

  • Running without MPI:

    cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
    HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test
    reduce_scatter --nranks 8 --loop 1000 --node_id 0 --size 256m
    --ranks_per_node 8
    [BENCHMARK] hcclReduceScatter(src!=dst, InputSizeInBytes=268435456,
    count=67108864, dtype=float, iterations=1000)
    
    [BENCHMARK]     NW Bandwidth   : 509.285134 GB/s
    [BENCHMARK]     Algo Bandwidth : 582.040153 GB/s
    [BENCHMARK]     Time           : 0.461197 seconds
    
  • Running with MPI:

    • Running HCCL on 1 server (8 Gaudi devices):

    python3 run_hccl_demo.py --size 32m --test reduce_scatter --loop 1000 -mpi -np 8
    
    • Running HCCL on 2 servers (16 Gaudi devices):

    python3 run_hccl_demo.py --test reduce_scatter --loop 1000 --size 32m -mpi --hostfile hostfile.txt
    

HCCL_All2All

  • Running without MPI:

    cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
    HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test
    all2all --nranks 8 --loop 1000 --node_id 0 --size 256m --ranks_per_node 8
    [BENCHMARK] hcclAlltoAll(src!=dst, InputSizeInBytes=268435456,
    count=67108864, dtype=float, iterations=1000)
    
    [BENCHMARK]     NW Bandwidth   : 509.252740 GB/s
    [BENCHMARK]     Algo Bandwidth : 582.003132 GB/s
    [BENCHMARK]     Time           : 0.461227 seconds
    
  • Running with MPI:

    • Running HCCL on 1 server (8 Gaudi devices):

    python3 run_hccl_demo.py --size 32m --test all2all --loop 1000 -mpi -np 8
    
    • Running HCCL on 2 servers (16 Gaudi devices):

    python3 run_hccl_demo.py --test all2all --loop 1000 --size 32m -mpi --hostfile hostfile.txt
    

Bisection Bandwidth Test Inside a Leaf Switch

  • Check if all leaf switches are performant (bisection bandwidth test inside leaf):

    • Run 64MB bidirectional send/recv test using all Gaudis/boxes within the leaf switch. Perform send_recv with (rank ^ 8)) bidirectional test.

    • Ensure all ranks are sending data to the leaf to make sure the leaf switch is fully utilized. As a simple rule of thumb, just use all Gaudis and do pair-wise exchange type send/recv using rank communicating with (rank ^ 8).

    • PASS criteria: Ensure BW is >68 GB/sec in each direction.

../../_images/Image1.png

Bisection Bandwidth Test Two Leaf and Spine Switch

  • Check if spine switches are performant (bisection bandwidth test across leaf and spine):

    • Run 64MB bidirectional send/recv test except half the ranks will communicate with other half of the ranks. Run send_recv with (rank ^ total_ranks/2).

    • PASS criteria: Ensure BW is >68  GB/sec in each direction.

../../_images/Image2.png

Bisection Bandwidth Test Across all Gaudi 3s for a Given Job

  • Check PEER alltoall connections are functional (bisection bandwidth test including all nodes):

    • Run 64MB bidirectional send/recv test in a loop (for i=0; i < total_ranks/8; i++) ; send_recv with (rank ^ (i*8)).

    • PASS criteria: Ensure BW is >68 GB/sec in each direction.

../../_images/Image3.png
  • Check alltoall connections are functional (bisection bandwidth test including all nodes):

    • Run 64MB bidirectional send/recv test in a loop (for i=0; i < total_ranks; i++) ; send_recv with (rank ^ i).

    • PASS criteria: Ensure BW is >68 GB/sec in each direction.

../../_images/Image4.png