Collectives Performance
On this Page
Collectives Performance¶
The below tests are performed using HCCL demos.
HCCL_Allreduce¶
Running without MPI:
cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0 HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test all_reduce --nranks 8 --loop 1000 --node_id 0 --size 256m --ranks_per_node 8 [BENCHMARK] hcclAllReduce(src!=dst, InputSizeInBytes=268435456, count=67108864, dtype=float, iterations=1000) [BENCHMARK] NW Bandwidth : 512.957298 GB/s [BENCHMARK] Algo Bandwidth : 293.118456 GB/s [BENCHMARK] Time : 0.915792 seconds
Running with MPI:
Running HCCL on 1 server (8 Gaudi devices):
python3 run_hccl_demo.py --size 32m --test all_reduce --loop 1000 -mpi -np 8
Running HCCL on 2 servers (16 Gaudi devices):
python3 run_hccl_demo.py --test all_reduce --loop 1000 --size 32m -mpi --hostfile hostfile.txt
HCCL_Allgather¶
Running without MPI:
cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0 HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test all_gather --nranks 8 --loop 1000 --node_id 0 --size 16m --ranks_per_node 8 [BENCHMARK] hcclAllGather(src!=dst, InputSizeInBytes=16777216, count=4194304, dtype=float, iterations=1000) [BENCHMARK] NW Bandwidth : 509.189283 GB/s [BENCHMARK] Algo Bandwidth : 72.741326 GB/s [BENCHMARK] Time : 0.230642 seconds
Running with MPI:
Running HCCL on 1 server (8 Gaudi devices):
python3 run_hccl_demo.py --size 32m --test all_gather --loop 1000 -mpi -np 8
Running HCCL on 2 servers (16 Gaudi devices):
python3 run_hccl_demo.py --test all_gather --loop 1000 --size 32m -mpi --hostfile hostfile.txt
HCCL_ReduceScatter¶
Running without MPI:
cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0 HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test reduce_scatter --nranks 8 --loop 1000 --node_id 0 --size 256m --ranks_per_node 8 [BENCHMARK] hcclReduceScatter(src!=dst, InputSizeInBytes=268435456, count=67108864, dtype=float, iterations=1000) [BENCHMARK] NW Bandwidth : 509.285134 GB/s [BENCHMARK] Algo Bandwidth : 582.040153 GB/s [BENCHMARK] Time : 0.461197 seconds
Running with MPI:
Running HCCL on 1 server (8 Gaudi devices):
python3 run_hccl_demo.py --size 32m --test reduce_scatter --loop 1000 -mpi -np 8
Running HCCL on 2 servers (16 Gaudi devices):
python3 run_hccl_demo.py --test reduce_scatter --loop 1000 --size 32m -mpi --hostfile hostfile.txt
HCCL_All2All¶
Running without MPI:
cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0 HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test all2all --nranks 8 --loop 1000 --node_id 0 --size 256m --ranks_per_node 8 [BENCHMARK] hcclAlltoAll(src!=dst, InputSizeInBytes=268435456, count=67108864, dtype=float, iterations=1000) [BENCHMARK] NW Bandwidth : 509.252740 GB/s [BENCHMARK] Algo Bandwidth : 582.003132 GB/s [BENCHMARK] Time : 0.461227 seconds
Running with MPI:
Running HCCL on 1 server (8 Gaudi devices):
python3 run_hccl_demo.py --size 32m --test all2all --loop 1000 -mpi -np 8
Running HCCL on 2 servers (16 Gaudi devices):
python3 run_hccl_demo.py --test all2all --loop 1000 --size 32m -mpi --hostfile hostfile.txt
Bisection Bandwidth Test Inside a Leaf Switch¶
Check if all leaf switches are performant (bisection bandwidth test inside leaf):
Run 64MB bidirectional send/recv test using all Gaudis/boxes within the leaf switch. Perform send_recv with (rank ^ 8)) bidirectional test.
Ensure all ranks are sending data to the leaf to make sure the leaf switch is fully utilized. As a simple rule of thumb, just use all Gaudis and do pair-wise exchange type send/recv using rank communicating with (rank ^ 8).
PASS criteria: Ensure BW is >68 GB/sec in each direction.
Bisection Bandwidth Test Two Leaf and Spine Switch¶
Check if spine switches are performant (bisection bandwidth test across leaf and spine):
Run 64MB bidirectional send/recv test except half the ranks will communicate with other half of the ranks. Run send_recv with (rank ^ total_ranks/2).
PASS criteria: Ensure BW is >68 GB/sec in each direction.
Bisection Bandwidth Test Across all Gaudi 3s for a Given Job¶
Check PEER alltoall connections are functional (bisection bandwidth test including all nodes):
Run 64MB bidirectional send/recv test in a loop (for i=0; i < total_ranks/8; i++) ; send_recv with (rank ^ (i*8)).
PASS criteria: Ensure BW is >68 GB/sec in each direction.
Check alltoall connections are functional (bisection bandwidth test including all nodes):
Run 64MB bidirectional send/recv test in a loop (for i=0; i < total_ranks; i++) ; send_recv with (rank ^ i).
PASS criteria: Ensure BW is >68 GB/sec in each direction.