Monitoring Switch and Gaudi 3 Accelerator

The following are the steps required to monitor the switch:

  • Clear counters before each experiment

  • Ingress and egress port packet drops/discards

  • Tx/Rx PFC pauses including different priority lanes (0-8)

  • Queue length size

  • PFC on/off

  • Headroom settings

  • Hash function

  • Only priority 0 must be enabled since we only use priority 0

  • Check for the following settings in switch:

    • platform trident mmu queue profile PFC_Profile

    • ingress threshold 1

    • ingress headroom 64000

    • interface Ethernet1/1-32/7

    • priority-flow-control on

    • priority-flow-control priority 0 no-drop

    • priority-flow-control priority 1 no-drop

    • priority-flow-control priority 2 no-drop

    • priority-flow-control priority 3 no-drop

    • uc-tx-queue 2 no priority

    • uc-tx-queue 3 no priority

    • platform trident mmu queue profile PFC_Profile apply

The following are the steps required to monitor each Gaudi:

  • Clear counters as show below:

    for i in {0..7} ; do echo 1 \| sudo tee
    ${HL_DBG_PATH}/"${HL_DBG_NAME}${i}"/nic_reset_cnt ; done
    
  • Monitor Psn_out_of_range and psn_out_of_sequence in all Ethernet interfaces.

  • Dmesg for port toggling/reset.