Profiling with TensorFlowΒΆ

In compliance with the original guide Optimize TensorFlow performance using the Profiler, you can choose any one of the Profiling APIs to perform profiling. Using TensorBoard Keras Callback is the recommended API.

  1. Add the text below to your training script:

# Create a TensorBoard callback
logs = "logs/" + datetime.now().strftime("%Y%m%d-%H%M%S")

tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs,
                                             histogram_freq = 1,
                                             profile_batch = '500,520')
model.fit(ds_train,
          epochs=2,
          validation_data=ds_test,
          callbacks = [tboard_callback])

Note that the TensorBoard object was passed to the fit method. Make sure to specify the sequence of steps (batches) you want to profile while taking the limited capacity of your buffer (which collects the data in the Synapse Profiling Subsystem) into consideration.

  1. Start the TensorBoard server in a dedicated terminal window:

$ tensorboard --logdir logs --bind_all --port=5990

In the example above, the listening port is set to 5990.

  1. Open new window tab in your browser and check out your TensorBoard website:

http://fq_domain_name:5990

Now you are ready to go and start your training.

The TensorBoard generates two kinds of information.

  • While your workload is being processed step by step (batch by batch), on the dashboard, you can monitor (online) the training process by tracking your model cost (loss) and accuracy.

  • Right after the last requested step was completed, the whole bunch of collected profiling data is analyzed (by TensorFlow) and submitted to your browser. No need to wait for the end of the training process.

Note

  • Carefully consider the number of steps you really need to profile and think of limited buffer size.

  • If needed, for buffer extension consult SynapseAI Profiler User Guide.

  • For vast majority of use cases, default settings are just good enough so that no special internal parameter adjustment is needed.

  • An error: Unknown device vendor might appear. This error appears due to TensorFlow not recognizing Gaudi. It does not affect the performance and will be removed in future releases.