Profiling with TensorFlow

This section provides simple guidelines to profile your model during the training process.

In compliance with the original guide Optimize TensorFlow performance using the Profiler, you can choose any of the Profiling APIs to perform profiling. The TensorBoard Keras Callback API is the recommended API.

  1. Add the following text to your training script.

# Create a TensorBoard callback
logs = "logs/" + datetime.now().strftime("%Y%m%d-%H%M%S")

tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs,
                                             histogram_freq = 1,
                                             profile_batch = '500,520')
model.fit(ds_train,
          epochs=2,
          validation_data=ds_test,
          callbacks = [tboard_callback])

Note

The TensorBoard object was passed to the fit method. Make sure to specify the sequence of steps (batches) you want to profile. Please consider the limited capacity of your buffer, which is responsible for collecting data within the Intel Gaudi Profiling subsystem.

  1. Start the TensorBoard server in a dedicated terminal window.

$ tensorboard --logdir logs --bind_all --port=5990

In the example above, the listening port is set to 5990.

  1. Open a new window tab in your browser and check out your TensorBoard website.

http://fq_domain_name:5990

You are now prepared to begin your training.

Two types of information are produced by TensorBoard:

  • Model Performance Tracking - While your workload is being processed in batches, you can track the progress of the training process on the dashboard in real-time by monitoring the model’s cost (loss) and accuracy.

  • Profiling Analysis - Right after the last requested step was completed, the collected profiling data is analyzed by TensorBoard and then immediately submitted to your browser, without any need to wait till the training process is completed.

Note

  • Carefully consider the size of your buffer and the number of steps you actually need to profile.

  • If you require an extension for your buffer, refer to profiling section.

  • In most use cases, the default settings are sufficient, and there is no need for any internal parameter adjustments.

  • An error: Unknown device vendor might appear due to TensorFlow not recognizing Gaudi. It does not affect the performance and will be removed in future releases.