Make Tensorflow’s Object-Detection Validation a true Post-Process

5 min readFeb 17, 2022

In Tensorflow 2, Object Detection API separated Training and Validating into two different processes. Even more, they are supposed to run simultaneously, monitoring the same directory where Checkpoints are saved. This article will explain a trick enabling Users to run Validation as a true Post-process, which is, as far as I know, not available in the current implementation of Object Detection API.

Table of Contents

1). By default, how is Training and Validation supposed to intertwine?
2). The Pros & Cons of the aforementioned approach
3). Make Validation a true Post-process
4). Verify
5). Conclusion

By default, how is Training and Validation supposed to intertwine?

Assuming all things are well set up, from preparing data as TFRecord, to downloading the Git repo models, to compiling the protos, to installing Tensorflow Object Detection API’s setup.pyas well as configuring the pipeline.config, the Training proccess is then executed, using the command:

python model_main_tf2.py \
  --pipeline_config_path="some_path/pipeline.config" \
  --model_dir="some_path" \
  --checkpoint_every_n=100 \
  --num_train_steps=1000 \
  --alsologtostderr

At the same time, the Validation is started, as followed:

python model_main_tf2.py \
  --pipeline_config_path="some_path/pipeline.config" \
  --model_dir="some_path" \
  --checkpoint_dir="some_path" \
  --eval_timeout=60

Whenever the Training saves a Checkpoint file, e.g.ckpt-1(ckpt-1.data-00000-of-00001 and ckpt-1.index), the Validation detects, ingests the file, starts validating, and saves the result as e.g.events.out.tfevents.1645005144.6403b50c5b59.912.0.v2. These processes keep going on until the Training is complete, and the Validation is then timed-out (found no new Checkpoint).

The Pros & Cons of the aforementioned approach

The reader may have a different resource set-up scheme for training and validating. For example, they may have one or two strong PCs with plenty of CPU/GPU. But in my case, Google Colab is used. Unfortunately, CoLab’s Notebook executes cells one-after-another; in one cell, one command after another, e.g. !python model_main_tf2.py…. If running Validation after Training, the Validation will only consider the latest Checkpoint, as proven below:

# 1). module "model_main_tf2.py", line 81.
if FLAGS.checkpoint_dir: # if checkpoint_dir argument is used.
  model_lib_v2.eval_continuously(...# 2). module "model_lib_v2.py", function eval_continuously(), line 1136:
for latest_checkpoint in tf.train.checkpoints_iterator(
  checkpoint_dir, timeout=timeout, min_interval_secs=wait_interval):# 3). module "checkpoint_utils.py", function checkpoints_iterator(), line 194:
while True:
  new_checkpoint_path = wait_for_new_checkpoint(
    checkpoint_dir, checkpoint_path, timeout=timeout)# 4). same module, function wait_for_new_checkpoint(), line 139:
while True:
  checkpoint_path = checkpoint_management.latest_checkpoint(checkpoint_dir)

There may be a few tricks to get multiple cells running at the same time, such as ipyparallel (But I am not experienced on this technique, and not sure if it can work with CoLab at all).

To the best of my knowledge, the pros and cons of Training and Validation running in parallel):

Pros:
- No need to keep all Checkpoints (In my case, one Checkpoint costs 42.5 MB). Indeed, Training only maintains the last 7 Checkpoints. This configuration is not changeable from outside.
- Combining with Tensorboard in real-time, one can stop the Training early if e.g. loss is no longer substantially reduced.
Cons:
- Resources must be shared between Training and Validation, which can slow down the Training. The Validation could also use GPU (Picture [1]).
- Validation process only takes into account the latest Checkpoint. In some cases, e.g. large Validation set and more granular checkpoint_every_n, the Validation may miss some Checkpoints, resulting in less fine-grained Visualization for Validation data.
- Complexity: In CoLab, it’s not straightforward to run multiple cells in parallel (i.e. one cell for Training, one cell for Validation).

Picture 1: GPU is used during Validation

Therefore, there may be a need for Validation process to be carried out after Training process. This would be particularly handy in case of using CoLab.

Make Validation a true Post-process

The code for Validation, specifically: ingesting a Checkpoint, restoring the model, validating it against Validation set, must be scattered in multiple places in the Git Repo models. Assembling all of the pieces can be time-consuming as well as possibly introducing new subtle issues.

The safe and quick approach should be to maintain the current way of executing, i.e. calling model_main_tf2.py with checkpoint_dir argument, but modify a few places in the code, where, instead of continuously watching over the directory for a new Checkpoint, it would just enumerate over all Checkpoints in the directory, and process each of them as if each is the single latest Checkpoint. Through this way, the correctness of the entire Validation process is kept intact. Every artifact produced should be then well-formatted and feedable to, e.g. Tensorboard.

First thing first, the Training only saves the last 7 Checkpoints:

# 1). module "model_main_tf2.py", line 105:
with strategy.scope():
  model_lib_v2.train_loop(
    pipeline_config_path=FLAGS.pipeline_config_path,
    model_dir=FLAGS.model_dir,
    train_steps=FLAGS.num_train_steps,
    use_tpu=FLAGS.use_tpu,
    checkpoint_every_n=FLAGS.checkpoint_every_n,
    record_summaries=FLAGS.record_summaries)# 2). module "model_lib_v2.py", function train_loop(), line 443:
def train_loop(
  pipeline_config_path,
  model_dir,
  config_override=None,
  train_steps=None,
  use_tpu=False,
  save_final_config=False,
  checkpoint_every_n=1000,
  checkpoint_max_to_keep=7,
  record_summaries=True,
  performance_summary_exporter=None,
  num_steps_per_iteration=NUM_STEPS_PER_ITERATION,
  **kwargs):

Unfortunately, this argument can’t be changed from outside. It’d be best to add checkpoint_max_to_keep=some_big_number to the calling of train_loop(), within module model_main_tf2.py (the module used for Training and Validation).

Next, within module model_lib_v2.py, function eval_continuously(), line 1136:

for latest_checkpoint in tf.train.checkpoints_iterator(
    checkpoint_dir, timeout=timeout, min_interval_secs=wait_interval):
  ...ckpt.restore(latest_checkpoint).expect_partial()

Considering variable latest_checkpoint, it is a string describing the full path to the latest Checkpoint file, e.g. absolute_path/ckpt-33. After identifying the path, it moves on to Validation, i.e. restoring the Model with the Checkpoint saved at the path, then start validating.

If we replace the use of tf.train.checkpoints_iterator() with some code that enumerates all the Checkpoints, and feed each into the body of the for-loop, everything should proceed as usual! The correctness would be maintained. Below is what I did: replace the iterator with the code in bold text. (beside, some additional libraries (regex, glob) and a utility function (natural_sort) are also needed)

import regex as re
import globdef natural_sort(l): 
  convert = lambda text: int(text) if text.isdigit() else text.lower() 
  alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
  return sorted(l, key = alphanum_key)for latest_checkpoint in natural_sort(list(set(map(lambda n: n[:n.index(".")], glob.glob(f"{checkpoint_dir}/ckpt-*.*"))))):

Last but not least, all of the changes above should take place after the Github Repo models is cloned, and before we do !cp object_detection/packages/tf2/setup.py . and !pip install ..

Verify

I think the argument checkpoint_every_n should be divisible by 100, which is NUM_STEPS_PER_ITERATION. The reason why can be seen here: after one iteration, Training tries to save a Checkpoint only if the just accumulated steps is larger than checkpoint_every_n. By making it divisible by 100, we can easily estimate how many Checkpoints will be generated, given a num_train_steps.

The Training now starts first. After it is complete, following is the Validation. All are in a sequential manner. At the end, within the checkpoint_dir, there will be eval and train folders, containing data for Visualization with Tensorboard:

tensorboard --logdir=checkpoint_dir

Picture 2: Visualization for Training & Validation

Conclusion

We have learned how to make Validation a true Post-process for Tensorflow’s Object-Detection API. This trick is at least helpful in my case: a simple set-up within CoLab, for both Training and Validation, runs sequentially.

P.s: If this article is useful, please consider being a follower of my Medium. 😋