【Tensorflow】【Python】Train Your Own Dataset — Data Reading, Processing, Training, Testing, Visualization, Debugging (Single Machine Single GPU, Single Machine Multiple GPUs, Multiple Machines Multiple GPUs)

Github Code Address: https://github.com/HansRen1024/Tensorflow-preprocessing-training-testing

All code can be found on Github, make sure to match the filenames.

TF version must be at least 1.4.0.

There is very little information online about MonitoredTrainingSession; I have studied the official API and source code for a long time.

2018.04.13 Update: SSD single machine training + testing code: https://github.com/HansRen1024/Tensorflow-SSD

2018.03.16 Update: Added visualization code

2018.03.19 Update: Added debug code

2018.03.20 Update: Added mean subtraction and normalization code

2018.03.21 Update: Added mobilenet and mobilenet v2 network models

2018.03.22 Update: Added resnet-50, resnet-101, resnet-152, resnet-200 **network models

Resolved the issue of GPU memory being fully occupied upon running (only works well on a single machine, not in a distributed environment).

2018.03.23 Update: Select GPU

2018.03.26 Update: Added validation phase, can monitor both train and val phase losses on tensorboard

2018.03.29 Update: Added finetune functionality

2018.04.02 Update: Optimized visualization effects, reset the method for calculating steps.

2018.04.09 Update: Added synchronous and asynchronous training switches in a distributed environment, rewrote MonitoredTrainingSession

2018.04.10 Update: Replaced custom i or step with global step, resolved log output mismatch issue

2018.04.11 Update: Resolved the issue of asynchronous training failing to record the last event

Introduction#

New business requires a distributed cluster; previously used caffe, now need to use Tensorflow.

Most online resources are just copies of the official demo, with no detailed articles from data reading and processing to training and testing. I will write one.

1. Preparing the Dataset#

Previously used caffe, where datasets were categorized into different directories. Here I provide two scripts, each with its own functionality explained.

Script 1 (img2bin.py) is not recommended, no subsequent updates:

Classes are the directory names of various datasets under the images directory.

img=img.resize((227,227)) needs to be selected based on the network structure.

This script is at the same level as the images directory, which contains three directories “0”, “1”, “2”, each holding all images of three categories. Finally, it generates a binary file named “train.tfrecords”.

Script 2 (list2bin.py):

Finally, it will print a set of BGR three-channel means, recording the training set mean, which is placed in arg_parsing.py for mean subtraction operations.

It must be converted to RGB three channels; otherwise, there will be issues with data reading later.

When I used caffe, I was accustomed to first generating a txt document (formatted as below), and then generating an lmdb dataset based on the document.

Script 2 generates tfrecords binary dataset based on this txt document.

The blog address for the script that generates the txt document (which is img2list.sh in Github):
【Caffe】Quick Start to Training Your Own Data 《A Serious Talk About Caffe》

The code is a .sh script that generates train.txt and val.txt in the dataset.

I wrote the second script because I was accustomed to judging the correctness of the training process based on the val output during the training phase. ~~However, due to some reasons, I couldn't perform val during the training phase. More research is needed -0-.~~

2. Configuration Preparation Phase#

Once you have the generated .tfrecords binary dataset, you can start setting up the environment for training.

There is a lot of code, so I will place explanations within the code for convenience. To be honest, I haven't understood many tf APIs; I just got it running. A lot of time will still be needed for learning and research.

Document 1 (arg_parsing.py):

This document is mainly used to set the input parameters for running, which is easy to understand. You can modify the default parameters after the uppercase characters in the code, or specify parameters in the command line.

Document 2 (dataset.py):

Reads data, performs mean subtraction, normalization, and integrates into a batch input for the network.

Document 3 (main.py):

The document to be executed in the command line, the content is very simple.

if tf.gfile.Exists(FLAGS.model_dir):

The three commented lines clear the models directory; if this directory does not exist, it creates one. To prevent accidental operations, I commented out these three lines.#

Document 4 (net/squeezenet.py):

Network structure document, written separately for convenience when changing network structures in the future.

My network is squeezenet; the official demo can only handle 3232 datasets, but I modified it so that this network can handle normal 227227 datasets.

network.py is just for reference, I later abandoned it and stopped updating it.

Document 5 (test.py):

Document for testing.

Document 6 (train.py)

Training document, which contains three training methods. This is the most labor-intensive document to write, quite frustrating～～～.

train():# General training method for single machine single GPU + single machine multiple GPUs.
train_dis_():# Usable distributed multi-machine multi-GPU.

I want to elaborate a bit on train.py.

Regarding session, most people use tf.Session(). I use tf.train.MonitoredTrainingSession(), which is recommended by the official API and used in the cifar-10 demo. This interface hook is very convenient, allowing flexible definition and use of some functions. Almost all hooks can be used here according to the official API documentation. For example, debug, summary, etc., worth researching. (2018.03.23 Through this MTS manager, it is difficult to add a validation phase; if validation is done through shared parameters, the main issue is the global_step conflict. If done by reading local ckpt, balancing the frequency of saving ckpt and validation is challenging.)
I don't understand why all GPU memory gets occupied immediately when training starts; setting batch_size to 1, 32, or 64 yields the same result. So running val will report insufficient GPU memory, and after a while, the machine will freeze -0-. I will slowly resolve this issue later.

3. The third train_dis() method can control whether it is synchronous or asynchronous, but there are some issues; I was in a hurry to get the process running and didn't manage it, will look into it later when I have time.

The second train_dis_() is a usable distributed multi-machine multi-GPU method, which I believe is asynchronous.
You can control which GPU to use with CUDA_VISIBLE_DEVICES=0,1. Or add the following line in main.py to set it.

3. Setting Up the Training Phase#

First scenario (single machine single GPU, single machine multiple GPUs):

Set the method and path called in main.py to train, and run directly:

python main.py

The default is training mode.

Second scenario (multiple machines multiple GPUs):

As long as you specify --job_name in the command line, it will automatically run distributed training.

First, distribute all documents and datasets to each server.

For example, I currently have two servers, 10.100.1.151 and 10.100.1.120.

I want to use server 151 as both ps and worker.

Use server 120 as a worker.

First, set ps_hosts and worker_hosts in arg_parsing.py, separated by commas, with no spaces.

Then, run on server 151:

CUDA_VISIBLE_DEVICES='' python src/main.py --job_name=ps --task_index=0

Here, CUDA_VISIBLE_DEVICES='' means not using GPU, as it can be processed with CPU since it is the parameter server.

Next, continue to run on server 151:

CUDA_VISIBLE_DEVICES=0 python src/main.py --job_name=worker --task_index=0

Finally, run on server 120:

CUDA_VISIBLE_DEVICES=0 python src/main.py --job_name=worker --task_index=1

You can use CUDA_VISIBLE_DEVICES to set which GPUs to enable, separated by commas.

4. Testing#

It is particularly noted that in a multi-machine multi-GPU setup, only the machine with index 0 will save the ckpt.

Prepare the .tfrecords for the test set, configure the path, and run in the command line:

python main.py --mode=testing

You must specify the testing mode.

Postscript#

You can explore the command line parameters; although there are many, they are all quite simple.

I will continue to update...

Synchronous and asynchronous (2018.04.09 resolved)
Mean subtraction, normalization (2018.03.20 resolved)
Visualization (2018.03.16 resolved)
Debug (2018.03.19 resolved)
Validation (2018.03.26 resolved)
The issue of GPU memory being fully occupied (2018.03.22 resolved)
Preparing more network model documents (2018.03.19 mobilenet, 2018.03.22 resnet)
Selecting GPU (2018.03.19 resolved)
Optimizing the visualization of the network structure diagram (2018.04.02 resolved)
Finetune (2018.03.29 resolved)
New issue, unable to find a solution online. After completing asynchronous distributed training, the last event recording fails. This is because when the index=0 machine runs run_context.request_stop(), other machines may be in the validation phase. (2018.04.11 resolved)

New issue, in a distributed synchronous training environment, after the validation phase, the global step and i appear to be mismatched. See the images for details.

Machine with index=0:

Machine with index=1 (stuck at 990 steps, waiting for machine 0 to finish validation.):

The solution I thought of was to write the log and validation as hooks added to MonitoredTrainingSession, but after trying, I found that the issue was not resolved. (2018.04.10 resolved)

This is also related to the issue below:

~~When reaching the validation step, machines with different performances will not synchronize during validation.~~

~~The sequence is: when reaching the validation step, the faster machines perform validation first, and only after they finish does the slower machines perform validation.~~

This increases the overall training time; the specific reason I observed is: if the validation step is set to 1000, when sess run 1001 steps, the parameters for step 1000 are only sent to the ps machine for updating after which they are sent back to the worker machine, and then training continues from step 1001. (I don't know if you can understand this -0-, MonitoredTrainingSession runs in three steps each time: begin, before run, and after run. Parameter passing to the ps machine and back to the worker machine should all occur in the before step.) (2018.04.02 After rewriting MonitoredTrainingSession, validation is only performed on the machine with index=0.)

After each synchronous training, the machine with index 0 will hang for a while and then report an error:

I searched Google but found no solution. However, it does not affect our overall training; we can ignore it!

I have a question; in tensorboard, I observed that val_loss is recorded every 100 steps. However, I actually run validation every 1000 steps. I'm a bit confused.
In synchronous training, the faster machines will start training first. If the main machine does not start training first, the initial step in tensorboard will not start from 1.

---------2018.03.16 Update-----Visualization-------

I rewrote a network structure document (squeezenet.py), optimized the parameter calculation process, and added summary code, allowing for viewing many contents on tensorboard.

Other documents have been updated accordingly.

~~If you want to revert to the old network structure document (network.py), you just need to modify the inference line in the train.py document.~~

Transfer the models directory from the server with index=0 back to your local machine, and run in the command line:

tensorboard --logdir=models/

In the browser address bar, enter:

localhost:6006

You can view the visualization content. You can also run the tensorboard command on the server, then enter the IP number in your local browser to view the visualization content.

---------2018.03.19 Update-----DEBUG-------

Added debug code in the train code, which is just one line. In tf.train.MonitoredTrainingSession's hook list, set whether to enable debug mode in arg_parsing.py.

I also found an interface for debugging on tensorboard in the API, tfdbg.TensorBoardDebugHook(), but my local tensorflow version is 1.2, which does not have this interface, so I didn't test it. It depends on personal preference whether to debug in the command line or on tensorboard.

At the same time, I modified the parameter code; each run must specify the training mode.

---------2018.03.20 Update-----Mean Subtraction, Normalization-------

Modified img2bin_list.py, which will finally calculate and print the BGR three-channel mean.
Modified arg_parsing.py to add mean parameters.
Modified dataset.py to include mean subtraction and normalization operations during data reading.

---------2018.03.21 Update-----Added Network Models-------

Modified arg_parsing.py, test.py, train.py
Added mobilenet and mobilenet v2 network models

** --------- 2018.03.22 Update -----Added Network Models, Resolved GPU Memory Issue------- **

Modified arg_parsing.py, test.py, train.py
Added resnet-50, resnet-101, resnet-152, resnet-200 network models
Resolved the issue of GPU memory being fully occupied upon running, but it only works well on a single machine, not in a distributed environment.

---------2018.03.26 Update-----Added Validation Phase During Training------- ****

Significant changes were made to train.py, and all network structure .py files were modified, requiring TensorFlow version to be at least 1.4.0.

The reason is that tf.variable_scope() does not have reuse=tf.AUTO_REUSE.

Of course, if you can set reuse individually, theoretically, it can also work with lower versions of TensorFlow.

It's just more troublesome, needing to determine whether it's train or val, and set reuse to False or True accordingly.

---------2018.03.29 Update-----Finetune------- ****

Finetuning is now possible.

I performed finetuning based on the model I previously trained,

~~I haven't looked for open-source pre-trained models yet.~~ (2018.04.02 I couldn't find suitable pre-trained models on Github, and even if I found them, there might be some issues.)

You can automatically finetune by specifying the model save path in the command line

--finetune=path/

---------2018.04.02 Update-----Optimized Visualization Effects------- ****

Separated the training and validation phase summaries into different namespaces, and optimized the network structure diagram.

---------2018.04.09 Update-----Synchronous and Asynchronous Training------- ****

In a distributed environment, added a boolean switch (--issync) to control synchronous or asynchronous parameter updates during training.
Worker servers must run the program one by one starting from the machine with index 0; if the program is not run in order, it will cause significant differences in steps between the servers.
Rewrote MonitoredTrainingSession, writing log and validation into two separate hooks, and in synchronous mode, only the worker machine with index=0 will perform validation.

---------2018.04.10 Update-----Resolved Log Output Step Mismatch Issue-------

In a distributed, synchronous training environment, controlled log and validation by reading global_step through run_context.session.run(global_step) in the hook, resolving the step mismatch issue between different machines.

---------2018.04.11 Update-----Resolved Asynchronous Training Failing to Record Last Event Issue-------

Changed asynchronous training to also control through reading global_step, and in ExitHook, forced the machine with index=0 to perform validation in the last phase, ensuring that other machines execute run_context.request_stop() first. This resolved the issue of failing to record the last event. My ideal solution to this problem is that when the machine with index=0 detects that the current global step has reached the training end count, it first waits for other machines to request stop before it requests stop itself.

---------2018.04.13 Update-----Added SSD Code-------

The specific model save path and data path need to be set correctly.
The images and xml files in the data path must be saved in two separate directories.