Some insights on training my own dataset with 【Darknet】【yolo v2】----VOC format

-------【2017.11.2 Update】------------SSD Portal----------

http://blog.csdn.net/renhanchi/article/details/78411095

------- 【2017.10.30 Update】------------Some Things to Mention-----------

Although I have hardly used darknet and yolo since writing this blog, I have been continuously updating this blog for more than half a year and solving problems for everyone.

If you want to learn how to use darknet and yolo through this blog, including some detection knowledge, I hope you can read every sentence of this article carefully and seriously.

Recently, I found that the author's code has changed a lot, resulting in many places being different from the content of my blog. I am now packaging the darknet that I initially downloaded for everyone.

** The difference between the old and new versions lies in the code; the algorithm architecture remains the same, so feel free to use the old version.
**

After decompressing, enter the directory, configure the Makefile, and simply run make all.

https://pan.baidu.com/s/1jIR2oTo

1. Introduction#

There are indeed many blog posts about training your own VOC format data with yolo, but when I followed their methods step by step, I encountered issues that other authors did not mention. Here, I will share my own experience on how to train my own dataset.

2. Dataset#

I recommend using the VOC and ILSVRC competition datasets because the xml files are readily available, saving a lot of effort.

If you want to label your own data, you can search for labelImg on github, download it, and run it directly after making it. I won't elaborate on the specific usage here.

You can also directly download existing datasets.

The ILSVRC2015 competition address is: http://image-net.org/challenges/LSVRC/2015/download-images-3j16.php

The VOC competition address is: http://host.robots.ox.ac.uk/pascal/VOC/index.html

My dataset consists of all the datasets related to people from VOC2007, VOC2012, ILSVRC2013, and ILSVRC2014. I only want to train for detecting people. Note that the ILSVRC suffix is JPEG; you can change it to jpg or not, as the darknet code is also compatible with JPEG format. However, to save trouble later, I changed all to jpg suffixes.

Regarding how to extract all datasets related to people from VOC, you can use the following shell script, which can be slightly modified to extract data from ILSVRC. In the ILSVRC dataset, the category for people is not person, but n00007846. This does not affect the subsequent training in the xml file, so you do not need to specifically change n00007846 to person. The reason is that the labels.txt file uses numbers 0, 1, 2, 3, etc., to represent categories, not words. These numbers correspond to the index of categories in data/names.list.

#!/bin/sh

year="VOC2012"

mkdir /your_path/VOCperson/${year}_Anno/  # Create folder
mkdir /your_path/VOCperson/${year}_Image/

cd /your_path/VOCdevkit/$year/Annotations/
grep -H -R "<name>person</name>" > /your_path/VOCperson/temp.txt  # Find lines with keywords and save them to a temporary document

cd /your_path/VOCperson/
cat temp.txt | sort | uniq > $year.txt     # Sort by name and delete duplicate adjacent lines that are exactly the same.
find -name $year.txt | xargs perl -pi -e 's|.xml:\t\t<name>person</name>||g'   # Remove suffix and other useless information from the document, keeping only the file names without suffixes

cat $year.txt | xargs -i cp /your_path/VOCdevkit/$year/Annotations/{}.xml /your_path/VOCperson/${year}_Anno/ # Copy annotation files based on file names
cat $year.txt | xargs -i cp /your_path/VOCdevkit/$year/JPEGImages/{}.jpg /your_path/VOCperson/${year}_Image/ # Copy dataset based on file names

rm temp.txt

3. Training Files#

3.1 Folder Setup#

Annotations ---- This folder contains all xml description files.

JPEGImages ---- This folder contains all jpg image files.

ImageSets -> Main ----
This folder contains a names.txt document (my document is named: train.txt, note that this name must match the second element in the sets list in the python code below), and the document contains the names of all training images without suffixes.

PS: The official training method for VOC places images in two different paths for 2007 and 2012, which I find cumbersome, so I put all files in one folder.

3.2 txt Documents#

A total of three types of txt documents need to be prepared:

First is the names.txt document containing all training data names under the ImageSets folder mentioned above.

Then there is the labels.txt that corresponds to all images one by one. These documents are generated by the scripts/voc_label.py file, and the paths need to be modified. I placed all image files and xml files in one folder, and the training category is only person. Therefore, the initial sets and classes also need to be modified. It is worth noting that if you are using the ILSVRC dataset and, like me, have not changed n00007846 in the xml file to person, you need to change classes to n00007846 to find the bbox information for this category. Another point to note is that the xml files in the ILSVRC dataset do not contain the difficult information, so you can comment out the related parts in the .py file.

Finally, there is the paths.txt document that saves the absolute paths of all training images. Note that the image file names in this document have the jpg suffix. When generating the above labels.txt document, paths.txt will be automatically generated.

Below is my .py file

import xml.etree.ElementTree as ET
import pickle
import os
from os import listdir, getcwd
from os.path import join

sets=[('person','train')]
classes = ["n00007846"]

def convert(size, box):
    dw = 1./(size[0])
    dh = 1./(size[1])
    x = (box[0] + box[1])/2.0 - 1
    y = (box[2] + box[3])/2.0 - 1
    w = box[1] - box[0]
    h = box[3] - box[2]
    x = x*dw
    w = w*dw
    y = y*dh
    h = h*dh
    return (x,y,w,h)

def convert_annotation(year, image_id):
    in_file = open('/home/hans/darknet/person/VOC%s/Annotations/%s.xml'%(year, image_id))
    out_file = open('/home/hans/darknet/person/VOC%s/labels/%s.txt'%(year, image_id), 'w')
    tree=ET.parse(in_file)
    root = tree.getroot()
    size = root.find('size')
    w = int(size.find('width').text)
    h = int(size.find('height').text)

    for obj in root.iter('object'):
        # difficult = obj.find('difficult').text
        cls = obj.find('name').text
        if cls not in classes: # or int(difficult)==1:
            continue
        cls_id = classes.index(cls)
        xmlbox = obj.find('bndbox')
        b = (float(xmlbox.find('xmin').text), float(xmlbox.find('xmax').text), float(xmlbox.find('ymin').text),
float(xmlbox.find('ymax').text))
        bb = convert((w,h), b)
        out_file.write(str(cls_id) + " " + " ".join([str(a) for a in bb]) + '\n')

wd = getcwd()
for year, image_set in sets:
    if not os.path.exists('/home/hans/darknet/person/VOC%s/labels/'%(year)):
        os.makedirs('/home/hans/darknet/person/VOC%s/labels/'%(year))
    image_ids = open('/home/hans/darknet/person/VOC%s/ImageSets/Main/%s.txt'%(year, image_set)).read().strip().split()
    list_file = open('%s_%s.txt'%(year, image_set), 'w')
    for image_id in image_ids:
        list_file.write('%s/VOC%s/JPEGImages/%s.jpg\n'%(wd, year, image_id))
        convert_annotation(year, image_id)
    list_file.close()

3.3 Training Configuration File#

First, create a person.names file in the data folder, containing only the word person.

Then modify the voc.data file in the cfg folder, changing classes to 1. The train path corresponds to the paths.txt mentioned above.
The names path corresponds to the person.names file. The backup path corresponds to the location for saving training weight files.

Finally, select a .cfg network; many blog posts on the darknet official website use yolo-voc.cfg, but I have consistently failed to train with this network. This is reflected in the fact that after training for a long time, there are no bbox and predict results during testing. There are two possible reasons for this issue: either the training has diverged or the training is insufficient, resulting in a low confidence probability for the predict results. For training divergence, it is easy to spot; just observe the two loss values after the number of iterations during training. If these two values keep increasing to hundreds, it indicates that the training has diverged. This can be resolved by lowering the learning rate and increasing the batch size. I will mention the meaning of parameter modifications and training outputs later.
To display predict results when training is insufficient, you can set the threshold during testing; the default value in darknet is .25, and you can try gradually lowering this value to see the effect.

./darknet detector test cfg/voc.data cfg/yolo_voc.cfg -thresh 0.25

At that time, my understanding of darknet was not deep enough, and when I couldn't solve the above problems, I switched to the yolo-voc.2.0.cfg network that I currently use. Here’s a brief explanation of the parameters inside:

batch:
The number of images sent to the network in each iteration, also known as batch size. Increasing this can allow the network to complete one epoch in fewer iterations. Given a fixed maximum number of iterations, increasing the batch size will extend the training time but will better find the direction of gradient descent. If you have enough GPU memory, you can appropriately increase this value to improve memory utilization. This value needs to be continuously tested; if it is too small, it will prevent convergence, and if it is too large, it will fall into local optima.

subdivision:
This parameter is interesting; it allows each batch to not be sent to the network all at once. Instead, it is divided into the number of subdivisions, and after processing each part, they are packaged together to count as one iteration. This reduces memory usage. If this parameter is set to 1, all batch images are sent to the network at once; if set to 2, half of them are sent at a time.

angle: The rotation angle of the image, used to enhance training effectiveness. Essentially, it increases the training sample set by rotating images.

saturation, exposure, hue: These parameters are used to enhance training effectiveness.

learning_rate: The learning rate; if training diverges, you can lower the learning rate. If training encounters a bottleneck and the loss remains unchanged, you can also reduce the learning rate.

max_batches: The maximum number of iterations.

policy: The learning policy, usually step-based.

step, scales: These two are combined; for example: learn_rate: 0.001, step:100,25000,35000 scales: 10, .1, .1 means that during the 0-100 iterations, the learning rate is the original 0.001, during 100-25000 iterations, the learning rate is 10 times the original 0.01, during 25000-35000 iterations, the learning rate is 0.1 times the current value, which is 0.001, and during 35000 to the maximum iterations, the learning rate is 0.1 times the current value, which is 0.0001. As iterations increase, lowering the learning rate allows the model to learn more effectively, thus better reducing train loss.

In the last convolutional layer, the filters value is 5×(number of classes + 5). I won’t elaborate on the specific reasons; just know it.

In the region, you need to change classes to your number of classes.

The last line, random, is a switch. If set to 1, during training, each batch image will be randomly resized to a size between 320-640 (in multiples of 32). The purpose is the same as the above parameters like saturation and exposure. If set to 0, all images will only be resized to the default size of 416*416. (Updated on 2018.04.08, a friend commented that if set to 1, during training, obj and noobj may all be 0; setting it to 0 resolves everything.)

3.4 Start Training#

You can download pre_trained files to improve your training efficiency.

Once all the above preparations are done, you can start training your model.

Run in terminal:

./darknet detector train cfg/voc.data cfg/yolo_voc.cfg darknet19_448.conv.23

-------- 【2017.06.29 Update】 --------- The source code has undergone significant changes by the author; the old version can refer to the content below --------------------------------------

I want to mention one point: other blog posts suggest modifying the .c source files. In fact, this is unnecessary for lazy people like us. The reason is that there is a segment of code for executing tests on the official website:

./darknet detect cfg/yolo.cfg yolo.weights data/dog.jpg

This is a simplified execution statement. Its complete form is:

./darknet detector test cfg/coco.data cfg/yolo.cfg yolo.weights data/dog.jpg

Actually, modifying the .c files allows us to use the simplified test execution statement, and the program will automatically call the paths set in the .c file. I personally think this is unnecessary. Also, the latest version no longer contains the yolo.cu file.

4. Training Output#

Here, I will talk about what the outputs are; some of them I don't fully understand, but I will mention the useful ones that I do understand.

Region Avg IOU: This is the intersection of the predicted bbox and the actual labeled bbox divided by their union. Obviously, the larger this value, the better the prediction result.

Avg Recall: This represents the average recall rate, meaning the number of detected objects divided by the total number of labeled objects.

count: The total number of labeled objects. If count = 6, recall = 0.66667, it means there are a total of 6 objects (which may include different categories, regardless of category), and I predicted 4, so Recall is 4 divided by 6 = 0.66667.

There is a line that differs from the above; it starts with the number of iterations, followed by train loss, avg train loss, learning rate, batch processing time, and the total number of images processed. Pay special attention to train loss and avg train loss; these two values should gradually decrease as the number of iterations increases. If the loss increases to hundreds, it indicates that the training has diverged. If the loss remains unchanged for a period, you need to lower the learning rate or change the batch size to enhance the learning effect. Of course, it may also mean that the training is already sufficient. This requires personal judgment.

5. Visualization#

Here, I will share a Matlab code for visualizing loss. It can intuitively show the change curve of your loss.

First, during training, you can use the script command to record all terminal outputs to a txt document.

script -a log.txt

./darknet detector train cfg/voc.data cfg/yolo_voc.cfg darknet19_448.conv.23

After training is complete, remember to use ctrl+D or type exit to end the screen recording.

Below is the Matlab code:

clear;
clc;
close all;

train_log_file = 'log.txt';

[~, string_output] = dos(['cat ', train_log_file, ' | grep "avg," | awk ''{print $3}''']);
train_loss = str2num(string_output);
n = 1:length(train_loss);
idx_train = (n-1);

figure;plot(idx_train, train_loss);

grid on;
legend('Train Loss');
xlabel('iterations');
ylabel('avg loss');
title(' Train Loss Curve');

I plotted the curve of avg train loss. My batch size is 8, and the curve fluctuates significantly. The learning rate is 0.0001, and after 25000 iterations, it drops to 0.00001. Please pay attention to the fact that the loss stabilizes around 7-8 and does not decrease significantly afterward. My understanding is that there may be two situations: either it falls into a local optimum or the learning encounters a bottleneck. However, I couldn't resolve this issue. Lowering the batch size to improve learning effectiveness does not change the loss. Increasing the batch size allows the network to consider the overall situation, slightly lowering the loss, but it remains unchanged. Lowering the learning rate to improve learning effectiveness lowers it slightly, but it remains unchanged. I hope that someone knowledgeable will see this article and provide guidance. The slight decrease after 25000 iterations is due to my lowering the learning rate, but the decrease is not significant, and it quickly stabilizes again. The training set has over 43,000 images, selected from the ILSVRC2015 training set and VOC2012 training set, all related to people.

----------- 【2017.09.26 Update】 -----------------

I have been lazy and have not written this supplementary note. I previously mentioned that I might have fallen into a local optimum. I did a lot of work without improvement. I had been treating the loss standard of caffe as a reference for darknet, which is actually problematic. The actual performance of the model I trained above is very good; it can detect even a small head from a distance.

Now, looking back at the training parameters, I think there is room for improvement. I will note it here for future reference when using darknet again. Since I did fine-tuning, the learning rate is very low, which is fine, but the momentum can be considered to increase appropriately, changing from 0.9 to 0.99. The learning policy can also be changed to poly.

----------- 【2017.10.30 Update】 ------------------------

Recently, I started working on some detection tasks, and looking back at darknet, I want to correct one point. For detection, one should not blindly use loss to evaluate the quality of a model. Loss should be used to judge whether the training is proceeding normally. For example, if the loss keeps increasing at the beginning of training and eventually becomes NAN, it indicates that the learning rate is too high. So what value should we use to evaluate the quality of the model? It should be mAP, which is the current mainstream standard, essentially the average precision. During the training process in darknet, precision results are not output, so we can also judge through recall. The closer the recall is to 1.0, the more accurately the model detects the actual number of objects. If using the old version of darknet and modifying the source code, precision can be output during the testing phase; I have already written the specific content in this article.

----------- 【2017.11.30 Update】 Added IOU --------------

I previously overlooked the IOU output; this output can also be used to judge whether the model is being trained correctly during the training phase and how effective it is in the end.

----------------- 【 2017.09.19 Update】 Python Visualization Code --------------------------------------------

I initially wrote this for caffe, but the principle is the same, so I will update this as well.

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Tue Aug 29 10:05:13 2017

@author: hans

http://blog.csdn.net/renhanchi
"""
import matplotlib.pyplot as plt
import numpy as np
import commands

train_log_file = "vegetable_squeezenet.log"

display = 10 #solver
test_interval = 100 #solver

train_output = commands.getoutput("cat " + train_log_file + " | grep 'avg,' | awk '{print $3}'")  #train loss

train_loss = train_output.split("\n")

_,ax1 = plt.subplots()

l1, = ax1.plot(display*np.arange(len(train_loss)), train_loss)

ax1.set_xlabel('Iteration')
ax1.set_ylabel('Train Loss')

plt.legend([l1], ['Train Loss'], loc='upper right')
plt.show()

If you find the fluctuations too large and the trend difficult to visualize, you can refer to
http://blog.csdn.net/renhanchi/article/details/78411095
for visualization code and make some modifications.

6. Evaluating the Model#

-------- 【2017.07.03 Update】 -------------------------------------------------------------------------

The following command does not apply to the new version of darknet ( If you downloaded the old version of darknet from the Baidu cloud link above, the following command is applicable )

First, use the following command:

./darknet detector recall cfg/xxx.data cfg/xxx.cfg backup/xxx.weights

The following command is applicable in the new version

In the xxxx.data file, add a line under train: valid=path/to/valid/images.txt

./darknet detector valid cfg/.....data cfg/.....cfg backup/....weights

Then it outputs a series of numbers, what is this??

------ 【2017.06.29 Update】 --- The source code has undergone significant changes by the author; the content below is no longer applicable to the new version- -----------

If you downloaded the old version of darknet from the Baidu cloud link above, the following command is still applicable

The output is cumulative, and the result only includes recall, not precision. Because the first generation of yolo has certain flaws, the precision is relatively low compared to other methods, so the author simply set the threshold very low, focusing only on recall and abandoning precision. The second generation of yolo has fixed this flaw, so you can modify the code to output precision. Open src/detector.c, find the validate_detector_recall function, where float thresh = .25; is used to set the threshold. I changed it to 0.25; the original author's value was 0.0001. So my initial precision was only a little over 1% -0-.
Continuing to look down, find this line:

fprintf(stderr, "%5d\t%5d\t%5d\tRPs/Img: %.2f\tIOU: %.2f%%\tRecall:%.2f%%\t", i, correct, total, (float)proposals/(i+1), 
avg_iou*100/total, 100.*correct/total);

Change it to the following line:

fprintf(stderr, "Number: %5d\tCorrect: %5d\tTotal: %5d\tRPs/Img: %.2f\tIOU: %.2f%%\tRecall:%.2f%%\tProposals: %5d\t
Precision: %.2f%%\n", i, correct, total, (float)proposals/(i+1), avg_iou*100/total, 100.*correct/total, proposals, 
100.*correct/(float)proposals);

~~
~~

After recompiling, execute the recall command again, and you will have precision now.

------ 【2017.12.22 Update】 --- Rephrasing Correct -----------------------------------------

Correct indicates how many bboxes were correctly identified. The value is calculated as follows: when an image is input into the network, the network predicts many bboxes for each category of objects. Each bbox has its confidence probability, and the bboxes with probabilities greater than the threshold are compared with the actual bboxes (i.e., the content of the labels in the txt file) to calculate the IOU. The bbox with the highest IOU for the current category is found, and if this maximum value exceeds the preset IOU threshold, it indicates that the current category object is classified correctly, and correct is incremented.
I will elaborate a bit more on the bbox threshold; we can adjust it through the command line using -thresh, which is also the threshold mentioned above for modifying the source code to output precision. The IOU threshold can only be adjusted through the source code.

Regarding the output parameters, my understanding is as follows:

Number indicates which image is being processed.

Correct indicates how many bboxes were correctly identified. The value is calculated as follows: when an image is input into the network, the network predicts many bboxes, each with its confidence probability. The number of bboxes with probabilities greater than the threshold is the value of Proposal.

Total indicates how many actual bboxes there are.

Rps/img indicates how many bboxes are predicted on average per image.

IOU I have explained above.

Recall I have also explained above. From the code, we can see that it is the value of Correct divided by Total.

Proposal indicates the number of predicted bboxes that exceed the threshold.

Precision indicates accuracy, which is the value of Correct divided by Proposal.

In summary, when an image is input into the network, Npro bboxes are predicted, but some are correct, and some are incorrect. nCor is the number of correctly predicted objects. Recall indicates the ratio of correctly predicted objects (nCor) to the actual number of objects (Total). Precision indicates the ratio of correctly predicted objects (nCor) to the total number of predicted objects (Proposal). As mentioned above, the author initially set the threshold to 0.0001, which would lead to a very large Proposal value (Npro), so the Correct value (nCor) would also be slightly larger, resulting in a high Recall value but a very low Precision value.

Note: The above content is partially referenced from http://blog.csdn.net/hysteric314/article/details/54097845

For recall, precision, and IOU, you can check this article
http://blog.csdn.net/hysteric314/article/details/54093734