Recently, I have been studying the ILSVRC_DET dataset and wanted to train models that only detect certain specified categories. I wrote the following shell script to extract all images and annotation files related to the specified categories from the 2011-2014 competition train sets and val sets.
#!/bin/sh
stage=val # I separated the training set and validation set into different folders
classes=(person)
nums=(n00007846) # ILSVRC corresponding code
years=(2011 2012 2013)
root="/media/hans/000DABD1000BEFA8/ImageNet" # top-level directory to store all files
i=0 # recursive code
for class in ${classes[@]};
do
all_path="$root/Home/${class}/all"
mkdir -p $all_path/Annotations/ # directory to store all files together
mkdir -p $all_path/JPEGImages/
num=${nums[$i]} # recursive code to keep them specified for the same category
for year in ${years[@]};
do
data_path="$root/ILSVRC$year/$stage" # specify the location of the original data
year_path="$root/Home/${class}/${year}" # specify the location of the data for each year
name="${class}_${year}_name.txt" # directory of all file names for each year, without file extension
path="${class}_${year}_path.txt" # absolute path
mkdir -p $year_path/Annotations/
mkdir -p $year_path/JPEGImages/
echo creating $path ...
find $data_path/xml/ . -type f | xargs grep -l "<name>$num</name>" > $year_path/$path
# recursively search all directories under the original xml data file, find files containing the specified code, and save the absolute path of the file to the path.txt file
echo creating $name ...
cd $year_path/
cat $path | awk -F '/' '{print $NF}' > $name # extract file names with file extensions
find -name $name | xargs perl -pi -e 's|.xml||g' # remove file extensions
echo creating temp.txt ...
cat $name >> $all_path/temp.txt # save all file names without overwriting to a temporary document
# copy .xml documents
echo copying ${class}_${year}.xml documents ...
cat $path | xargs -i cp -r {} $year_path/Annotations/
# copy the original data to the directory for each year based on the absolute path
/bin/cp -rf $year_path/Annotations/*.xml $all_path/Annotations/ # cp without prompt
# search for the specified .xml files for each year based on the file name, integrate all year data into one directory, and overwrite duplicate items
# copy .JPEG images
echo copying ${class}_${year}.JPEG images ...
cat $name | xargs -i find $data_path/img/ . -name {}.JPEG | xargs -i cp {} $year_path/JPEGImages/
# search for the original data based on the file name and specified file extension, and save it to the directory for each year
/bin/cp -rf $year_path/JPEGImages/*.JPEG $all_path/JPEGImages/
# search for each year's data based on the file name and specified file extension, integrate all year data into one directory, and overwrite duplicate items
echo
done
echo creating ${class}_all_name.txt ...
cd $all_path/
cat temp.txt |sort|uniq > ${class}_all_name.txt
# sort the data names containing all years, delete duplicate file names, and save them to a new document.
echo deleting temp.txt ...
rm temp.txt
let i+=1 # recursive code to keep them specified for the same category #i=$i+1
done
I am new to scripting and have been researching a lot online. I have found that the world of scripting is vast, with many different commands that can achieve the same effect. I am learning slowly.
There are many directory designs because it can greatly reduce the time used for recursive searching. Additionally, compressed files are placed in the corresponding subdirectories after being extracted. The above script is placed in the Home directory.
Below is the directory tree:
ImageNet -|
|- Home
|- ILSVRC2011 -|
| |- train -|
| | |- img
| | |- xml
| |- val -|
| |- img
| |- xml
|- ILSVRC2012 -|
| |- train -|
| | |- img
| | |- xml
| |- val -|
| |- img
| |- xml
|- ILSVRC2013 -|
| |- train -|
| | |- img
| | |- xml
| |- val -|
| |- img
| |- xml
|- ILSVRC2014 -|
| |- train -|
| | |- img
| | |- xml
| |- val -|
| |- img
| |- xml