【Shell】Extract a specific class of images and annotation files from the ILSVRC_DET dataset.

Recently, I have been studying the ILSVRC_DET dataset and wanted to train models that only detect certain specified categories. I wrote the following shell script to extract all images and annotation files related to the specified categories from the 2011-2014 competition train sets and val sets.

#!/bin/sh

stage=val   # I separated the training set and validation set into different folders
classes=(person)
nums=(n00007846) # ILSVRC corresponding code
years=(2011 2012 2013)
root="/media/hans/000DABD1000BEFA8/ImageNet" # top-level directory to store all files
i=0  # recursive code

for class in ${classes[@]}; 
do
    all_path="$root/Home/${class}/all" 
    mkdir -p $all_path/Annotations/ # directory to store all files together
    mkdir -p $all_path/JPEGImages/

    num=${nums[$i]} # recursive code to keep them specified for the same category

    for year in ${years[@]};
    do
        data_path="$root/ILSVRC$year/$stage" # specify the location of the original data
        year_path="$root/Home/${class}/${year}" # specify the location of the data for each year

        name="${class}_${year}_name.txt" # directory of all file names for each year, without file extension
        path="${class}_${year}_path.txt" # absolute path


        mkdir -p $year_path/Annotations/
        mkdir -p $year_path/JPEGImages/

        echo creating $path ...
        find $data_path/xml/ . -type f | xargs grep -l "<name>$num</name>" > $year_path/$path 
# recursively search all directories under the original xml data file, find files containing the specified code, and save the absolute path of the file to the path.txt file

        echo creating $name ...
        cd $year_path/
        cat $path | awk -F '/' '{print $NF}' > $name # extract file names with file extensions
        find -name $name | xargs perl -pi -e 's|.xml||g' # remove file extensions

        echo creating temp.txt ...
        cat $name >> $all_path/temp.txt # save all file names without overwriting to a temporary document

        # copy .xml documents
        echo copying ${class}_${year}.xml documents ...
        cat $path | xargs -i cp -r {} $year_path/Annotations/ 
# copy the original data to the directory for each year based on the absolute path

        /bin/cp -rf $year_path/Annotations/*.xml $all_path/Annotations/ # cp without prompt 
# search for the specified .xml files for each year based on the file name, integrate all year data into one directory, and overwrite duplicate items

        # copy .JPEG images
        echo copying ${class}_${year}.JPEG images ...
        cat $name | xargs -i find $data_path/img/ . -name {}.JPEG | xargs -i cp {} $year_path/JPEGImages/ 
# search for the original data based on the file name and specified file extension, and save it to the directory for each year
        /bin/cp -rf $year_path/JPEGImages/*.JPEG $all_path/JPEGImages/ 
# search for each year's data based on the file name and specified file extension, integrate all year data into one directory, and overwrite duplicate items

        echo 
    done

    echo creating ${class}_all_name.txt ...
    cd $all_path/
    cat temp.txt |sort|uniq > ${class}_all_name.txt 
# sort the data names containing all years, delete duplicate file names, and save them to a new document.

    echo deleting temp.txt ...
    rm temp.txt

    let i+=1 # recursive code to keep them specified for the same category    #i=$i+1
done

I am new to scripting and have been researching a lot online. I have found that the world of scripting is vast, with many different commands that can achieve the same effect. I am learning slowly.

There are many directory designs because it can greatly reduce the time used for recursive searching. Additionally, compressed files are placed in the corresponding subdirectories after being extracted. The above script is placed in the Home directory.

Below is the directory tree:

ImageNet -|

|- Home

|- ILSVRC2011 -|

| |- train -|

| | |- img

| | |- xml

| |- val -|

| |- img

| |- xml

|- ILSVRC2012 -|

| |- train -|

| | |- img

| | |- xml

| |- val -|

| |- img

| |- xml

|- ILSVRC2013 -|

| |- train -|

| | |- img

| | |- xml

| |- val -|

| |- img

| |- xml

|- ILSVRC2014 -|

| |- train -|

| | |- img

| | |- xml

| |- val -|

| |- img

| |- xml