【機械学習】【Python】ヒストグラム特徴を抽出して画像の事前分類を行う（決定木、ランダムフォレスト、アダブースト）

github アドレス： https://github.com/HansRen1024/Image-Pre-Classification

考えは、畳み込みニューラルネットワークで画像を分類する前に、まず事前分類を行うことです。二分類で十分で、現在の画像に私が分類したい物が含まれているかどうかを判断します。

なぜなら、画像を畳み込みニューラルネットワークに投げ込むと、必ず結果が出力されます。この結果の信頼度は高くないかもしれませんが、特定の状況ではこの信頼度が設定した閾値を超えることもあります。

特徴抽出には多くのアルゴリズムがありますが、ヒストグラムはあまりにも貧弱で、適用できません。

少量のデータセットでは、ヒストグラムでもそこそこ良いモデルを訓練できます。しかし、完全なデータセットにすると、うまくいきません。

私が提供するのは一つの考え方であり、解決策ではありません。私も模索中です。

---------【2018.01.24】更新 --------------------

同じ訓練セット（7W+）とテストセット（1W）

パラメータを微調整せず、256 次元のグレースケールヒストグラム、アダブースト、テストセットでの正確度を記録するのを忘れました。

パラメータを微調整せず、768 次元のカラーヒストグラム、アダブースト、正確度 0.8516；

パラメータは同じで、0.99 の分散割合に基づいて PCA で 221 次元に削減、アダブースト、正確度 0.6483、ヒストグラム特徴の独立性が非常に強いことを示しています。その後、MLE アルゴリズムで自動次元削減を試みましたが、1 次元しか削減されませんでした。まあ、今後は LDA を試みます。

パラメータを微調整した結果、768 次元のカラーヒストグラム、アダブースト、正確度は一時的に 0.90 で、まだ調整が完了していません。パラメータが多くて、本当に遅いです！

パラメータ未定、256 次元の LBP ヒストグラム、アダブースト、次回更新予定。

パラメータ未定、1024 次元のカラーヒストグラム＆LBP ヒストグラム、アダブースト、次回更新予定。

現在、0 平均化、正規化、正則化などの前処理は行っていません。

上記の前処理を行っていないため、各次元の分散はまだかなり大きいですが、気にしません。

相関係数法を通じて、上位 221 次元の特徴を確認し、この方法での次元削減効果は検証待ちです。

train と test のリスト文書形式は caffe から lmdb に変換するための文書形式と同じです：

パス + 空白 + カテゴリインデックス

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Wed Jan 17 13:09:08 2018

@author: hans
"""

import cv2
import os
import numpy as np
from sklearn.externals import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from skimage import transform
from sklearn import tree
import datetime



def gray(img_path):
    img = cv2.imread(img_path, 0)
    img=transform.resize(img, (227, 227))
    img = img*255
    img = img.astype(np.uint8)
    feature = cv2.calcHist([img],[0],None,[256],[0,256]).reshape(1,-1)
    return feature

def rgb(img_path):
    img = cv2.imread(img_path)
    img=transform.resize(img, (227, 227,3))
    b = img[:,:,0]*255
    g = img[:,:,1]*255
    r = img[:,:,2]*255
    b = b.astype(np.uint8)
    g = g.astype(np.uint8)
    r = r.astype(np.uint8)
    feature_b = cv2.calcHist([b],[0],None,[256],[0,256]).reshape(1,-1)
    feature_g = cv2.calcHist([g],[0],None,[256],[0,256]).reshape(1,-1)
    feature_r = cv2.calcHist([r],[0],None,[256],[0,256]).reshape(1,-1)
    feature = np.hstack((feature_b,feature_g,feature_r))
    return feature
    
def hist_feature(list_txt):
    root_path = 'image/'
    with open(list_txt, 'r') as f:
        line = f.readline()
        img_path = os.path.join(root_path,line.split(' ')[0])
        if mode == 0:
            feature = gray(img_path)
        elif mode == 1:
            feature = rgb(img_path)
        label = np.array([int(line.split(' ')[1].split('\n')[0])])
        line = f.readline()
        num = 2
        while line:
            img_path = os.path.join(root_path,line.split(' ')[0])
            if not os.path.isfile(img_path):
                line = f.readline()
                continue
            print("%d dealing with %s ..." %(num, line.split(' ')[0]))
            if mode == 0:
                hist_cv = gray(img_path)
            elif mode == 1:
                hist_cv = rgb(img_path)
            feature = np.vstack((feature,hist_cv))
            label = np.hstack((label,np.array([int(line.split(' ')[1].split('\n')[0])])))
            num+=1
            line = f.readline()
    joblib.dump(feature, list_txt.split('.')[0]+filename,compress=5)
    joblib.dump(label, list_txt.split('.')[0]+'_label.pkl', compress=5)
    return feature, label

def save_feature():
    t1 = datetime.datetime.now()
    X_train, y_train = hist_feature(train_list)
    t2 = datetime.datetime.now()
    X_test, y_test = hist_feature(test_list)
    t3 = datetime.datetime.now()
    print("\n訓練特徴抽出の時間: %0.2f"%(t2-t1).total_seconds())
    print("テスト特徴抽出の時間: %0.2f"%(t3-t2).total_seconds())

def decision_tree():
    dt = tree.DecisionTreeClassifier(criterion='gini',max_depth=None, min_samples_split=2, min_samples_leaf=1,random_state=80)
    return fit(dt, 'dt')

def random_forest():
    # criterion: 分岐の基準(gini/entropy), n_estimators: 木の数, bootstrap: ランダムに復元するか, n_jobs: 並行実行の数
    rf = RandomForestClassifier(n_estimators=25,criterion='entropy',bootstrap=True,n_jobs=4,random_state=80) # ランダムフォレスト
    return fit(rf, 'rf')

def adaboost():
    ada = AdaBoostClassifier(tree.DecisionTreeClassifier(criterion='gini',max_depth=11, min_samples_split=400, \
                                                         min_samples_leaf=30,max_features=30,random_state=10), \
                             algorithm="SAMME", n_estimators=100, learning_rate=0.001,random_state=10)
    return fit(ada, 'ada')

def fit(clf, s):
    t3 = datetime.datetime.now()
    X_train = joblib.load(train_list.split('.')[0]+filename)
    y_train = joblib.load(train_list.split('.')[0]+'_label.pkl')
    clf = clf.fit(X_train, y_train)
    joblib.dump(clf, s+'_model.pkl')
    t4 = datetime.datetime.now()
    print("--------------------------------\nモデル訓練の時間: %0.2f"%(t4-t3).total_seconds())
    scores = cross_val_score(clf, X_train, y_train,scoring='accuracy' ,cv=3)
    print("訓練クロス平均スコア: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
    return clf
    
def testScore(clf):
    t4 = datetime.datetime.now()
    X_test = joblib.load(test_list.split('.')[0]+filename)
    y_test = joblib.load(test_list.split('.')[0]+'_label.pkl')
    clf_score = clf.score(X_test, y_test)
    print ("--------------------------------\nテストスコア: %.4f" %clf_score)
    clf_pred = clf.predict(X_test)
    print clf_pred[:10]
    t5 = datetime.datetime.now()
    print("モデルテストの時間: %0.2f"%(t5-t4).total_seconds())
    scores = cross_val_score(clf, X_test, y_test,scoring='accuracy' ,cv=10)
    print("テストクロス平均スコア: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
    
mode=1
if mode==0:
    filename = '_feature_gray.pkl'
elif mode==1:
    filename = '_feature_rgb.pkl'

train_list = "train_all.txt"
test_list = "test_all.txt"

#train_list = "train.txt"
#test_list = "test.txt"

if __name__ == '__main__':
    
#    save_feature()
    
#    dt = decision_tree()
#    rf = random_forest()
    ada = adaboost()
#    gbdt = gradientboost()
    
#    dt = joblib.load('dt_model.pkl')
#    rf = joblib.load('rf_model.pkl')
    ada = joblib.load('ada_model.pkl')
    testScore(ada)