Keras版faster-rcnn算法詳解（RPN計算）

前段時間學完Udacity的機器學習和深度學習的課程，感覺隻能算剛剛摸到深度學習的門檻，于是開始看斯坦福的cs231n（http://cs231n.stanford.edu/syllabus.html），一不小心便入了計算機視覺的坑。原來除了識别物體，還可以進行定位(localization)，檢測(object detection)，語義分割(semantic segmentation)，執行個體分割(instance segmentation)，左右手互搏(GAN)，風格學習(transfer learning)等等。。。真是一下開了眼。從detection學起，開幹！

detection的話，自然是rgb大神的一系列工作，從rcnn一路到YOLO。這裡貼一個YOLO的視訊，給各位看官鑒賞一下:YOLO: Real-Time Object Detection（https://www.youtube.com/watch?v=VOC3huqHrss&feature=youtu.be）。也可以直接看這個位址，有更詳細的内容：YOLO: Real-Time Object Detection（https://pjreddie.com/darknet/yolo/）。Faster-rcnn的原文在這裡：Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks（https://arxiv.org/abs/1506.01497）。

由于tensorflow使用的不是很熟練，大部分項目都是用keras做的，是以在github上找到了一個keras版的faster-rcnn（https://github.com/yhenon/keras-frcnn），學習一下。基本上clone下來以後稍微調整幾處代碼就能成功跑起來了。我用Oxford的pet資料集進行了訓練，在我的老爺卡gtx970上訓練了差不多1個多小時，就能夠比較有效的實作detection了。下面是效果圖。

接下來就是了解代碼了，faster-rcnn的核心思想就是通過RPN替代過往的獨立的步驟進行region proposal，實作完全的end-to-end學習，進而對算法進行了提速。是以讀懂RPN是了解faster-rcnn的第一步。下面的代碼是如何得到用于訓練RPN的ground truth的，完全了解之後也就了解RPN的原理了。

計算過程比較長，但沒有複雜的數學知識，我畫了一個大概的流程圖，在此基礎上了解應該就容易多了。

下面來看代碼：

def calc_rpn(C, img_data, width, height, resized_width, resized_height, img_length_calc_function): downscale = float(C.rpn_stride) anchor_sizes = C.anchor_box_scales anchor_ratios = C.anchor_box_ratios num_anchors = len(anchor_sizes) * len(anchor_ratios) # calculate the output map size based on the network architecture (output_width, output_height) = img_length_calc_function(resized_width, resized_height) n_anchratios = len(anchor_ratios) # initialise empty output objectives y_rpn_overlap = np.zeros((output_height, output_width, num_anchors)) y_is_box_valid = np.zeros((output_height, output_width, num_anchors)) y_rpn_regr = np.zeros((output_height, output_width, num_anchors * 4)) num_bboxes = len(img_data['bboxes']) num_anchors_for_bbox = np.zeros(num_bboxes).astype(int) best_anchor_for_bbox = -1*np.ones((num_bboxes, 4)).astype(int) best_iou_for_bbox = np.zeros(num_bboxes).astype(np.float32) best_x_for_bbox = np.zeros((num_bboxes, 4)).astype(int) best_dx_for_bbox = np.zeros((num_bboxes, 4)).astype(np.float32) # get the GT box coordinates, and resize to account for image resizing gta = np.zeros((num_bboxes, 4)) for bbox_num, bbox in enumerate(img_data['bboxes']): # get the GT box coordinates, and resize to account for image resizing gta[bbox_num, 0] = bbox['x1'] * (resized_width / float(width)) gta[bbox_num, 1] = bbox['x2'] * (resized_width / float(width)) gta[bbox_num, 2] = bbox['y1'] * (resized_height / float(height)) gta[bbox_num, 3] = bbox['y2'] * (resized_height / float(height))

首先看一下參數，C是配置資訊，img_data包含一張圖檔的路徑，bbox坐标和對應的分類（可能一張圖檔有多組，即表示圖檔裡包含多個對象）。後面是圖檔的原尺寸和resize之後的尺寸，用于求bbox坐标在resize之後圖檔上的坐标，img_length_calc_function是一個方法，基于我們的設定來從圖檔尺寸計算出經過網絡之後特征圖的尺寸。

接下來讀取了幾個參數，downscale就是從圖檔到特征圖的縮放倍數，anchor_size和anchor_ratios是我們初步選區大小的參數，比如3個size和3個ratios，可以組合成9種不同形狀大小的選區。接下來通過img_.....function這個方法計算出了特征圖的尺寸。

下一步是幾個變量初始化，可以先不看，後面用到的時候再看。因為我們的計算都是基于resize以後的圖像的，是以接下來把bbox中的x1,x2,y1,y2分别通過縮放比對到resize以後的圖像。這裡記做gta，尺寸為(num_of_bbox,4)。

for anchor_size_idx in range(len(anchor_sizes)): for anchor_ratio_idx in range(n_anchratios): anchor_x = anchor_sizes[anchor_size_idx] * anchor_ratios[anchor_ratio_idx][0] anchor_y = anchor_sizes[anchor_size_idx] * anchor_ratios[anchor_ratio_idx][1] for ix in range(output_width): # x-coordinates of the current anchor box x1_anc = downscale * (ix + 0.5) - anchor_x / 2 x2_anc = downscale * (ix + 0.5) + anchor_x / 2 # ignore boxes that go across image boundaries if x1_anc < 0 or x2_anc > resized_width: continue for jy in range(output_height): # y-coordinates of the current anchor box y1_anc = downscale * (jy + 0.5) - anchor_y / 2 y2_anc = downscale * (jy + 0.5) + anchor_y / 2 # ignore boxes that go across image boundaries if y1_anc < 0 or y2_anc > resized_height: continue # bbox_type indicates whether an anchor should be a target bbox_type = 'neg' # this is the best IOU for the (x,y) coord and the current anchor # note that this is different from the best IOU for a GT bbox best_iou_for_loc = 0.0

上面這一段計算了anchor的長寬，然後比較重要的就是把特征圖的每一個點作為一個錨點，通過乘以downscale，映射到圖檔的實際尺寸，再結合anchor的尺寸，忽略掉超出圖檔範圍的。一個個大小、比例不一的矩形選框就躍然紙上了。對這些選框進行周遊，對每個選框進行下面的計算：

# bbox_type indicates whether an anchor should be a target bbox_type = 'neg' # this is the best IOU for the (x,y) coord and the current anchor # note that this is different from the best IOU for a GT bbox best_iou_for_loc = 0.0 for bbox_num in range(num_bboxes): # get IOU of the current GT box and the current anchor box curr_iou = iou([gta[bbox_num, 0], gta[bbox_num, 2], gta[bbox_num, 1], gta[bbox_num, 3]], [x1_anc,y1_anc, x2_anc, y2_anc]) # calculate the regression targets if they will be needed if curr_iou > best_iou_for_bbox[bbox_num] or curr_iou > C.rpn_max_overlap: cx = (gta[bbox_num, 0] + gta[bbox_num, 1]) / 2.0 cy = (gta[bbox_num, 2] + gta[bbox_num, 3]) / 2.0 cxa = (x1_anc + x2_anc)/2.0 cya = (y1_anc + y2_anc)/2.0 tx = (cx - cxa) / (x2_anc - x1_anc) ty = (cy - cya) / (y2_anc - y1_anc) tw = np.log((gta[bbox_num, 1] - gta[bbox_num, 0]) / (x2_anc - x1_anc)) th = np.log((gta[bbox_num, 3] - gta[bbox_num, 2]) / (y2_anc - y1_anc))

定義了兩個變量，bbox_type和best_iou_for_loc，後面會用到。計算了anchor與gta的交集，比較簡單，就不展開說了。然後就是如果交集大于best_iou_for_bbox[bbox_num]或者大于我們設定的門檻值，就會去計算gta和anchor的中心點坐标，再通過中心點坐标和bbox坐标，計算出x,y,w,h四個值的梯度值（不知道這麼了解對不對）。為什麼要計算這個梯度呢？因為RPN計算出來的區域不一定是很準确的，從隻有9個尺寸的anchor也可以推測出來，是以我們在預測時還會進行一次回歸計算，而不是直接使用這個區域的坐标。

接下來是根據anchor的表現對其進行标注。

if img_data['bboxes'][bbox_num]['class'] != 'bg': # all GT boxes should be mapped to an anchor box, so we keep track of which anchor box was best if curr_iou > best_iou_for_bbox[bbox_num]: best_anchor_for_bbox[bbox_num] = [jy, ix, anchor_ratio_idx, anchor_size_idx] best_iou_for_bbox[bbox_num] = curr_iou best_x_for_bbox[bbox_num,:] = [x1_anc, x2_anc, y1_anc, y2_anc] best_dx_for_bbox[bbox_num,:] = [tx, ty, tw, th] # we set the anchor to positive if the IOU is >0.7 (it does not matter if there was another better box, it just indicates overlap) if curr_iou > C.rpn_max_overlap: bbox_type = 'pos' num_anchors_for_bbox[bbox_num] += 1 # we update the regression layer target if this IOU is the best for the current (x,y) and anchor position if curr_iou > best_iou_for_loc: best_iou_for_loc = curr_iou best_regr = (tx, ty, tw, th) # if the IOU is >0.3 and <0.7, it is ambiguous and no included in the objective if C.rpn_min_overlap < curr_iou < C.rpn_max_overlap: # gray zone between neg and pos if bbox_type != 'pos': bbox_type = 'neutral' # turn on or off outputs depending on IOUs if bbox_type == 'neg': y_is_box_valid[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1 y_rpn_overlap[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0 elif bbox_type == 'neutral': y_is_box_valid[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0 elif bbox_type == 'pos': y_rpn_overlap[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1 start = 4 * (anchor_ratio_idx + n_anchratios * anchor_size_idx) y_rpn_regr[jy, ix, start:start+4] = best_regr

前提是這個bbox的class不是'bg'，即背景。如果交集大于這個bbox的最佳值，則進行一系列更新。如果交集大于我們設定的門檻值，則定義為一個positive的anchor，即存在與之重合度比較高的bbox，同時該bbox的num_anchors加1。如果交集剛好也大于best_iou_for_loc，則将best_regr設為目前的梯度值。這裡best_iou_for_loc指的是該anchor下的最佳交集，我的了解就是一個anchor如果能比對到1個以上的bbox為pos，那我們取best_iou_for_loc下的梯度，要知道這一步我們隻要找到最佳的選區就行了，并不管選區裡是哪個class。如果剛好處于最大和最小門檻值之間，那我們不确定它是背景還是對象，将其定義為neutral，即中性。

接下來根據bbox_type對本anchor進行打标，y_is_box_valid和y_rpn_overlap分别定義了這個anchor是否可用和是否包含對象。

for idx in range(num_anchors_for_bbox.shape[0]): if num_anchors_for_bbox[idx] == 0: # no box with an IOU greater than zero ... if best_anchor_for_bbox[idx, 0] == -1: continue y_is_box_valid[ best_anchor_for_bbox[idx,0], best_anchor_for_bbox[idx,1], best_anchor_for_bbox[idx,2] + n_anchratios * best_anchor_for_bbox[idx,3]] = 1 y_rpn_overlap[ start = 4 * (best_anchor_for_bbox[idx,2] + n_anchratios * best_anchor_for_bbox[idx,3]) y_rpn_regr[ best_anchor_for_bbox[idx,0], best_anchor_for_bbox[idx,1], start:start+4] = best_dx_for_bbox[idx,:]

這裡又出現了一個問題，很多bbox可能找不到心儀的anchor，那這些訓練資料就沒法利用了，是以我們用一個折中的辦法來保證每個bbox至少有一個anchor與之對應。下面是具體的方法，比較簡單，對于沒有對應anchor的bbox，在中性anchor裡挑最好的，當然前提是你不能跟我完全不相交，那就太過分了。。

y_rpn_overlap = np.transpose(y_rpn_overlap, (2, 0, 1)) y_rpn_overlap = np.expand_dims(y_rpn_overlap, axis=0) y_is_box_valid = np.transpose(y_is_box_valid, (2, 0, 1)) y_is_box_valid = np.expand_dims(y_is_box_valid, axis=0) y_rpn_regr = np.transpose(y_rpn_regr, (2, 0, 1)) y_rpn_regr = np.expand_dims(y_rpn_regr, axis=0) pos_locs = np.where(np.logical_and(y_rpn_overlap[0, :, :, :] == 1, y_is_box_valid[0, :, :, :] == 1)) neg_locs = np.where(np.logical_and(y_rpn_overlap[0, :, :, :] == 0, y_is_box_valid[0, :, :, :] == 1)) num_pos = len(pos_locs[0])

接下來通過numpy大法進行了一系列操作，對pos和neg的anchor進行了定位。

num_regions = 256 if len(pos_locs[0]) > num_regions/2: val_locs = random.sample(range(len(pos_locs[0])), len(pos_locs[0]) - num_regions/2) y_is_box_valid[0, pos_locs[0][val_locs], pos_locs[1][val_locs], pos_locs[2][val_locs]] = 0 num_pos = num_regions/2 if len(neg_locs[0]) + num_pos > num_regions: val_locs = random.sample(range(len(neg_locs[0])), len(neg_locs[0]) - num_pos) y_is_box_valid[0, neg_locs[0][val_locs], neg_locs[1][val_locs], neg_locs[2][val_locs]] = 0

因為negtive的anchor肯定遠多于postive的，是以在這裡設定了regions數量的最大值，并對pos和neg的樣本進行了均勻的取樣。

y_rpn_cls = np.concatenate([y_is_box_valid, y_rpn_overlap], axis=1) y_rpn_regr = np.concatenate([np.repeat(y_rpn_overlap, 4, axis=1), y_rpn_regr], axis=1) return np.copy(y_rpn_cls), np.copy(y_rpn_regr)

最後，得到了兩個傳回值y_rpn_cls,y_rpn_regr。分别用于确定anchor是否包含物體，和回歸梯度。

再來看一下網絡中RPN層的結構：

def rpn(base_layers,num_anchors): x = Convolution2D(512, (3, 3), padding='same', activation='relu', kernel_initializer='normal', name='rpn_conv1')(base_layers) x_class = Convolution2D(num_anchors, (1, 1), activation='sigmoid', kernel_initializer='uniform', name='rpn_out_class')(x) x_regr = Convolution2D(num_anchors * 4, (1, 1), activation='linear', kernel_initializer='zero', name='rpn_out_regress')(x) return [x_class, x_regr, base_layers]

通過1*1的視窗在特征圖上滑過，生成了num_anchors數量的channel,每個channel包含特征圖（w*h）個sigmoid激活值，表明該anchor是否可用，與我們剛剛計算的y_rpn_cls對應。同樣的方法，得到x_regr與剛剛計算的y_rpn_regr對應。

得到了region proposals，接下來另一個重要的思想就是ROI，可将不同shape的特征圖轉化為固定shape，送到全連接配接層進行最終的預測。等我學習完了再更新。由于自己也是學習過程，可能很多地方的了解有誤差，歡迎指正～

本文作者：Non

Keras版faster-rcnn算法詳解（RPN計算）

繼續閱讀

Codeforces 1417 D. Make Them Equal(思維+構造)

查找算法之二分查找查找算法之二分查找

查找算法學習之二分查找（Python版本）——BinarySearch

CQ V1.0分詞bates(基于雙數組tire樹)—應該是目前最快的中文分詞算法

Command Network(POJ 3164)---定根最小樹形圖模闆題題目描述輸入格式輸出格式輸入樣例輸出樣例分析源程式

開源低帶寬語音編解碼器

241 Different Ways to Add Parentheses（C代碼版）

【趨高機器視覺】機器視覺技術原了解析及解決方案

CSMA/CD1． CSMA/CD的概述2． CSMA 的工作原理3． CSMA/CD控制規程及特點4． CSMA/CD協定5． CSMA/CD的優點6．結束語

極大似然法(ML)與最大期望法(EM)

C++ 第十五周報告1--《冒泡法排序》

筆試面試題目：滑動視窗(二)

資料結構與算法（27）——排序（二）

Dijkstra--簡易版（最短路徑）

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

hdu7108哈希