天天看點

DL之YoloV3:YoloV3論文《YOLOv3: An Incremental Improvement》的翻譯與解讀(一)

論文位址: https://arxiv.org/pdf/1804.02767.pdf

YoloV3論文翻譯與解讀

Abstract

      We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50 in 51 ms on a Titan X, compared to 57.5 AP50 in 198 ms by RetinaNet, similar performance but 3.8× faster. As always, all the code is online at

https://pjreddie.com/yolo/.

      我們對YOLO系列算法進行一些最新情況介紹!我們做了一些小的設計更改以使它更好。我們還教育訓練了這個非常棒的新網絡。比上次大一點,但更準确。不過還是很快,别擔心。在320×320處,Yolov3在22毫秒内以28.2 mAP的速度運作,與SSD一樣精确,但速度快了三倍。當我們看到舊的.5 IOU地圖檢測标準yolov3是相當不錯的。在Titan X上,51 ms内可達到57.9 AP50,而在198 ms内,Retinanet可達到57.5 AP50,性能相似,但速度快3.8倍。與往常一樣,所有代碼都在

1. Introduction

      Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [12] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little. Actually, that’s what brings us here today. We have a camera-ready deadline [4] and we need to cite some of the random updates I made to YOLO but we don’t have a source. So get ready for a TECH REPORT! The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how we do. We’ll also tell you about some things we tried that didn’t work. Finally we’ll contemplate what this all means.

      有時候你隻需要打一年電話就行了,你知道嗎?今年我沒有做很多研究。在Twitter上花了很多時間。和GANs 玩了一會兒。去年我有一點動力,我設法對YOLO做了一些改進。但是,老實說,沒有什麼比這更有趣的了,隻是一些小的改變讓它變得更好。我也在其他人的研究上做了一點幫助。事實上,這就是我們今天來到這裡的原因。我們有一個攝像頭準備就緒的最後期限[4],我們需要引用我對Yolo所做的一些随機更新,但我們沒有來源。是以準備一份技術報告吧!關于技術報告,最重要的是他們不需要介紹,你們都知道為什麼我們會在這裡。是以,本導言的結尾将為論文的其餘部分做上标記。首先,我們會告訴你YOLOV3上處理了什麼。然後我們會告訴你我們是怎麼做的。我們也會告訴你一些我們嘗試過但不起作用的事情。最後,我們将思考這一切意味着什麼。

2. The Deal

   So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. We’ll just take you through the whole system from scratch so you can understand it all.

    是以YOLOv3是這樣的:我們主要從别人那裡獲得好主意。我們還訓練了一個新的分類器網絡,它比其他分類器更好。我們将從頭開始介紹整個系統,這樣您就能了解所有内容。

DL之YoloV3:YoloV3論文《YOLOv3: An Incremental Improvement》的翻譯與解讀(一)

     Figure 1. We adapt this figure from the Focal Loss paper [9]. YOLOv3 runs significantly faster than other detection methods with comparable performance. Times from either an M40 or Titan X, they are basically the same GPU.

       圖1.我們根據Focal Loss報告[9]調整了這個數字。Yolov3的運作速度明顯快于其他具有類似性能的檢測方法。從M40或Titan X獲得的時間,都是基于相同的GPU。

2.1. Bounding Box Prediction

      Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes [15]. The network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:

     按照YOLO9000,我們的系統預測使用次元叢集作為錨定框[15]的邊界框。網絡為每個邊界框預測4個坐标,分别為tx、ty、tw、th。如果單元格距圖像左上角偏移(cx, cy),且邊界框先驗有寬和高pw, ph,則預測對應:

    During training we use sum of squared error loss. If the ground truth for some coordinate prediction is tˆ * our gradient is the ground truth value (computed from the ground truth box) minus our prediction: tˆ * − t* . This ground truth value can be easily computed by inverting the equations above.

    在訓練中,我們使用誤差損失的平方和。如果地面真理協調預測tˆ*我們的梯度是地面真值(從地面實況框計算)-我們的預測:tˆ*−t *。這一地面真值可以很容易地計算通過反演上述方程。

      Figure 2. Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function. This figure blatantly self-plagiarized from [15].

     圖2.帶有尺寸優先和位置預測的邊界框。我們預測了盒子的寬度和高度作為與簇形心的偏移。我們使用一個sigmoid函數來預測盒子相對于過濾器應用程式位置的中心坐标。這個數字公然自抄自[15]。

    YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [17]. We use the threshold of .5. Unlike [17] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.

    YOLOv3使用邏輯回歸預測每個邊界框的客觀得分。如果邊界框先驗與地面真值對象的重疊超過任何其他邊界框先驗,則該值應為1。如果邊界框先驗不是最好的,但是重疊了超過某個門檻值的地面真值對象,我們忽略預測,跟随[17]。我們使用的門檻值是。5。與[17]不同的是,我們的系統隻為每個地面真值對象配置設定一個邊界框。如果一個邊界框先驗沒有配置設定給一個地面真值對象,它不會導緻坐标或類預測的損失,隻會導緻對象性的損失。

2.2. Class Prediction

    Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions. This formulation helps when we move to more complex domains like the Open Images Dataset [7]. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.

     每個框使用多标簽分類預測邊界框可能包含的類。我們沒有使用softmax,因為我們發現它對于良好的性能是不必要的,相反,我們隻是使用獨立的邏輯分類器。在訓練過程中,我們使用二進制交叉熵損失進行類預測。當我們移動到更複雜的領域,比如開放圖像資料集[7]時,這個公式會有所幫助。在這個資料集中有許多重疊的标簽(即女人和人)。使用softmax會假定每個框隻有一個類,而通常情況并非如此。多标簽方法可以更好地對資料模組化。

繼續閱讀