技術幹貨 | 基于MindSpore更好的了解Focal Loss

摘要：Focal Loss，是 Kaiming 大神團隊在他們的論文Focal Loss for Dense Object Detection提出來的損失函數，利用它改善了圖像物體檢測的效果。

今天更新一下恺明大神的Focal Loss，它是 Kaiming 大神團隊在他們的論文Focal Loss for Dense Object Detection提出來的損失函數，利用它改善了圖像物體檢測的效果。ICCV2017RBG和Kaiming大神的新作（https://arxiv.org/pdf/1708.02002.pdf）。

使用場景

最近一直在做人臉表情相關的方向，這個領域的 DataSet 數量不大，而且往往存在正負樣本不均衡的問題。一般來說，解決正負樣本數量不均衡問題有兩個途徑：

1. 設計采樣政策，一般都是對數量少的樣本進行重采樣

2. 設計 Loss，一般都是對不同類别樣本進行權重指派

我兩種政策都使用過，本文講的是第二種政策中的 Focal Loss。

論文分析

我們知道object detection按其流程來說，一般分為兩大類。一類是two stage detector(如非常經典的Faster R-CNN，RFCN這樣需要region proposal的檢測算法)，第二類則是one stage detector(如SSD、YOLO系列這樣不需要region proposal，直接回歸的檢測算法)。

對于第一類算法可以達到很高的準确率，但是速度較慢。雖然可以通過減少proposal的數量或降低輸入圖像的分辨率等方式達到提速，但是速度并沒有質的提升。

對于第二類算法速度很快，但是準确率不如第一類。

是以目标就是：focal loss的出發點是希望one-stage detector可以達到two-stage detector的準确率，同時不影響原有的速度。

So，Why？and result？

這是什麼原因造成的呢？the Reason is：Class Imbalance(正負樣本不平衡)，樣本的類别不均衡導緻的。

我們知道在object detection領域，一張圖像可能生成成千上萬的candidate locations，但是其中隻有很少一部分是包含object的，這就帶來了類别不均衡。那麼類别不均衡會帶來什麼後果呢？引用原文講的兩個後果：

(1) training is inefficient as most locations are easy negatives that contribute no useful learning signal;

(2) en masse, the easy negatives can overwhelm training and lead to degenerate models.

意思就是負樣本數量太大(屬于背景的樣本)，占總的loss的大部分，而且多是容易分類的，是以使得模型的優化方向并不是我們所希望的那樣。這樣，網絡學不到有用的資訊，無法對object進行準确分類。其實先前也有一些算法來處理類别不均衡的問題，比如OHEM（online hard example mining），OHEM的主要思想可以用原文的一句話概括：In OHEM each example is scored by its loss, non-maximum suppression (nms) is then applied, and a minibatch is constructed with the highest-loss examples。OHEM算法雖然增加了錯分類樣本的權重，但是OHEM算法忽略了容易分類的樣本。

是以針對類别不均衡問題，作者提出一種新的損失函數：Focal Loss，這個損失函數是在标準交叉熵損失基礎上修改得到的。這個函數可以通過減少易分類樣本的權重，使得模型在訓練時更專注于難分類的樣本。為了證明Focal Loss的有效性，作者設計了一個dense detector：RetinaNet，并且在訓練時采用Focal Loss訓練。實驗證明RetinaNet不僅可以達到one-stage detector的速度，也能有two-stage detector的準确率。

公式說明

介紹focal loss，在介紹focal loss之前，先來看看交叉熵損失，這裡以二分類為例，原來的分類loss是各個訓練樣本交叉熵的直接求和，也就是各個樣本的權重是一樣的。公式如下：

因為是二分類，p表示預測樣本屬于1的機率（範圍為0-1），y表示label，y的取值為{+1,-1}。當真實label是1，也就是y=1時，假如某個樣本x預測為1這個類的機率p=0.6，那麼損失就是-log(0.6)，注意這個損失是大于等于0的。如果p=0.9，那麼損失就是-log(0.9)，是以p=0.6的損失要大于p=0.9的損失，這很容易了解。這裡僅僅以二分類為例，多分類分類以此類推為了友善，用pt代替p，如下公式2:。這裡的pt就是前面Figure1中的橫坐标。

為了表示簡便，我們用p_t表示樣本屬于true class的機率。是以(1)式可以寫成:

顯然前面的公式3雖然可以控制正負樣本的權重，但是沒法控制容易分類和難分類樣本的權重，于是就有了Focal Loss，這裡的γ稱作focusing parameter，γ>=0，稱為調制系數：

為什麼要加上這個調制系數呢？目的是通過減少易分類樣本的權重，進而使得模型在訓練時更專注于難分類的樣本。

通過實驗發現，繪制圖看如下Figure1，橫坐标是pt，縱坐标是loss。CE（pt）表示标準的交叉熵公式，FL（pt）表示focal loss中用到的改進的交叉熵。Figure1中γ=0的藍色曲線就是标準的交叉熵損失(loss)。

這樣就既做到了解決正負樣本不平衡，也做到了解決easy與hard樣本不平衡的問題。

結論

作者将類别不平衡作為阻礙one-stage方法超過top-performing的two-stage方法的主要原因。為了解決這個問題，作者提出了focal loss，在交叉熵裡面用一個調整項，為了将學習專注于hard examples上面，并且降低大量的easy negatives的權值。是同時解決了正負樣本不平衡以及區分簡單與複雜樣本的問題。

我們來看一下，基于MindSpore實作Focal Loss的代碼：

import mindspore
import mindspore.common.dtype as mstype
from mindspore.common.tensor import Tensor
from mindspore.common.parameter import Parameter
from mindspore.ops import operations as P
from mindspore.ops import functional as F
from mindspore import nn

class FocalLoss(_Loss):

    def __init__(self, weight=None, gamma=2.0, reduction='mean'):
        super(FocalLoss, self).__init__(reduction=reduction)
        # 校驗gamma，這裡的γ稱作focusing parameter，γ>=0，稱為調制系數
        self.gamma = validator.check_value_type("gamma", gamma, [float])
        if weight is not None and not isinstance(weight, Tensor):
            raise TypeError("The type of weight should be Tensor, but got {}.".format(type(weight)))
        self.weight = weight
        # 用到的mindspore算子
        self.expand_dims = P.ExpandDims()
        self.gather_d = P.GatherD()
        self.squeeze = P.Squeeze(axis=1)
        self.tile = P.Tile()
        self.cast = P.Cast()

    def construct(self, predict, target):
        targets = target
        # 對輸入進行校驗
        _check_ndim(predict.ndim, targets.ndim)
        _check_channel_and_shape(targets.shape[1], predict.shape[1])
        _check_predict_channel(predict.shape[1])

        # 将logits和target的形狀更改為num_batch * num_class * num_voxels.
        if predict.ndim > 2:
            predict = predict.view(predict.shape[0], predict.shape[1], -1) # N,C,H,W => N,C,H*W
            targets = targets.view(targets.shape[0], targets.shape[1], -1) # N,1,H,W => N,1,H*W or N,C,H*W
        else:
            predict = self.expand_dims(predict, 2) # N,C => N,C,1
            targets = self.expand_dims(targets, 2) # N,1 => N,1,1 or N,C,1
 
        # 計算對數機率
        log_probability = nn.LogSoftmax(1)(predict)
        # 隻保留每個voxel的地面真值類的對數機率值。
        if target.shape[1] == 1:
            log_probability = self.gather_d(log_probability, 1, self.cast(targets, mindspore.int32))
            log_probability = self.squeeze(log_probability)

        # 得到機率
        probability = F.exp(log_probability)

        if self.weight is not None:
            convert_weight = self.weight[None, :, None]  # C => 1,C,1
            convert_weight = self.tile(convert_weight, (targets.shape[0], 1, targets.shape[2])) # 1,C,1 => N,C,H*W
            if target.shape[1] == 1:
                convert_weight = self.gather_d(convert_weight, 1, self.cast(targets, mindspore.int32))  # selection of the weights  => N,1,H*W
                convert_weight = self.squeeze(convert_weight)  # N,1,H*W => N,H*W
            # 将對數機率乘以它們的權重
            probability = log_probability * convert_weight
        # 計算損失小批量
        weight = F.pows(-probability + 1.0, self.gamma)
        if target.shape[1] == 1:
            loss = (-weight * log_probability).mean(axis=1)  # N
        else:
            loss = (-weight * targets * log_probability).mean(axis=-1)  # N,C

        return self.get_loss(loss)

使用方法如下：

from mindspore.common import dtype as mstype
from mindspore import nn
from mindspore import Tensor

predict = Tensor([[0.8, 1.4], [0.5, 0.9], [1.2, 0.9]], mstype.float32)
target = Tensor([[1], [1], [0]], mstype.int32)
focalloss = nn.FocalLoss(weight=Tensor([1, 2]), gamma=2.0, reduction='mean')
output = focalloss(predict, target)
print(output)

0.33365273

Focal Loss的兩個重要性質

1. 當一個樣本被分錯的時候，pt是很小的，那麼調制因子（1-Pt）接近1，損失不被影響；當Pt→1，因子（1-Pt）接近0，那麼分的比較好的（well-classified）樣本的權值就被調低了。是以調制系數就趨于1，也就是說相比原來的loss是沒有什麼大的改變的。當pt趨于1的時候（此時分類正确而且是易分類樣本），調制系數趨于0，也就是對于總的loss的貢獻很小。

2. 當γ=0的時候，focal loss就是傳統的交叉熵損失，當γ增加的時候，調制系數也會增加。專注參數γ平滑地調節了易分樣本調低權值的比例。γ增大能增強調制因子的影響，實驗發現γ取2最好。直覺上來說，調制因子減少了易分樣本的損失貢獻，拓寬了樣例接收到低損失的範圍。當γ一定的時候，比如等于2，一樣easy example(pt=0.9)的loss要比标準的交叉熵loss小100+倍，當pt=0.968時，要小1000+倍，但是對于hard example(pt < 0.5)，loss最多小了4倍。這樣的話hard example的權重相對就提升了很多。這樣就增加了那些誤分類的重要性Focal Loss的兩個性質算是核心，其實就是用一個合适的函數去度量難分類和易分類樣本對總的損失的貢獻。

MindSpore官方資料：GitHub : https://github.com/mindspore-ai/mindspore

Gitee:https : //gitee.com/mindspore/mindspore

長按下方二維碼加入MindSpore項目

本文分享自華為雲社群《技術幹貨 | 基于MindSpore更好的了解Focal Loss》，原文作者：chengxiaoli。

點選關注，第一時間了解華為雲新鮮技術~

技術幹貨 | 基于MindSpore更好的了解Focal Loss

論文分析

公式說明

結論

Focal Loss的兩個重要性質

繼續閱讀

Nature子刊基于昇思MindSpore打造的AI+科學計算新成果PeRCNN面世

從GAN到WGAN到WDGRL誤差函數的深入淺出解讀

搜尋算法小知識：QP之類目識别1.搜尋引擎中有一套多級類目體系，通常包含數十個一級類目、數百個二級類目、甚至還有三級和四

softer-nms論文學習詳解(Bounding Box Regression with Uncertainty for Accurate Object Detection)

複現經典：《統計學習方法》第 6 章邏輯斯谛回歸

最常用的決策樹算法（二）Random Forest、Adaboost、GBDT 算法

員外帶你讀論文：From RankNet to LambdaRank to LambdaMART: An Overview

為什麼要做特征的歸一化/标準化？

交叉熵損失分析交叉熵損失分析

交叉熵損失函數原理詳解交叉熵損失函數原理詳解

tf.nn.softmax_cross_entropy_with_logits函數

Pytorch的損失函數BCELoss(), BCEWithLogitsLoss(), nn.CrossEntropyLoss()差別1.nn.BCELoss()：2.nn.BCEWithLogitsLoss()：3.nn.CrossEntropyLoss()：

【pytorch函數筆記（三）】torch.nn.BCELoss()

交叉熵、二分類損失函數的差別——nn.CrossEntropyLoss()、nn.BCELoss()和 nn.BCEWithLogitsLoss()

深度學習基礎：3.反向傳播和梯度下降

softmax與cross entropy的差別聯系