当前位置: 首页 > news >正文

基于有效样本数的类别平衡损失 (Class-Balanced Loss, CVPR 2019)

Class-Balanced Loss Based on
Effective Number of Samples

​Figure 1.​​ Two classes, one from the ​​head​​ and one from the ​​tail​​ of a ​​long-tailed dataset​​ (iNaturalist 2017 [40] in this example), have drastically different number of samples. Models trained on these samples are ​​biased toward dominant classes​​ (black solid line). Reweighing the loss by ​​inverse class frequency​​ usually yields poor performance (red dashed line) on real-world data with ​​high class imbalance​​. We propose a ​​theoretical framework​​ to quantify the ​​effective number of samples​​ by taking ​​data overlap​​ into consideration. A ​​class-balanced term​​ is designed to re-weight the loss by ​​inverse effective number of samples​​. We show in experiments that the performance of a model can be improved when trained with the proposed ​​class-balanced loss​​ (blue dashed line).

​图1.​​ 两个类别,一个来自​​长尾数据集​​(本例为iNaturalist 2017 [40])的​​头部​​,一个来自​​尾部​​,样本数量差异巨大。在这些样本上训练的模型会​​偏向主导类别​​(黑色实线)。基于​​逆类别频率​​重新加权损失函数通常在​​高度类别不平衡​​的真实数据上表现不佳(红色虚线)。本文提出一个​​理论框架​​,通过考虑​​数据重叠​​来量化​​有效样本数量​​。设计了一个​​类别平衡项​​,通过​​逆有效样本数量​对损失函数重新加权。实验表明,使用提出的​类别平衡损失​(蓝色虚线)训练时,模型性能可以得到提升。

   Abstract   

With the rapid increase of ​​large-scale, real-world datasets​​, it becomes critical to address the problem of ​​long-tailed data distribution​​ (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt ​​class re-balancing strategies​​ such as ​​re-sampling​​ and ​​re-weighting​​ based on the number of observations for each class. In this work, we argue that as the number of samples increases, the ​additional benefit of a newly added data point will diminish​. We introduce a ​​novel theoretical framework​​ to measure ​​data overlap​​ by associating with each sample a small ​​neighboring region​​ rather than a single point. The ​​effective number of samples​​ is defined as the ​​volume of samples​​ and can be calculated by a simple formula ​​(1−β^n)/(1−β)​, where n is the number of samples and β ∈ [0, 1) is a ​​hyperparameter​​. We design a ​re-weighting scheme​ that uses the ​​effective number of samples​​ for each class to ​re-balance the loss​, thereby yielding a ​class-balanced loss​​. Comprehensive experiments are conducted on ​​artificially induced long-tailed CIFAR datasets​​ and ​​large-scale datasets​​ including ​​ImageNet​​ and ​​iNaturalist​​. Our results show that when trained with the proposed ​​class-balanced loss​​, the network is able to achieve ​​significant performance gains​​ on ​​long-tailed datasets​​.

随着​​大规模真实数据集​​的快速增长,解决​长尾数据分布​问题(即少数类别占据大部分数据,而多数类别样本不足)变得至关重要。现有方案通常采用​类别重平衡策略​,如基于每类样本数量的​重采样​​和​​重加权​。

本文认为随着样本数量增加,​新增数据点带来的边际收益会递减​。

本文提出​​新颖的理论框架​,通过将每个样本关联到一个​​小邻域​​而非单点来度量​​数据重叠​。​有效样本数量​​定义为​​样本体积​​,可通过简单公式​​(1−β^n)/(1−β)​​计算,其中n为样本数,β∈[0,1)是​​超参数​​。本文设计了​重加权方案​​,利用每类的​​有效样本数量​​来​​重新平衡损失​​,从而得到​​类别平衡损失​

在​​人工构建的长尾CIFAR数据集​​和包括​​ImageNet​​、​​iNaturalist​​在内的​​大规模数据集​​上进行了全面实验。结果表明,使用提出的​​类别平衡损失​​训练时,网络能在​​长尾数据集​​上取得​​显著性能提升​​。

   Introduction   

The recent success of ​​deep Convolutional Neural Networks (CNNs)​​ for visual recognition [25, 36, 37, 16] owes much to the availability of ​​large-scale, real-world annotated datasets​​ [7, 27, 48, 40]. In contrast with commonly used visual recognition datasets (e.g., CIFAR [24, 39], ImageNet ILSVRC 2012 [7, 33] and CUB-200 Birds [42]) that exhibit roughly ​​uniform distributions​​ of class labels, ​​real-world datasets​​ have ​skewed​ [21] distributions, with a ​long-tail​: a few ​​dominant classes​​ claim most of the examples, while most of the other classes are represented by relatively few examples. CNNs trained on such data perform poorly for ​​weakly represented classes​​ [19, 15, 41, 4].

A number of recent studies have aimed to alleviate the challenge of ​​long-tailed training data​​ [3, 31, 17, 41, 43, 12, 47, 44]. In general, there are two strategies: ​​re-sampling​​ and ​​cost-sensitive re-weighting​​. In ​re-sampling​, the number of examples is directly adjusted by ​over-sampling​(adding repetitive data) for the ​​minor class​​ or ​under-sampling​(removing data) for the ​​major class​​, or both. In ​​cost-sensitive re-weighting​​, we influence the ​​loss function​​ by assigning relatively higher costs to examples from ​​minor classes​​. In the context of ​​deep feature representation learning​​ using CNNs, ​​re-sampling​​ may either introduce large amounts of ​​duplicated samples​​, which slows down the training and makes the model susceptible to ​​overfitting​​ when ​​oversampling​​, or discard ​​valuable examples​​ that are important for ​​feature learning​​ when ​​under-sampling​​. Due to these disadvantages of applying ​​re-sampling​​ for CNN training, the present work focuses on ​re-weighting approaches​, namely, how to design a better ​​class-balanced loss​​.

深度卷积神经网络(CNN)​​在视觉识别中的成功很大程度上得益于​大规模真实标注数据集​​。与常用视觉数据集的​​均匀类别分布​​不同,​​真实数据集​​具有​​偏态分布,呈现​​长尾​​特征少数​​主导类别​​包含大多数样本,而其他多数类别样本稀少。在此类数据上训练的CNN对​​低代表性类别表现不佳

近期多项研究致力于缓解​​长尾训练数据的挑战。主要策略分为​重采样​​和​​代价敏感重加权​重采样​​通过为​​少数类​​​​过采样​​(添加重复数据)或对​​多数类​​​​欠采样​(删除数据)来直接调整样本数量​代价敏感重加权​​则通过给​​少数类​​样本分配更高代价来影响​​损失函数​。在CNN​​深度特征学习​​场景中,​​重采样​​可能引入大量​​重复样本​​导致训练速度下降和​过拟合​​(过采样时)或丢失对​​特征学习​​至关重要的​​有价值样本​​(欠采样时)。鉴于​​重采样​​在CNN训练中的这些缺陷,本研究聚焦于​重加权方法​,即如何设计更好的​​类别平衡损失​。

Typically​​, a ​​class-balanced loss​​ assigns sample weights ​​inversely proportionally to the class frequency​​. This ​​simple heuristic method​​ has been widely adopted [17, 43]. However, recent work on training from ​​large-scale, real-world, long-tailed datasets​​ [30, 28] reveals ​​poor performance​​ when using this strategy. Instead, they use a ​​"smoothed" version of weights​​ that are empirically set to be ​​inversely proportional to the square root of class frequency​​. These observations suggest an interesting question: ​how can we design a better class-balanced loss​​ that is applicable to a ​​diverse array of datasets​​?

通常​​,​​类别平衡损失​​会为样本分配​​与类别频率成反比​​的权重。这一​​简单启发式方法​​已被广泛采用。然而,最新针对​​大规模真实长尾数据集​​的研究表明,该策略会导致​​性能欠佳​​。研究者转而采用​​"平滑"权重​​,即经验性地设置为​​类别频率平方根的倒数​​。这些发现引出一个关键问题:​​如何设计更优的类别平衡损失​​,使其适用于​​多样化数据集​​?

We aim to answer this question from the perspective of ​​sample size​​. As illustrated in ​​Figure 1​​, we consider training a model to discriminate between a ​​major class​​ and a ​​minor class​​ from a ​​long-tailed dataset​​. Due to ​​highly imbalanced data​​, directly training the model or ​​re-weighting the loss by inverse number of samples​​ cannot yield ​​satisfactory performance​​. Intuitively, ​​the more data, the better​​. However, since there is​information overlap among data​, as the number of samples increases, the ​marginal benefit​​ a model can extract from the data ​​diminishes​. In light of this, we propose a ​​novel theoretical framework​​ to characterize ​​data overlap​​ and calculate the ​​effective number of samples​​ in a ​​model-and loss-agnostic manner​. A ​​class-balanced re-weighting term​​ that is ​​inversely proportional to the effective number of samples​​ is added to the ​​loss function​​. ​​Extensive experimental results​​ indicate that this ​​class-balanced term​​ provides a ​​significant boost​​ to the performance of commonly used loss functions for training CNNs on ​​long-tailed datasets​​.

本文从​​样本规模​​视角解答该问题。如​​图1​​所示,考虑在​​长尾数据集​​中区分​​主导类​​和​​少数类​​时,​​高度不平衡的数据​​使得直接训练模型或采用​​样本数倒数加权​​均无法获得​​满意性能​​。虽然直觉认为​​数据越多越好​​,但由于​数据间存在信息重叠​,随着样本量增加,模型能从中获取的​边际收益逐渐递减​。为此,本文提出​​创新理论框架​​,以​​模型与损失无关的方式​量化​数据重叠​​并计算​​有效样本数量​。通过在​​损失函数​​中添加​​与有效样本数成反比​​的​​类别平衡加权项​,​​大量实验​​证明该方案能显著提升CNN在​​长尾数据集​​上的性能表现。

Our ​​key contributions​​ can be summarized as follows:
(1) We provide a ​​theoretical framework​​ to study the ​​effective number of samples​​ and show how to design a ​​class-balanced term​​ to deal with ​​long-tailed training data​​.
(2) We show that ​​significant performance improvements​​ can be achieved by adding the proposed ​​class-balanced term​​ to existing commonly used loss functions including ​​softmax cross-entropy​​, ​​sigmoid cross-entropy​​ and ​​focal loss​​. In addition, we show our ​​class-balanced loss​​ can be used as a ​​generic loss for visual recognition​​ by outperforming commonly-used ​​softmax cross-entropy loss​​ on ​​ILSVRC 2012​​. We believe our study on ​​quantifying the effective number of samples​​ and ​​class-balanced loss​​ can offer ​​useful guidelines​​ for researchers working in domains with ​​long-tailed class distributions​​.

​核心贡献​​包括:
(1) 建立​​理论框架​​研究​​有效样本数量​​,提出处理​​长尾训练数据​​的​​类别平衡项​​设计方法
(2) 证实将​​类别平衡项​​融入​​softmax交叉熵​​、​​sigmoid交叉熵​​和​​焦点损失​​等常用损失函数后,可获得​​显著性能提升​​。特别地,在​​ILSVRC 2012​​上,本文的​​类别平衡损失​​作为​​通用视觉识别损失​​,性能超越传统​​softmax交叉熵损失​​。关于​​有效样本量化​​与​​类别平衡损失​​的研究,为​​长尾类别分布​​领域研究者提供了​​实用指导​​。

   Effective Number of Samples   ​

We formulate the ​​data sampling process​​ as a ​​simplified version of random covering​​. The ​​key idea​​ is to ​​associate each sample with a small neighboring region​​ instead of a single point. We present our ​​theoretical framework​​ and the ​​formulation of calculating effective number of samples​​.

本文将​​数据采样过程​​建模为​​随机覆盖问题的简化版本​​。​​核心思想​​在于将每个样本关联到一个​​小邻域区域​​而非单个数据点。本节将阐述​​理论框架​​及​​有效样本数量的计算公式​​。

​1. Data Sampling as Random Covering​

Given a class, denote the set of all possible data in the ​​feature space​​ of this class as ​​S​​. We assume the ​​volume of S is N​​ and ​​N ≥ 1​​. Denote each data as a ​​subset of S​​ that has the ​​unit volume of 1​​ and may ​​overlap with other data​​. Consider the ​​data sampling process​​ as a ​​random covering problem​​ where each data (subset) is ​​randomly sampled from S​​ to cover the entire set of S. The more data is being sampled, the better the ​​coverage of S​​ is. The ​​expected total volume of sampled data​​ increases as the number of data increases and is ​​bounded by N​​. Therefore, we define:

​Definition 1 (Effective Number).​The ​​effective number of samples​​ is the ​​expected volume of samples​​.

​1. 作为随机覆盖的数据采样​

对于特定类别,设其特征空间中所有可能数据构成的集合为​​S​​。假设​​S的体积为N​​(​​N≥1​​),每个数据样本是​​S的子集​​且具有​​单位体积1​​,并可能与其他数据​​存在重叠​​。将​​数据采样过程​​视为​​随机覆盖问题​​:每次从​​S​​中​​随机采样​​一个数据子集以覆盖整个S集合。采样数据越多,​​S的覆盖率​​就越高。​采样数据总体积的期望值​​随采样数量增加而增长,但​​上限为N​​。因此定义:

定义1(有效数量)​​:​有效样本数量​​即​​样本的期望体积​​。

The calculation of the ​​expected volume of samples​​ is a very difficult problem that depends on the ​​shape of the sample​​ and the ​​dimensionality of the feature space​​ [18]. To make the problem ​​tamable​​, we ​​simplify the problem​​ by not considering the situation of ​​partial overlapping​​.

That is, we assume a ​​newly sampled data​​ can only interact with ​​previously sampled data​​ in two ways: either ​​entirely inside​​ the set of previously sampled data with the ​​probability of p​​ or ​​entirely outside​​ with the ​​probability of 1−p​​, as illustrated in ​​Figure 2​​. As the number of sampled data points increases, the ​​probability p will be higher​​.

计算​​样本期望体积​​是极具挑战性的问题,其难度取决于​​样本形状​​和​​特征空间维度​。为使问题​​可解​​,本文通过​​忽略部分重叠情况​​来​​简化模型​​。如图2所示,假设​​新采样数据​​与​​已采样数据​​的交互仅有两种方式:

  1. ​完全包含​​在已采样数据集中(概率为​​p​​)
  2. ​完全位于​​已采样数据集之外(概率为​​1-p​​)
    如​​图2​​所示,随着采样数据点数量增加,​​重叠概率p将增大​​。

​Figure 2.​​ Giving the set of all possible data with volume N and the set of previously sampled data, a new sample with volume 1 has the probability of p being overlapped with previous data and the probability of 1 − p not being overlapped.

​图2​​:给定体积为N的所有可能数据集合及已采样数据集,新采样的单位体积样本与已有数据发生重叠的概率为p,不重叠的概率为1-p。

Before we dive into the mathematical formulations​​, we discuss the ​​connection between our definition of effective number of samples and real-world visual data​​. Our ​​key idea​​ is to capture the ​​diminishing marginal benefits​​ by using more data points of a class. Due to ​​intrinsic similarities among real-world data​​, as the number of samples grows, it is highly possible that a ​​newly added sample is a near-duplicate​​ of existing samples. In addition, ​​CNNs are trained with heavy data augmentations​​, where simple transformations such as ​​random cropping, re-scaling and horizontal flipping​​ will be applied to the input data. In this case, all ​​augmented examples​​ are also considered as same with the original example. Presumably, the ​​stronger the data augmentation​​ is, the ​​smaller the N will be​​. The ​​small neighboring region of a sample​​ is a way to capture all ​​near-duplicates​​ and instances that can be obtained by ​​data augmentation​​. For a class, ​​N can be viewed as the number of unique prototypes​​.

在深入数学公式之前,首先探讨​有效样本数量定义与现实视觉数据的关联​。核心思想​​是通过使用更多同类数据点来捕捉​​边际效益递减​​现象。由于​​现实数据固有的相似性​​,随着样本量增加,​​新增样本极可能是已有样本的近似重复​​。此外,​​CNN训练采用强数据增强​​(如随机裁剪、缩放和水平翻转等基础变换),这些​​增强样本​​也被视为与原始样本等同。可以推断,​​数据增强越强,N值越小​​。样本的​​小邻域概念​​能够涵盖所有​​近似重复样本​​及通过​​数据增强​​获得的实例。对于某个类别,​​N可视为独特原型数量​​。

​2. Mathematical Formulation​

Denote the ​​effective number (expected volume) of samples​​ as E_n, where n ∈ ℤ>0 is the number of samples.

​Proposition 1 (Effective Number).​E_n = (1−βⁿ)/(1−β), where β = (N − 1)/N.

Proof.​​ We prove the proposition by ​​induction​​. It is obvious that E₁ = 1 because there is no overlapping. So E₁ = (1−β¹)/(1−β) = 1 holds. Now let's consider a general case where we have previously sampled n−1 examples and are about to sample the nᵗʰ example. Now the ​​expected volume of previously sampled data​​ is E_n-1 and the newly sampled data point has the probability of p = E_n-1/N to be overlapped with previous samples. Therefore, the expected volume after sampling nᵗʰ example is:

Assume holds, then

The above proposition shows that the ​​effective number of samples is an exponential function​​ of n. The ​​hyperparameter β ∈ [0, 1)​​ controls how fast E_n grows as n increases.

2. 数学表达

定义​​有效样本数量(期望体积)​​为E_n,其中n∈ℤ>0表示样本数。

​命题1(有效数量)​​:E_n = (1−βⁿ)/(1−β), 其中 β = (N − 1)/N.

​证明​​:采用​​数学归纳法​​。当n=1时显然E₁=1(无重叠),故E₁=(1−β¹)/(1−β)=1成立。现考虑一般情况:已采样n−1个样本时,​​已采样数据期望体积​​为E_n-1,新增第n个样本时,其与已有样本重叠概率p=E_n-1/N。因此采样第n个样本后期望体积为:

成立,则

 证毕

该命题表明,​​有效样本数量是n的指数函数​​。​​超参数β∈[0,1)​​控制E_n随n增长的速度。

​Another explanation​​ of the effective number E_n is:

This means that the jᵗʰ sample contributes βʲ⁻¹ to the effective number. The ​​total volume N​​ for all possible data in the class can then be calculated as:

This is consistent with our definition of β in the proposition.

​Implication 1 (Asymptotic Properties).​

  .

另一种解释​​:第j个样本对有效数量的贡献为

这意味着第 j 个样本对​​有效样本数量​​的贡献为βʲ⁻¹。因此,该类别所有可能数据的​​总体积N​​可通过以下极限计算得出:

证毕

这与命题中对β的定义完全一致​​。

The ​​asymptotic property​​ of E_n shows that when N is large, the ​​effective number of samples equals the actual number of samples n​​. In this scenario, we think the number of ​​unique prototypes N is large​​, thus there is ​​no data overlap​​ and every sample is unique. On the other extreme, if N = 1, this means that we believe there exists a ​​single prototype​​ so that all the data in this class can be represented by this prototype via ​​data augmentation, transformations​​, etc.

E_n的渐近特性​​表明,当N较大时,​​有效样本数量等于实际样本数n​​。这种情况下,本文认为​​独特原型数量N较大​​,因此​​不存在数据重叠​​且每个样本都是独特的。另一个极端情况是N=1,这意味着我们认为存在​​单一原型​​,该类所有数据都可通过​​数据增强、变换​​等方式由该原型表示。

​   Class-Balanced Loss   ​

The ​​Class-Balanced Loss​​ is designed to address the problem of ​​training from imbalanced data​​ by introducing a ​​weighting factor​​ that is ​​inversely proportional to the effective number of samples​​. The ​​class-balanced loss term​​ can be applied to a ​​wide range of deep networks and loss functions​​. For an input sample ​​x​​ with label ​​y ∈ {1, 2, . . . , C}​​, where ​​C​​ is the total number of classes, suppose the model's estimated class probabilities are ​, where ​​pᵢ ∈ [0, 1] ∀ i​​, we denote the loss as ​​L(p, y)​​. Suppose the number of samples for class ​​i​​ is ​​nᵢ​​, based on ​​Equation 2​​, the proposed ​​effective number of samples​​ for class ​​i​​ is ​, where ​​βᵢ = (Nᵢ − 1)/Nᵢ​​.

类别平衡损失​

​类别平衡损失​​旨在通过引入​​与有效样本数成反比的权重因子​​,解决​​不平衡数据训练​​问题。该​​类别平衡损失项​​可应用于​​多种深度网络和损失函数​​。对于输入样本​​x​​及其标签​​y∈{1,2,...,C}​​(C为类别总数),设模型预测的类别概率为​​p=[p₁,p₂,...,p_C]​​(pᵢ∈[0,1]),记损失函数为​​L(p,y)​​。若类别i的样本数为nᵢ,基于​​公式2​​,其​​有效样本数​​为​​E_{nᵢ}=(1−βᵢ^{nᵢ})/(1−βᵢ)​​,其中​​βᵢ=(Nᵢ−1)/Nᵢ​​。

Without further information of data for each class, it is difficult to empirically find a set of good hyperparameters ​​Nᵢ​​ for all classes. Therefore, in practice, we assume ​​Nᵢ​​ is only ​​dataset-dependent​​ and set ​​Nᵢ = N​​, ​​βᵢ = β = (N − 1)/N​​ for all classes in a dataset. To ​​balance the loss​​, we introduce a ​​weighting factor αᵢ​​ that is ​​inversely proportional to the effective number of samples​​ for class ​​i​​: ​​αᵢ ∝ 1/E_{nᵢ}​​. To make the ​​total loss​​ roughly in the ​​same scale​​ when applying ​​αᵢ​​, we normalize ​​αᵢ​​ so that ​. For simplicity, we abuse the notation of ​​1/E_{nᵢ}​​ to denote the normalized weighting factor in the rest of our paper.

由于缺乏各类别的先验信息,难以通过经验为所有类别设置理想的超参数Nᵢ。因此实践中,本文假设​​Nᵢ仅与数据集相关​​,统一设置​​Nᵢ=N​​,​​βᵢ=β=(N−1)/N​​。为实现​​损失平衡​​,本文引入​​权重因子αᵢ∝1/E_{nᵢ}​​,并将其归一化以保证​​总损失规模一致​​。为简化表示,后文中​​1/E_{nᵢ}​​均指归一化后的权重因子。

Formally speaking, given a sample from class ​​i​​ that contains ​​nᵢ​​ samples in total, we propose to add a ​​weighting factor (1 − β)/(1 − β^{nᵢ})​​ to the loss function, with ​​hyperparameter β ∈ [0, 1)​​. The ​​class-balanced (CB) loss​​ can be written as:

where ​​n_y​​ is the number of samples in the ​​ground-truth class y​​.

We visualize ​​class-balanced loss​​ in ​​Figure 3​​ as a function of ​​n_y​​ for different ​​β​​. Note that:

  • ​β = 0​​ corresponds to ​​no re-weighting​
  • ​β → 1​​ corresponds to ​​re-weighing by inverse class frequency​

对于包含nᵢ个样本的类别i中的样本,我们向损失函数添加​​权重因子(1−β)/(1−β^{nᵢ})​​(超参数β∈[0,1))。​​类别平衡(CB)损失​​可表示为:

其中​​n_y​​表示​​真实类别y​​的样本数量。我们在​​图3​​中可视化了​​类别平衡损失​​随​​n_y​​变化的函数关系(针对不同​​β​​值)。需要特别说明的是:

  1. ​β=0​​时,等价于​​不进行重新加权​​(原始损失函数)
  2. ​β→1​​时,退化为​​按类别频率倒数进行重新加权​

The proposed ​​novel concept of effective number of samples​​ enables us to use a ​​hyperparameter β​​ to ​​smoothly adjust​​ the class-balanced term between ​​no re-weighting​​ and ​​re-weighing by inverse class frequency​​.

The proposed ​​class-balanced term​​ is:

  1. ​Model-agnostic​​: Independent of the choice of model architecture
  2. ​Loss-agnostic​​: Independent of the choice of loss function ​​L​​ and predicted class probabilities ​​p​

To demonstrate the ​​proposed class-balanced loss is generic​​, we show how to apply the ​​class-balanced term​​ to three commonly used loss functions: Softmax cross-entropy loss;Sigmoid cross-entropy loss​​;​​Focal loss​.

​提出的有效样本数新概念​​使我们能够通过​​超参数β​​在​​不重新加权​​和​​按类别频率倒数重新加权​​之间​​平滑调整​​类别平衡项。

​提出的类别平衡项​​具有以下特性:

  1. ​模型无关性​​:独立于模型架构的选择
  2. ​损失函数无关性​​:不受损失函数L和预测类别概率p的选择影响

为证明​​提出的类别平衡损失的通用性​​,我们展示了如何将该平衡项应用于三种常用损失函数:

  1. ​Softmax交叉熵损失​
  2. ​Sigmoid交叉熵损失​
  3. ​Focal损失​

    ​1. Class-Balanced Softmax Cross-Entropy Loss​

    Suppose the ​​predicted output​​ from the model for all classes are , where ​​C​​ is the total number of classes. The ​​softmax function​​ regards each class as ​​mutually exclusive​​ and calculates the ​​probability distribution​​ over all classes as  {1, 2, ..., C}​​. Given a sample with ​​class label y​​, the ​​softmax cross-entropy (CE) loss​​ for this sample is written as:

    Suppose class ​​y​​ has ​​n_y​​ training samples, the ​​class-balanced (CB) softmax cross-entropy loss​​ is:

    ​1. 类别平衡的Softmax交叉熵损失​

    假设模型对所有类别的​​预测输出​​为​​z = [z₁, z₂, ..., z_C]​​(C为类别总数)。​​Softmax函数​​将每个类别视为​​互斥的​​,并计算所有类别的​​概率分布​​为​​pᵢ = e^{zᵢ}/∑_{j=1}^C e^{zⱼ}​​(i ∈ {1,2,...,C})。给定​​类别标签y​​的样本,其​​Softmax交叉熵(CE)损失​​表示为:

    若类别y有​​n_y​​个训练样本,则​​类别平衡(CB)的Softmax交叉熵损失​​为:

    2. Class-Balanced Sigmoid Cross-Entropy Loss​

    Different from ​​softmax​​, class-probabilities calculated by ​​sigmoid function​​ assume each class is ​​independent and not mutually exclusive​​. When using ​​sigmoid function​​, we regard ​​multi-class visual recognition​​ as multiple ​​binary classification tasks​​, where each output node of the network is performing a ​​one-vs-all classification​​ to predict the probability of the target class over the rest of classes. Compared with softmax, sigmoid presumably has two advantages for ​​real-world datasets​​:

    1. ​Sigmoid doesn't assume mutual exclusiveness​​ among classes, which aligns well with real-world data, where classes might have ​​fine-grained similarities​​, especially in cases with large numbers of classes.
    2. Since each class is considered ​​independent​​ and has its own predictor, sigmoid ​​unifies single-label classification with multi-label prediction​​. This is particularly valuable for real-world data which often contains ​​multiple semantic labels​​.

    2. 类别平衡的Sigmoid交叉熵损失​

    与​​Softmax​​不同,​​Sigmoid函数​​计算的类别概率假设各类别是​​独立且非互斥的​​。使用​​Sigmoid函数​​时,我们将​​多类别视觉识别​​视为多个​​二分类任务​​,网络的每个输出节点执行​​一对多分类​​,预测目标类别相对于其他类别的概率。相比Softmax,Sigmoid对​​真实数据集​​具有两大优势:

    1. ​Sigmoid不假设类别互斥性​​,这与真实数据特性相符,特别是当存在大量具有​​细粒度相似性​​的类别时
    2. 由于每个类别被视为​​独立的​​且拥有独立的预测器,Sigmoid​​统一了单标签与多标签预测​​,这对常包含​​多重语义标签​​的真实数据尤为重要

    Using the same notations as softmax cross-entropy, for simplicity, we define ​​z^t_i​​ as:

    Then the ​sigmoid cross-entropy (CE) loss​​ can be written as:

    The ​class-balanced (CB) sigmoid cross-entropy loss​ is:

      沿用Softmax交叉熵的符号表示,为简化定义​​z^t_i​​为:

      ​Sigmoid交叉熵损失(CE)​可表示为:

      对应的​类别平衡(CB) Sigmoid交叉熵损失​为:

      3. Class-Balanced Focal Loss

      The recently proposed ​​focal loss (FL)​​ [26] adds a ​​modulating factor​​ to the sigmoid cross-entropy loss to:

      • ​Reduce relative loss​​ for well-classified samples
      • ​Focus on difficult samples​

      Denote ​​p^t_i = sigmoid(z^t_i) = 1/(1 + exp(−z^t_i))​​, the ​​focal loss​​ can be written as:

      The class-balanced (CB) focal loss is:

      The original focal loss has an α-balanced variant. The class-balanced focal loss is same as α-balanced focal loss when . Therefore, the class-balanced term can be viewed as an explicit way to set α_t in focal loss based on the effective number of samples.

      ​3. 类别平衡Focal Loss​

      最近提出的​​Focal Loss (FL)​​[26]通过在sigmoid交叉熵损失中引入​​调制因子​​实现:

      1. ​降低易分类样本​​的相对损失贡献
      2. ​聚焦难分类样本​​的训练

      定义 p^t_i = sigmoid(z^t_i) = 1/(1 + exp(-z^t_i)),则标准Focal Loss表达式为:

      类别平衡Focal Loss​​形式为:

      ​​原始focal loss存在一个α平衡的变体。当满足条件时,类别平衡focal loss与α平衡focal loss相同。因此,类别平衡项可以视为基于有效样本数来设置focal loss中α_t的显式方法。​

      相关文章:

    1. 量化交易 - RSRS(阻力支撑相对强度)策略研究 - 源码
    2. 递归实现归并排序
    3. RenderStage::runCameraSetUp
    4. 对象存储概述
    5. 《擦除序列》线性时间做法题解
    6. 点云(Point Cloud)介绍
    7. 可发1区的超级创新思路(python 、MATLAB实现):基于区域注意力双通道MABMA的时间序列预测模型
    8. Megatron-Core 进行大规模语言模型(LLM)训练【专题2】
    9. Vivado比特流生成、下载及板级验证操作步骤
    10. 【C++算法】64.字符串_字符串相乘
    11. 服务部署丨通过Docker部署AutoBangumi+qBittorrent实现自动追番
    12. sql之DML(insert、delete、truncate、update、replace))
    13. 前端工程化之自动化测试
    14. 聊聊Doris的数据模型,如何用结构化设计解决实时分析难题
    15. 【笔记】网路安全管理-实操
    16. element-plus样式失效的原因总结
    17. 机器学习(1)— 开发环境安装
    18. 每天学一个 Linux 命令(25):more
    19. n8n 中文系列教程_02. 自动化平台深度解析:核心优势与场景适配指南
    20. Manus技术架构、实现内幕及分布式智能体项目实战 线上高级实训班
    21. 影子调查丨义门陈遗址建筑被“没收”风波
    22. 伊朗外交部:美矛盾立场被视为缺乏谈判的严肃性和诚意
    23. 海南陵水县一别墅区被指违建已获确认,60岁举报人曾两度遭蒙面人袭击
    24. 上海刚刚经历了史上最热的四月上半月,明天冲击32℃
    25. 香港美国商会喊话特朗普政府:许多成员已受影响,尽快解决当前局势
    26. “青春版提案”点亮百年名校,《政协联线》电视专栏今年第7期将播出