Source: Bi-Tempered Logistic Loss for Training Neural Nets with Noisy Data from Google Research

Posted by Ehsan Amid, Student Researcher and Rohan Anil, Software Engineer, Google Research

The quality of models produced by machine learning (ML) algorithms directly depends on the quality of the training data, but real world datasets typically contain some amount of noise that introduces challenges for ML models. Noise in the dataset can take several forms from corrupted examples (e.g., lens flare in an image of a cat) to mislabelled examples from when the data was collected (e.g., an image of cat mislabelled as a flerken).

The ability of an ML model to deal with noisy training data depends in great part on the loss function used in the training process. For classification tasks, the standard loss function used for training is the logistic loss. However, this particular loss function falls short when handling noisy training examples due to two unfortunate properties:

**Outliers far away can dominate the overall loss:**The logistic loss function is sensitive to outliers. This is because the loss function value grows without bound as the mislabelled examples (outliers) are far away from the decision boundary. Thus, a single bad example that is located far away from the decision boundary can penalize the training process to the extent that the final trained model learns to compensate for it by stretching the decision boundary and potentially sacrificing the remaining good examples. This “large-margin” noise issue is illustrated in the left panel of the figure below.**Mislabeled examples nearby can stretch the decision boundary:**The output of the neural network is a vector of activation values, which reflects the margin between the example and the decision boundary for each class. The softmax transfer function is used to convert the activation values into probabilities that an example will belong to each class. As the tail of this transfer function for the logistic loss decays exponentially fast, the training process will tend to stretch the boundary closer to a mislabeled example in order to compensate for its small margin. Consequently, the generalization performance of the network will immediately deteriorate, even with a low level of label noise (right panel below).

We visualize the decision surface of a 2-layered neural network as it is trained for binary classification. Blue and orange dots represent the examples from the two classes. The network is trained with logistic loss under two types of noisy conditions: (left) large-margin noise and (right) small-margin-noise. |

We tackle these two problems in a recent paper by introducing a “bi-tempered” generalization of the logistic loss endowed with two tunable parameters that handle those situations well, which we call “temperatures”—*t _{1}*, which characterizes boundedness, and

To demonstrate the effect of each temperature, we train a two-layer feed-forward neural network for a binary classification problem on a synthetic dataset that contains a circle of points from the first class, and a concentric ring of points from the second class. You can try this yourself on your browser with our interactive visualization. We use the standard logistic loss function, which can be recovered by setting both temperatures equal to 1.0, as well as our bi-tempered logistic loss for training the network. We then demonstrate the effects of each loss function for a clean dataset, a dataset with small-margin noise, large-margin noise, and a dataset with random noise.

**Noise Free Case:**

We show the results of training the model on the noise-free dataset in column (a), using the logistic loss (*top*) and the bi-tempered logistic loss (*bottom*). The white line shows the decision boundary for each model. The values of (*t _{1}*,

**Small-Margin Noise: **

To illustrate the effect of tail-heaviness of the probabilities, we artificially corrupt a random subset of the examples that are near the decision boundary, that is, we flip the labels of these points to the opposite class. The results of training the networks on data with small-margin noise using the logistic loss as well as the bi-tempered loss is shown in column (b).

As can be seen, the logistic loss, due to the lightness of the softmax tail, stretches the boundary closer to the noisy points to compensate for their low probabilities. On the other hand, the bi-tempered loss using only the tail-heavy probability transfer function by adjusting *t _{2}* can successfully avoid the noisy examples. This can be explained by the heavier tail of the tempered exponential function, which assigns reasonably high probability values (and thus, keeps the loss value small) while maintaining the decision boundary away from the noisy examples.

**Large-Margin Noise:**

Next, we evaluate the performance of the two loss functions for handling large-margin noisy examples. In (c), we randomly corrupt a subset of the examples that are located far away from the decision boundary, the outer side of the ring as well as points near the center).

For this case, we only use the boundedness property of the bi-tempered loss, while keeping the softmax probabilities the same as the logistic loss. The unboundedness of the logistic loss causes the decision boundary to expand towards the noisy points to reduce their loss values. On the other hand, the bounded bi-tempered loss, bounded by adjusting *t _{1}*, incurs a finite amount of loss for each noisy example. As a result, the bi-tempered loss can avoid these noisy examples and maintain a good decision boundary.

**Random Noise:**

Finally, we investigate the effect of random noise in the training data on the two loss functions. Note that random noise comprises both small-margin and large-margin noisy examples. Thus, we use both boundedness and tail-heaviness properties of the bi-tempered loss function by setting the temperatures to (*t _{1}*,

As can be seen from the results in the last column, (d), the logistic loss is highly affected by the noisy examples and clearly fails to converge to a good decision boundary. On the other hand, the bi-tempered can recover a decision boundary that is almost identical to the noise-free case.

**Conclusion**

In this work we constructed a bounded, tempered loss function that can handle large-margin outliers and introduced heavy-tailedness in our new tempered softmax function, which can handle small-margin mislabeled examples. Using our bi-tempered logistic loss, we achieve excellent empirical performance on training neural networks on a number of large standard datasets (please see our paper for full details). Note that the state-of-the-art neural networks have been optimized along with a large variety of variables such as: architecture, transfer function, choice of optimizer, and label smoothing to name just a few. Our method introduces two additional tunable variables, namely (*t _{1}*,

**Acknowledgements**:*This blogpost reflects work with our co-authors Manfred Warmuth, Visiting Researcher and Tomer Koren, Senior Research Scientist, Google Research. Preprint of our paper is available here, which contains theoretical analysis of the loss function and empirical results on standard datasets at scale.*

除非特别声明，此文章内容采用知识共享署名 3.0许可，代码示例采用Apache 2.0许可。更多细节请查看我们的服务条款。

Tags:
Develop

- Admin Essentials: know your options for Modern Enterprise Browser Management
- TheVentureCity and Google Consolidate Miami as a Tech Powerhouse
- Keep a better eye on your Google Cloud environment
- Using HLL++ to speed up count-distinct in massive datasets
- Season of Docs Announces Results of 2019 Program
- Admin Insider: What's new in Chrome Enterprise, Release 79
- Discover insights from text with AutoML Natural Language, now generally available
- Introducing Storage Transfer Service for on-premises data
- How Mynd uses G Suite to manage a flurry of acquisitions
- W3C Trace Context Specification: What it Means for You

- 如何选择 compileSdkVersion, minSdkVersion 和 targetSdkVersion (25,381)
- Google 推出的 31 套在线课程 (22,461)
- 谷歌招聘软件工程师 (22,337)
- Seti UI 主题: 让你编辑器焕然一新 (13,824)
- Android Studio 2.0 稳定版 (9,420)
- Android N 最初预览版：开发者 API 和工具 (8,036)
- 像 Sublime Text 一样使用 Chrome DevTools (6,325)
- 用 Google Cloud 打造你的私有免费 Git 仓库 (6,077)
- Google I/O 2016: Android 演讲视频汇总 (5,609)
- 面向普通开发者的机器学习应用方案 (5,539)
- 生还是死？Android 进程优先级详解 (5,233)
- 面向 Web 开发者的 Sublime Text 插件 (4,341)
- 适配 Android N 多窗口特性的 5 个要诀 (4,311)
- 参加 Google I/O Extended，观看 I/O 直播，线下聚会！ (3,624)

© 2019 中国谷歌开发者社区 - ChinaGDG