Adding Training Noise To Improve Detections In Transformers

Denoising, explained The post Adding Training Noise To Improve Detections In Transformers appeared first on Towards Data Science.

Apr 28, 2025 - 19:22

Modern vision transformers add noise to improve the performance of 2D and 3D object detection. In this article we will learn how this mechanism works and discuss its contribution.

Early Vision Transformers

DETR — DEtection TRansformer (Carion, Massa et al. 2020), one of the first transformer architectures for object detection, used learned decoder queries to extract detection information from the image tokens. These queries were randomly initialized, and the architecture did not impose any constraints that forced these queries to learn things that resembled anchors. While achieving comparable results with Faster-RCNN, its drawback was in its slow convergence — 500 epochs were required to train it (DN-DETR, Li et al., 2024). More recent DETR-based architectures, used deformable aggregation that enabled queries to focus only on certain regions in the image (Zhu et al., Deformable DETR: Deformable Transformers For End-To-End Object Detection, 2020), while others (Liu et al., DAB-DETR: Dynamic Anchor Boxes Are Better Queries For DETR, 2022) used spatial anchors (generated using k-means, in a way similar to the way anchor-based CNNs do it), that were encoded into the initial queries. Skip connections forced the decoder block of the transformer learn boxes as regression values from the anchors. Deformable attention layers used the pre-encoding anchors to sample spatial features from the image and use them to construct tokens for attention. During training the model learns the optimal anchors to use. This approach teaches the model to explicitly use features like box size in its queries.

Figure 1. DETR, basic diagram. The yellow and purple queries optimally lead to detections with very low confidence or detections with class “No object”. Source: The author.

Prediction To Ground Truth Matching

In order to calculate the loss, the trainer first needs to match the model’s predictions with ground truth (GT) boxes. While anchor-based CNNs have relatively easy solutions to that problem (e.g. every anchor can only be matched with GT boxes in its voxel during training, and, in inference, non-maximum suppression to remove overlapping detections), the standard for transformers, set by DETR, is to use a bipartite matching algorithm called the Hungarian algorithm. In each iteration, the algorithm finds the best prediction to GT matching (a matching that optimizes some cost function, like the mean squared distance between box corners, summed over all the boxes). The loss is then calculated between pairs of prediction-GT box and can be back-propagated. Excess predictions (prediction with no matching GT) incur a separate loss that encourages them to decrease their confidence score.

The Problem

The time complexity of the Hungarian algorithm is o(n³). Interestingly, this isn’t necessarily the bottle neck in training quality: it was shown (The Stable Marriage Problem: An Interdisciplinary Review From The Physicist’s Perspective, Fenoaltea et al., 2021) that the algorithm is unstable, in the sense that a small change in its objective function may lead to a dramatic change in its matching result — leading to inconsistent query training objectives. The practical implications in transformer training are that object-queries can jump between objects and take a long time to learn the best features for convergence.

DN-DETR

An elegant solution to the unstable matching problem was proposed by Li et al. and later adopted by many other works, including DINO, Mask DINO, Group DETR etc.

The main idea in DN-DETR is to boost training by creating fictitious, easy-to-regress-from anchors, that skip the matching process. This is done during training by adding a small amount of noise to GT boxes and feeding these noised-up boxes as anchors to the decoder queries. The DN queries are masked from the organic queries and vice versa, to avoid cross attention that would interfere with the training. The detections generated by these queries are already matched with their source-GT boxes and do not require the bipartite matching. The authors of DN-DETR have shown that during validation stages at epoch ends (where denoising is turned off), this improves the stability of the model compared to DETR and DAB-DETR, in the sense that more queries are consistent in their matching with a GT object in successive epochs. (See Figure 2).

The authors show that using DN both accelerates convergence, and achieves better detection results. (See Figure 3). Their ablation study exhibits an increase of 1.9% in AP on COCO detection dataset, compared the previous SOTA (DAB-DETR, AP 42.2%), when using ResNet-50 as backbone.

Figure 2. Illustration of instability during training as measured during validation. Based on data provided in DN-DETR (Li et al., 2022). Image source: The author.

Figure 3. DN-DETR’s performance quickly surpasses DETR’s maximum performance in 1/10 of the training epochs. Based on data in DN-DETR (Li et al., 2022). Image source: The author.

DINO And Contrastive Denoising

DINO took this idea further, and added contrastive learning to the denoising mechanism: in addition to the positive example, DINO creates another noised-up version for each GT, which is mathematically constructed to be more distant from the GT, compared to the positive example (see Figure 4). That version is used as a negative example for the training: the model learns to accept the detection nearer to the ground truth, and reject the one that is farther away (by learning to predict the class “no object”).

In addition, DINO enables multiple contrastive denoising (CDN) groups — multiple noised-up anchors per GT object — getting more out of each training iteration.

DINO’s authors report AP of 49% (on COCO val2017) when using CDN.

Recent temporal models, that need to keep track on objects from frame to frame, like Sparse4Dv3, using the CDN, and add temporal denoising groups, where some of the successful DN anchors are stored (along with the learned, non-DN anchors), for usage in later frames, enhancing the model’s performance in object tracking.

Figure 4. Denoising Illustrated. A snapshot of the training process. Green boxes are the current anchors (either learned from previous images or fixed). The blue box is a ground truth (GT) box of a bird object. The yellow box is a positive example generated by adding noise to the GT box (which changes both position and dimensions). The red box is a negative example, guaranteed to be farther away (in the x, y, w, h space) from the GT than the positive example. Source: The author.

Discussion

Denoising (DN) seems to improve the convergence speed and final performance of vision transformer detectors. But, examining the evolution of the various methods mentioned above, raises the following questions:

DN improves models that use learnable anchors. But are learnable anchor really so important? Would DN also improve models that use non-learnable anchors?
The main contribution of DN to the training is by adding stability to the gradient descent process by bypassing the bipartite matching. But it seems that the bipartite matching is there, mainly because the standard in transformer works is to avoid spatial constraints on queries. So, if we manually constrained queries to specific image locations, and gave up the use of bipartite matching (or used a simplified version of bipartite matching, that runs on each image patch separately) — would DN still improve results?

I couldn’t find works that provided clear answers to these questions. My hypothesis is that a model that uses non-learnable anchors (provided that the anchors are not too sparse) and spatially constrained queries, 1 — would not require a bipartite matching algorithm, and 2 — would not benefit from DN in training, as the anchors are already known and there is no profit in learning to regress from other evanescent anchors.

If the anchors are fixed but sparse, then, I can see how using evanescent anchors that are easier to regress from, can provide a warm-start to the training process.

Anchor-DETR (Wand et al., 2021) compare the spatial distribution of learnable and non-learnable anchors, and the performance of the respective models, and in my opinion, the learnability does not add that much value to the model’s performance. Notably — they use the Hungarian algorithm in both methods, so it’s unclear whether they could give up the bipartite matching and retain the performance.

One consideration to make is that there may be production reasons to avoid NMS in inference, which promotes using the Hungarian algorithm in training.

Where can denoising really be significant? In my opinion — in tracking. In tracking the model is fed a video stream, and is required not only to detect multiple objects across successive frames, but also to preserve the unique identity of each detected object. Temporal transformer models, i.e. models that utilize the sequential nature of the video stream, don’t process individual frames independently. Instead, they maintain a bank that stores previous detections. At training, the tracking model is encouraged to regress from an object’s previous detection (or more precisely — the anchor that is attached to the object’s previous detection), rather than regressing from simply the nearest anchor. And since the previous detection is not constrained to some fixed anchor grid, it is plausible that the flexibility that DN induces, is beneficial. I would very much like to read future works that attend to these issues.

That’s it for denoising and its contribution to vision transformers! If you liked my article, you’re welcome to visit some of my other articles on deep learning, machine learning and Computer Vision!

The post Adding Training Noise To Improve Detections In Transformers appeared first on Towards Data Science.