[Multiple target tracking] TrackFormer took three days to translate a single sentence! ! !

TrackFormer: Multi-Object Tracking with Transformers

Abstract

The challenging task of multi-object tracking (MOT) re-quires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories. We formulate this task as a frame-to-frame set prediction problem andintroduce TrackFormer, an end- to-end trainable MOT ap-proach based on an encoder-decoder Transformer architec-ture. Our model achieves data association between framesvia attention by evolving a set of track predictions through a video sequence. The Transformer decoder initializes newtracks from static object queries and autoregressively fol -lows existing tracks in space and time with the concep-tually new and identity preserving track queries.Bothquery types benefit from self- and encoder-decoder atten-tion on global frame-level features, thereby omitting any ad-ditional graph optimization or modeling of motion and/orappearance. TrackFormer introduces a new tracking-by-attention paradigm and while simple in its design is able to achieve state-of-the-art performance on the task of multi-object tracking (MOT17 and MOT20) and segmentation(MOTS20 ).

The challenging task of multiple object tracking (MOT) requires reasoning about trajectory initialization, identity, and spatiotemporal trajectories simultaneously. We formulate the task as a frame-to-frame set prediction problem and introduce TrackFormer, an end-to-end trainable MOT method based on an encoder-decoder transformer architecture. Our model enables data association between frames and attention by evolving a set of trajectory predictions in a video sequence. The Transformer decoder initializes new trajectories from static object queries and autoregressively tracks existing trajectories in space and time via conceptually new and identity-preserving trajectory queries. Bothquery types benefit from their own and encoder-decoder focus on global frame-level features, thereby omitting any additional graph optimization or modeling of motion and/or appearance. TrackFormer introduces a new attention tracking paradigm that, although simple in design, is abl

1.Introduction

Humans need to focus their attention to track objects in space and time, for example, when playing a game of ten-nis, golf, or pong. This challenge is only increased when tracking not one, but multiple objects, in crowded and realworld scenarios. Following this analogy, we demonstrate the effectiveness of Transformer [51] attention for the task of multi-object tracking (MOT) in videos

Humans need to focus on trajectory objects in space and time, for example, when playing tennis, golf, or table tennis. This challenge only increases when tracking multiple objects instead of one in crowded and real-world scenes. Following this analogy, we demonstrate the effectiveness of Transformer [51] in focusing on the multi-object tracking (MOT) task in videos

The goal in MOT is to follow the trajectories of a set of objects, e.g., pedestrians, while keeping their identities dis-criminated as they are moving throughout a video sequence. Due to the advances in image-level object detection [7, 39], most approaches follow the two-steptracking-by-detection paradigm: (i) detecting objects in individual video frames and (ii) associating sets of detections between frames andthereby creating individual object tracks over time. Tra-ditional tracking-by-detection methods associate detectionsvia temporally sparse [23, 26] or dense [19, 22] graph opti-mization, or apply convolutional neural networks to predict matching scores between detections [8, 24]

The goal of MOT is to track the trajectories of a group of objects (e.g. pedestrians) while moving throughout a video sequence without distinguishing their identities. Due to advances in image-level object detection [7, 39], most methods follow a two-step detection and tracking approach: (i) detecting objects in a single video frame and (ii) associating sets of detections between frames, thereby following Create individual object trajectories over time. Traditional tracking of detection methods correlates detections through temporal sparse [23, 26] or dense [19, 22] graph optimization, or applies convolutional neural networks to predict match scores between detections [8, 24]

Recent works [4,6,29,69] suggest a variation of the tradi-tional paradigm, coinedtracking-by-regression[12]. In this approach, the object detector not only provides frame-wisedetections, but replaces the data association step with a con-tinuous regression of each track to the changing position of its object. These approaches achieve track association im-plicitly, but provide top performance only by relying either on additional graph optimization [6, 29] or motion and ap-pearance models [4]. This is largely due to the isolated and local bounding box regression which lacks any notion of object identity or global communication between tracks

Recent work [4, 6, 29, 69] proposes a variant of the traditional paradigm, tracking via regression [12]. In this approach, the object detector not only provides frame-by-frame detection, but also replaces the data association step with a continuous regression of each trajectory to the changing position of its object. These methods effectively achieve trajectory association but only provide optimal performance by relying on additional graph optimization [6, 29] or motion and probabilistic models [4]. This is largely due to isolated and local bounding box regression, which lacks any notion of object identity or global communication between trajectories

We present a first straightforward instantiation oftracking-by-attention, TrackFormer, an end-to-end train-able Transformer [51] encoder-decoder architecture.Itencodes frame-level features from a convolutional neural network (CNN) [18] and decodes queries into boundingboxes associated with identities. The data association isperformed through the novel and simple concept oftrackqueries. Each query represents an object and follows it inspace and time over the course of a video sequence in anautoregressive fashion. New objects entering the scene aredetected by static object queries as in [7, 71] and subse-quently transform to future track queries. At each frame, the encoder-decoder computes attention between the inputimage features and the track as well as object queries, and outputs bounding boxes with assigned identities. Thereby, TrackFormer performs tracking-by-attention and achieves detection and data association jointly without relying on any additional track matching, graph optimization, or ex-plicit modeling of motion and/or appearance. In contrast to tracking-by-detection/regression, our approach detectsand associates tracks simultaneously in a single step via at-tention (and not regression). TrackFormer extends the re-cently proposed set prediction objective for object detection [7, 48, 71] to multi-object tracking

We propose the first direct attention-tracking instantiation of TrackFormer, an end-to-end trainable Transformer [51] encoder-decoder architecture. It encodes frame-level features of a convolutional neural network (CNN) [18] and decodes the query into identity-related bounding boxes. Data correlation is achieved through a novel and simple query concept. Each query represents an object and follows it in space and time during the course of the video sequence in an automated manner. As described in [7, 71], new objects entering the scene are detected through static object queries and subsequently converted into future trajectory queries. At each frame, the encoder-decoder computes attention between input image features and trajectories as well as object queries, and outputs bounding boxes with specified identities. Thus, TrackFormer tracks through attention

Summary: Transfomer is purely based on attention for detection and data association, and is not based on common matching algorithms such as Hungarian Kalman and other algorithms

We evaluate TrackFormer on the MOT17 [30] andMOT20 [13] benchmarks where it achieves state-of-the-artperformance for public and private detections. Furthermore, we demonstrate the extension with a mask prediction headand show state-of-the -art results on the Multi-Object Track-ing and Segmentation (MOTS20) challenge [52]. We hope this simple yet powerful baseline will inspire researchers to explore the potential of the tracking-by-attention paradigm. In summary, we make the following contributions :

We evaluate TrackFormer on the MOT17 [30] and MOT20 [13] benchmarks, where it achieves state-of-the-art performance on public and private detection. Furthermore, we demonstrate scaling with a mask prediction head and show state-of-the-art results on the Multiple Object Tracking and Segmentation (MOTS20) challenge [52]. We hope that this simple yet powerful baseline will inspire researchers to explore the potential of attentional tracking paradigms. In summary, we made the following contributions:

An end-to-end trainable multi-object tracking ap-preach which achieves detection and data association in a new tracking-by-attention paradigm

An end-to-end trainable multi-object tracking method enabling detection and data association in a new attention tracking paradigm

The concept of autoregressive track queries which em-bed an object’s spatial position and identity, thereby tracking it in space and time

The concept of autoregressive tracking queries the spatial location and identity of an object, thereby tracking it in space and time

New state-of-the-art results on three challenging multi-object tracking (MOT17 and MOT20) and segmenta-tion (MOTS20) benchmarks

Latest technical results on three challenging multi-object tracking (MOT17 and MOT20) and segmentation (MOTS20) benchmarks

2. Related work

TBD:

refrains from associating detec-tions between frames but instead accomplishes tracking byregressing past object locations to their new positions in the current frame. Previous efforts [4, 15] use regression headson region-pooled object features. In [69], objects are rep-resented as center points which allow for an association by a distance-based greedy matching algorithm. To overcome their lacking notion of object identity and global track rea-soning, additional re-identification and motion models [4], as well as traditional [29 ] and learned [6] graph methods have been necessary to achieve top performance

Avoid correlating detections between frames and instead achieve tracking by regressing past object positions to new positions in the current frame. Previous work [4, 15] used regression of head regions to incorporate object features. In [69], objects are represented as center points, which allows association via a distance-based greedy matching algorithm. To overcome the methods that lack the concept of object recognition and global trajectory reconstruction, additional re-identification and motion models [4] are needed, as well as traditional [29] and learned [6] graph methods to achieve optimal performance

split trace

not only predicts objectmasks but leverages the pixel-level information to mitigate issues with crowdedness and ambiguous backgrounds. Prior attempts used category-agnostic image segmenta-tion [31], applied Mask R-CNN [17] with 3D convolu-tions [ 52], mask pooling layers [38], or represented objects as unordered point clouds [58] and cost volumes [57]. However, the scarcity of annotated MOT segmentation datamakes modern approaches still rely on bounding boxes

Not only object masks are predicted, but pixel-level information is exploited to alleviate crowding and background blur problems. Previous attempts have used class-agnostic image segmentation [31], applied masked R-CNN [17][52] with 3D convolutions, masked pooling layers [38], or represented objects as unordered Point clouds [58] and cost volumes [57]. However, the scarcity of annotated MOT segmentation data means that modern methods still rely on bounding boxes.

In contrast, TrackFormer casts the entire tracking objec-tive into a single set prediction problem, applying attentionnot only for the association step. It jointly reasons abouttrack initialization, identity, and spatio-temporal trajecto-ries. We only rely on feature -level attention and avoid addi-tional graph optimization and appearance/motion models

In contrast, TrackFormer projects the entire tracking object into a single-episode prediction problem, rather than just applying attention to the association step. Together it explains orbital initialization, identity, and spatiotemporal trajectories. We rely only on feature-level attention, avoiding additional graph optimizations and appearance/motion models

3. TrackFormer

We present TrackFormer, an end-to-end trainable multi-object tracking (MOT) approach based on an encoder-decoder Transformer [51] architecture. This section de-scribes how we cast MOT as a set prediction problem andintroduce the newtracking-by-attentionparadigm. Further-more, we explain the concept oftrack queriesand their ap-plication for frame-to-frame data association

We propose TrackFormer, an end-to-end trainable multi-object tracking (MOT) approach based on the encoder-decoder converter [51] architecture. This section describes how we treat MOT as an ensemble prediction problem and introduce a new attention tracking algorithm. Additionally, we explain the concept of queries and their application in frame-to-frame data correlation

3.1MOT as a set prediction problem

Given a video sequence withKindividual object iden-tities, MOT describes the task of generating ordered tracksTk= (bkt1,bkt2,...)with bounding boxesbtand track iden-titiesk. The subset(t1,t2,...) of total framesTindicates the time span between an object entering and leaving the scene. These include all frames for which an object isoccluded by either the background or other objectsIn order to cast MOT as a set prediction problem, weleverage an encoder-decoder Transformer architecture. Ourmodel performs online tracking and yields per-frame objectbounding boxes and class predictions associated with iden-tities in four consecutive steps

To transform MOT into an ensemble prediction problem, we propose an encoder-decoder-transformer architecture. Our model performs online tracking and produces per-frame object bounding boxes and identity-related class predictions in four consecutive steps

Frame-level feature extraction using a general CNN backbone, such as ResNet-50[18]
Frame feature coding with self-attention in Transformer coding
Decode the query using itself and the encoder-decoder in the Transformer decoder
Map queries to box and class predictions using multilayer perceptrons (MLP).

Objects are implicitly represented in the decoderqueries, which are embeddings used by the decoder to output bound-ing box coordinates and class predictions. The decoder al-ternates between two types of attention: (i) self-attentionover all queries, which allows for joint reasoning about the objects in a scene and (ii) encoder-decoder attention, which gives queries global access to the visual information of the encoded features. The output embeddings accumu-late bounding box and class information over multiple de-coding layers. permutation invariance of Transformersrequires additive feature and object encodings for the framefeatures and decoder queries, respectively

3.2 Tracking-by-attention with queries

The total set of output embeddings is initialized with two types of query encodings: (i) static object queries, which allow the model to initialize tracks at any frame of the video, and (ii) autoregressive track queries, which are responsible for tracking objects across frames

The total set of output embeddings is initialized with two types of query encodings: (i) static object queries, which allow the model to initialize trajectories for any frame of the video; (ii) autoregressive trajectory queries, which are responsible for tracking objects across frames.

The simultaneous decoding of object and track queries allows our model to perform detection and tracking in a uni-fied way, thereby introducing a new tracking-by-attention paradigm. Different tracking-by-X approaches are defined by their key component responsible for track generation .For tracking-by-detection, the tracking is performed by computing/modelling distances between frame-wise objectdetections. The tracking-by-regression paradigm also per-forms object detection, but tracks are generated by regress-ing each object box to its new position in the current frame. Technically, our TrackFormer also performs regression in the mapping of object embeddings with MLPs. However, the actual track association happens earlier via attention in the Transformer decoder. A detailed architecture overview which illustrates the integration of track and object queries into the Transformer decoder is shown in the appendix

Simultaneous decoding of objects and tracking queries enables our model to detect and track in a unified manner, introducing a new approach to attentional tracking. Different X tracing methods are defined by their key components responsible for trace generation. For detection tracking, tracking is performed by calculating/modeling the distance between object detections frame by frame. The regression tracking paradigm also forms object detection, but trajectories are generated by regressing each object box to its new position in the current frame. Technically, our TrackFormer also performs regression when mapping object embeddings using MLP. However, the actual track association occurs earlier via attention in the Transformer decoder. A detailed architectural overview illustrating the integration of tracking and object querying into the Transformer decoder is shown in the appendix

Trajectory initialization

New objects appearing in the scene are detected by a fixed number of object output embeddings, each of which is initialized with a static and learned object encoding, called an object query [7]. Intuitively, each object query learns to predict objects with specific spatially appropriate relationships, such as bounding box size and location. The decoder’s own attention relies on object encoding to avoid duplicate detections and to reason about spatial and categorical relationships of objects. The number of object queries should exceed the maximum number of objects per frame

trajectory query

To achieve frame-to-frame tracking generation, we introduce the concept of tracking queries into the encoder. Tracking queries track objects through video sequences that carry object identity information while adapting to their changing positions in an autoregressive manner.

For this purpose, each new object detection initializes track query with the corresponding output embedding of the previous frame. The Transformer encoder-decoder per-forms attention on frame features and decoder queries con-tinuously updating the instance-specific representation of anobject's identity and location in each track query embed-ding. Self-attention over the joint set of both query types al-lows for the detection of new objects while simultaneously avoiding re-detection of already tracked objects

To do this, each new object detection initializes the trajectory query using the corresponding output embedding from the previous frame. The Transformer encoder-decoder focuses on frame features and decoder queries, continuously updating instance-specific representations of object identities and locations in each track query embedding. Self-attention on the union collection of both query types helps detect new objects while avoiding re-detection of already tracked objects

The ability to decode a random number of track queries allows for an attention-based short-term re-identification process. We continue decoding previously dropped tracking queries to obtain the maximum number of tracking reidframes. During this patient window, tracking queries are considered inactive and will not contribute to the trajectory until a classification score higher than σtrack triggers re-identification. The spatial information embedded into each trajectory query prevents them from being applied to long-term occlusions with large object movements,

3.3. TrackFormer training

For track queries to work in interaction with objectqueries and follow objects to the next frame, TrackFormerrequires dedicated frame-to-frame tracking training. As in-dicated in Figure 2, we train on two adjacent frames and optimize the entire MOT objective at once. The loss forframetmeasures the set prediction of all output embed-dingsN=Nobject + Ntrackwith respect to the ground truthobjects in terms of class and bounding box prediction

In order for the tracking query to interact with the object query and follow the object to the next frame, TrackFormerre requires specialized frame-to-frame tracking training. As shown in Figure 2, we train on two adjacent frames and optimize the entire MOT target simultaneously. The loss of a frame is measured in terms of class and bounding box predictions over the set predictions of all output embeddings N = Nobject + N tracked relative to ground truthobjects

Ensemble loss prediction two steps: detection + tracking