ITTO: Is This Tracker On? A Benchmark Protocol for Dynamic Tracking

tl;dr We present ITTO, a long-range tracking benchmark suite for evaluating and diagnosing tracking methods on complex and long-range motions.

ITTO benchmark annotations.

Abstract

We introduce ITTO, a challenging new benchmark suite for evaluating and diagnosing the capabilities and limitations of point tracking methods. Our videos are sourced from existing datasets and egocentric real-world recordings, with high-quality human annotations collected through a multi-stage pipeline. ITTO captures the motion complexity, occlusion patterns, and object diversity characteristic of real-world scenes -- factors that are largely absent in current benchmarks. We conduct a rigorous analysis of state-of-the-art tracking methods on ITTO, breaking down performance along key axes of motion complexity. Our findings reveal that existing trackers struggle with these challenges, particularly in re-identifying points after occlusion, highlighting critical failure modes. These results point to the need for new modeling approaches tailored to real-world dynamics. We envision ITTO as a foundation testbed for advancing point tracking and guiding the development of more robust tracking algorithms.

ITTO Benchmark Complexity

Grounded in real-world scenes, ITTO features multiple objects undergoing diverse motions, including repeated appearance and disappearance, rapid position changes, and non-rigid motions. We quantitatively compare ITTO to existing benchmarks and find that it is a much more challenging setting for real-world tracking. We show that ITTO covers the diversity and nuance of the real-world and is significantly more challenging than all existing benchmarks, with 14× as many per-track reappearances, 4.5× as many moving points, and 2× as many tracked objects per video.


Dataset	Real?	Static Points	Reappearance Count	Objects per Video	Track Duration	Occlusion Rate
TAP-Vid Davis	✓	82.3%	0.10	3.3	44.0	31.0%
TAP-Vid Kinetics	✓	88.1%	0.27	5.1	146.0	40.6%
TAP-Vid RGB Stacking	✗	94.2%	0.08	—	167.0	35.0%
Dynamic Replica	✗	99.0%	1.02	—	215.4	28.0%
ITTO	✓	30.7%	5.86	5.86	221.6	58.1%

Overall Model Performance

We find that state-of-the-art trackers have significantly lower performance on ITTO than on currently used tracking benchmarks:

Partitioned Metrics: Motion Complexity & Track Reappearance

To better understand how individual models perform on ITTO and investigate how robust they are to different aspects of motion complexity, we partition ITTO tracks along two axes: track motion and reappearance frequency. We find that these definitions provide us with a quantitative measure of how robust a model is to these aspects of motion difficulty, and give us a measure of potential breaking points for real-world deployment.

All trackers struggle on tracks with high motion:

Track motion visualization — Co3 performance by track motion

ITTO ground truth

CoTracker3 Predictions

All trackers also struggle on tracks with frequent reappearances:

ITTO ground truth

CoTracker3 Predictions

Catastrophic Tracking Failure

How well can models track through the motion difficulties of ITTO videos? We look at tracking failure with respect to frame index (time). We define track failure rate as the percentage of tracks that are 2, 4, and 6 pixels outside of ground truth. We find that tracks quickly degrade for frames outside of models’ native resolution windows and remain broken for the rest of the video. Our experiments suggest that this error drift is irrecoverable: if a track is occluded beyond the model’s input resolution, the model fails to recover the track upon reappearance. One exception to this behavior is LocoTrack, which has a more sophisticated track initialization step based on correlations in all frames of video, which could explain its ability to recover from errors:

BibTeX

@inproceedings{Demler_2025_Neurips,
    author    = {Demler, Ilona and Chauhan, Saumya and Gkioxari, Georgia},
    title     = {Is This Tracker On? A Benchmark Protocol for Dynamic Tracking},
    booktitle = {39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmarks},
    month     = {December},
    year      = {2025},
    pages     = {19446-19455}
}

ITTO

Is This Tracker On? A Benchmark Protocol for Dynamic Tracking