NeurIPS 2025 Track on Datasets and Benchmarks
tl;dr We present ITTO, a long-range tracking benchmark suite for evaluating and diagnosing tracking methods on complex and long-range motions.
We introduce ITTO, a challenging new benchmark suite for evaluating and diagnosing the capabilities and limitations of point tracking methods. Our videos are sourced from existing datasets and egocentric real-world recordings, with high-quality human annotations collected through a multi-stage pipeline. ITTO captures the motion complexity, occlusion patterns, and object diversity characteristic of real-world scenes -- factors that are largely absent in current benchmarks. We conduct a rigorous analysis of state-of-the-art tracking methods on ITTO, breaking down performance along key axes of motion complexity. Our findings reveal that existing trackers struggle with these challenges, particularly in re-identifying points after occlusion, highlighting critical failure modes. These results point to the need for new modeling approaches tailored to real-world dynamics. We envision ITTO as a foundation testbed for advancing point tracking and guiding the development of more robust tracking algorithms.
Grounded in real-world scenes, ITTO features multiple objects undergoing diverse motions, including repeated appearance and disappearance, rapid position changes, and non-rigid motions. We quantitatively compare ITTO to existing benchmarks and find that it is a much more challenging setting for real-world tracking. We show that ITTO covers the diversity and nuance of the real-world and is significantly more challenging than all existing benchmarks, with 14× as many per-track reappearances, 4.5× as many moving points, and 2× as many tracked objects per video.
Dataset | Real? | Static Points | Reappearance Count | Objects per Video | Track Duration | Occlusion Rate |
---|---|---|---|---|---|---|
TAP-Vid Davis | ✓ | 82.3% | 0.10 | 3.3 | 44.0 | 31.0% |
TAP-Vid Kinetics | ✓ | 88.1% | 0.27 | 5.1 | 146.0 | 40.6% |
TAP-Vid RGB Stacking | ✗ | 94.2% | 0.08 | — | 167.0 | 35.0% |
Dynamic Replica | ✗ | 99.0% | 1.02 | — | 215.4 | 28.0% |
ITTO | ✓ | 30.7% | 5.86 | 5.86 | 221.6 | 58.1% |
To better understand how individual models perform on ITTO and investigate how robust they are to different aspects of motion complexity, we partition ITTO tracks along two axes: track motion and reappearance frequency. We find that these definitions provide us with a quantitative measure of how robust a model is to these aspects of motion difficulty, and give us a measure of potential breaking points for real-world deployment.
All trackers struggle on tracks with high motion:
All trackers also struggle on tracks with frequent reappearances:
How well can models track through the motion difficulties of ITTO videos? We look at tracking failure with respect to frame index (time). We define track failure rate as the percentage of tracks that are 2, 4, and 6 pixels outside of ground truth. We find that tracks quickly degrade for frames outside of models’ native resolution windows and remain broken for the rest of the video. Our experiments suggest that this error drift is irrecoverable: if a track is occluded beyond the model’s input resolution, the model fails to recover the track upon reappearance. One exception to this behavior is LocoTrack, which has a more sophisticated track initialization step based on correlations in all frames of video, which could explain its ability to recover from errors:
@inproceedings{Demler_2025_Neurips,
author = {Demler, Ilona and Chauhan, Saumya and Gkioxari, Georgia},
title = {Is This Tracker On? A Benchmark Protocol for Dynamic Tracking},
booktitle = {39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmarks},
month = {December},
year = {2025},
pages = {19446-19455}
}