A fine-tuning approach based on spatio-temporal features for few-shot video object detection

Cores Costa, Daniel; Seidenari, Lorenzo; Bimbo, Alberto del; Brea Sánchez, Víctor Manuel; Mucientes Molina, Manuel

doi:10.1016/j.engappai.2025.110198

A fine-tuning approach based on spatio-temporal features for few-shot video object detection

Files

2025_engappai_cores_finetunning.pdf (2.47 MB)

Identifiers

URI: https://hdl.handle.net/10347/43086

ISSN: 0952-1976

E-ISSN: 1873-6769

DOI: 10.1016/j.engappai.2025.110198

Publication date

2025-04-15

Authors

Cores Costa, Daniel

Seidenari, Lorenzo

Bimbo, Alberto del

Brea Sánchez, Víctor Manuel

Mucientes Molina, Manuel

Publisher

Elsevier

Metrics

Export

Abstract

This paper describes a new Fine-Tuning approach for Few-Shot object detection in Videos that exploits spatio-temporal information to boost detection precision. Despite the progress made in the single image domain in recent years, the few-shot video object detection problem remains almost unexplored. A few-shot detector must quickly adapt to a new domain with a limited number of annotations per category. Therefore, it is not possible to include videos in the training set, hindering the spatio-temporal learning process. We propose augmenting each training image with synthetic frames to train the spatio-temporal module of our method. This module employs attention mechanisms to mine relationships between proposals across frames, effectively leveraging spatio-temporal information. A spatio-temporal double head then localizes objects in the current frame while classifying them using both context from nearby frames and information from the current frame. Finally, the predicted scores are fed into a long-term object-linking method that generates object tubes across the video. By optimizing the classification score based on these tubes, our approach ensures spatio-temporal consistency. Classification is the primary challenge in few-shot object detection. Our results show that spatio-temporal information helps to mitigate this issue, paving the way for future research in this direction. FTFSVid achieves 41.9 AP50 on the Few-Shot Video Object Detection (FSVOD-500) and 42.9 AP50 on the Few-Shot YouTube Video (FSYTV-40) dataset, surpassing our spatial baseline by 4.3 and 2.5 points. Additionally, FTFSVid outperforms previous few-shot video object detectors by 3.2 points on FSVOD-500 and 14.5 points on FSYTV-40, setting a new state-of-the-art.

Keywords

Few-shot object detection| Video object detection| Few-shot learning

Bibliographic citation

Cores, D., Seidenari, L., Bimbo, A. D., Brea, V. M., & Mucientes, M. (2025). A fine-tuning approach based on spatio-temporal features for few-shot video object detection. Engineering Applications of Artificial Intelligence, 146, 110198. 10.1016/j.engappai.2025.110198

Publisher version

https://doi.org/10.1016/j.engappai.2025.110198

Rights

© 2025 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
Attribution 4.0 International

Collections

Electrónica e Computación
Centro de Investigación en Tecnoloxías Intelixentes da USC (CiTIUS)

Full item page

A fine-tuning approach based on spatio-temporal features for few-shot video object detection

Files

Identifiers

Publication date

Authors

Advisors

Tutors

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

Metrics

Export

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Bibliographic citation

Relation

Has part

Has version

Is based on

Is part of

Is referenced by

Is version of

Requires

Publisher version

Sponsors

Rights

Collections