A fine-tuning approach based on spatio-temporal features for few-shot video object detection

Cores Costa, Daniel; Seidenari, Lorenzo; Bimbo, Alberto del; Brea Sánchez, Víctor Manuel; Mucientes Molina, Manuel

doi:10.1016/j.engappai.2025.110198

A fine-tuning approach based on spatio-temporal features for few-shot video object detection

dc.contributor.affiliation	Universidade de Santiago de Compostela. Departamento de Electrónica e Computación
dc.contributor.affiliation	Universidade de Santiago de Compostela. Centro de Investigación en Tecnoloxías Intelixentes da USC (CiTIUS)
dc.contributor.author	Cores Costa, Daniel
dc.contributor.author	Seidenari, Lorenzo
dc.contributor.author	Bimbo, Alberto del
dc.contributor.author	Brea Sánchez, Víctor Manuel
dc.contributor.author	Mucientes Molina, Manuel
dc.date.accessioned	2025-10-15T07:26:00Z
dc.date.available	2025-10-15T07:26:00Z
dc.date.issued	2025-04-15
dc.description.abstract	This paper describes a new Fine-Tuning approach for Few-Shot object detection in Videos that exploits spatio-temporal information to boost detection precision. Despite the progress made in the single image domain in recent years, the few-shot video object detection problem remains almost unexplored. A few-shot detector must quickly adapt to a new domain with a limited number of annotations per category. Therefore, it is not possible to include videos in the training set, hindering the spatio-temporal learning process. We propose augmenting each training image with synthetic frames to train the spatio-temporal module of our method. This module employs attention mechanisms to mine relationships between proposals across frames, effectively leveraging spatio-temporal information. A spatio-temporal double head then localizes objects in the current frame while classifying them using both context from nearby frames and information from the current frame. Finally, the predicted scores are fed into a long-term object-linking method that generates object tubes across the video. By optimizing the classification score based on these tubes, our approach ensures spatio-temporal consistency. Classification is the primary challenge in few-shot object detection. Our results show that spatio-temporal information helps to mitigate this issue, paving the way for future research in this direction. FTFSVid achieves 41.9 AP50 on the Few-Shot Video Object Detection (FSVOD-500) and 42.9 AP50 on the Few-Shot YouTube Video (FSYTV-40) dataset, surpassing our spatial baseline by 4.3 and 2.5 points. Additionally, FTFSVid outperforms previous few-shot video object detectors by 3.2 points on FSVOD-500 and 14.5 points on FSYTV-40, setting a new state-of-the-art.
dc.description.peerreviewed	SI
dc.description.sponsorship	This research was partially funded by the Spanish Ministerio de Ciencia e Innovación (grant number PID2020-112623GB-I00), and the Galician Consellería de Cultura, Educación e Universidade (grant numbers ED431C 2018/29, ED431C 2021/048, ED431G 2019/04). These grants are co-funded by the European Regional Development Fund (ERDF).
dc.identifier.citation	Cores, D., Seidenari, L., Bimbo, A. D., Brea, V. M., & Mucientes, M. (2025). A fine-tuning approach based on spatio-temporal features for few-shot video object detection. Engineering Applications of Artificial Intelligence, 146, 110198. 10.1016/j.engappai.2025.110198
dc.identifier.doi	10.1016/j.engappai.2025.110198
dc.identifier.essn	1873-6769
dc.identifier.issn	0952-1976
dc.identifier.uri	https://hdl.handle.net/10347/43086
dc.journal.title	Engineering Applications of Artificial Intelligence
dc.language.iso	eng
dc.page.final	11
dc.page.initial	1
dc.publisher	Elsevier
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-112623GB-I00/ES/IA RESPONSABLE PARA MINERIA DE PROCESOS 2.0
dc.relation.publisherversion	https://doi.org/10.1016/j.engappai.2025.110198
dc.rights	© 2025 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
dc.rights	Attribution 4.0 International
dc.rights.accessRights	open access
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Few-shot object detection
dc.subject	Video object detection
dc.subject	Few-shot learning
dc.title	A fine-tuning approach based on spatio-temporal features for few-shot video object detection
dc.type	journal article
dc.type.hasVersion	VoR
dc.volume.number	146
dspace.entity.type	Publication
relation.isAuthorOfPublication	3daa2166-1c2d-4b3d-bbb0-3d0036bd8cf2
relation.isAuthorOfPublication	22d4aeb8-73ba-4743-a84e-9118799ab1f2
relation.isAuthorOfPublication	21112b72-72a3-4a96-bda4-065e7e2bb262
relation.isAuthorOfPublication.latestForDiscovery	3daa2166-1c2d-4b3d-bbb0-3d0036bd8cf2

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2025_engappai_cores_finetunning.pdf
Size:: 2.47 MB
Format:: Adobe Portable Document Format

Download

Collections

Electrónica e Computación
Centro de Investigación en Tecnoloxías Intelixentes da USC (CiTIUS)