Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo
Loading...
Identifiers
Publication date
Authors
Advisors
Tutors
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
De Gruyter
Abstract
This paper demonstrates the use of LLMs as first-pass filters in corpus annotation, with a focus on semantic disambiguation – a task more challenging than form-based classification due to its context-dependence. Using as a case study the polysemous Galician noun pobo ‘people/village’, the study demonstrates the applicability of LLM-assisted annotation to low-resource languages. 300 examples were annotated by three human coders and four LLMs (Claude 4 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and Claude 4.5 Opus) using a static, single-phase prompting approach. Since first-pass filters should capture as many actual occurrences of the target phenomenon as possible, priority was given to recall over precision. Accordingly, the paper argues for F 2 , a recall-focused metric, over commonly used alternatives like F 1 or MCC for validating LLM performance in filtering tasks. Claude 4.5 Opus with pretraining achieved the best performance against the human consensus ( F 2 = 0.944, recall = 100 %), resulting in substantial workload reduction with no information loss. The study demonstrates that LLMs can serve as effective first-pass filters for semantic annotation in corpus linguistics, extending their applicability to low-resource languages.
Description
Bibliographic citation
Míguez-Rego, Vítor (2026). Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo. Open Linguistics, 12(1). https://doi.org/10.1515/opli-2025-0078
Relation
Has part
Has version
Is based on
Is part of
Is referenced by
Is version of
Requires
Publisher version
https://doi.org/10.1515/opli-2025-0078/htmlSponsors
This work was funded by the Spanish Ministry of Science, Innovation and Universities (MICIU) and the State Research Agency (AEI) under grant PID2022-137170OB-I00 (10.13039/501100011033), and by the European Regional Development Fund (ERDF/EU).
Rights
© 2026 the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License.
Attribution 4.0 International
Attribution 4.0 International








