Míguez, Vítor2026-02-192026-02-192026-02-09Míguez-Rego, Vítor (2026). Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo. Open Linguistics, 12(1). https://doi.org/10.1515/opli-2025-0078https://hdl.handle.net/10347/45973This paper demonstrates the use of LLMs as first-pass filters in corpus annotation, with a focus on semantic disambiguation – a task more challenging than form-based classification due to its context-dependence. Using as a case study the polysemous Galician noun pobo ‘people/village’, the study demonstrates the applicability of LLM-assisted annotation to low-resource languages. 300 examples were annotated by three human coders and four LLMs (Claude 4 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and Claude 4.5 Opus) using a static, single-phase prompting approach. Since first-pass filters should capture as many actual occurrences of the target phenomenon as possible, priority was given to recall over precision. Accordingly, the paper argues for F 2 , a recall-focused metric, over commonly used alternatives like F 1 or MCC for validating LLM performance in filtering tasks. Claude 4.5 Opus with pretraining achieved the best performance against the human consensus ( F 2 = 0.944, recall = 100 %), resulting in substantial workload reduction with no information loss. The study demonstrates that LLMs can serve as effective first-pass filters for semantic annotation in corpus linguistics, extending their applicability to low-resource languages.eng© 2026 the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License.Attribution 4.0 Internationalhttp://creativecommons.org/licenses/by/4.0/Corpus linguisticsCorpus annotationSemantic disambiguationLarge language modelsGalician570104 Lingüística informatizadaLarge language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobojournal article10.1515/opli-2025-00782300-9969open access