Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo

Míguez, Vítor

doi:10.1515/opli-2025-0078

Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo

dc.contributor.affiliation	Universidade de Santiago de Compostela. Instituto da Lingua Galega (ILG)
dc.contributor.affiliation	Universidade de Santiago de Compostela. Departamento de Filoloxía Galega
dc.contributor.author	Míguez, Vítor
dc.date.accessioned	2026-02-19T07:12:04Z
dc.date.available	2026-02-19T07:12:04Z
dc.date.issued	2026-02-09
dc.description.abstract	This paper demonstrates the use of LLMs as first-pass filters in corpus annotation, with a focus on semantic disambiguation – a task more challenging than form-based classification due to its context-dependence. Using as a case study the polysemous Galician noun pobo ‘people/village’, the study demonstrates the applicability of LLM-assisted annotation to low-resource languages. 300 examples were annotated by three human coders and four LLMs (Claude 4 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and Claude 4.5 Opus) using a static, single-phase prompting approach. Since first-pass filters should capture as many actual occurrences of the target phenomenon as possible, priority was given to recall over precision. Accordingly, the paper argues for F 2 , a recall-focused metric, over commonly used alternatives like F 1 or MCC for validating LLM performance in filtering tasks. Claude 4.5 Opus with pretraining achieved the best performance against the human consensus ( F 2 = 0.944, recall = 100 %), resulting in substantial workload reduction with no information loss. The study demonstrates that LLMs can serve as effective first-pass filters for semantic annotation in corpus linguistics, extending their applicability to low-resource languages.
dc.description.peerreviewed	SI
dc.description.sponsorship	This work was funded by the Spanish Ministry of Science, Innovation and Universities (MICIU) and the State Research Agency (AEI) under grant PID2022-137170OB-I00 (10.13039/501100011033), and by the European Regional Development Fund (ERDF/EU).
dc.identifier.citation	Míguez-Rego, Vítor (2026). Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo. Open Linguistics, 12(1). https://doi.org/10.1515/opli-2025-0078
dc.identifier.doi	10.1515/opli-2025-0078
dc.identifier.essn	2300-9969
dc.identifier.uri	https://hdl.handle.net/10347/45973
dc.issue.number	1
dc.journal.title	Open Linguistics
dc.language.iso	eng
dc.page.final	16
dc.page.initial	1
dc.publisher	De Gruyter
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2022-137170OB-I00/ES/ETIQUETADOR SEMANTICO MULTILINGUE AUTOMATICO Y SOSTENIBLE
dc.relation.publisherversion	https://doi.org/10.1515/opli-2025-0078/html
dc.rights	© 2026 the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License.
dc.rights	Attribution 4.0 International	en
dc.rights.accessRights	open access
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Corpus linguistics
dc.subject	Corpus annotation
dc.subject	Semantic disambiguation
dc.subject	Large language models
dc.subject	Galician
dc.subject.classification	570104 Lingüística informatizada
dc.title	Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo
dc.type	journal article
dc.type.hasVersion	VoR
dc.volume.number	12
dspace.entity.type	Publication
relation.isAuthorOfPublication	57a03e10-d9a2-4a43-b76a-459649b0ceaf
relation.isAuthorOfPublication.latestForDiscovery	57a03e10-d9a2-4a43-b76a-459649b0ceaf

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2026_open_miguez_large.pdf
Size:: 599.28 KB
Format:: Adobe Portable Document Format

Download

Collections

Filoloxía Galega
Instituto da Lingua Galega (ILG)