Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo

dc.contributor.affiliationUniversidade de Santiago de Compostela. Instituto da Lingua Galega (ILG)
dc.contributor.affiliationUniversidade de Santiago de Compostela. Departamento de Filoloxía Galega
dc.contributor.authorMíguez, Vítor
dc.date.accessioned2026-02-19T07:12:04Z
dc.date.available2026-02-19T07:12:04Z
dc.date.issued2026-02-09
dc.description.abstractThis paper demonstrates the use of LLMs as first-pass filters in corpus annotation, with a focus on semantic disambiguation – a task more challenging than form-based classification due to its context-dependence. Using as a case study the polysemous Galician noun pobo ‘people/village’, the study demonstrates the applicability of LLM-assisted annotation to low-resource languages. 300 examples were annotated by three human coders and four LLMs (Claude 4 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and Claude 4.5 Opus) using a static, single-phase prompting approach. Since first-pass filters should capture as many actual occurrences of the target phenomenon as possible, priority was given to recall over precision. Accordingly, the paper argues for F 2 , a recall-focused metric, over commonly used alternatives like F 1 or MCC for validating LLM performance in filtering tasks. Claude 4.5 Opus with pretraining achieved the best performance against the human consensus ( F 2  = 0.944, recall = 100 %), resulting in substantial workload reduction with no information loss. The study demonstrates that LLMs can serve as effective first-pass filters for semantic annotation in corpus linguistics, extending their applicability to low-resource languages.
dc.description.peerreviewedSI
dc.description.sponsorshipThis work was funded by the Spanish Ministry of Science, Innovation and Universities (MICIU) and the State Research Agency (AEI) under grant PID2022-137170OB-I00 (10.13039/501100011033), and by the European Regional Development Fund (ERDF/EU).
dc.identifier.citationMíguez-Rego, Vítor (2026). Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo. Open Linguistics, 12(1). https://doi.org/10.1515/opli-2025-0078
dc.identifier.doi10.1515/opli-2025-0078
dc.identifier.essn2300-9969
dc.identifier.urihttps://hdl.handle.net/10347/45973
dc.issue.number1
dc.journal.titleOpen Linguistics
dc.language.isoeng
dc.page.final16
dc.page.initial1
dc.publisherDe Gruyter
dc.relation.projectIDinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2022-137170OB-I00/ES/ETIQUETADOR SEMANTICO MULTILINGUE AUTOMATICO Y SOSTENIBLE
dc.relation.publisherversionhttps://doi.org/10.1515/opli-2025-0078/html
dc.rights© 2026 the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License.
dc.rightsAttribution 4.0 Internationalen
dc.rights.accessRightsopen access
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectCorpus linguistics
dc.subjectCorpus annotation
dc.subjectSemantic disambiguation
dc.subjectLarge language models
dc.subjectGalician
dc.subject.classification570104 Lingüística informatizada
dc.titleLarge language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo
dc.typejournal article
dc.type.hasVersionVoR
dc.volume.number12
dspace.entity.typePublication
relation.isAuthorOfPublication57a03e10-d9a2-4a43-b76a-459649b0ceaf
relation.isAuthorOfPublication.latestForDiscovery57a03e10-d9a2-4a43-b76a-459649b0ceaf

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2026_open_miguez_large.pdf
Size:
599.28 KB
Format:
Adobe Portable Document Format