Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo
| dc.contributor.affiliation | Universidade de Santiago de Compostela. Instituto da Lingua Galega (ILG) | |
| dc.contributor.affiliation | Universidade de Santiago de Compostela. Departamento de Filoloxía Galega | |
| dc.contributor.author | Míguez, Vítor | |
| dc.date.accessioned | 2026-02-19T07:12:04Z | |
| dc.date.available | 2026-02-19T07:12:04Z | |
| dc.date.issued | 2026-02-09 | |
| dc.description.abstract | This paper demonstrates the use of LLMs as first-pass filters in corpus annotation, with a focus on semantic disambiguation – a task more challenging than form-based classification due to its context-dependence. Using as a case study the polysemous Galician noun pobo ‘people/village’, the study demonstrates the applicability of LLM-assisted annotation to low-resource languages. 300 examples were annotated by three human coders and four LLMs (Claude 4 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and Claude 4.5 Opus) using a static, single-phase prompting approach. Since first-pass filters should capture as many actual occurrences of the target phenomenon as possible, priority was given to recall over precision. Accordingly, the paper argues for F 2 , a recall-focused metric, over commonly used alternatives like F 1 or MCC for validating LLM performance in filtering tasks. Claude 4.5 Opus with pretraining achieved the best performance against the human consensus ( F 2 = 0.944, recall = 100 %), resulting in substantial workload reduction with no information loss. The study demonstrates that LLMs can serve as effective first-pass filters for semantic annotation in corpus linguistics, extending their applicability to low-resource languages. | |
| dc.description.peerreviewed | SI | |
| dc.description.sponsorship | This work was funded by the Spanish Ministry of Science, Innovation and Universities (MICIU) and the State Research Agency (AEI) under grant PID2022-137170OB-I00 (10.13039/501100011033), and by the European Regional Development Fund (ERDF/EU). | |
| dc.identifier.citation | Míguez-Rego, Vítor (2026). Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo. Open Linguistics, 12(1). https://doi.org/10.1515/opli-2025-0078 | |
| dc.identifier.doi | 10.1515/opli-2025-0078 | |
| dc.identifier.essn | 2300-9969 | |
| dc.identifier.uri | https://hdl.handle.net/10347/45973 | |
| dc.issue.number | 1 | |
| dc.journal.title | Open Linguistics | |
| dc.language.iso | eng | |
| dc.page.final | 16 | |
| dc.page.initial | 1 | |
| dc.publisher | De Gruyter | |
| dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2022-137170OB-I00/ES/ETIQUETADOR SEMANTICO MULTILINGUE AUTOMATICO Y SOSTENIBLE | |
| dc.relation.publisherversion | https://doi.org/10.1515/opli-2025-0078/html | |
| dc.rights | © 2026 the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License. | |
| dc.rights | Attribution 4.0 International | en |
| dc.rights.accessRights | open access | |
| dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | Corpus linguistics | |
| dc.subject | Corpus annotation | |
| dc.subject | Semantic disambiguation | |
| dc.subject | Large language models | |
| dc.subject | Galician | |
| dc.subject.classification | 570104 Lingüística informatizada | |
| dc.title | Large language models as first-pass filters for corpus annotation: semantic disambiguation of Galician pobo | |
| dc.type | journal article | |
| dc.type.hasVersion | VoR | |
| dc.volume.number | 12 | |
| dspace.entity.type | Publication | |
| relation.isAuthorOfPublication | 57a03e10-d9a2-4a43-b76a-459649b0ceaf | |
| relation.isAuthorOfPublication.latestForDiscovery | 57a03e10-d9a2-4a43-b76a-459649b0ceaf |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- 2026_open_miguez_large.pdf
- Size:
- 599.28 KB
- Format:
- Adobe Portable Document Format