Training and evaluation of vector models for Galician

García González, Marcos

doi:10.1007/s10579-024-09740-0

Training and evaluation of vector models for Galician

Files

2024_LangResEv_Garcia_Training.pdf (1.38 MB)

Identifiers

URI: https://hdl.handle.net/10347/45924

E-ISSN: 1574-0218

DOI: 10.1007/s10579-024-09740-0

Publication date

2024-06-04

Authors

García González, Marcos

Publisher

Springer Nature

Metrics

Export

Abstract

This paper presents a large and systematic assessment of distributional models for Galician. To this end, we have first trained and evaluated static word embeddings (e.g., word2vec, GloVe), and then compared their performance with that of current contextualised representations generated by neural language models. First, we have compiled and processed a large corpus for Galician, and created four datasets for word analogies and concept categorisation based on standard resources for other languages. Using the aforementioned corpus, we have trained 760 static vector space models which vary in their input representations (e.g., adjacency-based versus dependency-based approaches), learning algorithms, size of the surrounding contexts, and in the number of vector dimensions. These models have been evaluated both intrinsically, using the newly created datasets, and on extrinsic tasks, namely on POS-tagging, dependency parsing, and named entity recognition. The results provide new insights into the performance of different vector models in Galician, and about the impact of several training parameters on each task. In general, fastText embeddings are the static representations with the best performance in the intrinsic evaluations and in named entity recognition, while syntax-based embeddings achieve the highest results in POS-tagging and dependency parsing, indicating that there is no significant correlation between the performance in the intrinsic and extrinsic tasks. Finally, we have compared the performance of static vector representations with that of BERT-based word embeddings, whose fine-tuning obtains the best performance on named entity recognition. This comparison provides a comprehensive state-of-the-art of current models in Galician, and releases new transformer-based models for NER. All the resources used in this research are freely available to the community, and the best models have been incorporated into SemantiGal, an online tool to explore vector representations for Galician

Keywords

Distributional semantics| Word embeddings| Galician| Intrinsic evaluation| Extrinsic evaluation

Bibliographic citation

Garcia, M. Training and evaluation of vector models for Galician. Lang Resources & Evaluation 58, 1419–1462 (2024). https://doi.org/10.1007/s10579-024-09740-0

Publisher version

https://doi.org/10.1007/s10579-024-09740-0

Collections

Lingua e Literatura Españolas, Teoría da Literatura e Lingüística Xeral
Centro de Investigación en Tecnoloxías Intelixentes da USC (CiTIUS)

Full item page

Training and evaluation of vector models for Galician

Files

Identifiers

Publication date

Authors

Advisors

Tutors

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

Metrics

Export

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Bibliographic citation

Relation

Has part

Has version

Is based on

Is part of

Is referenced by

Is version of

Requires

Publisher version

Sponsors

Rights

Collections