Evaluating the Language Abilities of Large Language Models vs. Humans: Three Caveats

Evelina Leivada; Vittoria Dentella; Fritz Günther

doi:10.5964/bioling.14391

Evaluating the Language Abilities of Large Language Models vs. Humans: Three Caveats

Evelina Leivada
Department of Catalan Philology, Universitat Autònoma de Barcelona, Barcelona, Spain; Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
Vittoria Dentella
Department of English and German Studies, Universitat Rovira i Virgili, Tarragona, Spain
Fritz Günther
Institut für Psychologie, Humboldt-Universitat zu Berlin, Berlin, Germany

Abstract

We identify and analyze three caveats that may arise when analyzing the linguistic abilities of Large Language Models. The problem of unlicensed generalizations refers to the danger of interpreting performance in one task as predictive of the models’ overall capabilities, based on the assumption that because a specific task performance is indicative of certain underlying capabilities in humans, the same association holds for models. The human-like paradox refers to the problem of lacking human comparisons, while at the same time attributing human-like abilities to the models. Last, the problem of double standards refers to the use of tasks and methodologies that either cannot be applied to humans or they are evaluated differently in models vs. humans. While we recognize the impressive linguistic abilities of LLMs, we conclude that specific claims about the models’ human-likeness in the grammatical domain are premature.

PDF HTML XML

Published at

19. April 2024
https://doi.org/10.5964/bioling.14391
Issue:

Vol. 18 (2024)
Section:

Forum
Keywords:

Artificial Intelligence grammaticality Large Language Models probabilities
Share:

This work is licensed under a Creative Commons Attribution (CC BY) 4.0 International License.

PlumX

Dimensions

Views:

Total	Abstract	PDF	HTML	XML
4704	1772

Authors

Abstract