1. Introduction

BIOLING

Biolinguistics

1450-3417

PsychOpen

bioling.19021

10.5964/bioling.19021

Forum

Data

Fundamental Principles of Linguistic Structure Are Not Represented by ChatGPT

Fundamental principles of linguistic structure are not represented by ChatGPT

Murphy

Elliot

*¹² Leivada

Evelina

³⁴ Dentella

Vittoria

⁵ Montero

Raquel

³ Günther

Fritz

⁶ Marcus

Gary

⁷ Grohmann

Kleanthes K.

1Vivian L. Smith Department of Neurosurgery, UTHealth, Houston, TX, USA 2Texas Institute for Restorative Neurotechnologies, UTHealth, Houston, TX, USA 3Departament de Filologia Catalana, Universitat Autònoma de Barcelona, Barcelona, Spain 4Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain 5Department of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy 6Institut für Psychologie, Humboldt-Universität zu Berlin, Berlin, Germany 7Department of Psychology, New York University, New York, NY, USA University of Cyprus, Nicosia, Cyprus

*Texas Institute for Restorative Neurotechnologies, 1133 John Freeman Blvd, Houston, TX 77030, USA. elliot.murphy@uth.tmc.edu

04122025

2025

e19021

24 07 2025 29 09 2025

2025

Murphy, Leivada, Dentella et al.

https://creativecommons.org/licenses/by/4.0/

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License, CC BY 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

A core component of a successful artificial general intelligence would be the rapid creation and manipulation of grounded compositional abstractions and the demonstration of expertise in the family of recursive hierarchical syntactic objects necessary for the creative use of human language. We evaluated the recently released o3 model (OpenAI; o3-mini-high) from ChatGPT and discovered that while it succeeds on some basic linguistic tests relying on linear, surface statistics (e.g., the Strawberry Test), it fails to generalize basic phrase structure rules; it fails with comparative sentences involving semantically illegal cardinality comparisons (‘Escher sentences’); it fails to correctly rate and explain acceptability dynamics; and it fails to distinguish between instructions to generate unacceptable semantic vs. unacceptable syntactic outputs. When tasked with generating simple violations of grammatical rules, it is seemingly incapable of representing multiple parses to evaluate against various possible semantic interpretations. We ran all of these prompts multiple times again through the API and provide basic accuracy scores. In stark contrast to many recent claims that artificial language models are on the verge of replacing the field of linguistics, our results suggest not only that deep learning is hitting a wall with respect to compositionality (Marcus, 2022), but that it is hitting [a [stubbornly [resilient wall]]] that cannot readily be surmounted to reach human-like compositional reasoning simply through more compute.

compositionalitysyntaxOpenAIo3semantics

1. Introduction

Large language models—deep neural nets trained in next-word prediction in a large corpus of text—have proven capable of parsing the complex sequential statistics of written text without many obvious grammatical errors (Besta et al., 2025; Lindström, 2024; Russin et al., 2024; Zhao & Zhang, 2024). This has spurred many to deem them capable of human-like compositionality, in particular with respect to syntax-semantics (Mahowald et al., 2024). Some have even claimed that “large language models are better than theoretical linguists at theoretical linguistics” (Ambridge & Blything, 2024), and that we are facing “the end of (generative) linguistics as we know it” (Chesi Forthcoming) (although Chesi qualifies that many modern approaches to generative grammar are arguably just as architecturally opaque as LLMs). This would be an extremely consequential state of affairs—if it can be shown to be true. Yet, much recent work indicates that they merely emulate human language (Dentella et al., 2023, 2024; Katzir, 2023; Schaeffer et al., 2023) as opposed to being in possession of human-like syntactic competence.

In this report, the most recent reasoning model from OpenAI (o3-mini-high) is assessed for its ability to assess and generate compositonal representations. o3, like other ‘reasoning models’, is based on large language models but includes additional modules to improve certain computational functions and multi-step logical reasoning. Others have already expressed scepticism about the promise of o3. For example, its recent high performance on the ARC-AGI test “is not due to intelligence but due to the application of knowledge and computing resources that together enable an effective search in the given space of possible solutions” (Pfister & Jud, 2025). We agree in principle with the assessment in Mollica and Piantadosi (2022) that “Linguistic corpora are a low-dimensional projection of both syntax and thought, so it is not implausible that a smart learning system could recover at least some aspects of these cognitive systems from watching text alone”. The critical challenge, as ever, is to demonstrate this capacity empirically.

In our report, a number of basic flaws are discovered and noted with respect to the linguistic capabilities of o3. These pertain to fundamental properties of basic sentence structure building and semantic evaluation.

2. Method

We identified a number of basic linguistic processes, and a number of more hierarchically complex computations, to subject to direct investigation. o3-mini (OpenAI, 2025) was prompted via OpenAI’s API with the reasoning effort set to ‘high’ (as of September 2025, OpenAI does not have a temperature parameter that the user can specify for this model). To maximize reproducibility as much as possible, an integer seed was specified for each prompt and the system reference field was saved. Lastly, to explore the consistency in the responses, the model was asked 3 times the same prompt. Prompts are directly reproduced in highlighted red boxes, and responses are directly reproduced below. Given the preliminary nature of our experimental report we refrain from conducting analyses such as logistic regressions and provide only elementary descriptive statistics. More systematic analyses will be forthcoming in future work.

3. Results

We begin with initially unproblematic tests for the model (Prompts 1-7), before moving to more complex tests that proved problematic (Prompts 8-26).

3.1. Linear Order and Basic Dependencies

Starting first with some basic linear-based computations that do not involve higher-order compositional reasoning, o3-mini-high was able to successfully generate the following responses.

Prompt 1

Generate a palindrome that uses the word ‘knight’.

It passed the ‘strawberry test’.

Prompt 2

How many occurrences of the letter ‘r’ are there in ‘strawberry’?

The model was prompted with the following series of requests, and delivered reasonable responses for all (Prompts 3-7).

Prompt 3

In the sentence ‘Professors were mean but teachers were nice, they were likely moody’, who does ‘they’ refer to?

Prompt 4

In the sentence ‘Teachers were nice but professors were mean, they were likely moody’, who does ‘they’ refer to?

Prompt 5

In the sentence ‘Βill was happy but Mary was sad, he was probably overworked’, who does ‘he’ refer to?

Prompt 6

Does this sentence make sense to you?

3.2. Phrase Structure

Next, the model was tested for basic phrase structure representations.

Prompt 7

Is ‘Dogs dogs dog dog dogs’ grammatical?

Dog/dogs was then substituted for an invented pseudoword (Prompt 8). When presented with an ungrammatical structure (a superfluous ‘glarts’ was added to the grammatical 5-word formula above), the model incorrectly claimed that this was grammatical. The reasoning provided was fallacious, confusing the role of the middle words and mis-understanding the role of the final words.

Prompt 8

Pretend that ‘glart’ is a word that refers to a group of alien creatures, and can also refer to the action of pleasing. In this context, is ‘Glarts glarts glart glart glarts glarts’ grammatical?

When prompted with an even more preposterous example (adding three additional “glarts” to the end of the initially grammatical “Glarts glarts glart glart glarts”), the model generated an inaccurate tree structure that was not faithful to the string input (by mistakenly including more than two instances of “glart”) and declared it to be grammatical.

Prompt 9

Given the same context as above, is ‘Glarts glarts glart glart glarts glarts glarts glarts’ grammatical?

3.3. Escher Sentences

Next, we turned to comparative sentences involving semantically illegal cardinality comparisons (sometimes termed ‘Escher sentences’). o3-mini-high failed to parse the comparative illusion, noting only the structural acceptability, despite the sentence being ungrammatical.

Prompt 10

Is the sentence ‘Fewer athletes have been to Beijing than I have’ acceptable?

Prompt 11

Is the sentence ‘More women have finished university than he has’ acceptable?

3.4. Center-Embedding

We tested center-embedding acceptability. The model failed to detect ungrammaticality due to a missing verb (or superfluous Noun Phrase). The reasoning provided was flawed and included some hallucination of pronominal elements (although the model helpfully does not recommend this sentence “for everyday use”!).

Prompt 12

Is ‘The doctor the nurse the hospital had hired met John?’ acceptable?

With the next prompt below, the model fallaciously injects an additional ‘met’ not present in the prompt.

Prompt 13

Draw me a syntactic tree structure, in line with Minimalist syntax, for the sentence ‘The doctor the nurse the hospital had hired met John’.

The syntactic tree structure generated above does not successfully map onto the interpretation seemingly intended by the model’s output.

3.5. Paradox

When pushed with the below prompt – a more audacious semantic task – the response does not deliver on its stated reasoning plan to inject ‘ambiguity’ and ‘intricacy’. There is no element of surprisal or contradiction beyond the standard liar’s paradox (i.e., “This statement is false”). Of course, it is perhaps unfair and unreasonable to expect the model to create a paradox “more complex and sophisticated than all existing paradoxes or puzzles”, but the point to be made here concerns the means through which the model judged and reasoned on its attempt.

Prompt 14

Create for me a semantic or philosophical paradox that is more complex and sophisticated than all existing paradoxes or puzzles.

3.6. Generation of Syntactic Violations

When asked to generate an ungrammatical sentence (an easy task for humans – one could simply recite a list of random words), the model struggled in a rather peculiar way. It tried to generate a semantically anomalous output but the sentence that was generated was entirely grammatical and acceptable. The final step in its chain of reasoning claimed to ensure the presence of “a series of ungrammatical, confusing phrases” —which were not present in the output. The model’s output may be rather Joycean and surreal in its content, but it does not satisfy the clear instruction to be ungramatical.

Prompt 15

Can you generate for me a complex sentence that merges multiple modifiers and clauses into an ungrammatical output. Please ensure that the sentence is ungrammatical with no room for ambiguity.

The model also failed to provide a basic example of an ungrammatical sentence in line with a rather direct task of violation-formation.

Prompt 16

Generate an English sentence that violates a recursive application of a grammatical rule. Please choose any syntactic rule you like.

The model provided a grammatical English sentence, with fallacious reasoning as to its putative unacceptability (“Who did who see?” is a common multiple wh-question seeking the agent and participant of a seeing event). Technically, the model makes a valid point about the presence of multiple wh-operators often leading to illegal read-outs, but failed to then reflect on the other possible readings of the simple four-word string it outputted and claimed to be fundamentally ungrammatical. This provides a more stringent test for (the lack of) compositional syntax than the more common tests recently used that simply task language models with dispassionately generating strings of discourse with certain stylistic qualities (Piantadosi, 2024).

3.7. Generation of Multiple Syntactic Violations

Next, o3-mini-high failed in a number of ways with the following prompts designed to test the parsing of multiple, related syntactic representations.

Prompt 17

Generate two sentences. The first sentence must contain one type of syntactic violation. The second sentence must continue the discourse content from the first, but must contain a different type of syntactic violation that explicitly is caused by some type of relation or connection with the first sentence. Draw a Minimalist tree structure to map the explicit coordination of these multiple error types.

The model fails to take account of the fact that Sentence 1 (‘The pair of scholars debate their thesis in a hurried conference’) is perfectly grammatical under standard English present tense. It focused only on how ‘pair’ is singular and so ‘would require “debates”’ – seemingly incapable of parsing interactional syntactic dynamics that require multiple steps to construct and evaluate against various possible semantic interpretations. Instead, it seemed limited to evaluating syntactic violations on a mono-configurational basis, failing to reflect on how one possible violation type could directly lead to multiple different types of acceptability under standard English syntax. In other words, humans would readily notice that while Sentence 1 may technically violate one typically expected form of agreement relation, it does not preclude the string from being subject to a wholly standard and acceptable interpretation.

Next, consider Sentence 2 (‘Owing to this faulty construction, themselves misinterpreting the rule from the previous discussion, the committee postponed the session’). This sentence is also (awkwardly) grammatical under basic movement applications allowing ‘themselves’ to be interpreted with ‘the committee’. Interestingly, it also appears here that the content of the prompt has influenced the semantics of Sentence 2—which makes reference to some form of rule misinterpretation. The model seems incapable of abstracting away from the basic instruction to generate syntactic violations and provide a semantic representation that is wholly independent from aspects of statistical inferences made from the prompt. On top of this, Sentence 1 and 2 do not in fact form a coherent discourse continuation, as explicitly requested in the prompt.

The accompanying tree structure that was generated does not accurately represent the semantics of the two separate sentences, and appears to try and represent ‘postponed the session’ without any clear syntactic categorization.

Note also that the final explanation for these sentences focuses explicitly on the basic possible agreement relation between two discrete elements (‘The pair’ and ‘themselves’), rather than taking a more global syntactic assessment of the role of these two elements in the context of their respective syntactic structures. Not only does the model fail to generate clear syntactic violations, it also fails to provide a level of discourse coherence that is independent of the semantics of the prompt.

When these types of errors were presented to the model (Prompt 18), it provided two sentences that did indeed exhibit a coherent discourse relation. However, it still failed to generate a syntactic violation in Sentence 2 that relied on explicit properties of Sentence 1.

Prompt 18

Both Sentence 1 and Sentence 2 are grammatical English sentences. For example, Sentence 1 means ‘There are two scholars and they are presently debating their thesis’. Sentence 2 means that the committee - who were misinterpreting the rule from the previous discussion - postponed the session, and that this was due to ‘this faulty construction’. It is also unclear what the discourse relation is between Sentence 1 and 2. Sentence 1 is about monks and theses, and Sentence 2 is about committees and constructions.

The model highlighted how ‘This reverberation’ in Sentence 2 is related in meaning to the previous sentence—which is irrelevant to the requested task of relating the syntactic violation itself (and not just the semantics) in Sentence 2 back to Sentence 1 (recall that Prompt 17 requests “a different type of syntactic violation that explicitly is caused by some type of relation or connection with the first sentence”; the request here is that the violation itself is causally driven by properties of the first sentence, and not simply linked to its meaning).

When these further errors were presented to the model, it ultimately succeeded in generating two separate types of syntactic violations for Sentence 1 and 2. Yet, while the discourse relation between the sentences was salient, the syntactic violation in Sentence 2 still did not satisfy the request of being directly linked to properties of Sentence 1 (achieving this successfully could easily have been achieved via Binding restrictions or 𝜑-feature violations, for example). The tree structure provided was also insuffiently transparent as to the core syntactic relations between elements.

Prompt 19

You have simply repeated the same type of violation across both sentences - you have not generated a second sentence whose violation is directly linked to properties of the first sentence.

When the model was once again corrected on this point, it provided two sentences that had the same type of syntactic violations (rather than different types), and the violation in Sentence 2 was again only related to the meaning of Sentence 1 but had zero connection to its syntactic configuration.

Prompt 20

You have only linked Sentence 2’s violation back to discourse features of Sentence 1. I would like you to generate a violation in Sentence 2 that is linked to syntactic properties of Sentence 1.

The model believed that this was a success because the Sentence 2 violation ‘is directly inherited from a syntactic property (the double wh‐extraction) introduced in Sentence 1’ – even though the extraction in Sentence 2 is purely bound to properties of Sentence 2 itself, with no connection to syntactic features in Sentence 1. While the presence of ‘his’ in Sentence 2 does indeed refer back to ‘The senator’, the wh-extraction constitutes a violation for independent reasons, and so does not satisfy the requests of (i) generating two different types of syntactic violations, and (ii) forming the second violation via a direct connection to syntactic properties of Sentence 1.

To summarize this line of inquiry, we provided in total 6 successive prompts (over Section 3.6-3.7) requesting types of violations, and we plot below the success of the model in satisfying various of these requests as they pertain to elements of structure and meaning.

Table 1Representation of the Success of o3-Mini-High in Generating Different Types of Syntactic Violations; ‘No’ and ‘Yes’ Indicate Failure or Success

Prompt #	Unacceptable Structure	Multiple Violation Types	Causally Driven Violation
15	No	N/A	N/A
16	No	N/A	N/A
17	No	No	No
18	Yes	Yes	No
19	Yes	Yes	No
20	Yes	No	No

3.8. Scope

Next, we turned to scope ambiguities (Kamath et al., 2024). o3-mini-high correctly identified Option A as the most commonly selected option (Prompt 21), but it did not provide any logical reasoning for why Option B below could in principle be true.

Prompt 21

There are exactly six chairs evenly spaced around a circular table. On the basis of this statement alone, and with no further context, there are two options:

A: The six different chairs are all around the same table.

B: The six chairs aren’t all around the same table.

Specifically in relation to this context, which of these two options is most likely?

The model’s logic implies that ‘a chair’ must semantically only refer to an absolute singular entity due purely to its grammatical features, which ignores how some interactional property of the syntactic features of the word and its role in a compositional structure could influence an alternative meaning to shift between broad and narrow scope readings (i.e., three chairs could surround Table A, and the other three chairs could surround Table B). This points to a lack of human-like arbitration between possible semantic representations delivered by a grammatical configuration and world knowledge.

Assessing the three bullet points in the explanation: When deciding between Options A and B, (i) there are many sentences that include the string ‘a circular table’ that readily result in an interpretation of multiple different tables (e.g., ‘Each Prince was gifted a circular table’); (ii) the even spacing does not strictly pertain to the decision at hand; and (iii) the model’s descripion of ‘Contextual Convention’ only begs the question by invoking circular reasoning (i.e., the sentence means X because it means X).

3.9. Assessment of Grammaticality

We asked the model to assess the acceptability (Tjuatja et al., 2024) and grammaticality of 16 sentences. Sentences (1)-(11) were ungrammatical, and the model successfully identified these as such. These ungrammatical sentences contained common violations discussed in the literature, such as adjunct islands, whether-islands, and binding condition violations. Sentences (12)-(16) were grammatical. However, the model incorrectly claimed that (12), (15) and (16) were ungrammatical, and its explanation for why (14) is grammatical was incorrect. Below the prompt, we focus on the responses pertinent to (12)-(16) since these were the items causing errors. This prompt is assuredly complex, but if artificial models are “better than theoretical linguists at theoretical linguistics” (Ambridge & Blything, 2024) we might expect some general successes.

Prompt 22

Please assess the following sentences for their acceptability and grammaticality. Explain how each of the sentences either does or does not violate any number of linguistic rules.

1) The journalists said that Trump lied about each other.

2) Mike tries will win.

3) The man expected the client to shoot each other.

4) For themselves to decide to go would be absurd.

5) For each other to lose would be disgraceful.

6) Sam believes to be intelligent.

7) Kim expects Saul to like herself.

8) I talked about Dale to himself.

9) Who did Tom talk with Sally after seeing?

10) Who does Diane wonder whether Cooper likes?

11) What did you make the claim that Kyle bought?

12) John likes Mary's picture of himself.

13) John likes Mary's picture of herself.

14) Jimmy expected Saul to win himself.

15) Jimmy expected himself to win Saul.

16) We think that they expected that pictures of each other would be in the room.

Below is a summary chart for the accuracy of o3-mini-high in identifying unacceptable and acceptable sentences (we caveat this by highlighting the limited sample size and non-systematic assessment).

Figure 1 Bar Chart Representing Classification Accuracy for o3-Mini-High for Unacceptable and Acceptable Sentences

The model incorrectly stated that ‘John’ is ‘too far removed’ to bind with ‘himself’ in (12). With (14), the model incorrectly states that ‘The intended reading is that Saul is expected to win on his own’. (14) can be read as Jimmy expecting Saul to win some potential prize, whereby the prize could be, e.g., some painting of Saul, whereby ‘Saul won himself’ would similarly mean ‘Saul won a painting of himself’.

These arguments also apply to (15), which the model incorrectly indentified as ungrammatical, even though Jimmy could, again, be expected to win some painting (or somesuch) of Saul (or, indeed, Saul himself could logically be the prize, e.g., ‘Saul’ could be the name of a pet or robot).

(16) has a dual reading, one under which ‘we’ and ‘each other’ are linked (ungrammatical) and one under which ‘they’ and ‘each other’ are linked (grammatical). The model failed to parse these possibilities.

3.10. Assessment of Graded Acceptability

Next, we followed up on the initial indications from Section 3.9 that the model succeeds in identifying ungrammatical sentences but struggles to reliably identify acceptable sentences as such. Instead of presenting only grammatical and ungrammatical sentences (as in Section 3.9), we exploited the gradient cline in acceptability in the constructions below (‘*’ = unacceptable; ‘?’ = partially acceptable for some) collated from some recent linguistics literature (Amiraz, 2022; Murphy, 2024a; Toosarvandani, 2014; Wu, 2025). Note that the ‘partial acceptability’ rating was motivated directly by prior literature, and was not arbitrarily stipulated by our group. 14 of these sentences were either unacceptable (1-4) or partially acceptable (5-14). We presented these sentences to o3-mini-high (without the below annotated ‘*’ or ‘?’) and asked it to sort them by acceptability. We provide an abridged prompt below, for reasons of space (the sentences were presented below this prompt text in randon order, without numbering).

Prompt 23

Please sort the sentences below into increasing levels of acceptability: from (1) (wholly unacceptable) to (2) (unacceptable) to (3) (partially acceptable) to (4) (acceptable).

The model identified 7 sentences as unacceptable, only 2 sentences as partially acceptable, and 15 sentences as acceptable, diverging from the acceptability profile provided above. Below, we have marked sentences incorrectly judged by the model with a red cross, and those correctly judged with a green tick. If the model assigned a partially unacceptable sentence as (1) (‘wholly unacceptable’) and provided a reasonable explanation, we considered this to be correct and hence assigned it a green tick.

Below is a summary chart for the accuracy of o3-mini-high across the different types of sentences it was tasked with rating (we again caveat this by highlighting the limited sample size in the present preliminary study).

Figure 2 Bar Chart Representing Accuracy for o3-Mini-High Across the Three Types of Sentences Provided in the Prompts

The explanation for (14) incorrectly states that this is unacceptable due to more common expectations of the presence of ‘to raise children’ (we note that the model’s own numbering system in its response text seems inconsistent and flawed, so we refer to numbered items in our ‘Gradient Cline in Acceptability’ list above). The model was unable to recognize zeugmatic conceptual coordination as motivating inclusion into either the partially acceptable or unacceptable groups (i.e., (7)). Some of the explanations for the unacceptable sentences – though correctly identified as such – do not provide a coherent explanation for their unacceptable nature. For example, ‘Not three students arrived’ is deemed unacceptable purely because ‘it is odd and ungrammatical’—which raises the question as to why!

Importantly, we wish to stress that we provided to the model four distinct options for acceptability, which were not utilized correctly for some of the partially acceptable sentences – even when the model explicitly noted in its response that these were in fact not wholly acceptable. For example, two of the sentences that the model placed in the ‘Acceptable’ group are noted as being ‘odd’ and ‘unexpected’ – ideal criteria to motivate their inclusion into the ‘Partially Acceptable’ group.

Overall, the model succeeded in identifying the most egregiously unacceptable sentences (in both this section and in Section 3.9), and most of the plainly acceptable sentences. However, some of its explanations were either lacking in specificity or were inconsistent with the model’s grouping of the sentences in question. In addition, the model struggled considerably with partially acceptable sentences, classifying only two sentences as partially acceptable out of ten—and one of these two sentences was incorrectly classified (two of the partially acceptable sentences were classified as unacceptable with reasonable explanations, and so we deemed these to be correct judgments). As such, only one sentence out of ten was correctly placed within the ‘partially acceptable’ group. Therefore, we conclude that the kind of acceptability spectrum that humans are acutely sensitive to is not reliably captured by o3.

3.11. Modified Jabberwocky

In order to test the potential interaction of lexical and configurational processes, we presented the model with the following prompt.

Prompt 24

Can you generate for me three ‘Jabberwocky’-style sentences which have the following properties: First, instead of replacing all content words with pseudowords (the typical way to implement a Jabberwocky sentence), I want you to replace all function words with pseudowords. The second sentence must contain a syntactic violation that must be detectible for English speakers. None of the pseudowords must rhyme with any other pseudoword across the three sentences. Finally, the three sentences must together form a coherent event structure.

Breaking down the four requests:

Success: All function words were accurately replaced with pseudowords.

Failure: The two neighboring pseudowords that are claimed to create “a clear syntactic violation” are not readily parsed as [Determiner, Determiner], since it is not necessarily ungrammatical to have two co-occurring pseudowords. For example, the sentence could readily be parsed as ‘The explorer followed with the map to a hidden grove’; or ‘across the map’, ‘within the map’, ‘in the map’, ‘on the map’, etc. The prompt requested that the syntactic violation must be detectible by English speakers – the model could have injected a syntactic violation that was more obviously marked on the content words.

Failure: The pseudowords ‘flim’ and ‘krim’ rhyme.

Success: A coherent narrative structure was provided.

Overall, the model was able to generate a series of narratively connected sentences and switch out all function words with pseudowords—operations that rely purely on lexical statistics, not structure. It failed with instructions that demanded a level of higher-order syntactic and even phonological inferences. Interestingly, by its own internal logic under which ‘lupn puxit map’ was inferred as the ungrammatical phrase ‘a the map’, the model was correct. But it was seemingly unable to check against other alternative parsings that would render this string of words grammatical. The model made it seem as if placing two pseudowords in the “wrong order” constitutes a syntactic violation since the English words that the model substituted them for would be ungrammatical. But of course, no English speaker would know that these pseudowords were transformed from specific function words.

3.12. Syntactic Superposition

The next prompt required the model to represent multiple syntactic violations within a single sentence, but to do so in a manner that nevertheless yielded some interpretable output. Though this is admittedly a difficult challenge, our motivation here was to expose the type of reasoning o3 exhibited when encountered with this challenge of negotiating two distinct syntactic rules in the service of some semantically-related goal.

Prompt 25

Generate a list of 10 sentences that exhibit the following property: They all violate two different types of grammatical rules, but violating these two rules simultaneously yields a semantically or syntactically acceptable sentence. Each of the 10 sentences must combine different rule violations.

The explanation for Sentences 2-5 and 7-10 can be used as a justification for basically all ungrammatical sentences as to why they are ungrammatical. This justification boils down to ‘speakers can just choose to ignore this word’ or ‘some people stutter sometimes’. This is perfectly true, but it is hardly in full compliance with the prompt’s request for a sentence that is “semantically or syntactically acceptable”. Meanwhile, sentences 1 and 6 rely on non-standard forms of English. As such, the model in effect failed to generate a single example of two syntactic rules ‘cancelling out’ (in semantic space or configurational space) to yield some interpretable structure. Perhaps most importantly, the prompt required the model to “combine different rule violations”, yet the general theme of ‘redundancy’, ‘repetition’ and ‘superfluous’ elements cited by the model in its explanations ensured one general violation type became overwhelmingly dominant (i.e., simply repeating a word).

3.13. Impossible Objects

Inspired by sentences involving complex forms of polysemy (e.g., “Lunch was delicious but took forever”; “The newspaper on the table was sued by a millionaire”; “The White House issued a statement before being repainted”) involving the combination of categorially distinct semantic types (Gotham, 2017; Murphy, 2021, 2024a), we generated the following prompt.

Prompt 26

Some sentences involving polysemous words can yield semantically 'impossible' objects, like nouns that are simultaneously referred to as processes or events or concrete tokens. Generate five sentences that each involve reference to a different type of semantically impossible entity, but which is perfectly comprehensible to English speakers as not violating any rules of semantic composition or conceptual combination. In these sentences, you must only refer to the named entity once explicitly. In addition, each sentence must exhibit a different combination of multiple meanings being combined together.

From these responses, it seems clear that the model has no human-like sense of semantic anomaly. The model is correct that (3) can be interpreted as a piece of information and also a physical text, but the other examples fail to generate any coherent sense of impossibility. For example, it is not ‘semantically impossible’ for something concrete to have an emotional impact. In (2), the model’s intended meaning, of tree bark ‘resounding’, is still not triggering of an impossible entity. With (5) (‘The plot was buried beneath layers of mystery’), the model uses ‘buried’ as metaphorical, such that an abstract plot exhibits some relation to some abstract mystery, hence causing no impossibility. With (1), (4) and (5), the model seems to assume that ‘figurative’ and ‘playful’ meanings suffice to satisfy the prompt’s request for blending semantically distinct meanings.

In addition, the prompt requested ‘a different combination’ across all sentences, but the ‘concrete/physical’ sense was used every time (sometimes twice with one sentence, as in (2)). This task would have been easily achievable if the model had blended physical, event, information and institution senses in various ways—instead, it was only able to mix vaguely metaphorical meanings. Notice that, as with some previous prompts above (e.g., Prompt 24), here we gave the model a generous clue as to how to solve this problem, and yet it was still unable to do so.

3.14. Summary

We ran these above prompts 3 additional times in o3 via OpenAI’s API, and the accuracy closely reproduced the above effects that we found via the chat interface. In Supplementary Material we provide the full results, broken down by accuracy.

4. Discussion

As predicted by some previous position papers and experimental reports (Baggio & Murphy, 2024; Leivada et al., 2023a, 2023b; Marcus, 2024), the latest sophisticated reasoning model from OpenAI (o3) falls short of demonstrating human-like expertise in compositional syntax-semantics. It fails to cleanly dissociate conceptual content from structural configuration – a basic requirement of compositional syntax (Evans, 1985; McCarty et al., 2023; Murphy, 2025) – and its provides surreal meanings instead of truly ungrammatical sentences. It was unable to generate a Jabberwocky structure that accurately represented a clear syntactic violation, it was unable to accurately assess the output of applying two distinct syntactic violations to a sentence, and it was unable to represent semantically impossible entities. Our results indicate that the kind of sentence acceptability spectrum that humans are acutely sensitive to (Sprouse & Almeida, 2012) is not reliably captured by o3. Although we only provide minimal descriptive statistics over a brief sample size (with a more systematic investigation forthcoming), our prompts covered a broad range of grammatical demands, and indicate not only that large language models (LLMs) (like ChatGPT-4o and Large Reasoning Models like o3) have problems with ‘contextual’ and ‘pragmatic’ reasoning, but that they have not yet grasped formal language competence (in contrast to more optimistic assessments in Mahowald et al., 2024).

4.1. Structure or Statistics?

While Beguš et al. (2025) report that GPT-4 is capable of recognizing ambiguities, correcting its own analytical errors, and commenting on the feasibility of multiple solutions, we found that the more recent o3 model fails to achieve something much more elementary: It was unable to reliably distinguish between meaning and structure. When Beguš et al. (2025) focus on OpenAI’s o1 model, they claim that its “ability to construct center-embedded sentences without being explicitly prompted to do so thus suggests that the model acquired grammatical structure beyond the simple distributional tendencies of its training data set”. In contrast, our results cast a more pessimistic light on the grammatical capacities of o3, including explicitly for center-embedding.

Moreover, our results (see especially Sections 3.6-3.13) help emphasize an apparent lack of meta-linguistic understanding (contra Beguš et al., 2025). For LLMs, language simply is the system it is trying to master, whereas for humans language is exploited as a powerful cognitive and inferential tool. Meta-linguistic understanding is only possible in principle if there is some separate cognitive/generative model or grounding in a world model that language is used to revise/update (Leivada et al., 2023b; Marcus, 2022). This does not seem to be the case for o3.

One caveat we wish to highlight here is the possibility that the model’s failure with drawing tree representations may simply be due to issues with interfacing with the drawing module itself, and may not necessarily be driven by issues in syntactic representation. Future work could attempt to have o3 output distinct types of configurational representations, perhaps via formalized languages that may be more approximate to native features of the model. A related caveat is that we have no direct human performance scores to directly makes claims about certain ‘human-level’ performance, which will be needed to make such comparisons.

4.2. Syntax or Salmon?

Our results support recent hypotheses concerning the ability of language models to represent ‘horizontal’ linguistic information, but their significantly reduced ability to represent ‘vertical’ types of hierarchical compositional syntax-semantics (Murphy, 2024b, 2024c). Postulating a chain of uni-directional associations between elements (and only showing an ability to deal with mono-configurational assessments, rather than understanding the dynamic relationship between syntactic processes and variable semantic interpretations; i.e., Sections 3.7-3.10) does not entail grammatical understanding. The language system does not fly solo—it is always in the game of driving higher-order inferences, planning, consolidating experience, and aiding directed attention. As suggested by our results, o3-mini-high lacks an ability to handle syntactic inferences alongside cognitive model updating, given its clear inability to recognize the various ways in which semantic and syntactic representations dynamically interact. Numerous examples from our report illustrate this. For example, the semantically zeugmatic constructions ‘The salmon was fast and delicious’ and ‘My appointment was long and obnoxious’ were deemed felicitous. The model was likely heavily biased by the lexico-semantic statistics of these constructions rather than by the subtle ways in which the grammar regulates distinct coordinates in conceptual space that differ markedly from the same general meaning being configured in syntactically distinct ways (e.g., compare with ‘The salmon was fast and it was delicious’; Murphy, 2021).

Our results therefore indicate a strong bias for imposing ‘horizontal’ relations on the part of o3. Humans, in contrast, have a strong bias from an early age to impose hierarchical, compositional structure above and beyond linear relations (Murphy, 2020a; Perkins & Lidz, 2021). As reviewed in Murphy (2024c), LLMs seem able to capture certain features of dependencies (Tesnière, 1959), but other fundamental principles of language that regulate how constituency, headedness, and incremental node counts yield semantic instructions during parsing (via the mapping of syntactic objects to updates of cognitive models) remain somewhat elusive.

4.3. Reasoning or Rambling?

Though it may represent an advance in “the boundaries of what small models can achieve, delivering exceptional STEM capabilities—with particular strength in science, math, and coding—all while maintaining the low cost and reduced latency of OpenAI o1-mini” (OpenAI, 2025), this most recent model nevertheless falls short in similar ways to previous models (Dentella et al., 2024; Murphy, 2024c). Our work expands on previous results exposing a stark absence of response stability in large language models (Dentella et al., 2023). Language models can assign probabilities to strings of words, but grammaticality cannot be construed as a phenomenon of transitional probability extracted from lexical items alone (Lenneberg, 1967). For this reason, recent advances that dispense with the notion of ‘tokenization’ altogether in favour of seeking ‘Large Concept Models’ grounded in semantic representations may potentially be more preferable in some cases (LCM Team et al., 2024).

Not only does the o3 model fall short in terms of providing a clear path towards artificial general intelligence (Pfister & Jud, 2025), it also fails to demonstrate a robust grasp of some of the most fundamental elements of compositional linguistic structures. Our brief report provides further reasons for scepticism towards the claim from Microsoft that OpenAI’s recent models “[attain] a form of general intelligence” and show “sparks of artificial general intelligence” (Bubeck et al., 2023, p. 92). We find claims from the AI team at Apple more reasonable here: A recent assessment found no evidence of formal reasoning in language models, with the team concluding that their behavior is better explained by sophisticated pattern matching (Mirzadeh et al., 2024). Consulting some of the explanations for acceptability provided by o3 (e.g., Section 3.7-3.10) also reinforces the assessment that ChatGPT is a professional “bullshitter” (Hicks et al., 2024), “bloviator” and “a fluent spouter of bullshit” (Marcus & Davis, 2020).

Interestingly, various advocates and proponents of LLMs have recently argued that linguists who claim that sentences such as ‘Dogs dogs dog dog dogs’ are grammatical are offering a psycholinguistically implausible and unhelpful theory of grammar. And yet, in a twist of irony, according to the present results the most advanced model from OpenAI does not appear to agree with this critique, and is seemingly so eager to attempt to parse these types of structures that it readily determines wholly ungrammatical cases (such as the “Glarts…” examples in Prompts 8-9) to be grammatical.

In some of our prompts requesting the generation of ungrammatical structures (Section 3.6) or the assessment of complex embedding (Section 3.4), we suspect that o3 was doubtless influenced by lexical statistics to a much greater extent than by any level of hidden states used to support (some format of) grammatical configuration (a bias already documented for text-to-image models; Leivada et al., 2023b). Yet, the task at hand was explicitly to invoke higher-order hierarchical representations and attempt to de-noise the relevant assessments from any influence from lexico-semantic statistics.

4.4. Theories or Tools?

“The best material model for a cat is another, or preferably the same cat”.

– Rosenblueth and Wiener (1945)

The fact that o3 was unable to reliably generate basic violations of syntactic rules should motivate some degree of concern and scepticism towards claims that LLMs do better than linguists on every job that syntactic theory was intended to perform. Ambridge and Blything (2024) argue that “large language models are better than theoretical linguists at theoretical linguistics”—an assessment at odds with our discovery that the most sophisticated reasoning model from OpenAI deems a number of grammatical sentences to constitute violations of binding theory, amongst other things. As pointed out by others, it is also incoherent to claim that LLMs can directly constitute a “theory of language” (Katzir, 2023; Müller, 2024). This type of theory-nihilism (and data-ism) has been bolstered by the recent surge of interest in LLMs, but it has yet to be proven capable of being translated into a concrete scientific research program that can replace dominant theories of language acquisition and processing.

Although Piantadosi (2024) recently attempted to do to Chomsky what Chomsky did to Skinner in 1959 (i.e., refute his research enterprise and much of its philosophical basis), Piantadosi’s arguments proved to be flawed (Katzir, 2023)¹1

See also a follow-up debate on this topic: “A conversation on large language models: Murphy & Piantadosi”. ActInf GuestStream 041.1 (23 April 2023). https://youtube.com/watch?v=EEyVd9d3D5U

. As pointed out already by Collins (2024):

“The fundamental reason that LLMs cannot be scientific theories is not because they are probabilistic, or because they involve parameter tuning. Nor even does it have to do with their lack of human intelligibility. As Piantadosi notes, such things are common enough among mature sciences. Rather, the issue is that the repre- sentational capacities of LLMs (and their connectionist siblings) are unbounded in a way that makes their representations arbitrary”.

As a brief aside, it is worth highlighting in this context that it was the human brain during evolution that created syntactic structure (Murphy, 2019, 2020b, 2024c; Murphy et al., 2022, 2023, 2024b). LLMs, by contrast, being universal function approximators (Yun et al. 2019), are surely able to reproduce certain aspects of lexico-semantic statistics from the ‘fossilized’ remains of the human generative machine they recover from data (Mitchell & Krakauer, 2023). But there are very plausible reasons to assume that whatever method LLMs use it bears little resemblance to the algorithms deployed by human infants (Leivada & Murphy, 2022; Murphy et al., 2025), who deploy specialized knowledge rather than solely invoking general token-prediction algorithms. Due to LLMs being a universal approximation method, they are more akin to tools such as generalized Fourier series than scientific theories of human cognition. Relatedly, distributional semantics vectors can certainly be used as a proxy for natural language meanings, but they are not to be confused with “the stuff of thought” itself (Pinker, 2007). This is not even to mention related concerns that hover in the background, like the fact that the back propagation training algorithms used with LLMs are considerably different from human learning mechanisms (Evanson et al., 2023).

4.5. Design or Data?

Instead of scaling to unprecendented levels of compute via architectures that are fundamentally grounded in token prediction, a return to more traditional design features of the human mind (predicate-argument structure, variable binding, constituent structure, minimal compositional binding; Donatelli & Koller, 2023) may be needed to orchestrate a more reliable expertise in human language (Ramchand 2024). This could be implemented by forms of neuro-symbolic approaches.

Still, it is also certainly true that mainstream theoretical linguistics (e.g., the minimalist enterprise) was in some ways ill-equipped to successfully predict which patterns of linguistic activity might be (un)approachable by LLMs. To illustrate, a potential weakness in this direction with respect to recent generative grammar theorizing has been the underestimation of the extent to which lexical information drives composition. This type of information may permit LLMs to abductively infer certain elements of grammatical rules, in whatever format this ultimately takes (Ramchand, 2024). Future research should more carefully apply the tools of linguistics to isolate specific sub-components of syntax that might be in principle achievable by language models, given specific design features. For instance, with LLMs “complete recovery of syntax might be very difficult computationally” (Marcolli et al., 2025, p. 13), even if we assume that attention modules can in principle “satisfy the same algebraic structure” as what Marcolli et al. postulate as being necessary for syntax-semantics interface mappings.

More broadly, the currently popular vector approach (Piantadosi et al., 2024) risks conflating the implementation medium with the computational level: for a high-dimensional vector space that supposedly encodes structured symbolic derivations, unless the structure is explicitly recoverable and manipulable then the representation is only functionally equivalent in a loose sense. How does a vector-based learner discover abstract, exceptionless rules without relying on statistical accident? Piantadosi et al. (2024), and others, risk explaining compositionality post hoc (“it emerges in the geometry”) rather than as a necessary design property. And it is these necessary algebraic properties of language that linguistic theory tries to capture.

5. Conclusion

In contrast to some recent claims that we may be living through “the end of (generative) linguistics as we know it” (Chesi, Forthcoming), our results should spur cognitive scientists, psychologists and philosophers to press even further into the reaches of algorithmic and psycholinguistic models of hierarchical syntactic composition. Some recent directions here come from exploiting concepts from statistical physics (Murphy et al., 2024a) to uncover previously unknown principles of language design (and to provide a potential meta-language to compare and quantify distinct syntactic theories), and from recent attempts to bridge symbolic theories of language with probabilistic-connectionist models of parsing (Murphy, 2024c) to offer a neurobiologically plausible infrastructure for syntactic inferences.

The goal here should not be to virtuously resist the era of big data from the safety of our theoretical models of syntax, but to learn how best to properly leverage computational methods – not in order to surrender to LLMs (Piantadosi, 2024) but to utilize them (van Rooij et al., 2024) to assess how statistical and symbolic representations interact during the acquisition and processing of language.

Ethics Statement

This work did not involve human subject data, and no use of generative AI (outside of the main experiments which directly probed ChatGPT) was involved in the preparation of this manuscript.

References

Ambridge, B.& Blything, L. (2024). Large language models are better than theoretical linguists at theoretical linguistics. Theoretical Linguistics, 5(1-2), 33-48. 10.1515/tl-2024-2002

Amiraz, O. (2022). Not all scalar inferences are alike: the effect of existential presuppositions. In Degano, M., Roberts, T., Sbardolini, G., & Schouwstra, M. (Eds.). Proceedings of the 2022 Amsterdam Colloquium, 23, 8–14.

Baggio, G., & Murphy, E. (2024). On the referential capacity of large language models.

Beguš, G., Dąbkowski, M., & Rhodes, R. (2025). Large linguistic models: Analyzing theoretical linguistic abilities of LLMs. Lingbuzz. https://lingbuzz.net/007269

Besta, M., Barth, J., Schreiber, E., Kubicek, A., Catarino, A., Gerstenberger, R., Nyczyk, P., Iff, P., Li, Y., Houliston, S., Sternal, T., Copik, M., Kwaśniewski, G., Müller, J., Flis, L., Erberhard, H., Chen, U., Niewiadomski, H., & Hoefler, T. (2025). Reasoning language models: A blueprint.

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4.

Chesi, C.

(Forthcoming). Is it the end of (generative) linguistics as we know it? Italian Journal of Linguistics.

Collins, J. (2024). The simple reason LLMs are not scientific models (and what the alternative is for linguistics). Lingbuzz. https://lingbuzz.net/008026

Dentella, V., Günther, F., & Leivada, E. (2023). Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias. Proceedings of the National Academy of Sciences of the United States of America, 120(51), e2309583120. 10.1073/pnas.2309583120

38091290

Dentella, V., Günther, F., Murphy, E., Marcus, G., & Leivada, E. (2024). Testing AI on language comprehension tasks reveals insensitivity to underlying meaning. Scientific Reports, 14, 28083. 10.1038/s41598-024-79531-8

39543236

Donatelli, L., & Koller, A. (2023). Compositionality in computational linguistics. Annual Review of Linguistics, 9, 463–481. 10.1146/annurev-linguistics-030521-044439

Evans, G. (1985). Semantic theory and tacit knowledge. In Collected Papers (pp. 322–342). Oxford University Press.

Evanson, L., Lakretz, Y., & King, J. R. (2023). Language acquisition: Do children and language models follow similar learning stages? Findings of the Association for Computational Linguistics: ACL, 2023, 12205–12218. 10.18653/v1/2023.findings-acl.773

Gotham, M.

(2017). Composing criteria of individuation in copredication. Journal of Semantics, 34(2), 333–371.

Hicks, M. T., Humphries, J., & Slater, J. (2024). ChatGPT is bullshit. Ethics and Information Technology, 26(2), 1–10. 10.1007/s10676-024-09775-5

Kamath, G., Schuster, S., Vajjala, S., & Reddy, S. (2024). Scope ambiguities in large language models. Transactions of the Association for Computational Linguistics, 12, 738–754. 10.1162/tacl_a_00670

Katzir, R.

(2023). Why large language models are poor theories of human linguistic cognition: A reply to Piantadosi. Biolinguistics, 17, e13153. 10.5964/bioling.13153

LCM Team, Barrault, L., Duquenne, P., Elbayad, M., Kozhevnikov, A., Alastruey, B., Andrews, P., Coria, M., Couairon, G., Costa-jussà, M. R., Dale, D., Elsahar, H., Heffernan, K., Janeiro, J. M., Tran, T., Ropers, C., Sánchez, E., San Roman, R., Mourachko, A., … & Schwenk, H. (2024). Large Concept Models: Language modeling in a sentence representation space. Meta AI.

Leivada, E., Dentella, V., & Murphy, E. (2023a). The quo vadis of the relationship between language and large language models.

Leivada, E., & Murphy, E. (2022). A demonstration of the uncomputability of parametric models of language acquisition and a biologically plausible alternative. Language Development Research, 2(1), 105–138.

Leivada, E., Murphy, E., & Marcus, G. (2023b). DALL·E 2 fails to reliably capture common syntactic processes. Social Sciences & Humanities Open, 8(1), 100648. 10.1016/j.ssaho.2023.100648

Lenneberg, E. H. (1967). Biological foundations of language. John Wiley & Sons.

Lindström, A. D. (2024). Learning, reasoning, and compositional generalisation in Multimodal Language Models [PhD thesis]. Umeå University.

Mahowald, K., Ivanova, A., Blank, I. A., Kanwisher, N., Tenenbaum, J. B., & Fedorenko, E. (2024). Dissociating language and thought in Large Language Models: A cognitive perspective. Trends in Cognitive Sciences, 28(6), 517–540. 10.1016/j.tics.2024.01.011

38508911

Marcolli, M., Chomsky, N., & Berwick, R. C. (2025). Mathematical structure of syntactic merge: An algebraic model for generative linguistics. MIT Press.

Marcus, G. (2022, March 10). Deep learning is hitting a wall. Nautilu.

Marcus, G. (2024). Taming Silicon Valley: How we can ensure that AI works for us. MIT Press.

Marcus, G., & Davis, E. (2020, August 22). GPT-3, bloviator: OpenAI’s language generator has no idea what it’s talking about. MIT Technology Review.

Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models.

Mitchell, M., & Krakauer, D. (2023). The debate over understanding in AI’s large language models. Proceedings of the National Academy of Sciences of the United States of America, 120(13), e2215907120. 10.1073/pnas.2215907120

36943882

Mollica, F., & Piantadosi, S. (2022). Meaning without reference in large language models.

Müller, S.

(2024). Large language models: The best linguistic theory, a wrong linguistic theory, or no linguistic theory at all. Zeitschrift für Sprachwissenschaft, 44(1). 10.18148/zs/2025-2001

McCarty, M. J., Murphy, E., Scherschligt, X., Woolnough, O., Morse, C. W., Snyder, K., Mahon, B. Z., & Tandon, N. (2023). Intraoperative cortical localization of music and language reveals signatures of structural complexity in posterior temporal cortex. iScience, 26, 107223. 10.1016/j.isci.2023.107223

37485361

Murphy, E.

(2019). No country for Oldowan men: Emerging factors in language evolution. Frontiers in Psychology, 10, 1448. 10.3389/fpsyg.2019.01448

31275219

Murphy, E.

(2020a). Language design and communicative competence: the minimalist perspective. Glossa: A Journal of General Linguistics, 5(1), 2. 10.5334/gjgl.1081

Murphy, E. (2020b). The oscillatory nature of language. Cambridge University Press.

Murphy, E. (2021). Linguistic representation and processing of copredication [PhD thesis]. University College London.

Murphy, E.

(2024a). Predicate order and coherence in copredication. Inquiry, 67(6), 1744–1780. 10.1080/0020174X.2021.1958054

Murphy, E.

(2024b). ROSE: A neurocomputational architecture for syntax. Journal of Neurolinguistics, 70, 101180. 10.1016/j.jneuroling.2023.101180

Murphy, E.

(2024c). ROSE: A Universal Neural Grammar. Cognitive Neuroscience. Advance online publication. 10.1080/17588928.2025.2523875

40653898

Murphy, E. (2025). The nature of language and the structure of reality. In Benítez-Burraco, A., López, I.F., Férnandez-Pérez, M., & Ivanova, O. (Eds.), Biolinguistics at the cutting edge: Promises, achievements, and challenges (207–236) De Gruyter Mouton.

Murphy, E., de Villiers, J., & Morales, S. L. (2025). A comparative investigation into compositional syntax and semantics in DALL⋅E and young children. Social Sciences & Humanities Open, 11, 101332. 10.1016/j.ssaho.2025.101332

Murphy, E., Forseth, K. J., Donos, C., Snyder, K. M., Rollo, P. S., & Tandon, N. (2023). The spatiotemporal dynamics of semantic integration in the human brain. Nature Communications, 14, 6336. 10.1038/s41467-023-42087-8

37875526

Murphy, E., Holmes, E., & Friston, K. (2024a). Natural language syntax complies with the free-energy principle. Synthese, 203, 154. 10.1007/s11229-024-04566-3

38706520

Murphy, E., Rollo, P. S., Segaert, K., & Hagoort, P. (2024b). Multiple dimensions of syntactic structure are resolved earliest in posterior temporal cortex. Progress in Neurobiology, 241, 102669. 10.1016/j.pneurobio.2024.102669

39332803

Murphy, E., Woolnough, O., Rollo, P. S., Roccaforte, Z., Segaert, K., Hagoort, P., & Tandon, N. (2022). Minimal phrase composition revealed by intracranial recordings. The Journal of Neuroscience, 42(15), 3216–3227. 10.1523/JNEUROSCI.1575-21.2022

35232761

OpenAI. (2025, January 31). OpenAI o3-mini. https://openai.com

Perkins, L., & Lidz, J. (2021). Eighteen-month-old infants represent nonlocal syntactic dependencies. Proceedings of the National Academy of Sciences of the United States of America, 118(41), e2026469118. 10.1073/pnas.2026469118

34607945

Piantadosi, S. T. (2024). Modern language models refute Chomsky’s approach to language. In Gibson, E., & Poliak, M. (Eds.), From Fieldwork to Linguistic Theory: A Tribute to Daniel Everett (pp. 353–414). Language Science Press.

Piantadosi, S. T., Muller, D. C. Y., Rule, J. S., Kaushik, K., Gorenstein, M., Leib, E. R., & Sanford, E. (2024). Why concepts are (probably) vectors. Trends in Cognitive Sciences, 28(9), 844–856. 10.1016/j.tics.2024.06.011

39112125

Pinker, S. (2007). The stuff of thought: Language as a window into human nature. Penguin.

Pfister, R., & Jud, H. (2025). Understanding and benchmarking artificial general intelligence: OpenAI’s o3 is not AGI.

Ramchand, G. (2024). On LLMs, generative grammar, and how we need theory more than ever. Lingbuzz. https://lingbuzz.net/008643

Rosenblueth, A., & Wiener, N. (1945). The role of models in science. Philosophy of Science, 12(4), 316–321. 10.1086/286874

Russin, J., McGrath, S. W., Williams, D. J., & Elber-Dorozko, L. (2024). From Frege to ChatGPT: compositionality in language, cognition, and deep neural networks.

Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are emergent abilities of large language models a mirage?

Sprouse, J., & Almeida, D. (2012). Assessing the reliability of textbook data in syntax: Adger’s Core Syntax. Journal of Linguistics, 48, 609–652. 10.1017/S0022226712000011

Tesnière, L. (1959). Eléments de Syntaxe Structurale. Librairie C. Klincksieck. Republished as Elements of Structural Syntax. Translated by Timothy Osborne and Sylvain Kahane. John Benjamins.

Tjuatja, L., Neubig, G., Linzen, T., & Hao, S. (2024). What goes into a LM acceptability judgment? Rethinking the impact of frequency and length.

Toosarvandani, M.

(2014). Contrast and the structure of discourse. Semantics and Pragmatics, 7, 4. 10.3765/sp.7.4

van Rooij, I., Guest, O., Adolfi, F., de Haan, R., Kolokova, A., & Rich, P. (2024). Reclaiming AI as a theoretical tool for cognitive science. Computational Brain & Behavior, 7, 616–636. 10.1007/s42113-024-00217-5

Wu, D. (2025). Constituent negation requires entailment of an alternative. Lingbuzz. https://lingbuzz.net/008781

Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S. J., & Kumar, S. (2019). Are transformers universal approximators of sequence-to-sequence functions?

Zhao, J., & Zhang, X. (2024). Large Language Model is not a (multilingual) compositional relation reasoner. Proceedings of the First Conference on Language Modeling. Philadelphia, United States.

Data Availability

Prompts and responses from o3 are available in Supplementary Materials (see Murphy et al., 2025).

Supplementary Materials

For this article, prompts and responses from o3 are available as Supplementary Materials (see Murphy et al., 2025).

Murphy

Leivada

Dentella

Montero

Günther

Marcus

(2025). Supplementary materials to "Fundamental principles of linguistic structure are not represented by ChatGPT" [Data]. PsychOpen GOLD. 10.23668/psycharchives.21439

The authors have no funding to report.

The authors have declared that no competing interests exist.

The authors have no additional (i.e., non-financial) support to report.