1. Introduction
Large language models—deep neural nets trained in next-word prediction in a large corpus of text—have proven capable of parsing the complex sequential statistics of written text without many obvious grammatical errors (Besta et al., 2025; Lindström, 2024; Russin et al., 2024; Zhao & Zhang, 2024). This has spurred many to deem them capable of human-like compositionality, in particular with respect to syntax-semantics (Mahowald et al., 2024). Some have even claimed that “large language models are better than theoretical linguists at theoretical linguistics” (Ambridge & Blything, 2024), and that we are facing “the end of (generative) linguistics as we know it” (Chesi Forthcoming) (although Chesi qualifies that many modern approaches to generative grammar are arguably just as architecturally opaque as LLMs). This would be an extremely consequential state of affairs—if it can be shown to be true. Yet, much recent work indicates that they merely emulate human language (Dentella et al., 2023, 2024; Katzir, 2023; Schaeffer et al., 2023) as opposed to being in possession of human-like syntactic competence.
In this report, the most recent reasoning model from OpenAI (o3-mini-high) is assessed for its ability to assess and generate compositonal representations. o3, like other ‘reasoning models’, is based on large language models but includes additional modules to improve certain computational functions and multi-step logical reasoning. Others have already expressed scepticism about the promise of o3. For example, its recent high performance on the ARC-AGI test “is not due to intelligence but due to the application of knowledge and computing resources that together enable an effective search in the given space of possible solutions” (Pfister & Jud, 2025). We agree in principle with the assessment in Mollica and Piantadosi (2022) that “Linguistic corpora are a low-dimensional projection of both syntax and thought, so it is not implausible that a smart learning system could recover at least some aspects of these cognitive systems from watching text alone”. The critical challenge, as ever, is to demonstrate this capacity empirically.
In our report, a number of basic flaws are discovered and noted with respect to the linguistic capabilities of o3. These pertain to fundamental properties of basic sentence structure building and semantic evaluation.
2. Method
We identified a number of basic linguistic processes, and a number of more hierarchically complex computations, to subject to direct investigation. o3-mini (OpenAI, 2025) was prompted via OpenAI’s API with the reasoning effort set to ‘high’ (as of September 2025, OpenAI does not have a temperature parameter that the user can specify for this model). To maximize reproducibility as much as possible, an integer seed was specified for each prompt and the system reference field was saved. Lastly, to explore the consistency in the responses, the model was asked 3 times the same prompt. Prompts are directly reproduced in highlighted red boxes, and responses are directly reproduced below. Given the preliminary nature of our experimental report we refrain from conducting analyses such as logistic regressions and provide only elementary descriptive statistics. More systematic analyses will be forthcoming in future work.
3. Results
We begin with initially unproblematic tests for the model (Prompts 1-7), before moving to more complex tests that proved problematic (Prompts 8-26).
3.1. Linear Order and Basic Dependencies
Starting first with some basic linear-based computations that do not involve higher-order compositional reasoning, o3-mini-high was able to successfully generate the following responses.
Prompt 1
Generate a palindrome that uses the word ‘knight’.
It passed the ‘strawberry test’.
Prompt 2
How many occurrences of the letter ‘r’ are there in ‘strawberry’?
The model was prompted with the following series of requests, and delivered reasonable responses for all (Prompts 3-7).
Prompt 3
In the sentence ‘Professors were mean but teachers were nice, they were likely moody’, who does ‘they’ refer to?
Prompt 4
In the sentence ‘Teachers were nice but professors were mean, they were likely moody’, who does ‘they’ refer to?
Prompt 5
In the sentence ‘Βill was happy but Mary was sad, he was probably overworked’, who does ‘he’ refer to?
Prompt 6
Does this sentence make sense to you?
3.2. Phrase Structure
Next, the model was tested for basic phrase structure representations.
Prompt 7
Is ‘Dogs dogs dog dog dogs’ grammatical?
Dog/dogs was then substituted for an invented pseudoword (Prompt 8). When presented with an ungrammatical structure (a superfluous ‘glarts’ was added to the grammatical 5-word formula above), the model incorrectly claimed that this was grammatical. The reasoning provided was fallacious, confusing the role of the middle words and mis-understanding the role of the final words.
Prompt 8
Pretend that ‘glart’ is a word that refers to a group of alien creatures, and can also refer to the action of pleasing. In this context, is ‘Glarts glarts glart glart glarts glarts’ grammatical?
When prompted with an even more preposterous example (adding three additional “glarts” to the end of the initially grammatical “Glarts glarts glart glart glarts”), the model generated an inaccurate tree structure that was not faithful to the string input (by mistakenly including more than two instances of “glart”) and declared it to be grammatical.
Prompt 9
Given the same context as above, is ‘Glarts glarts glart glart glarts glarts glarts glarts’ grammatical?
3.3. Escher Sentences
Next, we turned to comparative sentences involving semantically illegal cardinality comparisons (sometimes termed ‘Escher sentences’). o3-mini-high failed to parse the comparative illusion, noting only the structural acceptability, despite the sentence being ungrammatical.
Prompt 10
Is the sentence ‘Fewer athletes have been to Beijing than I have’ acceptable?
Prompt 11
Is the sentence ‘More women have finished university than he has’ acceptable?
3.4. Center-Embedding
We tested center-embedding acceptability. The model failed to detect ungrammaticality due to a missing verb (or superfluous Noun Phrase). The reasoning provided was flawed and included some hallucination of pronominal elements (although the model helpfully does not recommend this sentence “for everyday use”!).
Prompt 12
Is ‘The doctor the nurse the hospital had hired met John?’ acceptable?
With the next prompt below, the model fallaciously injects an additional ‘met’ not present in the prompt.
Prompt 13
Draw me a syntactic tree structure, in line with Minimalist syntax, for the sentence ‘The doctor the nurse the hospital had hired met John’.
The syntactic tree structure generated above does not successfully map onto the interpretation seemingly intended by the model’s output.
3.5. Paradox
When pushed with the below prompt – a more audacious semantic task – the response does not deliver on its stated reasoning plan to inject ‘ambiguity’ and ‘intricacy’. There is no element of surprisal or contradiction beyond the standard liar’s paradox (i.e., “This statement is false”). Of course, it is perhaps unfair and unreasonable to expect the model to create a paradox “more complex and sophisticated than all existing paradoxes or puzzles”, but the point to be made here concerns the means through which the model judged and reasoned on its attempt.
Prompt 14
Create for me a semantic or philosophical paradox that is more complex and sophisticated than all existing paradoxes or puzzles.
3.6. Generation of Syntactic Violations
When asked to generate an ungrammatical sentence (an easy task for humans – one could simply recite a list of random words), the model struggled in a rather peculiar way. It tried to generate a semantically anomalous output but the sentence that was generated was entirely grammatical and acceptable. The final step in its chain of reasoning claimed to ensure the presence of “a series of ungrammatical, confusing phrases” —which were not present in the output. The model’s output may be rather Joycean and surreal in its content, but it does not satisfy the clear instruction to be ungramatical.
Prompt 15
Can you generate for me a complex sentence that merges multiple modifiers and clauses into an ungrammatical output. Please ensure that the sentence is ungrammatical with no room for ambiguity.
The model also failed to provide a basic example of an ungrammatical sentence in line with a rather direct task of violation-formation.
Prompt 16
Generate an English sentence that violates a recursive application of a grammatical rule. Please choose any syntactic rule you like.
The model provided a grammatical English sentence, with fallacious reasoning as to its putative unacceptability (“Who did who see?” is a common multiple wh-question seeking the agent and participant of a seeing event). Technically, the model makes a valid point about the presence of multiple wh-operators often leading to illegal read-outs, but failed to then reflect on the other possible readings of the simple four-word string it outputted and claimed to be fundamentally ungrammatical. This provides a more stringent test for (the lack of) compositional syntax than the more common tests recently used that simply task language models with dispassionately generating strings of discourse with certain stylistic qualities (Piantadosi, 2024).
3.7. Generation of Multiple Syntactic Violations
Next, o3-mini-high failed in a number of ways with the following prompts designed to test the parsing of multiple, related syntactic representations.
Prompt 17
Generate two sentences. The first sentence must contain one type of syntactic violation. The second sentence must continue the discourse content from the first, but must contain a different type of syntactic violation that explicitly is caused by some type of relation or connection with the first sentence. Draw a Minimalist tree structure to map the explicit coordination of these multiple error types.
The model fails to take account of the fact that Sentence 1 (‘The pair of scholars debate their thesis in a hurried conference’) is perfectly grammatical under standard English present tense. It focused only on how ‘pair’ is singular and so ‘would require “debates”’ – seemingly incapable of parsing interactional syntactic dynamics that require multiple steps to construct and evaluate against various possible semantic interpretations. Instead, it seemed limited to evaluating syntactic violations on a mono-configurational basis, failing to reflect on how one possible violation type could directly lead to multiple different types of acceptability under standard English syntax. In other words, humans would readily notice that while Sentence 1 may technically violate one typically expected form of agreement relation, it does not preclude the string from being subject to a wholly standard and acceptable interpretation.
Next, consider Sentence 2 (‘Owing to this faulty construction, themselves misinterpreting the rule from the previous discussion, the committee postponed the session’). This sentence is also (awkwardly) grammatical under basic movement applications allowing ‘themselves’ to be interpreted with ‘the committee’. Interestingly, it also appears here that the content of the prompt has influenced the semantics of Sentence 2—which makes reference to some form of rule misinterpretation. The model seems incapable of abstracting away from the basic instruction to generate syntactic violations and provide a semantic representation that is wholly independent from aspects of statistical inferences made from the prompt. On top of this, Sentence 1 and 2 do not in fact form a coherent discourse continuation, as explicitly requested in the prompt.
The accompanying tree structure that was generated does not accurately represent the semantics of the two separate sentences, and appears to try and represent ‘postponed the session’ without any clear syntactic categorization.
Note also that the final explanation for these sentences focuses explicitly on the basic possible agreement relation between two discrete elements (‘The pair’ and ‘themselves’), rather than taking a more global syntactic assessment of the role of these two elements in the context of their respective syntactic structures. Not only does the model fail to generate clear syntactic violations, it also fails to provide a level of discourse coherence that is independent of the semantics of the prompt.
When these types of errors were presented to the model (Prompt 18), it provided two sentences that did indeed exhibit a coherent discourse relation. However, it still failed to generate a syntactic violation in Sentence 2 that relied on explicit properties of Sentence 1.
Prompt 18
Both Sentence 1 and Sentence 2 are grammatical English sentences. For example, Sentence 1 means ‘There are two scholars and they are presently debating their thesis’. Sentence 2 means that the committee - who were misinterpreting the rule from the previous discussion - postponed the session, and that this was due to ‘this faulty construction’. It is also unclear what the discourse relation is between Sentence 1 and 2. Sentence 1 is about monks and theses, and Sentence 2 is about committees and constructions.
The model highlighted how ‘This reverberation’ in Sentence 2 is related in meaning to the previous sentence—which is irrelevant to the requested task of relating the syntactic violation itself (and not just the semantics) in Sentence 2 back to Sentence 1 (recall that Prompt 17 requests “a different type of syntactic violation that explicitly is caused by some type of relation or connection with the first sentence”; the request here is that the violation itself is causally driven by properties of the first sentence, and not simply linked to its meaning).
When these further errors were presented to the model, it ultimately succeeded in generating two separate types of syntactic violations for Sentence 1 and 2. Yet, while the discourse relation between the sentences was salient, the syntactic violation in Sentence 2 still did not satisfy the request of being directly linked to properties of Sentence 1 (achieving this successfully could easily have been achieved via Binding restrictions or 𝜑-feature violations, for example). The tree structure provided was also insuffiently transparent as to the core syntactic relations between elements.
Prompt 19
You have simply repeated the same type of violation across both sentences - you have not generated a second sentence whose violation is directly linked to properties of the first sentence.
When the model was once again corrected on this point, it provided two sentences that had the same type of syntactic violations (rather than different types), and the violation in Sentence 2 was again only related to the meaning of Sentence 1 but had zero connection to its syntactic configuration.
Prompt 20
You have only linked Sentence 2’s violation back to discourse features of Sentence 1. I would like you to generate a violation in Sentence 2 that is linked to syntactic properties of Sentence 1.
The model believed that this was a success because the Sentence 2 violation ‘is directly inherited from a syntactic property (the double wh‐extraction) introduced in Sentence 1’ – even though the extraction in Sentence 2 is purely bound to properties of Sentence 2 itself, with no connection to syntactic features in Sentence 1. While the presence of ‘his’ in Sentence 2 does indeed refer back to ‘The senator’, the wh-extraction constitutes a violation for independent reasons, and so does not satisfy the requests of (i) generating two different types of syntactic violations, and (ii) forming the second violation via a direct connection to syntactic properties of Sentence 1.
To summarize this line of inquiry, we provided in total 6 successive prompts (over Section 3.6-3.7) requesting types of violations, and we plot below the success of the model in satisfying various of these requests as they pertain to elements of structure and meaning.
Table 1
Representation of the Success of o3-Mini-High in Generating Different Types of Syntactic Violations; ‘No’ and ‘Yes’ Indicate Failure or Success
| Prompt # | Unacceptable Structure | Multiple Violation Types | Causally Driven Violation |
|---|---|---|---|
| 15 | No | N/A | N/A |
| 16 | No | N/A | N/A |
| 17 | No | No | No |
| 18 | Yes | Yes | No |
| 19 | Yes | Yes | No |
| 20 | Yes | No | No |
3.8. Scope
Next, we turned to scope ambiguities (Kamath et al., 2024). o3-mini-high correctly identified Option A as the most commonly selected option (Prompt 21), but it did not provide any logical reasoning for why Option B below could in principle be true.
Prompt 21
There are exactly six chairs evenly spaced around a circular table. On the basis of this statement alone, and with no further context, there are two options:
A: The six different chairs are all around the same table.
B: The six chairs aren’t all around the same table.
Specifically in relation to this context, which of these two options is most likely?
The model’s logic implies that ‘a chair’ must semantically only refer to an absolute singular entity due purely to its grammatical features, which ignores how some interactional property of the syntactic features of the word and its role in a compositional structure could influence an alternative meaning to shift between broad and narrow scope readings (i.e., three chairs could surround Table A, and the other three chairs could surround Table B). This points to a lack of human-like arbitration between possible semantic representations delivered by a grammatical configuration and world knowledge.
Assessing the three bullet points in the explanation: When deciding between Options A and B, (i) there are many sentences that include the string ‘a circular table’ that readily result in an interpretation of multiple different tables (e.g., ‘Each Prince was gifted a circular table’); (ii) the even spacing does not strictly pertain to the decision at hand; and (iii) the model’s descripion of ‘Contextual Convention’ only begs the question by invoking circular reasoning (i.e., the sentence means X because it means X).
3.9. Assessment of Grammaticality
We asked the model to assess the acceptability (Tjuatja et al., 2024) and grammaticality of 16 sentences. Sentences (1)-(11) were ungrammatical, and the model successfully identified these as such. These ungrammatical sentences contained common violations discussed in the literature, such as adjunct islands, whether-islands, and binding condition violations. Sentences (12)-(16) were grammatical. However, the model incorrectly claimed that (12), (15) and (16) were ungrammatical, and its explanation for why (14) is grammatical was incorrect. Below the prompt, we focus on the responses pertinent to (12)-(16) since these were the items causing errors. This prompt is assuredly complex, but if artificial models are “better than theoretical linguists at theoretical linguistics” (Ambridge & Blything, 2024) we might expect some general successes.
Prompt 22
Please assess the following sentences for their acceptability and grammaticality. Explain how each of the sentences either does or does not violate any number of linguistic rules.
1) The journalists said that Trump lied about each other.
2) Mike tries will win.
3) The man expected the client to shoot each other.
4) For themselves to decide to go would be absurd.
5) For each other to lose would be disgraceful.
6) Sam believes to be intelligent.
7) Kim expects Saul to like herself.
8) I talked about Dale to himself.
9) Who did Tom talk with Sally after seeing?
10) Who does Diane wonder whether Cooper likes?
11) What did you make the claim that Kyle bought?
12) John likes Mary's picture of himself.
13) John likes Mary's picture of herself.
14) Jimmy expected Saul to win himself.
15) Jimmy expected himself to win Saul.
16) We think that they expected that pictures of each other would be in the room.
Below is a summary chart for the accuracy of o3-mini-high in identifying unacceptable and acceptable sentences (we caveat this by highlighting the limited sample size and non-systematic assessment).
Figure 1
Bar Chart Representing Classification Accuracy for o3-Mini-High for Unacceptable and Acceptable Sentences
The model incorrectly stated that ‘John’ is ‘too far removed’ to bind with ‘himself’ in (12). With (14), the model incorrectly states that ‘The intended reading is that Saul is expected to win on his own’. (14) can be read as Jimmy expecting Saul to win some potential prize, whereby the prize could be, e.g., some painting of Saul, whereby ‘Saul won himself’ would similarly mean ‘Saul won a painting of himself’.
These arguments also apply to (15), which the model incorrectly indentified as ungrammatical, even though Jimmy could, again, be expected to win some painting (or somesuch) of Saul (or, indeed, Saul himself could logically be the prize, e.g., ‘Saul’ could be the name of a pet or robot).
(16) has a dual reading, one under which ‘we’ and ‘each other’ are linked (ungrammatical) and one under which ‘they’ and ‘each other’ are linked (grammatical). The model failed to parse these possibilities.
3.10. Assessment of Graded Acceptability
Next, we followed up on the initial indications from Section 3.9 that the model succeeds in identifying ungrammatical sentences but struggles to reliably identify acceptable sentences as such. Instead of presenting only grammatical and ungrammatical sentences (as in Section 3.9), we exploited the gradient cline in acceptability in the constructions below (‘*’ = unacceptable; ‘?’ = partially acceptable for some) collated from some recent linguistics literature (Amiraz, 2022; Murphy, 2024a; Toosarvandani, 2014; Wu, 2025). Note that the ‘partial acceptability’ rating was motivated directly by prior literature, and was not arbitrarily stipulated by our group. 14 of these sentences were either unacceptable (1-4) or partially acceptable (5-14). We presented these sentences to o3-mini-high (without the below annotated ‘*’ or ‘?’) and asked it to sort them by acceptability. We provide an abridged prompt below, for reasons of space (the sentences were presented below this prompt text in randon order, without numbering).
Prompt 23
Please sort the sentences below into increasing levels of acceptability: from (1) (wholly unacceptable) to (2) (unacceptable) to (3) (partially acceptable) to (4) (acceptable).
The model identified 7 sentences as unacceptable, only 2 sentences as partially acceptable, and 15 sentences as acceptable, diverging from the acceptability profile provided above. Below, we have marked sentences incorrectly judged by the model with a red cross, and those correctly judged with a green tick. If the model assigned a partially unacceptable sentence as (1) (‘wholly unacceptable’) and provided a reasonable explanation, we considered this to be correct and hence assigned it a green tick.
Below is a summary chart for the accuracy of o3-mini-high across the different types of sentences it was tasked with rating (we again caveat this by highlighting the limited sample size in the present preliminary study).
Figure 2
Bar Chart Representing Accuracy for o3-Mini-High Across the Three Types of Sentences Provided in the Prompts
The explanation for (14) incorrectly states that this is unacceptable due to more common expectations of the presence of ‘to raise children’ (we note that the model’s own numbering system in its response text seems inconsistent and flawed, so we refer to numbered items in our ‘Gradient Cline in Acceptability’ list above). The model was unable to recognize zeugmatic conceptual coordination as motivating inclusion into either the partially acceptable or unacceptable groups (i.e., (7)). Some of the explanations for the unacceptable sentences – though correctly identified as such – do not provide a coherent explanation for their unacceptable nature. For example, ‘Not three students arrived’ is deemed unacceptable purely because ‘it is odd and ungrammatical’—which raises the question as to why!
Importantly, we wish to stress that we provided to the model four distinct options for acceptability, which were not utilized correctly for some of the partially acceptable sentences – even when the model explicitly noted in its response that these were in fact not wholly acceptable. For example, two of the sentences that the model placed in the ‘Acceptable’ group are noted as being ‘odd’ and ‘unexpected’ – ideal criteria to motivate their inclusion into the ‘Partially Acceptable’ group.
Overall, the model succeeded in identifying the most egregiously unacceptable sentences (in both this section and in Section 3.9), and most of the plainly acceptable sentences. However, some of its explanations were either lacking in specificity or were inconsistent with the model’s grouping of the sentences in question. In addition, the model struggled considerably with partially acceptable sentences, classifying only two sentences as partially acceptable out of ten—and one of these two sentences was incorrectly classified (two of the partially acceptable sentences were classified as unacceptable with reasonable explanations, and so we deemed these to be correct judgments). As such, only one sentence out of ten was correctly placed within the ‘partially acceptable’ group. Therefore, we conclude that the kind of acceptability spectrum that humans are acutely sensitive to is not reliably captured by o3.
3.11. Modified Jabberwocky
In order to test the potential interaction of lexical and configurational processes, we presented the model with the following prompt.
Prompt 24
Can you generate for me three ‘Jabberwocky’-style sentences which have the following properties: First, instead of replacing all content words with pseudowords (the typical way to implement a Jabberwocky sentence), I want you to replace all function words with pseudowords. The second sentence must contain a syntactic violation that must be detectible for English speakers. None of the pseudowords must rhyme with any other pseudoword across the three sentences. Finally, the three sentences must together form a coherent event structure.
Breaking down the four requests:
Success: All function words were accurately replaced with pseudowords.
Failure: The two neighboring pseudowords that are claimed to create “a clear syntactic violation” are not readily parsed as [Determiner, Determiner], since it is not necessarily ungrammatical to have two co-occurring pseudowords. For example, the sentence could readily be parsed as ‘The explorer followed with the map to a hidden grove’; or ‘across the map’, ‘within the map’, ‘in the map’, ‘on the map’, etc. The prompt requested that the syntactic violation must be detectible by English speakers – the model could have injected a syntactic violation that was more obviously marked on the content words.
Failure: The pseudowords ‘flim’ and ‘krim’ rhyme.
Success: A coherent narrative structure was provided.
Overall, the model was able to generate a series of narratively connected sentences and switch out all function words with pseudowords—operations that rely purely on lexical statistics, not structure. It failed with instructions that demanded a level of higher-order syntactic and even phonological inferences. Interestingly, by its own internal logic under which ‘lupn puxit map’ was inferred as the ungrammatical phrase ‘a the map’, the model was correct. But it was seemingly unable to check against other alternative parsings that would render this string of words grammatical. The model made it seem as if placing two pseudowords in the “wrong order” constitutes a syntactic violation since the English words that the model substituted them for would be ungrammatical. But of course, no English speaker would know that these pseudowords were transformed from specific function words.
3.12. Syntactic Superposition
The next prompt required the model to represent multiple syntactic violations within a single sentence, but to do so in a manner that nevertheless yielded some interpretable output. Though this is admittedly a difficult challenge, our motivation here was to expose the type of reasoning o3 exhibited when encountered with this challenge of negotiating two distinct syntactic rules in the service of some semantically-related goal.
Prompt 25
Generate a list of 10 sentences that exhibit the following property: They all violate two different types of grammatical rules, but violating these two rules simultaneously yields a semantically or syntactically acceptable sentence. Each of the 10 sentences must combine different rule violations.
The explanation for Sentences 2-5 and 7-10 can be used as a justification for basically all ungrammatical sentences as to why they are ungrammatical. This justification boils down to ‘speakers can just choose to ignore this word’ or ‘some people stutter sometimes’. This is perfectly true, but it is hardly in full compliance with the prompt’s request for a sentence that is “semantically or syntactically acceptable”. Meanwhile, sentences 1 and 6 rely on non-standard forms of English. As such, the model in effect failed to generate a single example of two syntactic rules ‘cancelling out’ (in semantic space or configurational space) to yield some interpretable structure. Perhaps most importantly, the prompt required the model to “combine different rule violations”, yet the general theme of ‘redundancy’, ‘repetition’ and ‘superfluous’ elements cited by the model in its explanations ensured one general violation type became overwhelmingly dominant (i.e., simply repeating a word).
3.13. Impossible Objects
Inspired by sentences involving complex forms of polysemy (e.g., “Lunch was delicious but took forever”; “The newspaper on the table was sued by a millionaire”; “The White House issued a statement before being repainted”) involving the combination of categorially distinct semantic types (Gotham, 2017; Murphy, 2021, 2024a), we generated the following prompt.
Prompt 26
Some sentences involving polysemous words can yield semantically 'impossible' objects, like nouns that are simultaneously referred to as processes or events or concrete tokens. Generate five sentences that each involve reference to a different type of semantically impossible entity, but which is perfectly comprehensible to English speakers as not violating any rules of semantic composition or conceptual combination. In these sentences, you must only refer to the named entity once explicitly. In addition, each sentence must exhibit a different combination of multiple meanings being combined together.
From these responses, it seems clear that the model has no human-like sense of semantic anomaly. The model is correct that (3) can be interpreted as a piece of information and also a physical text, but the other examples fail to generate any coherent sense of impossibility. For example, it is not ‘semantically impossible’ for something concrete to have an emotional impact. In (2), the model’s intended meaning, of tree bark ‘resounding’, is still not triggering of an impossible entity. With (5) (‘The plot was buried beneath layers of mystery’), the model uses ‘buried’ as metaphorical, such that an abstract plot exhibits some relation to some abstract mystery, hence causing no impossibility. With (1), (4) and (5), the model seems to assume that ‘figurative’ and ‘playful’ meanings suffice to satisfy the prompt’s request for blending semantically distinct meanings.
In addition, the prompt requested ‘a different combination’ across all sentences, but the ‘concrete/physical’ sense was used every time (sometimes twice with one sentence, as in (2)). This task would have been easily achievable if the model had blended physical, event, information and institution senses in various ways—instead, it was only able to mix vaguely metaphorical meanings. Notice that, as with some previous prompts above (e.g., Prompt 24), here we gave the model a generous clue as to how to solve this problem, and yet it was still unable to do so.
3.14. Summary
We ran these above prompts 3 additional times in o3 via OpenAI’s API, and the accuracy closely reproduced the above effects that we found via the chat interface. In Supplementary Material we provide the full results, broken down by accuracy.
4. Discussion
As predicted by some previous position papers and experimental reports (Baggio & Murphy, 2024; Leivada et al., 2023a, 2023b; Marcus, 2024), the latest sophisticated reasoning model from OpenAI (o3) falls short of demonstrating human-like expertise in compositional syntax-semantics. It fails to cleanly dissociate conceptual content from structural configuration – a basic requirement of compositional syntax (Evans, 1985; McCarty et al., 2023; Murphy, 2025) – and its provides surreal meanings instead of truly ungrammatical sentences. It was unable to generate a Jabberwocky structure that accurately represented a clear syntactic violation, it was unable to accurately assess the output of applying two distinct syntactic violations to a sentence, and it was unable to represent semantically impossible entities. Our results indicate that the kind of sentence acceptability spectrum that humans are acutely sensitive to (Sprouse & Almeida, 2012) is not reliably captured by o3. Although we only provide minimal descriptive statistics over a brief sample size (with a more systematic investigation forthcoming), our prompts covered a broad range of grammatical demands, and indicate not only that large language models (LLMs) (like ChatGPT-4o and Large Reasoning Models like o3) have problems with ‘contextual’ and ‘pragmatic’ reasoning, but that they have not yet grasped formal language competence (in contrast to more optimistic assessments in Mahowald et al., 2024).
4.1. Structure or Statistics?
While Beguš et al. (2025) report that GPT-4 is capable of recognizing ambiguities, correcting its own analytical errors, and commenting on the feasibility of multiple solutions, we found that the more recent o3 model fails to achieve something much more elementary: It was unable to reliably distinguish between meaning and structure. When Beguš et al. (2025) focus on OpenAI’s o1 model, they claim that its “ability to construct center-embedded sentences without being explicitly prompted to do so thus suggests that the model acquired grammatical structure beyond the simple distributional tendencies of its training data set”. In contrast, our results cast a more pessimistic light on the grammatical capacities of o3, including explicitly for center-embedding.
Moreover, our results (see especially Sections 3.6-3.13) help emphasize an apparent lack of meta-linguistic understanding (contra Beguš et al., 2025). For LLMs, language simply is the system it is trying to master, whereas for humans language is exploited as a powerful cognitive and inferential tool. Meta-linguistic understanding is only possible in principle if there is some separate cognitive/generative model or grounding in a world model that language is used to revise/update (Leivada et al., 2023b; Marcus, 2022). This does not seem to be the case for o3.
One caveat we wish to highlight here is the possibility that the model’s failure with drawing tree representations may simply be due to issues with interfacing with the drawing module itself, and may not necessarily be driven by issues in syntactic representation. Future work could attempt to have o3 output distinct types of configurational representations, perhaps via formalized languages that may be more approximate to native features of the model. A related caveat is that we have no direct human performance scores to directly makes claims about certain ‘human-level’ performance, which will be needed to make such comparisons.
4.2. Syntax or Salmon?
Our results support recent hypotheses concerning the ability of language models to represent ‘horizontal’ linguistic information, but their significantly reduced ability to represent ‘vertical’ types of hierarchical compositional syntax-semantics (Murphy, 2024b, 2024c). Postulating a chain of uni-directional associations between elements (and only showing an ability to deal with mono-configurational assessments, rather than understanding the dynamic relationship between syntactic processes and variable semantic interpretations; i.e., Sections 3.7-3.10) does not entail grammatical understanding. The language system does not fly solo—it is always in the game of driving higher-order inferences, planning, consolidating experience, and aiding directed attention. As suggested by our results, o3-mini-high lacks an ability to handle syntactic inferences alongside cognitive model updating, given its clear inability to recognize the various ways in which semantic and syntactic representations dynamically interact. Numerous examples from our report illustrate this. For example, the semantically zeugmatic constructions ‘The salmon was fast and delicious’ and ‘My appointment was long and obnoxious’ were deemed felicitous. The model was likely heavily biased by the lexico-semantic statistics of these constructions rather than by the subtle ways in which the grammar regulates distinct coordinates in conceptual space that differ markedly from the same general meaning being configured in syntactically distinct ways (e.g., compare with ‘The salmon was fast and it was delicious’; Murphy, 2021).
Our results therefore indicate a strong bias for imposing ‘horizontal’ relations on the part of o3. Humans, in contrast, have a strong bias from an early age to impose hierarchical, compositional structure above and beyond linear relations (Murphy, 2020a; Perkins & Lidz, 2021). As reviewed in Murphy (2024c), LLMs seem able to capture certain features of dependencies (Tesnière, 1959), but other fundamental principles of language that regulate how constituency, headedness, and incremental node counts yield semantic instructions during parsing (via the mapping of syntactic objects to updates of cognitive models) remain somewhat elusive.
4.3. Reasoning or Rambling?
Though it may represent an advance in “the boundaries of what small models can achieve, delivering exceptional STEM capabilities—with particular strength in science, math, and coding—all while maintaining the low cost and reduced latency of OpenAI o1-mini” (OpenAI, 2025), this most recent model nevertheless falls short in similar ways to previous models (Dentella et al., 2024; Murphy, 2024c). Our work expands on previous results exposing a stark absence of response stability in large language models (Dentella et al., 2023). Language models can assign probabilities to strings of words, but grammaticality cannot be construed as a phenomenon of transitional probability extracted from lexical items alone (Lenneberg, 1967). For this reason, recent advances that dispense with the notion of ‘tokenization’ altogether in favour of seeking ‘Large Concept Models’ grounded in semantic representations may potentially be more preferable in some cases (LCM Team et al., 2024).
Not only does the o3 model fall short in terms of providing a clear path towards artificial general intelligence (Pfister & Jud, 2025), it also fails to demonstrate a robust grasp of some of the most fundamental elements of compositional linguistic structures. Our brief report provides further reasons for scepticism towards the claim from Microsoft that OpenAI’s recent models “[attain] a form of general intelligence” and show “sparks of artificial general intelligence” (Bubeck et al., 2023, p. 92). We find claims from the AI team at Apple more reasonable here: A recent assessment found no evidence of formal reasoning in language models, with the team concluding that their behavior is better explained by sophisticated pattern matching (Mirzadeh et al., 2024). Consulting some of the explanations for acceptability provided by o3 (e.g., Section 3.7-3.10) also reinforces the assessment that ChatGPT is a professional “bullshitter” (Hicks et al., 2024), “bloviator” and “a fluent spouter of bullshit” (Marcus & Davis, 2020).
Interestingly, various advocates and proponents of LLMs have recently argued that linguists who claim that sentences such as ‘Dogs dogs dog dog dogs’ are grammatical are offering a psycholinguistically implausible and unhelpful theory of grammar. And yet, in a twist of irony, according to the present results the most advanced model from OpenAI does not appear to agree with this critique, and is seemingly so eager to attempt to parse these types of structures that it readily determines wholly ungrammatical cases (such as the “Glarts…” examples in Prompts 8-9) to be grammatical.
In some of our prompts requesting the generation of ungrammatical structures (Section 3.6) or the assessment of complex embedding (Section 3.4), we suspect that o3 was doubtless influenced by lexical statistics to a much greater extent than by any level of hidden states used to support (some format of) grammatical configuration (a bias already documented for text-to-image models; Leivada et al., 2023b). Yet, the task at hand was explicitly to invoke higher-order hierarchical representations and attempt to de-noise the relevant assessments from any influence from lexico-semantic statistics.
4.4. Theories or Tools?
“The best material model for a cat is another, or preferably the same cat”.
The fact that o3 was unable to reliably generate basic violations of syntactic rules should motivate some degree of concern and scepticism towards claims that LLMs do better than linguists on every job that syntactic theory was intended to perform. Ambridge and Blything (2024) argue that “large language models are better than theoretical linguists at theoretical linguistics”—an assessment at odds with our discovery that the most sophisticated reasoning model from OpenAI deems a number of grammatical sentences to constitute violations of binding theory, amongst other things. As pointed out by others, it is also incoherent to claim that LLMs can directly constitute a “theory of language” (Katzir, 2023; Müller, 2024). This type of theory-nihilism (and data-ism) has been bolstered by the recent surge of interest in LLMs, but it has yet to be proven capable of being translated into a concrete scientific research program that can replace dominant theories of language acquisition and processing.
Although Piantadosi (2024) recently attempted to do to Chomsky what Chomsky did to Skinner in 1959 (i.e., refute his research enterprise and much of its philosophical basis), Piantadosi’s arguments proved to be flawed (Katzir, 2023)1. As pointed out already by Collins (2024):
“The fundamental reason that LLMs cannot be scientific theories is not because they are probabilistic, or because they involve parameter tuning. Nor even does it have to do with their lack of human intelligibility. As Piantadosi notes, such things are common enough among mature sciences. Rather, the issue is that the repre- sentational capacities of LLMs (and their connectionist siblings) are unbounded in a way that makes their representations arbitrary”.
As a brief aside, it is worth highlighting in this context that it was the human brain during evolution that created syntactic structure (Murphy, 2019, 2020b, 2024c; Murphy et al., 2022, 2023, 2024b). LLMs, by contrast, being universal function approximators (Yun et al. 2019), are surely able to reproduce certain aspects of lexico-semantic statistics from the ‘fossilized’ remains of the human generative machine they recover from data (Mitchell & Krakauer, 2023). But there are very plausible reasons to assume that whatever method LLMs use it bears little resemblance to the algorithms deployed by human infants (Leivada & Murphy, 2022; Murphy et al., 2025), who deploy specialized knowledge rather than solely invoking general token-prediction algorithms. Due to LLMs being a universal approximation method, they are more akin to tools such as generalized Fourier series than scientific theories of human cognition. Relatedly, distributional semantics vectors can certainly be used as a proxy for natural language meanings, but they are not to be confused with “the stuff of thought” itself (Pinker, 2007). This is not even to mention related concerns that hover in the background, like the fact that the back propagation training algorithms used with LLMs are considerably different from human learning mechanisms (Evanson et al., 2023).
4.5. Design or Data?
Instead of scaling to unprecendented levels of compute via architectures that are fundamentally grounded in token prediction, a return to more traditional design features of the human mind (predicate-argument structure, variable binding, constituent structure, minimal compositional binding; Donatelli & Koller, 2023) may be needed to orchestrate a more reliable expertise in human language (Ramchand 2024). This could be implemented by forms of neuro-symbolic approaches.
Still, it is also certainly true that mainstream theoretical linguistics (e.g., the minimalist enterprise) was in some ways ill-equipped to successfully predict which patterns of linguistic activity might be (un)approachable by LLMs. To illustrate, a potential weakness in this direction with respect to recent generative grammar theorizing has been the underestimation of the extent to which lexical information drives composition. This type of information may permit LLMs to abductively infer certain elements of grammatical rules, in whatever format this ultimately takes (Ramchand, 2024). Future research should more carefully apply the tools of linguistics to isolate specific sub-components of syntax that might be in principle achievable by language models, given specific design features. For instance, with LLMs “complete recovery of syntax might be very difficult computationally” (Marcolli et al., 2025, p. 13), even if we assume that attention modules can in principle “satisfy the same algebraic structure” as what Marcolli et al. postulate as being necessary for syntax-semantics interface mappings.
More broadly, the currently popular vector approach (Piantadosi et al., 2024) risks conflating the implementation medium with the computational level: for a high-dimensional vector space that supposedly encodes structured symbolic derivations, unless the structure is explicitly recoverable and manipulable then the representation is only functionally equivalent in a loose sense. How does a vector-based learner discover abstract, exceptionless rules without relying on statistical accident? Piantadosi et al. (2024), and others, risk explaining compositionality post hoc (“it emerges in the geometry”) rather than as a necessary design property. And it is these necessary algebraic properties of language that linguistic theory tries to capture.
5. Conclusion
In contrast to some recent claims that we may be living through “the end of (generative) linguistics as we know it” (Chesi, Forthcoming), our results should spur cognitive scientists, psychologists and philosophers to press even further into the reaches of algorithmic and psycholinguistic models of hierarchical syntactic composition. Some recent directions here come from exploiting concepts from statistical physics (Murphy et al., 2024a) to uncover previously unknown principles of language design (and to provide a potential meta-language to compare and quantify distinct syntactic theories), and from recent attempts to bridge symbolic theories of language with probabilistic-connectionist models of parsing (Murphy, 2024c) to offer a neurobiologically plausible infrastructure for syntactic inferences.
The goal here should not be to virtuously resist the era of big data from the safety of our theoretical models of syntax, but to learn how best to properly leverage computational methods – not in order to surrender to LLMs (Piantadosi, 2024) but to utilize them (van Rooij et al., 2024) to assess how statistical and symbolic representations interact during the acquisition and processing of language.
This is an open access article distributed under the terms of the Creative Commons Attribution License (