1 Introduction
Children acquire at an early age the ability to organize syntactic and morphological relations between and within words. They include various rule-governed structures, from flat structures to non-adjacent dependencies, and syntactic frames, (semi-)fixed expressions, morphological case markers, etc. We call these elaborated syntactic and morphological structures extended syntax or extended morphology. Here, we are motivated to enhance our understanding of how these extended structures emerged in the hominin lineage. According to the minimalist account of generative syntax, the syntactic capacity emerged from a single macro-mutation as early as 100–200 kya in modern humans (e.g., Berwick & Chomsky, 2016, 2019; Chomsky, 2017). This mutation brought about the Merge operation. It is said that the single operation Merge is critical for syntax, defines human language per se, and recursive application generates infinite structures from finite means in a binary fashion (Chomsky, 1995).
Indirect anthropogenetic findings, however, favor a Darwinian scenario (e.g., Christiansen & Kirby, 2003; Christiansen et al., 2009; de Boer et al., 2020; Foley, 2001; Fujita & Fujita, 2022; Hillert, 2015, 2021; Pinker & Bloom, 1990; Zuberbühler, 2019). Nevertheless, the apparent debate is less controversial than is often stipulated since the Merge- or recursion-only hypothesis reserves adaptive evolutionary processes for the broad language faculty (Hauser et al., 2002; Pinker & Jackendoff, 2005).
Our approach considers neurobiological changes in the hominin lineage and evidence from reverse linguistic analysis. We discuss various pre-syntactic conditions and view the biological Merge capacity as a property that may have emerged early in our hominin ancestors, such as Homo erectus. The actual cognitive implementation and use of this capacity in language may be a late cultural byproduct of complex syntactic branching, particularly in context of the development of writing and reading. The signal system of extant nonhuman primates has been fairly well assessed. Monkeys use referential vocal signals and respond accordingly to different call types (Fitch, 2000; Gifford et al., 2005; Seyfarth et al., 1980). The signal patterns are fixed in monkeys but less constrained and more open in nonhuman apes (e.g., Botha, 2003; Coudé et al., 2011; Deacon, 1989; Ferrari et al., 2017). One of the earliest forms of hominin communication may have consisted of a signal exchange system based on the perception of events. The signals that resembled our ancestors’ environment were presumably mainly iconic and holistically stored and retrieved from memory. Then, a transition took place from discrete references to sense, a distinction already introduced by Frege (1892). The development of discrete concepts and nonverbal action-based event-structures is one of the most critical steps for creating words, that is, conventionalized vocal or gestural signs. Pantomimes certainly played an important role in referring not only to perceived events but also to imagined ones (e.g., Arbib, 2012; Gärdenfors, 2017; Zywiczynski et al., 2018). Specifically, pantomimes include the ability to imagine past and future events. The process of imagination is a precondition for the invention of conventionalized symbols. We discuss a more precise possible scenario further below, but non-referential concepts seem not to be verifiable in nonhuman primates.
Here, we consider extended syntax and morphology as an optional late cultural product. In contrast to the minimalist account, we assume that the biological Merge-capacity is not necessarily human-specific. One reason is that the level of cortical development in Homo erectus was virtually comparable to early modern humans, that is, quite advanced. Another reason is that we do not consider the iterative application of Merge to be critical for the initial stages of language. Instead, we suggest a pragmatic grammar without function words or inflections that exclusively relies on contextual information. The emergence of basic lexical grouping is semantically motivated (see Goldberg, 2005; Hillert, 2023; Jackendoff, 1997). Others propose an intermediate stage of proto-Merge (Progovac, 2010, 2015; Progovac & Locke, 2009) or an initial stage of core-Merge (Suzuki & Matsumoto, 2022).
We agree here with the idea of one (or more) intermediate precursor stages of modern language, such as pragmatic grammar. The critical stages involve the development of symbolic elements and their grouping within a perceived event-context. These lexical groupings can be considered as semantic frames. Syntax may not have played a cardinal role at this stage. We find evidence for this argument in contact languages (signed or spoken) and in the analysis of modern languages (e.g., Bickerton, 1990; Gil, 1994, 2005, 2013; Gil & Shen, 2019; Jackendoff, 1999; Jackendoff & Wittenberg, 2014, 2017; Willer Gold et al., 2018).
A pragmatic grammar may have been the semantic foundation for organizing words hierarchically, including binary branching, such as Merge, or n-ary tree structures. We believe that the ability to use a pragmatic grammar may be associated with properties of a new genotype. This new genotype can be associated with late Homo erectus if we consider neurobiological factors and archeological reports about their cognitive behaviors. Since we find structures of a pragmatic grammar in first language acquisition (e.g., telegraphic speech) and language breakdowns (e.g., agrammatic Broca’s aphasia) but also in different types of contact languages (e.g., second language acquisition or home-signing), we assume that there are no genotype differences in the ability to use a pragmatic or an extended syntax.
Extended syntax and morphology are exclusively a cultural product and took thousands of years to develop. However, as mentioned before, the biological capacity for a pragmatic grammar may have been deeply rooted in the survival strategy of an early hominin group refining the dominant social group structures which can be studied even today in extant nonhuman primates. We agree that binary branching such as Merge is an elegant computational mechanism to generate complex structure with a single operation. According to our account, however, the human brain is more flexible and therefore more efficient in generating lexical groupings in multiple ways rather than relying on a rigid single operation. Cognitive and neurobiological factors must be considered to understand how syntax and other properties of language emerged. In particular, rehearsal of specific sound patterns associated with discrete concepts expanded the workspace capacity at the cortical level.
In sum, the shift from a pragmatic grammar to non-binary or binary branching served the purpose of internalizing rules for expressing complex thoughts. Morphosyntactic rules replaced complex storytelling at the pragmatic grammar level. The internalization of morphosyntactic rules may have been related to population growth and to the urge for complex social bonding and coordination (Dunbar, 1996). It took modern humans about 16,000 generations to develop a modern language (assuming that a generation lasts, on average, 18 years). However, the timeline may be much longer if we consider different precursor stages. In addition, we assume that the genotype, here the innate capacity supporting a pragmatic grammar and its modern derivatives, is not unique to early modern humans but shared with their closest extinct relatives, such as Neanderthals, Denisovans and late Homo erectus, if we adopt the classical species taxonomy and do not reclassify them as variations of the same species, that is, Homo sapiens sensu lato (Bräuer, 2008).
2 Cortical Changes in the Hominin Lineage
Human language is strikingly different from animal communication because a grammar system organizes words in sentences and discourse to convey complex meanings. One reason is that language is a cultural product that is ready-made and available to the child in its surrounding world. Another reason is that the child's brain is language-ready. The brain is ready to be selectively sculptured according to the perceived input. After birth, synaptic density significantly increases and peaks at 1–2 years of age. It drops sharply during adolescence and stabilizes during adulthood. Evolution selected neural excess and pruning in the lineage of hominins as an efficient and robust mechanism to shape distributed networks for cognition (e.g., Navlakha et al., 2015). Typically developing children acquire the basic properties of language at the latest by the age of four or five. However, the acquisition of extended syntax involving non-canonical structures and long-distance dependencies takes much longer, until late childhood or early adolescence (e.g., Dabrowska et al., 2009; Skeide et al., 2016; Skeide & Friederici, 2016). Moreover, a prolonged acquisition process applies also to certain non-literal expressions, including sarcasm and novel metaphors (e.g., Glenwright & Pexman, 2010; Van Herwegen et al., 2013). In contrast to macaques or chimpanzees, cortical synaptogenesis is significantly delayed in humans (e.g., Huttenlocher & Dabholkar, 1997; Liu et al., 2012). In-born acquisition abilities enable the child to use finite means to produce infinite structures. To what extent these innate abilities are language-specific syntactic parameters (Chomsky, 1986) or properties of general cognition, such as mind reading and perceptual and cognitive strategies, remains to be seen (Tomasello, 2003).
We find qualitative and quantitative differences in comparing the neural properties of the human brain against the brain of (nonhuman) great apes. In humans, the hubs of the language circuit are Broca’s area with the Brodmann areas (BAs) 44 & 45 (respectively pars opercularis and pars triangularis) and Wernicke’s area with the posterior sections of BAs 21 & 22 of the superior and middle temporal gyrus (S/MTG). Moreover, prosodic information and metaphoric expressions are primarily processed in the right hemisphere (e.g., Bottini et al., 1994), but idiomatic strings are not (Hillert & Buračas, 2009), and auditory and fine-grained articulatory processes involve subcortico-cortical structures, particularly the basal ganglia. Furthermore, neuropsychological data reveal an ambiguous picture. Some studies report that the language circuit is engaged not only during language processing but also in the context of actions, music, or calculations (e.g., Bookheimer, 2002; Fadiga et al., 2009; Nishitani et al., 2005; Ruck, 2014; Wakita, 2014). Other studies show language-specific activations in the left inferior frontal gyrus (e.g., anterior BA 44) which do not overlap with non-linguistic processes (e.g., Campbell & Tyler, 2018; Fedorenko et al., 2011; Jouravlev et al., 2019; Papitto et al., 2020). Some methodological issues are associated with this debate: any two different tasks will recruit different cortical activations. The question is how specifically we define the relevant cortical region. The narrower the definition of a cortical region, the higher the probability of finding activations exclusively for a specific task within the predefined region of interest. It is an empirical question whether specific cortical regions are recruited by the type of computation rather than by domain specificity. Furthermore, it has been argued that activation differences may be related to differences in workspace demands and integrative control functions associated with a syntactic structure (e.g., Hillert, 2014; Kaan & Swaab, 2002; Novick et al., 2010; Saur et al., 2008). We use the term workspace here to refer to a buffer that holds memory traces for about two seconds, but rehearsal operations prevent fading (Baddeley & Hitch, 1974; Cowan, 2001). This approach is consistent with the global workspace hypothesis. Dynamic global workspace functions are supported by prefrontal and posterior regions (e.g., Baars et al., 2013; Dehaene & Changeux, 2011). Most interesting is the finding that the white-matter fiber streams of the language circuits seem to engage different types of computations. The dorsal streams, which connect Broca’s area and the prefrontal motor cortex with the parietotemporal junction (PTJ) and posterior STG, mainly consist of the arcuate fasciculus (AF) and the superior longitudinal fasciculus (SLF). In principle, SLF connects frontal and parietal regions, and AF is a frontotemporal tract extending towards the parietal under SLF. The streams differ in their endpoints: AF terminates (directly or indirectly) in BA 44, SLF in BA6 of the premotor cortex (e.g., Bernal & Ardila, 2009; Friederici & Gierhan, 2013). AF is particularly implicated in complex, hierarchical syntactic processing but also in phonological processing. SLF is involved in speech processing, including rehearsal operations (Catani & Mesulam, 2008; Hickok & Poeppel, 2007).
The ventral streams connect the posterior temporal lobe (STG and MTG) to Broca’s area via the extreme capsule (EC), and uncinate fasciculus (UF). The primary function of the ventral streams is to transfer lexical information (form and meaning), local phrase structures, and treelets to the inferior frontal gyrus (e.g., Bajada et al., 2015; DeWitt & Rauschecker, 2012; Hillert, 2014; Hodgson et al., 2021; Matchin & Hickok, 2020; Pillay et al., 2017; Ralph et al., 2017; van der Lely & Pinker, 2014). Finally, two different streams, the inferior longitudinal fasciculus (IFL) and inferior fronto-occipital fasciculus (IFOF), connect the occipital lobe with the frontal lobe via temporal regions. Little is known about their precise functions, but it has been suggested that they are involved in processes associated with lexical semantics, goal-orientation and possibly theory of mind (e.g., Almairac et al., 2015; Catani & Thiebaut de Schotten, 2008; Glasser & Rilling, 2008). Compared to nonhuman great apes, the anterior prefrontal and precentral regions of the human cortex increased in size (Semendeferi et al., 2001; Schoenemann et al., 2005).
Again, left-sided asymmetric regions homologous to Broca’s and Wernicke’s areas have been identified in nonhuman primates (monkeys, chimpanzees, bonobos, gorillas, orangutans). However, these homologs differ in size, degree of laterality, cortical connectivity, and microstructure. In general, Broca’s area (BAs 44 & 45) and the anterior (Heschl’s gyrus) and posterior portions of Wernicke’s area (BA 22 corresponds to Tpt) show a more pronounced left-sided asymmetry consisting of larger cortical mini-column spacing for better connectivity (e.g., Buxhoeveden et al., 2001; Golestani et al., 2007; Tzourio-Mazoyer & Mazoyer, 2017). AF of the dorsal stream projects further into the middle and inferior temporal cortex. Wider-spaced mini-columns enable a higher resolution of phonological processing, while a denser structure with a lower resolution is associated with holistic-like processes (e.g., Hopkins et al., 2009; Palomero-Gallagher & Zilles, 2019; Schenker et al., 2008, 2010; Spocter et al., 2010; Wilson & Petkov, 2011). Furthermore, it has been reported that BA 44, but not BA 45, is left-over-right asymmetric in individual adult human brains (n = 10; Amunts et al., 1999), but asymmetry of BAs 44 & 45 seems to change throughout the lifespan, and BA 44 is later maturing (Amunts et al., 2003). Left-sided asymmetry of neurophil spacing (space between neurons and glial cells) in the gray matter has, however, also been found in other cortical regions, such as the visual and primary motor cortex (Amunts et al., 1996, 2007; Seldon, 1981a, 1981b). It is, therefore, plausible to assume that some cortical changes are a direct outcome of environmental factors and relate to different behavioral-cognitive activities that are not necessarily language-specific. Moreover, MTG and ITG expanded in the hominin lineage. In macaques (Macaca) and chimpanzees (Pan troglodytes), AF reaches posterior STG. Still, modern humans also massively project into MTG and ITG (e.g., de Schotten et al., 2012; Rilling, 2014; Sousa et al., 2017).
The precise neuroanatomical changes associated with extinct ancestral hominins are difficult to reconstruct as the only evidence relies on endocasts (Holloway, 1978). Australopithecus (A.) species mainly lived between 4.4 and 1.4 mya in eastern and southern Africa during the Pliocene and Pleistocene cooling periods. The fossil remains of the bipedal A. afarensis show hybrid anatomical features (such as dentition and shape of skeletal structure) between Homo species and nonhuman great apes. Paleoneurological evidence points to an expansion of the superior and inferior parietal region at about 3 mya, which may have caused rewiring of the temporoparietal junction, including a region homologous to Wernicke’s area (Bruner et al., 2023). Endocasts of A. afarensis show that the lunate sulcus, which separates area V1 of the occipital lobe from the angular gyrus of the parietal lobe, is placed more posterior (Armstrong et al., 1991; Dart, 1925; Holloway et al., 2004). Endocasts of Homo erectus, from which anatomically modern humans are descended, show significant brain expansion up to 1,000 cc and a pronounced Broca’s cap. This bulge can be seen in an endocast at the level of the temporal pole. A more recent endocranial morphology study supports the view that the frontoparietal areas expanded in concert rather than separately (Ponce de León et al., 2021).
In sum, we assume that with expansion and interconnectivity of the neural networks, signals became more discrete at the low end of the iconic-symbolic spectrum. Semantic relations between symbolic signals may have initially referred to perceptual criteria without syntactic constraints. Not only did experiences become internalized in symbolic representations, so did the relations between these concepts in terms of action-based event-structures. A critical role was certainly played by the increasing workspace and rehearsal capacity, but also by cortical control of signing and vocalization. We assume that abstract semantic categories are required for high-ordered branching in general, whereas the n-ary Merge operation may be a byproduct of those properties. Before we discuss in more detail how extended syntax emerged from symbolic representations, let us briefly review the concept of a pragmatic grammar.
3 Reverse Linguistic Analysis
We find evidence that a precursor stage of extended syntax is rooted in simple syntax (Culicover & Jackendoff, 2005). We introduce the term pragmatic grammar here as the semantic or syntactic relations between lexical elements that are implicitly provided by pragmatics rather than by syntactic markers or word order. Asymmetric semantic relations may be based on non-verbal strategies, such as agent-first, and preference attributes may be mentally stored along with a symbolic unit. In general, the interpretation relies mainly on contextual information, prosody, or default strategies. A pragmatic grammar can be found in certain stages of first and second language acquisition, in agrammatic aphasia, in grammar acquisition of feral children, in contact languages and emerging sign languages (e.g., Bickerton, 1981, 1990; Jackendoff, 1997, 1999; Jackendoff & Wittenberg, 2014; Klein & Perdue, 1997; Progovac & Locke, 2009; Sebba, 1997; Tallerman, 2014).
An often-quoted example is the Malayan dialect of Riau Indonesian, which served in its history as a lingua franca. It is considered to be mono-categorial: it has virtually no syntactic categories, and the word order is based on pragmatic or prosodic strategies provided by an association operator (Gil, 2005, 2013, 2014). Depending on the context, listeners interpret ayam makan (chicken eating) or makan ayam (eating chicken) as we eat chicken, someone is eating chicken, someone eats chicken because of the chicken, the chicken is eating, etc. Presented out of context, the default associative strategy may be to understand chicken as the theme and not as the agent. Otherwise, the speaker has the option to use a grammatical marker. The existential marker ada in ada makan can be understood as there is an eating, someone’s eating, or he did eat although context is still required for a more precise interpretation.
Moreover, an interesting lexical pattern can be found at around 18 to 24 months during first-language acquisition. Children produce two-word utterances with a mean length of utterance (MLU) of two morphemes (range 1.75–2.25), such as give toy or Daddy go, whereas inflections and function words are rarely produced. In general, MLU gradually increases during acquisition, but different grammar stages can be differentiated (Brown, 1973). One study showed that workspace span abilities in 3-year-olds are a better predictor of MLU than age is (Blake et al., 1994). Children also go through these stages when the acquisition is delayed, as in the case of two deaf children who were not exposed to a first sign language until the age of six years (Berk & Lillo-Martin, 2012). A study with “post-childhood” first language learners of American Sign Language (ASL) with at least 9 years of language experience shows that pragmatic grammar (event knowledge) overrides word order, independent of the subject’s animacy. In contrast to the control groups, deaf native ASL signers and hearing second-language ASL signers consistently relied on word order (Cheng & Mayberry, 2019, 2021). In the case of restricted language experience in early childhood, a structural magnetic resonance imaging (MRI) study reveals negative changes in adjusted grey matter volume and cortical thickness in bilateral frontotemporal regions. However, no anatomical changes are reported when deaf infant signers are compared to hearing infant speakers (Cheng et al., 2019, 2023).
Again, deaf children, who create home signs to communicate with their hearing parents, rely on a relatively fixed word order by distinguishing the agent role and placing the action in the final position of a sequence. Similar to spoken language, deaf children go through two gestural stages, and their developed home sign systems are more complex than the gestures used to support speech (Feldman et al., 1978; Goldin-Meadow, 2003; Goldin-Meadow & Yang, 2017). The well-known case of the Nicaraguan Sign Language also illustrates a gradual process from a basic to a more extended grammar. Initially, the deaf children used a word order based on pragmatic principles. The younger deaf children elaborated on these basic structures acquired from the older children and developed grammatical markers to express syntactic relations or verb agreement (Senghas et al., 2004, 2005). Further examples are the emerging Al-Sayyid Bedouin Sign Language (Sandler et al., 2005) and the isolated village sign language Central Taurus Sign Language (Caselli et al., 2014) which indicate similar basic-to-extended grammar patterns.
Again, adults who learn a second language without explicit instructions show a canonical linguistic competence called the basic variety across all examined pairs of first and second language (Jackendoff, 1999; Klein & Perdue, 1997). Initially, second-language speakers tend to acquire words without inflections and rely on a word order based on pragmatic strategies. For example, the agent-first strategy, which often applies together with the focus-last strategy, is efficient in interpreting trigrams such as hit girl boy as The girl hit the boy rather than The boy hit the girl. focus-last often represents the result or significance caused by the agent. However, pragmatics typically tells us the intended meaning. For example, the string drink milk Bob and drink Bob milk will always be understood as Bob drinks milk.
Individuals who suffer from brain lesions show systematic linguistic deficits. In the case of agrammatic aphasia, patients often fall back on the agent-first strategy since they have particular difficulties with function words and assigning thematic roles. Accordingly, they have a high error rate in understanding reversible passive sentences or object-relative clauses in which the patient is mentioned first (e.g., Caplan et al., 1985; Caramazza & Zurif, 1976). Another example is feral children who have difficulties acquiring the grammatical competence of native speakers. Genie, a well-known victim of severe child abuse, was not exposed to language until the age of 13 years. She quickly acquired words after her rescue, but her grammar remained far behind despite many years of intensive training (Curtiss, 1977).
A pragmatic grammar also resurfaces in standard fully-fledged languages. These structures include the agent-first strategy, minimal attachment of modifiers, literal and figurative lexical collocations, and syntactically freely placed adverbial expressions. If pragmatic grammar was a precursor stage in evolution, its interpretative processes relied on contextual information, world knowledge, theory of mind about subjective intentions or social conventions, and on additional gestural, vocal, facial, or postural cues. All these aspects are still today part of spontaneous speech. We assume that the refinement of grammatical structure, including extended syntax, is closely related to the implications of the social brain hypothesis. The social brain hypothesis implies a correlation between social group size and neocortex size in primates. In modern humans at least, this correlation is mediated by mentalizing skills and associated with the theory of mind network that links the prefrontal cortex with the temporal lobe (e.g., Dor, 2015; Dunbar, 1996, 2009, 2005; Dunbar et al., 2015; Roberts et al., 2022). The social brain hypothesis is consistent with the previously mentioned concept of global workspace functions. They are considered here to be significant for the development of extended syntax in languages.
4 The Emergence of Semantics and Syntax
An answer to how the capacity for extended syntax and morphology emerged remains speculative. However, indirect evidence from various disciplines, particularly paleoanthropology, lets us sketch a plausible scenario. Our starting point is the signal exchanges of our closest extant relatives, monkeys and genus Pan. Monkeys combine no more than two vocal signals, and the meanings seem to be idiomatic-like or combinatory rather than compositional (e.g., Arnold & Zuberbühler, 2008; Cheney & Seyfarth, 1990; Seyfarth & Cheney, 2003; Zuberbühler, 2019; Zuberbühler & Bickel, 2022). Again, trained or enculturated chimpanzees occasionally produce flexible bigrams to express immediate needs (e.g., Crockford & Boesch, 2005; Girard-Buttoz et al., 2022; Goodall, 1986; Savage-Rumbaugh et al., 1986).
Apart from fossilized bones, the most striking clues about cognitive behavior in the hominin lineage are the development of the lithic tool industry, from basic pounding tools to flint knapping. A. afarensis already engaged in habitual tool manufacture as early as 3.4 mya (Skinner et al., 2015), while flint-knapping as part of the Acheulean assemblage was a domain of Homo erectus. But what kind of abilities do these tools indicate concerning the evolution of language? The oldest hominin tool users were individuals of the species A. afarensis. This species lived about 3 million years ago and applied Oldowan techniques.
These techniques require only basic goal-oriented behavior, consisting of a few percussions, and indicate sequential steps. In contrast, the Acheulean techniques are associated with Homo erectus, a species with a significant increase in cortical mass and connectivity (up to 1000 cc) compared to Australopithecus (450 cc). In particular, manufacturing a hand-axe at around 1.6 mya required more than 50 percussions, from which several goal-oriented steps can be inferred (e.g., Gowlett, 2006; Holloway, 2008, 2012). The manufacturing steps of the Acheulean techniques were removing the core's surface layer, detaching large flakes for bifacial thinning, finer thinning and shaping, and preparing the edge. Finishing work was done with wooden or bone hammers to control the flaking process better. These techniques imply visual affordance and manual actions to be planned and sequentially combined. Moreover, it is also argued that the late Acheulean techniques (< 800k years ago) imply hierarchical steps and nested part-whole structures (Stout, 2011; Stout et al., 2008). More recently it has been argued, however, that action grammar is sequential in nature and shows weak compositionality (Coopmans et al., 2023).
Two aspects are of particular interest here. First, we find a significant increase in the complexity of toolmaking from Oldowan to late Acheulean. Second, functional MRI studies simulating late Acheulean toolmaking steps and language production both activate the inferior frontal gyrus, including Broca’s area (e.g., Molenberghs et al., 2009; Stout et al., 2021; Uomini & Meyer, 2013). This finding supports the thesis that Broca’s area is involved in processing more complex intentional actions (Fedorenko et al., 2012; Koechlin & Jubault, 2006). To what extent BA 44 or BA 45 or further subdivisions thereof are specifically involved in action grammar, much like for symbolic computations, requires further research. Furthermore, as mentioned before, we can find dominant social group structures or structured representations in great apes’ cognition (Planer & Sterelny, 2021). Since action grammar developed during a period of more than 2 million years, it is a plausible assumption than behavioral changes had an incremental impact on early hominins’ cognition and brain structure and circuits. Thus, it is possible that initially action grammar provided the neurophysiological foundation for symbolic grammar. We argue here that it is not only Broca’s area and its subdivisions, which may have gradually emerged, but the complete frontotemporal circuit providing a substantial increase in workspace.
Another plausible link between action and symbolic grammar implies the technology hypothesis, which states that skills of stone-tool making were culturally transmitted by gestural language (e.g., Corballis, 2003; Fazio et al., 2009; Fitch, 2014; Fujita, 2009; Fujita & Fujita, 2022; Lombao et al., 2017; Morgan et al., 2015; Stout & Chaminade, 2012). Thus, imitation and pantomimes were cardinal for teaching tool manufacturing and informing about weather conditions, predators, or locations of food resources (e.g., Arbib, 2011, 2012; Gärdenfors, 2017, 2021). In particular, the teacher-student relationship may have played a significant role. The partial transfer to vocal instructions was a success story. Although the following evolutionary steps of pragmatic grammar lack direct empirical evidence, they seem plausible, and most are debated in the literature.
Initially, the signing was iconic and holistic (including pantomimes) in both the gestural and vocal domains, and imitated sounds and shapes of the perceived habitat. An onomatopoeia that resembles the sounds that it describes is perhaps one residual. Another type is sound-shape congruency, such as the bouba-kiki effect, which shows that sounds may be linked to shapes across cultures in a way that is non-arbitrary. For example, speakers associate the nonce word bouba with a round shape and kiki with a spiky shape (e.g., Ćwiek et al., 2022). The development of discrete concepts includes a gradual dissociation from iconicity towards symbolism. The role of iconicity in ASL indicates that the development towards sensory-independent meanings might also be motivated by easing process demands. In one study, only new, hearing ASL-learners benefited from sign iconicity, in contrast to proficient ASL-English bilinguals. Different factors might be related to this outcome. One explanation is that bilinguals’ iconic sign computations are conceptually mediated, slowing down processing time. One possible conclusion is that ASL-English bilinguals process symbolic (non-iconic) signs more efficiently than iconic signs since concepts can be directly accessed (Baus et al., 2013; Emmorey, 2014). The evolution of a non-iconic semantic network may therefore be motivated not only by the increasing number of lexical options but also by having direct access points to discrete concepts. Other important factors may have contributed to this emerging process, such as gossiping, grooming, motherese, or pair bonding (e.g., Számadó & Szathmáry, 2006).
However, the most challenging question is how vocalizations associated with emotional arousal became a phonetic, speech-like format. The first step, vocalizations, may have been used intentionally before the sound patterns became arbitrary and applied to speech. Thus, the segmentation process was also gradually implemented at the sound level before a speech-like format was developed. One idea is that the segmentation of holistic chunks of sound patterns produced distinct syllables (MacNeilage, 2008). Moreover, the increasing demand for more (content) words asked for affixation and hierarchically organized structures of the sound patterns (e.g., Carstairs-McCarthy, 1999; Jackendoff, 1999; Wray, 1998). The duality of patterning was born (Hockett, 1959).
We suggest that semantics, along with phonology, was born before basic or extended syntax. Collocations of two or three words may have been the standard pattern of early pragmatic grammar. Further gradual and incremental developments can be assumed at the phrasal level to create asymmetric relations between words. The relations are conceptually grounded, such that the action follows the entity causing the action. Along with the increasing population growth in the hominin lineage, social bonding and cooperation in all aspects of life were mutual, reciprocal processes (e.g., Scott-Phillips, 2007). At all linguistic levels, compositional structures became eminent. Concepts and their semantic relations mentally consolidated through argument structures, thematic roles, and phrase structures.
The timeline of when extended syntax, including Merge, emerged in the hominin lineage is controversially debated. We agree with the generative model that it emerged more recently and may coincide with the appearance of behavioral modernity. However, we assume that the extended syntax capacity was already in place in Homo erectus but not used because pragmatic grammar was sufficient for their socioecological needs. Restricting any form of language-readiness to archaic or modern humans seems anthropocentric considering the long history of hominin evolution. Our assumptions are based on the following.
According to conservative estimates, the species Homo erectus was around for about 1.8 my, but within its lineage, there are significant anatomical variations. Late Homo erectus’ brain volume increased to 1,000 cc and had human-like prefrontal and temporoparietal regions (Wynn, 1998). Fossil records show, moreover, a Broca's cap morphology. Again, Homo erectus did not only develop Acheulean tools (e.g., Shea, 2016) but traveled long distances to the south and north of Africa and out of Africa to the Middle East and China. Moreover, they built water-transport crafts to reach the island of Java (Dubois, 1894). Since this species had the social and technological skills to build boats, these large-scale social group activities imply that individuals had the ability to plan for the future and to make predictions about new habitats. Most of all it implies that they presumably developed a language-like system, such as pragmatic grammar, to share knowledge (Everett, 2017; Gil, 2008). Further support for this assumption is the discovery of the 700–230 ky old Berekhat Ram figurine, which appears to demonstrate symbolism. This figurine has been associated with Homo erectus (d’Errico & Nowell, 2000).
These technological and aesthetic skills point to a more sophisticated social culture quite distinguishable from any cultural activities seen before in the hominin lineage. At the same time, it is obvious that extended syntax, argument structures, and rich morphology were not needed in context of the socioecological conditions Homo erectus was living in. However, this species may have had the innate capacity to generate those extended linguistic structures on the basis of a pragmatic grammar. We assume, furthermore, that binary branching implied by Merge does not play a crucial role in modern languages and does not exclusively define syntax or language (Pinker & Jackendoff, 2005). Finally, since Homo erectus increasingly used manual skills, we believe that vocalizations successively replaced gestures while the latter kept their supplemental function. Human language consists of multiple components that evolved separately or in concert. It is, therefore, difficult to single out a particular hominin species equipped in a single step with these various basic and extended cognitive and language-related components.
5 Conclusion
We suggest different evolutionary milestones in the evolution of syntax. Nonhuman primates including monkeys and nonhuman apes primarily produce vocal signals to express states of emotional arousal, occasionally combining two signals. Although their brains share homologous structures with the brains of modern humans due to common evolutionary ancestry, neural mass and connectivity at the synaptic level and between cortical and subcortical regions are not specifically designed for elaborated symbolic mentalizing. In contrast, the human brain supports cortical control of conceptualizations, whereas rehearsal operations provide maintenance and updates of these representations and increase workspace capacities. We, furthermore, assume that ventral streams mainly support pragmatic grammar while extended hierarchical branching is associated with dorsal streams. Here, Broca’s area seems to work like a buffer in which information is unified and linearized for output. In contrast, semantic and syntactic structures are generated in posterior regions, including Wernicke’s area and PTJ (e.g., Boeckx et al., 2014; van der Lely & Pinker, 2014).
The evolution of language in functional terms implies several milestones. Although various scenarios are possible, the general picture we suggest is as follows: Early hominins may have mainly relied on iconic and holistic signals, including pantomimes, which resembled emotional arousal states and information perceived in the environment. In turn, segmentation took place at different levels. Concepts became discrete, and sound patterns symbolic. Two or three words were combined according to action-based event structures. This pragmatic grammar stage, also indicated by reverse linguistic analysis, presumably can be associated with critical genotype changes in Homo erectus that provided the foundation of extended symbolic computations.
The externalization of thoughts was an overwhelming benefit for our ancestors. Along with population growth and the increasing demand for social collaboration, semantic roles as found in non-verbal event structures of action grammar became internalized. The externalization of these semantic structures in the fashion of symbolic representations brought about pragmatic grammar. We do not believe that genotype differences between Homo erectus and Homo sapiens sensu lato (anatomical modern humans, (pre-) archaic Homo sapiens, Homo heidelbergensis, Neanderthals, and Denisovans) were critical for the development of extended syntactic and morphological structures. They can be considered as cultural accumulations.
The development of extended syntax may have started with the generation of treelets that are small templates of syntactic nodes typically underspecified in some respects of sentential tree structure. These treelets can be readily accessed and integrated into larger structures (J. D. Fodor, 1998; Sakas & J. D. Fodor, 2012). The path was set for basic and extended syntactic branching which includes hierarchical structures. They were also implemented at the phonological or morphological level. Binary branching of Merge and its iterative application is only one form of possible syntactic branching. Other strategies to organize phrases and sentences are equally important, including idiomatic collocations, metaphoric expressions, treelets, and n-ary branching. After all, the beauty of language is its diversity.