Cognitive Phonetics: The Transduction of Distinctive Features at the Phonology-Phonetics Interface

We propose that the interface between phonology and phonetics is mediated by a transduction process that converts elementary units of phonological computation, features, into temporally coordinated neuromuscular patterns, called ‘True Phonetic Representations’, which are directly interpretable by the motor system of speech production. Our view of the interface is constrained by substance-free generative phonological assumptions and by insights gained from psycholinguistic and phonetic models of speech production. To distinguish transduction of abstract phonological units into planned neuromuscular patterns from the biomechanics of speech production usually associated with physiological phonetics, we have termed this interface theory ‘Cognitive Phonetics’ (CP). The inner workings of CP are described in terms of Marr’s (1982/2010) tri-level approach, which we used to construct a linking hypothesis relating formal phonology to neurobiological activity. Potential neurobiological correlates supporting various parts of CP are presented. We also argue that CP augments the study of certain phonetic phenomena, most notably coarticulation, and suggest that some phenomena usually considered phonological (e.g., naturalness and gradience) receive better explanations within CP.


Introduction
This paper aims to elucidate the nature of a cognitive system that takes as its input a representation consisting of distinctive features (i.e., the output of the phonological module) and generates a representation directly interpretable by the neuromuscular system associated with speech production.This system we will call 'Cognitive Phonetics' and the representations it generates 'True Phonetic Representations '. 1   This paper draws on both the phonological and phonetic literature.Unsurprisingly, as generative linguists, our interpretation of these two traditions conflicts rather sharply with that of more phonetically oriented scholars.Thanks to the critical comments of two such reviewers, we have tried to clarify our assumptions and inferences about both phonetics and phonology.
Even if these perspectives remain incommensurable, we hope to have made the sources of disagreement and incompatibility more evident in light of the reviews we received.
Here we will concentrate solely on speech (pre)production, leaving the perceptual direction of this system aside whenever possible.In line with the theme of this volume, our inquiry is a resuscitation of certain proposals made by Eric Lenneberg 50 years ago (see section 2), recast in the modern biolinguistic research program advocated by David Poeppel and colleagues as an attempt to unify theoretical linguistics and cognitive neuroscience.
Our point of departure is a fairly well-established claim: Surface (also known as 'phonetic' or 'output') representations of the phonological component of a generative grammar are matrices of distinctive features (where columns represent segments). 2During most of the 1960s, it was usually assumed that the features of underlying and surface representations are entities of a different kind, the former being binary, the latter gradual scales (Chomsky & Halle 1968: 297).However, one aspect of Postal's (1968) 'naturalness condition'-the statement that a surface representation is identical (and therefore composed from the same set of representational elements) to its underlying representation except as requested otherwise by phonological rules-seems to have been, often tacitly, adopted over the following decades, after a brief period of uncertainty.Thus in early 1970s, in an influential compendium on the contemporary issues in phonological theory, Maran (1973), discussing classificatory (phonological) and phonetic features, concluded that [w]e do not, however, claim at this stage that the set of abstract phonological features is identical in membership to the set of phonetic features.There are many things which remain unclear.(Maran 1973: 73) But already by the late 1970s a consensus seems to have emerged that underlying and surface representations do consist of the same vocabulary of features: Assuming that utterances are best represented as a string of feature matrixes at the phonetic level, we can raise the question of how sounds are represented for the purpose of phonological description (i.e., in the UR and at all intermediate levels).[. . . ] [A] fundamental tenet of generative phonology has been that sounds are most properly represented at these levels in the same way they are phonetically-namely, as feature matrixes in which each feature describes an articulatory and/or acoustic property of the sound.(Kenstowicz & Kisseberth 1979: 239) If we assume that URs and SRs belong to the same cognitive module, that is, the phonological module, and if we assume that a 'module' may operationally be defined as an encapsulated computational system that operates over a particular kind of abstract units (Boeckx 2009: 125-127), if follows that all levels of phonological linguistic knowledge.We use the term in a broader sense, as a scientific abstraction in general, similar to how H2O 'represents' water in formal stating of chemical processes.The main difference between a surface representation and a true phonetic representation, as will be shown in greater detail in section 4, is that the former represents knowledge (competence), and the latter represents information feeding speech production.This more general sense of usage is in line with Marr's (1982Marr's ( /2010: 20) : 20) definition of 'representation' as "a formal system for making explicit certain entities or types of information together with a specification of how the system does this".
2 Other data structures have been proposed, such as the feature geometry trees of Sagey (1986) and related work, but the simpler feature matrix structure is sufficient for our discussion.
representation are built from the same set of primitives (Hale & Kissock 2007: 83).
Thus the output of the phonological module, the surface representation, also consists of matrixes of distinctive features.
We understand distinctive features here as a particular kind of substance-free units of mental representation, neither articulatory nor acoustic in themselves, but rather having articulatory and acoustic correlates, as Halle (1983Halle ( /2002: 108-109) : 108-109) and Reiss (2018, chapter 15.7) have pointed out.Many influential phonological texts have stated over the last several decades that features serve as a bundle of information that the brain sends to the articulators (if speech is the chosen modality).Here are three examples of such statements: In articulatory terms each feature might be viewed as information the brain sends to the vocal apparatus to perform whatever operations are involved in the production of the sound, while acoustically a feature may be viewed as the information the brain looks for in the sound wave to identify a particular segment as an instance of a particular sound.(Kenstowicz & Kisseberth 1979: 239) [. . .] [T]he distinctive features correspond to controls in the central nervous system which are connected in specific ways to the human motor and auditory systems.[. . . ] In producing speech, instructions are sent from higher centers in the nervous system to the different feature boxes in the middle part of (5) ['tone', 'vocal', 'labial' etc.-vv & cr] about the utterance to be produced.(Halle 1983(Halle /2002: 109) : 109) The [. . . ] featurally specified representation constitutes the format that is both the endpoint of perception -but which is also the set of instructions for articulation.(Poeppel & Idsardi 2011: 179) If one thinks about how exactly features engage the articulatory system, it becomes apparent that there is a substantial conceptual gap between features and neural structures or activities.At present there is no way to link either the general concept 'distinctive feature' or any of the particular features (e.g., [CORONAL]) to any known neural structure (e.g., dendron, neuron, cortical column etc.) or activity (e.g., long term potentiation, oscillation, synchronization etc.; see Embick & Poeppel 2015).In fact, there seems to be very little understanding of how the brain exactly represents and computes any of the units or processes that are part of linguistic competence (Chomsky 2000a;Gallistel & King 2010;Mausfeld 2012).In other words, the units of linguistic computation and the units of neurological computation-as currently understood-are mostly incommensurable.This problem was therefore dubbed 'the ontological incommensurability problem' by Poeppel & Embick (2005).The proposed solution to it is to decompose a particular linguistic domain (e.g., phonology) into formal units and operations that are as basic and as generic as possible, and then formulate biologically plausible and scientifically productive 'linking hypotheses' across the fields of linguistics and neuroscience (Poeppel & Embick 2005, Poeppel 2012, Embick & Poeppel 2015).
The main goal of this paper is to formulate a hypothesis about the 'intermodular bridge' (Pylyshyn 1984: 147) from the symbolic and substance free (phonology) to the physical and substantive (phonetics).By pursuing this line of inquiry a modest attempt is made to formulate a theory of the phonology-phonetics interface 3 in strict biolinguistic terms, that is, in such a fashion that it can be linked to the kind of neurobiological activity that we might plausibly find in a neuromuscular system.Distinctive feature theory was initially outlined by Roman Jakobson in a lecture delivered in 1928 (see Jakobson 1971: 3-6) and in an often overlooked paper from the late 1930s (Jakobson 1939), and subsequently elaborated by Jakobson, Fant & Halle (1952) and Jakobson & Halle (1956).The idea of a 'distinctive feature' was founded upon purely phonological-that is, non-biological and non-cognitiveinsights about phonemic oppositions in the vein of Trubetzkoy (1939Trubetzkoy ( /1969)), as shown in the following passage: Any minimal distinction carried by the message confronts the listener with a two-choice situation.Within a given language each of these oppositions has a specific property which differentiates it from all the others.The listener is obliged to choose either between two polar qualities of the same category, such as grave vs. acute, compact vs. diffuse, or between the presence and absence of a certain quality, such as voiced vs. unvoiced, nasalized vs. non-nasalized, sharpened vs. non-sharpened (plain).The choice between the two opposites may be termed distinctive feature.The distinctive features are the ultimate distinctive entities of language since no one of them can be broken down into smaller linguistic units.The distinctive features combined into one simultaneous or [. . .] concurrent bundle form a phoneme.(Jakobson, Fant & Halle 1952: 2) Despite many revisions of the theory during the following decades (e.g., Chomsky & Halle 1968: 298-329, Halle & Clements 1983, Clements 1985, Clements & Hume 1995), it stands to reason that distinctive feature theory was never meant to face one of the more difficult questions of modern biolinguistics and of cognitive neuroscience in general, namely, how to bridge the gap between a cognitive faculty, in this case phonological competence partly represented by features, and brain.The existence of features themselves should not be in question-they have withstood almost a century of rational and empirical scrutiny and are considered "to be a scientific achievement on the order of the discovery and verification of the periodic table in chemistry" (Jackendoff 1994: 60).Also clear is the fact that features are somehow interpreted by the sensorimotor (SM) system because utterances are effectively externalized and perceived/parsed.Therefore, a question that logically follows from these facts is how exactly to get from discrete, timeless, abstract cognitive entities (features), on the one hand, to temporally arranged articulatory movements and ultimately to continuously varying sound waves, on the other.
Here we will adopt the position that cognition, including linguistic cognition, is best understood as a set of modules (see Chomsky 1984 andCurtiss 2013 for justification), each of which is characterized by mappings involving inputs and outputs in a particular format (Reiss 2007, section 2.1).Modules are connected 3 An influential source on this topic is the collection of papers in the special issue of the Journal of Phonetics (1990) dedicated to the relationship between phonetics and phonology.In the course of this paper, we will address what we consider as some shortcomings of these previous discussions of the phonology-phonetics interface.via 'interfaces'-configurations in which the outputs of one module serve as the inputs to another module.We argue that the interface between the phonological component of the grammar and phonetics (in this case starting with the neurophonetics of speech production, that is, with sending efferent neural commands to speech organs) is mediated by a system that transduces features into True Phonetic Representations-arrays of temporally coordinated neuromuscular information directly interpretable by the motor system in charge of speech production.An assumption that is interleaved in this proposal is that distinctive features, as currently conceived in modern literature, are not directly intelligible to the SM system.It is a non-trivial matter to show why this is so, and we return to this issue in section 3. Thus, our research question is that of transduction of distinctive features at the phonology-phonetics interface, which necessarily precedes speech production.A convenient and productive way to fractionate this question and begin to approach it is to adopt Marr's (1982Marr's ( /2010) ) three level perspective that specifiesfor any cognitive information-processing system-its computational level ('What is computed and why?'), algorithmic level ('How is it computed?'),and implementational level ('How is it realized physically?').It should be noted that these three levels of analysis do not state some fundamental truth about cognitive systems in general (e.g., that every cognitive system consists of three levels); rather, these are explanatory devices that provide a convenient way of dividing a cognitive system in order to study it, or in Marr's (1982Marr's ( /2010: 24) : 24) words, these are "the different levels at which an information-processing device must be understood before one can be said to have understood it completely".Since the cognitive system under study is an information-processing device, we will frame our discussion in Marr's terms.
The rest of the paper is structured as follows.In section 2 we revisit Lenneberg's (1967) Chapter Three where he introduces abstract neuromuscular schemata to account for the transformation of basic phonological units, segments in his case, into muscular events.In section 3, we state in more detail some general properties of Cognitive Phonetics, our proposed interface theory; we show how it can be constrained by both phonological and phonetic considerations; and we provide arguments for why features need to be transduced before a representation can be legible to the SM system.In section 4, we define the transduction of features into True Phonetic Representations following Marr's (1982) tri-level approach and we explore its neurobiological substrate.In section 5, we pursue several direct consequences of viewing the phonology-phonetics interface this way and introduce the concept of 'intrasegmental coarticulation'.We conclude (section 6) by summarizing our results and by pointing out some further research strategies that follow directly from our insights.

2.
Lenneberg's Neuromuscular Schemata Lenneberg (1967: 89-90) was well aware of the complexity of the relationship between discrete, logically ordered phonological units (phonemes, segments) on the one hand, and continuous articulatory movements with concomitant acoustic results on the other.He recognized that although some acoustic discontinuities corresponding to segment transitions are detectable in a spectrogram, in general, these boundaries are not apparent, and the acoustic record of speech provides very limited information about phonological organization.This complexity is of course mirrored in speech production, since discrete sequences of segments correspond to continuous movements of physical systems: "[w]hen we think of the entire musculature of the speech apparatus in activity, we realize that there is a continuous waxing and waning in states of contraction throughout these muscles" (Lenneberg 1967: 90).The relation between phonological units and articulatory movements is further complicated by various directions, scopes and types of segmental coarticulation: "[t]he muscular activity associated with one phoneme is influenced by the phonemes that precede and follow it" (Lenneberg 1967: 92).As was already understood at that time ( Öhman 1966, 1967), and as subsequent research has confirmed (Hardcastle & Hewlett 1999), coarticulation is a ubiquitous phenomenon that obliterates the neat, beads-on-a-string-like succession of phonological segments.A further problem that Lenneberg emphasized is that the order and duration of events at different levels of phonetic organization-perceptual, acoustic, neural-are not perfectly aligned: The perceptual order of speech sounds need not be identical with the order of acoustic correlates (we may ignore or fail to hear certain acoustic phenomena); the order of acoustic events need not be identical with the order of motor or articulatory events (movements occur that do not produce sound or sound-changes); the order of central neuronal events may be different from the order of peripheral motor events (certain nervous impulses must be initiated in advance of others because traveling time to the periphery is longer for some pathways [e.g., the recurrent nerve supplying the muscles of the larynx-vv & cr] than others [e.g., the trigeminal nerve innervating the muscles of the jaw-vv & cr]).(Lenneberg 1967: 93) Lenneberg's discussion illustrates how segmental units of surface representations radically differ from their realizations.The former are discrete, timeless, neatly ordered mental abstractions, the latter continuous, dynamic, overlapping, coordinated movements of respiratory, phonatory and articulatory organs.The magnitude of this mismatch is even greater when we take into account the tremendous complexity of the neuromuscular mechanisms by which mental representations are realized.The production of speech is the most complex neuromuscular activity human beings ever come to master, requiring temporal coordination of over 100 muscles controlled by more than 1400 motor commands per second (Stetson 1951, Lenneberg 1967: 91-92, Laver 1994: 1).Stated this way, it becomes apparent that the mental unit represented as [t] on the one hand, and the sound of producing that unit on the other, are separated by a considerable gap.The problem, then, is to explicitly relate the two sides, taking into account their fundamentally different natures.Lenneberg (1967: 98-107) proposed a two-step process which, essentially, transmutes segments into real-time muscular activity.A few caveats are due before sketching his proposals.First, Lenneberg's discussion is based on the production of idealized utterances.His examples are not drawn from observed speech, but are models of the process of speech production applied to hypothetical tokens.A related second point is that Lenneberg's proposal is not intended as part of a psycholinguistic theory of language use, what is sometimes called a 'psychologically real' model of speech production.Similar to the components of Marr's (1982) trilevel analysis, the components of Lenneberg's model are "theoretical stages that help us visualize the complications of speech production" (Lenneberg 1967: 99).Third, Lenneberg takes segments, not distinctive features, to be the basic phonological units, and uses a traditional structuralist terminology-'phonemes' for abstract segmental distinctive units, 'phones' for their intended realizations.One of our primary goals in this paper is to show how Lenneberg's insights can be further developed by combining them with a finer level of phonological representation using distinctive features.
Lenneberg's model, as shown in Figure 1, takes a string of phones as its input and applies two operations: (1) it assigns muscle activity to each phone; (2) it orders that muscle activity temporally.Both medial processes of Figure 1 may be represented in a form of a schema.Lenneberg represented the assignment of muscle activity to each phone with a table where columns stand for successive phones, and rows for muscles relevant for their production (Figure 2).This schema is intended as a matrix indicating which muscles are to be contracted in order to produce a given speech sound.Rows correspond to specific muscles (abstractly labeled from a to f ), columns to phones; '+' means contraction of a given muscle, '0' means relaxation.For example, the schema in Figure 2 indicates that in order to produce phone IV it will be necessary to contract muscles b, c, d, e. Naturally, in actual cases of realization of phones, many more muscles are involved.The next step in transduction is to order muscular activity from Figure 2 temporally.This process is illustrated in Figure 3.A simplifying assumption is that the relevant muscles may be grouped into classes, here denoted as α through δ, ranked according to the time it takes neural impulses to travel from the brain stem and to reach the muscles in each class.Thus the α class of muscles has an activation latency that is four times greater than the δ class, three times greater than G, and two times greater than B. A further simplification is that in this schema all phones are assumed to be of equal duration. 4Based on the classification of relevant muscles into latency groups, shown in the left table of Figure 3, the schema from Figure 2 is rearranged to obey this relative temporal order.The table on the right in Figure 3 shows that if a string of phones I to VI is to be realized correctly, then the first neuromuscular event to occur is the firing of impulses for contraction of muscle e; after that muscles b and c contract but e relaxes, and so on.Due to temporal shifting of the muscles associated with particular phones, the columns in this schema can no longer be put into one-to-one correspondence with the segments in the phonological string.It is here that the phonemic 'Easter eggs' are smashed (Hockett 1955: 210) and coarticulatory effects begin to emerge. 5Therefore, each column in the right schema of Figure 3 corresponds to a 'temporal segment' which indicates, for a given point in time, which muscles need to be contracted or relaxed.Unfortunately, Lenneberg does not discuss the details of this temporal arrangement.For example, he leaves unresolved the question of how much time does one cell denote-5 ms, 10 ms, 20 ms?Time is represented abstractly in Figure 3, from 1 to 9, a reflection of the hypothetical and tentative nature of his discussion, that is, "merely stat [ing] what the neuronal firing order is on some given level in the brain" (Lenneberg 1967: 102).
The result of both steps in the transduction of phones into a neuromuscular schema is given in Figure 4.For each unit of time (abstractly denoted here 4 This is a curious assumption/simplification on Lenneberg's behalf since four pages prior to describing the transduction of segments into neuromuscular schemata he discusses timing problems arising from differences in segmental duration (cf. Lenneberg 1967: 96-97).In fact, temporal discrepancies on various levels of phonetic organization are what initially prompted him to devise such a model of transduction.

5
The corresponding quote referenced here is as follows: Imagine a row of Easter eggs carried along a moving belt; the eggs are of various sizes, and variously colored, but not boiled.At a certain point, the belt carries the row of eggs between the two rollers of a wringer, which quite effectively smash them and rub them more or less into each other.The flow of eggs before the wringer represents the series of impulses from the phoneme source; the mess that emerges from the wringer represents the output of the speech transmitter.At a subsequent point, we have an inspector [i.e., a hearer-vv & cr] whose task it is to examine the passing mess and decide, on the basis of the broken and unbroken yolks, the variously spread-out albumen, and the variously colored bits of shell, the nature of the flow of eggs which previously arrived at the wringer.Note that he does not have to try to put the eggs together again-a manifest physical impossibility-but only to identify.(Hockett 1955: 210) as a temporal segment), the schema specifies which muscle needs to contract and across how many such units, that is, for how long.Within each column, events are assumed to be simultaneous.Notice that for example a I in the fourth temporal segment, which is a muscle contraction associated with the phone ordered first in the string of Figure 2, is preceded by four muscle contractions unrelated to that phone (b II , b III , c II , c III ).The anticipation of future events emphasizes the need for a model of speech preproduction that feeds the sensorimotor system with "a hierarchic plan in which events are selected [. . .] as an integration of all elements within units of several seconds duration" (Lenneberg 1967: 103).For reasons discussed at length (see especially Lenneberg 1967: 102-107), Lenneberg on page 106 explains that a 'sequential chain model' that scans the surface representation from 'left to right', interpreting linearly ordered segments, is not a viable model for relating phonology to phonetics.Instead, what is needed is a 'central plan model' of speech preproduction, which Lenneberg described as follows: On the lowest level, muscular contractions belonging to different speech sounds intermingle and therefore their sequencing cannot be programmed without considering the order of the speech sounds to which they belong.But the choice and sequencing of speech sounds cannot take place without knowledge of the sequence of morphemes to which the sounds belong.[Compare the two different pronunciations of the article the depending on whether the following morpheme begins with a consonant or a vowel-vv & cv] [. . .] On the next higher level, the level of morphemes, we encounter again the phenomenon of intermingling of elements and an impossibility to plan the sequence without insight into the syntactic structure of higher constituents.[. . .] On a still higher level, the level of immediate constituents, [. . .] syntactic elements cannot be ordered without knowledge of the entire sentence.(Lenneberg 1967: 106) The need for a hierarchical central plan for speech production is thus just a specific example of a more general requirement for all levels of linguistic computation and behavior, a requirement that probably extends into other behavioral domains such as navigating through space.
In summary, Lenneberg (1967, chapter 3) already recognized the complexities involved in transforming a mental representation of a string of phones into a temporally coordinated sequence of muscular contractions.The result of this transduction may be understood as a neuromuscular schema such as given in Figure 4.The sequential arrangements of muscular events require preplanning with anticipation of later events.Therefore, the occurrence of some events is contingent upon other events yet to come, which may be adduced as proof that sequencing on a neuromuscular level is not accomplished by a sequential chain model (i.e., by scanning and interpreting a string of segments), but rather by a complex central plan model.The observed interdigitation of muscular correlates of a given phone is mirrored on higher levels of organization, for which a central plan model is also required.The importance of Lenneberg's work, foundational to biolinguistics, derives from his capacity to invoke and synthesize concepts and results from domains as diverse as phonology, phonetics, physiology and neurology.

Phonology-Phonetics Interface (PPI)
One of the points that emerged from the previous discussion is that relating phonology and phonetics is a non-trivial and complex task.Lenneberg's views were generally a step in the right direction because he understood the need to explicitly address the conceptual gap between the units and operations characteristic of these two systems.Yet, there is room for further improvement by adopting ideas and findings that were mostly unavailable in the 1960s.In particular, the discussion of the phonology-phonetics interface (PPI) can be constrained from 'both sides', that is, by strictly adopting a constrained phonological theory which feeds the interface in production (section 3.1), and by using insights from modern models of speech production which are fed by this interface (section 3.2).

Phonology
On the phonological side, we assume a generative substance-free approach (Hale & Reiss 2000a, 2000b, 2008, Reiss 2018, Bale & Reiss 2018).Phonology is understood here as a component of the language faculty that involves formal computations over discrete symbolic units such as distinctive features, syllables, feet etc.Since phonology is a part of the knowledge of language, by definition "all the work in phonology is internal to the mind/brain" (Chomsky 2012: 48).Furthermore, representations involved in phonology are abstract and symbolic, that is, devoid of articulatory, acoustic, typological, statistical etc. information; computations involved in phonology treat features and other phonological units as arbitrary symbols (Hale & Reiss 2008: 169).All representational levels of the phonological component of a generative grammar-underlying, surface, and intermediate-consist of distinctive features (and perhaps markers of other segmental and suprasegmental structure, such as syllable or foot boundaries, which need not detain us here).This means that features are part of the 'representational alphabet' of the phonological module.
Representational levels are related by ordered phonological rules which serve as the computational aspect of phonology (Vaux 2008).
It is important to distinguish between computation and transduction.Computation is the formal manipulation (reordering, regrouping, deletion, addition, etc.) of representational elements within a module, and without a change in the representational alphabet.Transduction is a process of converting an element in one form into a distinct form, that is, a mapping between dissimilar formats.For example, in the process of hearing, air pressure differentials are transduced into biomechanical vibrations of the tympanic membrane and the ossicles of the middle ear, which are transduced via the oval window into fluidic movements within the cochlea, which are in turn transduced by the organ of Corti into electrical signals which are passed on for further processing in the nervous system.The distinction between computation and transduction facilitates conceptualizing the notion of modularity.A module can be thought of as a device which takes input representations and computes over them, generating thereby an output in the same representational alphabet.Modules of the mind (and of organic systems more generally) are linked by transducers which convert information in one form into a form required by the computational module fed by the conversion process.An interface between modules is therefore defined by (1) the form of the input, (2) the form of the output, and (3) a set of transformations that relate (1) to (2).
By virtue of the form of its representations and operations, each module imposes 'legibility conditions' at its interfaces: If some information is to be legible to a given module, that information must come in a specific form in which that module operates (Chomsky 2000a: 9-14).Otherwise, that information would either not be received by that module at all or would be treated as noise (perhaps as human speech is noise to dogs which lack the needed cognitive modules and transducers, even though their auditory system is far superior to that of humans). 6 The SM system imposes certain legibility conditions on phonology, the component of the grammar with which it interfaces, most notably the condition that information must have a linear arrangement (one cannot produce eleven words in parallel) with certain temporal properties (one cannot produce a polysyllabic word in three nanoseconds).Linearity is a complex notion (see Cairns & Raimy 2011, Idsardi & Raimy 2013).For example, in phonological representations, several tiers may be distinguished (segmental, moraic, prosodic, etc.), leading to a kind of multilinearity characteristic for autosegmental phonology; also, in speech, many overlapping articulatory events may be detected, as will be shown in more detail in section 4. Nonetheless, the general idea of linearity, namely, that sequential ordering and precedence relations among basic units play an important role, seems to hold for both phonology and phonetics, unlike for syntax (Chomsky 1995: 334-340, Everaert et al. 2015).Another condition, to which we will return in more detail below, is the condition of bi-directionality: If the same phonological architecture is to be employed in both language comprehension and in speaking, that is, if it is not the case that humans use completely different grammatical devices for each direction,7 then the atomic representational units of phonology, features, must integrate acoustic and articulatory correlates.8If a feature were defined exclusively in terms As Chomsky put it: To be usable, the expressions of the language faculty (at least some of them), have to be legible by the outside systems.So the sensorimotor system and the conceptual-intentional system have to be able to access, to 'read' the expressions; otherwise the system wouldn't even know it is there.(Chomsky 2000b: 17) of, say, its articulatory correlates, as the feature [CORONAL] is, then in principle such a feature could not be used in phonological decoding.
In the phonological theory we adopt, features themselves are substance-free cognitive units (see Reiss 2018: chapter 15.7 for justification), that is, they do not contain information on the temporal coordination of muscle contractions, on the spectral configuration of the acoustic target to be reached, and so on.Yet without this information, the respiratory, phonatory and articulatory systems cannot produce speech.The motor system for speech production requires information about substance and time in order to arrange the articulatory score, therefore this information has to be integrated into a representation before being fed to the motor system.The most plausible way to escape this deadlock (i.e., phonology is substance free, but the SM system needs information about substance to produce speech) is to abandon the idea of a direct, unmediated interface between grammar/phonology and SM system, and posit a cognitive phonetic transduction system that converts distinctive feature matrices into True Phonetic Representations that provide the SM system with legible information needed to produce speech.
In summary: • Outputs of the phonological module, surface representations (SRs) consisting of substance-free features, do not contain substantial and temporal information.
• The SM system requires articulatory, auditory and temporal information in order to produce speech.
∴ SRs are not legible to the SM system and phonology cannot in principle feed speech production directly.
∴ The interface between phonology and the SM system is mediated by transduction.
Before turning to the nature of this transduction system, let us review how modern models of speech production further constrain our approach to the PPI.

Speech Production
On the side of speech production, modern models such as DIVA (Guenther 1995a, 1995b, Guenther et al. 1998, 2006, Tourville & Guenther 2011, Guenther & Vladusich 2012), HSFC (Hickok 2012), LRM (Levelt et al. 1999, Indefrey & Levelt 2004), and MAPL (Poeppel & Idsardi 2011) provide several theoretical and empirical constraints on the nature of representations that directly feed the SM system during speech.In constructing his model of transduction of phones into neuromuscular schemata, Lenneberg (1967, chapter 3) made the assumption that this process involves reaching specific articulatory targets and took into consideration only the distribution of muscle contractions in time.However, more recent research showed ticulation.It is known that a substantial decline in articulation can occur in such a case, but not a complete inability to articulate (Cowie & Douglas-Cowie 1992, Lane et al. 1997).Also, healthy speakers articulate intelligibly while their hearing is blocked by loud masking noise (Lombard 1911, Lane & Tranel 1971).We therefore remain unconvinced that there are no articulatory correlates of features.
that these targets include auditory information as well.Speech production is a mechanism in which feedforward and feedback processes are tightly and intricately related, as witnessed by the general architecture of the Directions Into Velocities of Articulators (DIVA) model, currently the most elaborate and empirically validated model of speech production (see Figure 58.3 of Guenther & Hickok 2016: 728).
Manipulating a speaker's auditory feedback during speech production results in substantial compensatory changes in motor speech acts compared to undisturbed speech (Yates 1963, Guenther et al. 1998, Houde & Jordan 1998, Larson et al. 2001, Purcell & Munhall 2006, Hickok & Poeppel 2016, chapter 25, section 2.2.1).For example, if a subject is asked to produce one vowel and the feedback that she or he hears is manipulated so that it sounds like another vowel, then the subject will change the vocal tract configuration so that the feedback sounds like the original vowel.In other words, speakers will readily modify their articulations to hit an auditory target, suggesting that the goal of speech production involves an intricate relation between articulatory and auditory configurations.Furthermore, although individuals who become deaf as adults can remain intelligible for years after they lose their hearing, they show some speech production impairments immediately, including the inability to adjust pitch and loudness in different listening conditions, and over time they can exhibit substantial articulatory decline (Walstein 1990, Perkell et al. 2000).The fact that speakers are able to repeat speech acts that they heard, even when given speech acts are ad hoc inventions such as "zlurb", suggests that people effortlessly map between articulatory and auditory systems (see the work on the Memory-Action-Perception Loop by Poeppel & Idsardi (2011) for further discussion).
The Hierarchical State Feedback Control (HSFC) model (Hickok 2012) provides further corroboration for the view that features integrate both articulatory and auditory information by showing that speech production involves parallel activation of both auditory and motor units corresponding to the information provided by an appropriate mental representation, and also a sensory-motor coordinate transform network mediating auditory and acoustic programs.It has been well established that surface representations of the phonological module, spelled out in terms of features, serve as both the starting point of speech production and as the end-point of speech perception (Poeppel & Idsardi 2011, Idsardi & Monahan 2016).In an indirect manner, the groundwork for these findings was already laid by the Motor Theory of Speech Perception (Liberman et al. 1967, Liberman & Mattingly 1985), which posits that speech perception involves translating acoustic signals into motor gestures that produce them, and by the Acoustic Theory of Speech Production (Fant 1960, Stevens 1998), which highlights the importance of acoustic or auditory targets in the process of speech production.It follows logically from all this that distinctive features allow for mapping from auditory input to words and from words to action, and therefore must properly be defined via abstract articulatory and auditory correlates.
Modern neuropsychological and neurophysiological evidence indicates that the cognitive aspect of externalizing language through speech has two distinct stages, phonological and phonetic, lending further support for the necessity of cognitive phonetics as a mediating system between phonology and the SM system.The LRM model, named after its creators Levelt, Roelofs & Meyer (1999), explicates the successive stages of spoken word production, and clearly distinguishes between cognitive phonological computation and cognitive phonetic encoding.Indefrey & Levelt (2004) reviewed data from 82 imaging experiments and found that phonological operations are independently conducted within the average time window of 205 ms, followed by an average of 145 ms of cognitive phonetic processing.Evidence from aphasia also supports the dichotomy between phonological and phonetic cognitive processing (Buchwald & Miozzo 2011, 2012).Consider the words pill and spill in English.Both are assumed to contain the segment /p/ in their underlying representations; in the surface representation the former has [p h ] and the latter [p].It is of interest to determine what exactly happens when an aphasic patient simplifies a consonant cluster so that /s/ does not get realized in a word like spill.Will the resultant realization of /p/ be aspirated, consistent with the notion that the deletion of /s/ occurred within the phonological module (i.e., before motor plans for a cluster are implemented), or will it be produced without aspiration, reflecting the conception that the phonological mapping /sp/ → [sp] was left intact and that the deletion of the fricative occurred after phonological computation?Buchwald & Miozzo (2011) measured VOT productions of two aphasic patients who did not realize /s/ in /sp/, /st/, /sk/ clusters and compared these with realizations of correctly produced consonants.Results showed two different patterns of production, with one patient producing the initial stop consonant with a long VOT ([p h ]), and the other producing it with a short VOT ([p]).These findings have been taken to suggest that the errors of the former patient were phonologically based and the errors of the latter patient were phonetically based and "are consistent with an account of spoken production containing at least two processing levels that can be selectively impaired by brain damage: one processing stage [i.e., cognitive phonological] with context independent representations and another [i.e., cognitive phonetic] with context-specific representations" (Buchwald & Miozzo 2011: 1118).Similar results emerged in examination of durational properties of nasal consonants when deleted in /sn/ and /sm/ clusters (Buchwald & Miozzo 2012).
In summary, modern research into speech production, and to a lesser extent speech perception, constrains our approach to the PPI insofar as it shows (1) that the target of speech production is a complex representation that integrates both articulatory and auditory information; (2) that speech production is strongly influenced by auditory and somatosensory feedback; (3) that features have abstract articulatory and acoustic correlates, as demanded by ( 1) and ( 2); (4) that cognitive aspects of externalizing language through speech have two distinct stages: a substance-free computational stage (phonology) and a substantial transduction stage (cognitive phonetics).

An Interface Theory: Cognitive Phonetics
Cognitive Phonetics (CP) is a theory of the phonology-phonetics interface (PPI).It is motivated by the conceptual distance between the characteristics of phonology as shown in section 3.1 on the one hand, and the characteristics of the speech production mechanism as shown in section 3.2 on the other.CP proposes that the output of the grammar is transduced into a representation that contains substance-related information required by the SM system in order to externalize language through speech.Figure 5 illustrates the general architecture of the PPI and the place of CP within it.
Recall that our present focus on speech externalization, without discussion of speech perception and phonological comprehension, is a matter of expository convenience, not a claim about the purview of CP.As the interface between phonology and phonetics, CP is a bi-directional system, thus also relevant for transduction in the direction of perception, that is, for decomposing, parsing, and mentally representing the sound of speech (Reiss 2007, section 2.5, Poeppel et al. 2008).Therefore, in the 'input' direction, CP serves as "the bridge from the physical to the symbolic" (Pylyshyn 1984: 152).In the 'output' direction, which is our focus here, CP is the bridge from the symbolic to the physical, relating the substance-free (phonology) to the substance-laden (physiological phonetics).
CP is fed by the output of the phonological grammar, and directly feeds the sensorimotor (SM) system associated with speech production.CP is substanceinfusing in the sense that it provides the means to externalize language through speech in real time using human neurophysiological machinery.The movements of various organs and the subsequent acoustic consequences comprise the substanceladen aspect of speech traditionally associated with articulatory and acoustic phonetics.CP is a transduction system, which means it changes inputs of one ontological type into outputs of another.The input to CP is a mental representation comprised in part of abstract distinctive features.The output is a representation that contains information on the auditory target to be reached, the muscles necessary to realize a given input, and their temporal arrangement.Outputs of phonology are interchangeably called in the literature 'surface' representations and 'phonetic' representations, while representations from which these are derived are called 'underlying' or 'phonological' representations (Kenstowicz 1994: 60).Since both are phonological representations, that is, encoded in the primitives of the phonological module, it is misleading to call only one representational level phonological.Therefore, in line with our ideas regarding the PPI and CP, we propose a terminological clarification.Inputs to phonology, typically conceived of as strings of concatenated morphemes, we will call 'underlying phonological representations' (UPR); outputs of phonology, which are the inputs CP, will be called 'surface phonological representations' (SPR); and outputs of CP 'true phonetic representations' (TPR); or, for short, 'underlying representations' (UR), 'surface representations' (SR), and 'phonetic representations' (PR), respectively.URs and SRs are part of phonology; PRs are extragrammatical, non-phonological entities.
It is an understatement to say that progress in solving the ontological incommensurability problem in all cognitive domains has been modest.In this light, the fact that we are still talking about theoretical abstractions (e.g., PRs) and not solely in terms of neurobiological processes does not reflect a commitment to any sort of dualism.It reflects instead the position that theoretical cognitive models are crucial for understanding neurobiology of any cognitive domain, including language (Gallistel & King 2010, Poeppel 2012).However, provided that we decompose models of various aspects of cognition-language and speech programing included (Boeckx et al. 2014)-into elementary units and operations, it is a logical necessity that for these units and operations to be 'real' in any coherent sense of that word, they must have a neurobiological substrate, as reflected by Figure 5.For phonology, works like Phillips et al. (2000), Binder et al. (2000), Hickok & Poeppel (2000a, 2004, 2007), Indefrey & Levelt (2004), Obleser et al. (2004), Mesgarani et al. (2008Mesgarani et al. ( , 2014)), Idsardi & Raimy (2013), Monahan et al. (2013), Idsardi & Monahan (2016) provide information on what this substrate might be and how to look for it.For neurobiological substrate of cognitive aspects of speech perception and production see Hickok & Poeppel (2000b, 2016), Poeppel et al. (2008), Poeppel & Hackl (2008), Poeppel & Monahan (2008), Poeppel & Idsardi (2011), Blumstein & Baum (2016), Guenther & Hickok (2016), Tremblay et al. (2016).The neurobiological substrate for CP will be explored in section 4.
CP shares its name and some conceptual commitments with the theory of cognitive phonetics by Tatham (1984Tatham ( , 1987Tatham ( , 1990) ) and Morton (1987), although there are substantial differences.While both approaches reject the notion of a direct interface between phonology and phonetics, and argue for a cognitive approach to certain phonetic phenomena, their theory (henceforth 'CP-TM') offers a different view of what phonology is and how it works.Although CP-TM was somewhat sympathetic to contemporary developments in generative phonology (Tatham 1990, section 3.1), the most important difference from our approach is that CP-TM did not fully commit to the generative architecture of the human language faculty, and therefore did not inherit all the implications (and results) that the generative framework entails.In particular, while CP-TM acknowledges the existence and phonological importance of features (ibid.), as soon as the phonetic level (albeit a cognitive one) is reached, CP-TM, like most phonetic models, tacitly shift attention to the realization of segments (Tatham 1990, section 6).In contrast, we are interested in decomposing SRs into phonological primitives, features, and in exploring how these might be implemented neurobiologically in real time.A further difference is that CP-TM has no commitments to neurobiology and keeps the discussion strictly in the cognitive domain.In fact, CP-TM resolutely banishes neurobiological considerations and maintains an "extreme dualist view" (Tatham 1990: 11).
The positing of a cognitive aspect of phonetics in no way blurs the competence/performance distinction.Phonology is competence; phonetics, even its cognitive aspect, is performance by definition, since only mental grammar is defined as competence.The transduction process modeled by CP (see section 4) does not entail 'knowledge' (e.g., 'knowing how' to produce speech) in any useful sense of the word (see Chomsky (1980: 101-102) for a relevant discussion on this matter).Transduction of SRs into PRs entails a set of neuromuscular processes.Its ontogenetic development most likely follows the development of performance systems in general (Lenneberg 1967, section 4.II).These processes are most properly conceived as 'automatic synergisms', "whole trains of events that are preprogrammed and run off automatically", and that "form the basis of all motor phenomena in vertebrates" (Lenneberg 1967: 92; see also Lorenz & Tinbergen 1957, 1970 for the seminal investigation of innate egg rolling automatisms in greylag geese).That they are cognitive, at least partially, despite being part of performance should also not be controversial.9CP by definition has access to cognitive representations generated by phonology, as shown by the left portion of Figure 5, and it is in this respect that the epithet 'cognitive' is justified; what CP generates, phonetic representations (PFs), are instructions for the SM system on how to execute neuromuscular commands, which are no longer cognitive.One of the main characteristics of a transducer is that it changes the format of its input, and in our case the input is a cognitive entity.

The Inner Workings of CP: Transduction
In this section, we turn to the primary research question for Cognitive Phonetics (CP): How are phonological features related to human neurobiological structures?
In other words, how can we bridge the symbolic and the physical in the domain of speech?As we have indicated, this means exploring the structure of the transducer that converts SR-type information into PR-type information.Clearly, our chances of understanding a transducer are better if we have a good understanding of the transducer's inputs and outputs.The relatively robust results of generative phonology, as compared with other domains of cognition, provide us with an anchor for such explorations-we have a fairly explicit model of the nature of SR-type information as linearly ordered strings of feature matrices.Models of comparable detail are not available for the other two aspects of CP, the transduction procedures and PR-type information, and it is to those topics that we now turn.Marr's (1982Marr's ( /2010) tri-level theory, which we will adopt in further discussion, has been widely accepted as a means to gain insight into information processing systems (IPS) such as CP.Marr proposes that IPSs are best analyzed in terms of three conceptual levels, each corresponding to a specific set of questions.These levels include the 'computational level', the 'representational and algorithmic level', and the 'implementational level' (Marr 1982(Marr /2010 22-27) 22-27), defined by the following questions: • Computational level: What does the process do?Why does the process do it?
• Representational/algorithmic level: How does the process work?In particular, what are the input and output representations and what is the algorithm for the transformation?
• Implementational level: How are the output representation and the algorithm realized physically?In particular, what is the neurobiological substrate of the mapping in question?
Before proceeding, let us clarify a confusing terminological ambiguity.The fact that we are describing transduction, as distinct from computation, and yet still can talk about the computational level of a transducer does not reflect an intellectual inconsistency, but rather just two different uses of a term.As was stated in section 3.1, the main difference between a computational module and a transducer is that the former is a mapping between entities in the same format (e.g., feature matrices to feature matrices), and the latter is a mapping between entities of dissimilar formats (e.g., feature matrices to muscle commands, or sound vibrations to neural impulses).However, both modules and transducers are IPSs, therefore both are amenable to Marr's tri-level analysis, and both can be analyzed at the computational level in Marr's sense.
So, what implications does Marr's theory have for our research question?First, it calls for maximal conceptual decomposition of the representations and operations posited by linguistics.For a long time, the cognitive neuroscience of language was (and to a certain extent perhaps still is) focused on exploring the neurobiological correlates of rather complex linguistic entities or domains, such as syntax (so for example, "Broca's area underlies syntax" would be a common assertion in such a tradition), phonology, lexical semantics, and so on (Poeppel 2012: 36-49).However, Marr (1982Marr ( /2010) ) argued that IPSs are best studied by decomposing them into representational and computational primitives, and then by building a bottom-up understanding of them.It is partly from this method that the success of his theory of vision derives, and it is a success that has inspired much of the recent work in computational neuroscience of language.Second, Marr's theory encourages us to seek an explanation for an IPS's nature from several different sources (for example, linguistics, cognitive science more broadly, neurobiology, formal computational theory) and facilitates explicitly connecting cognitive primitives with neurobiological structures.Therefore, it serves as a general framework for positing linking hypotheses across the fields of linguistics and neurobiology.

The Computational Level
Let us now turn to defining transduction-the operational aspect of CP-at the phonology-phonetics interface in terms of these three levels.Firstly, we want to address the 'what' and 'why' questions of the computational level.What does transduction in CP do?It transforms a representational format that is necessary for the coding of phonological knowledge into a representational format adequate for instructing the neuromuscular system on what it must accomplish in articulatory terms.Why does CP carry out transduction?In general, the answer to this question follows directly from the theoretical and empirical considerations of section 3.1, namely, that outputs of phonology, SRs consisting of substance-free features, lack crucial substantial and temporal information and are thus not legible to the SM system; therefore, phonology cannot in principle feed speech production directly, but only through transduction.The very fact that phonology and phonetics constitute two distinct domains that share an interface logically implies the necessity of transduction between them.In the absence of CP, a mental expression could not be externalized through the human SM system.The transduction maps between properties of the mind-mental representations composed at the most basic level of discrete, timeless, symbolic elements-and the functioning of the motor system, which works in terms of gradual, dynamic, temporally arranged neuromuscular activity.Since we do speak, the existence of transduction is confirmed.

The Representational and Algorithmic Level
We now turn to the question of how the transduction process works in CP. 10 The first step at this level is to state the representations involved in transduction.The input representation, SR, is a matrix of distinctive features.Each feature is transduced and receives interpretation by the SM system.Features are elementary units of phonological computation, stored in long term memory, that represent articulatory and acoustic information in a highly abstract manner. 11Each feature may abstractly be schematized as shown in Figure 6, which is an extension of the Memory-Action-Perception Loop of Poeppel & Idsardi (2011).The input representation thus involves a set of idealized acoustic targets at which the neuromuscular system will aim, as corroborated by studies discussed in section 3.2, and a set of idealized articulatory configurations needed to achieve these goals.It should be emphasized that these 'targets' are not precise, physically invariant acoustic measurements, as features are substance-free units; they are coarse mental representations of acoustic spaces.It is a basic finding of psychoacoustic phonetics that what a speaker deems a repetition of the same category may in fact reflect a wildly different acoustic signal (Liberman 1957).The cognitive unity between acoustic and articulatory correlates of features seems to be so strong that hearing the speech of another person excites a corresponding motor program, regardless of whether the hearer has the intention to also speak (Cooper & Lauritsen 1974, Fadiga et al. 2002).10 Here we will make two simplifying assumptions.We will assume that features within a single bundle (segment) are parts of an unordered and unstructured set and are not grouped hierarchically so as to mimic the composition of the vocal apparatus.We will also abstract away from the possibility, strongly suggested by evidence presented in Keating (1988) and Hale & Kissock (2007), that featurally underspecified segments persevere into SRs.Integrating perseverant underspecification into CP will be left aside for future research.
The output representation, called 'True Phonetic Representation', or 'Phonetic Representation' (PR) for short, is a complex array of neural commands that activate muscles involved in speech production.As pointed out in section 2, uttering even a single syllable involves hundreds of neuromuscular connections, therefore a detailed description of every neuromuscular event for every single and interacting feature is far beyond the scope of this paper.Our modest goal here is to sketch the fate of a transduced feature in a few simple and idealized cases.Take, for example, the feature [+ROUND].Since lip rounding is known to have systematically varying muscular expression (due to interaction with other features, to which we will return below), the Phonetic Representation (PR) has to allow for this variation across contexts.
The transduced form of [+ROUND], call it PR [+ROUND] , engages at least four muscles: orbicularis oris, buccinator, mentalis, levator labii superioris.The idealized expression, assuming no directly interfering articulatory movements (a relatively rare case in actual speech), is simultaneous contraction of the superior and inferior parts of orbicularis oris, contraction of mentalis (for protruding the lower lip) and levator labii superioris (for protruding the upper lip), and relaxation of buccinator.This is the case observed in pronouncing [u].In [y], on the other hand, PR [+ROUND] in addition to contracting the aforementioned muscles also involves a compressing movement (lips drawn together horizontally) caused by the contraction of the buccinator.The difference between protrusion and compression in PR [+ROUND] is dependent on whether PR [+ROUND] is interacting with PR [+BACK] or PR [-BACK] (Catford 1982: 172-173).Of course, various other complications exist, but this suffices to illustrate the general idea.The exact and fully detailed characterization of PR [+ROUND] will thus be possible only after thoroughly studying various possible interactions of transduced features, no doubt a massive phonetic undertaking.
Note that PRs are still abstractly related to speech; they are not hi-fi encodings of speech-sound articulations, although they are less abstractly related to speech than SRs.This is because what is actually externalized is further complicated by a great number of factors.As Hale & Kissock (2007: 85) point out, transduction is followed by other performance factors that have no bearing on either grammar or transduction, factors like speech rate, loudness, interruptions due to sneezing, and many other situational effects.We will also have nothing to say here about how other aspects of SRs (e.g., prosodic elements like tone) are transduced.
The algorithm that transforms SRs into PRs has two steps, echoing Lenneberg's (1967) proposals outlined in section 2. In the first step (A 1 ), a feature is related to muscles which need to be contracted in order to produce an appropriate acoustic effect.Since speech occurs in real time, the second step (A 2 ) will entail temporal coordination of muscular activity demanded by A 1 .A tremendous amount of complexity arises in relation to the second step of transduction.The main resultant phenomenon of this step is coarticulation (see Hardcastle & Hewlett 1999, Farnetani & Recasens 2013, and Volenec 2015 for surveys)-temporal overlapping of various aspects of PRs.Neurobiological studies on speech perception have uncovered that the human perceptual system consistently uses two time scales to analyze a continuous speech signal, a segmental time-frame of roughly 10-80 ms, and a syllabic time-frame of 100-500 ms (Poeppel et al. 2008, Poeppel & Idsardi 2011, Chait et al. 2015): There are two critically important chunk-sizes that seem universally instantiated in spoken languages: segments and syllables.Temporal coordination of distinctive features overlapping for relatively brief amounts of time (10-80 ms) comprise segments; longer coordinated movements (100-500 ms) constitute syllabic prosodies.(Poeppel & Idsardi 2011: 182) However, transduced features often 'spill over' these temporal borders, crossing segmental and sometimes even syllabic boundaries in both directions, thus leading to coarticulation.Our decision to examine the transduction of [+ROUND] to PR [+ROUND] is useful since this aspect of speech relies on several muscles and is known to show great propensity for temporal overextending, especially in the anticipatory direction.Lisker (1978: 133) states that "lip-rounding and nasalization are segmental features of English that refuse to be contained within their 'proper' segmental boundaries, as these are commonly placed".(Note that Lisker's example should not be specific to English if it derives from universal transducer properties.)Likewise, according to Benguerel & Cowan (1974) PR [+ROUND] may be evident several consonants in advance of the rounded vowel for which it is required: In French, labial coarticulation can extend up to 6 segments in the anticipatory direction.Lubker et al. (1975) showed, using electromyography, that in Swedish PR [+ROUND] can start up to 600 ms ahead of a rounded vowel.Both directions of temporal overextending of PR [+ROUND] are observed in English, as demonstrated by Laver's (1994: 321) clever example [h w ud w tS w uz w p w ô w un w dZ w us w ] (Who'd choose prune juice?).
The neurobiological mechanisms underlying transduction algorithms are universal properties of the human species, as witnessed by the fact that humans, in all non-pathological cases, use them without fail (see Dronkers 1996 for an example of a pathological case demonstrating a disruption of A 1 ).However, although the transduction algorithms are biologically universal in humans, CP will still show great output variability due to these two transduction steps being applied to SRs that reflect featurally distinct utterances.Here it is critical to distinguish between the status of the output of the transduction system (True Phonetic Representations) and the system itself (Cognitive Phonetics): The output of the system is, trivially, Ilanguage-dependent because CP is fed by surface representations of that I-language (more precisely then, the output is surface-representation-dependent); the system itself is part of the human biological make-up and is therefore a universal property of the human species.Although this stance is somewhat controversial in phonetics, in our view it is the only biolinguistically coherent approach to the study of the PPI.The universality of CP is merely a reflection of the fact that there exists a biological object we may call 'the phonetic implementational system', of which CP is one part and the SM system another.The question of whether there are variations in individual phonetic implementational systems among humans need not detain us here, just as the fact that no two humans have identical eyes does not hinder biologists in studying a biological object called 'the human eye'.On the other hand, rejection of language-specific phonetics in no way precludes the possibility that certain sets of similar I-languages-which (sets) can roughly correspond to geosociopolitical notions 'language' and 'dialect' (see Chomsky 1986, section 2)show recurrent (co)articulatory patterns.In our view, for example, the recurrent difference in pronunciation of English [i] and German [i] is to be attributed to representational (featural) differences present in I-languages of English and German speakers, not to language-specific phonetics. 12In general, our position is that all recurrent or linguistically relevant differences in pronunciation result from representational differences in the lexicon and from differences in the phonological rule component.This position is parallel to the Minimalist idea that cross-linguistic syntactic differences arise from differences in lexicon and functional heads, and not from languages having different syntaxes (Chomsky 1995).

The Implementational Level
The implementational level is concerned with the neurobiological substrate of CP (see Figure 5).How is transduction of features at the PPI instantiated in the human brain?Many mysteries still surround this question and proposed answers are everchanging.At a relatively gross neuroanatomical level, speech production engages a widely distributed neural network.In a meta-analysis of overt speech production, Eickhoff et al. (2009) reported consistent activation in left inferior frontal gyrus (IFG), ventral precentral gyrus (motor and premotor cortex), ventral postcentral gyrus (somatosensory cortex), superior temporal gyrus (STG; i.e., auditory cortex), supplementary motor area (SMA), anterior insula, superior paravermal cerebellum (lobules V and VI), basal ganglia and thalamus.Of particular importance for transduction is the 'dorsal stream', usually stated to have an "auditory-motor integration function" (Hickok & Poeppel 2007: 394) and to be "involved in mapping sound representations onto articulatory-based representations" (Hickok & Poeppel 2004: 72).12 A reviewer raises the question of how many features would be needed in our approach in order to describe "minute differences between neighboring dialects", for example "the differences between the English accents in the US", arguing that the twenty-something features that are usually assumed to exist are not enough.The objection is mathematically unjustified since assuming 20 features (Odden 2013) and surface underspecification (Hale & Kissock 2007) will yield 3 20 (≈ 3.5 billion) different segments that can feed CP, which seems to be more than enough not only for the description of a non-technical notion such as 'English accents in the US', but also for accounting for all possible recurrent (co)articulatory patterns.Of course, any increase in the feature set, even one that maintains the same order of magnitude as the usually assumed '20 or so', yields explosive increases in descriptive typological power: For example, 30 features yield 3 30 which is about 206 trillion different segments.The reviewer's worries reflect the normal human lack of intuition with respect to combinatoric explosion.Any linear increase in what we attribute to UG results in exponential growth in descriptive capacity, clearly a welcome result (Reiss 2012).
The dorsal stream is comprised of structures in the posterior frontal lobe and the posterior dorsal-most part of the temporal lobe and parietal operculum.The dorsal stream is strongly left-dominant, which is why production deficits result predominantly from dorsal temporal and frontal lesions.The specifics of these general findings lend support for various aspects of CP.
The articulatory motor programs for executing features are coded in posterior IFG of the left hemisphere, traditionally known as Broca's area.More specifically, Hickok (2012: 138) reports that pars opercularis (BA44) and the ventral-most part of BA6 store articulatory programs needed to reach the auditory targets imposed by features.BA44 and BA6 are thus the most likely candidates for storing articulatory aspects of features (see Figure 6).The anterior insula, a cortical area beneath the frontal and temporal lobes of the left hemisphere, is reported to be involved in preparation of speech, that is, in "translating a phonetic 'concept' obtained from left IFG into articulatory motor patterns" (Blumstein & Baum 2016: 649, Eickhoff et al. 2009), roughly corresponding to our A 1 .Dronkers (1996) showed that lesions to that part of the brain lead to apraxia of speech, the inability to assign muscular activity to a phonological representation.Dronkers' results are rather robust and show a clear disruption of A 1 , since all 25 examined stroke patients suffering from apraxia of speech had the same lesion, while the anterior insula was spared in all 19 healthy participants.By way of the dorsal stream, information from the anterior insula is transmitted to the pre-SMA, often implicated in articulatory initiation and sequencing of neuromuscular activity (Alario et al. 2006, Guenther et al. 2006, Bohland & Guenther 2006), and then projected to the primary motor cortex.The pre-SMA also receives temporal information from the cerebellum and the basal ganglia (see below).It can therefore be hypothesized that the pre-SMA integrates information from A 1 and A 2 , and forms a finalized True Phonetic Representation.From the primary motor cortex, neurons send signals to the brainstem and spinal cord that ultimately result in muscle contractions.
Important structures for the temporal organization of speech (corresponding to A 2 ) include the cerebellum and basal ganglia.Information from the insula (corresponding to A 1 ) is directly transmitted to the cerebellum and basal ganglia, structures that are well-established constituents of cortical-subcortical loops for movement preparation (Jueptner & Krukenberg 2001).More specifically, selection and sequencing of motor programs for articulation is mediated through basal ganglia, and the conversion of the discretely prepared sequences into a fluent, temporally distributed action is carried out by the cerebellum (Eickhoff et al. 2009(Eickhoff et al. : 2416)).Cerebellar dysfunction affects temporal aspects of speech production and results in a dysarthria characterized by improper timing of cognitively discrete elements (such as feature bundles), substantial aberrations in their total and relative duration, disrupted coordination of orofacial and laryngeal movements, slowed/delayed execution of articulatory movements etc. (Ackerman et al. 2007).Information from the cerebellum and basal ganglia ties into the pre-SMA, presumably where A 1 and A 2 are integrated to form a True Phonetic Representation directly interpretable by the primary motor cortex (PMC) which sends efferent neuromuscular commands. 1313 According to Eickhoff et al.: The basal ganglia and the cerebellum both forward their information to the PMC which precedes M1 in a serial fashion.The parallel engagement of the Features also have acoustic correlates (see Figure 6) that serve as targets for articulatory movements.There is accumulating evidence and a convergence of opinion that portions of the superior temporal sulcus (STS)-bilaterally but perhaps with a mild leftward bias-are important for encoding acoustic/auditory aspects of phonological representations (Indefrey & Levelt 2004, Buchsbaum et al. 2001).In an attempt to pinpoint this region more narrowly, Hickok & Poeppel (2007: 398) suggest "that the crucial portion of the STS that is involved in phonological-level processes is bounded anteriorly by the most anterolateral aspect of Heschl's gyrus and posteriorly by the posterior-most extent of the Sylvian fissure".Mesgarani et al. (2014) showed that acoustic phonetic information is represented in the STS and is distributed along five distinct areas, each roughly corresponding to a general 'manner of articulation' class of speech sounds.By measuring the responses in implanted electrical cortical grids placed along the superior-most part of the temporal gyrus, they found that their electrode e1 responded selectively to stops, e2 to sibilant fricatives, e3 to low back vowels, e4 to high front vowels and a palatal glide, and e5 to nasals (Mesgarani et al. 2014(Mesgarani et al. : 1009)).Similarly, Bouchard et al. (2013) constructed an auditory-based 'place of articulation' cortical map in the STG, confirming labial, coronal and dorsal 'places' with different electrodes, and cutting across various manner classifications.Scharinger et al. (2012) found, using magnetoencephalography, neural correlates of three phonologically relevant vowel variables-height, frontness and roundness spelled in terms of first three formants-again localizing them in the superior temporal gyrus.
STS and STG project auditory representations to an area in the Sylvian fissure at the boundary between the parietal and temporal lobes (called 'Spt'), where they are integrated with articulatory representations (Hickok et al. 2009, 2011, Gow 2012).Activity in Spt is highly correlated with activity in the pars opercularis (Buchsbaum et al. 2001(Buchsbaum et al. , 2005)), the posterior sector of Broca's region implicated in storage of articulatory motor programs.White matter tracts identified via diffusion tensor imaging suggest that Spt and the pars opercularis are densely connected neuroanatomically (Hickok et al. 2009).Spt therefore appears to be involved in sensorimotor integration, that is, in translation between auditory and articulatory correlates of features.

Interim Summary
At the beginning of this section, we stated that the main goal of this paper is to gain a better understanding of how phonological features relate to neurobiological structures.Let us summarize our proposals.Recent neuroscience evidence is consistent with the idea that Cognitive Phonetics transduces abstract features (elements of SRs) into temporally distributed neuromuscular activities (elements of PRs), relating the phonological grammar to the vastly different SM system.This is carried subcortical motor loops is thus followed by a sequentially organized common final pathway: the PMC first combines the processed information about selected movement programs and their temporal sequencing provided by the basal ganglia and the cerebellum, respectively, into a final movement representation.These are then forwarded to M1 for the generation of the final output to lower motor neurons and hence execution.(Eickhoff et al. 2009(Eickhoff et al. : 2416;;emphasis added) out by assigning each feature a specific set of muscular contractions (A 1 ) and by ordering them temporally (A 2 ).Neurolinguistic evidence outlined in section 4.3 suggests that transduction is implemented by a widely distributed neural network which engages the inferior frontal gyrus (stores articulatory correlates), the superior temporal gyrus (stores auditory correlates), the Spt (sensorimotor integration), the anterior insula (A 1 ), the cerebellum and basal ganglia (A 2 ), the supplementary motor area (integrates A 1 and A 2 ), and the primary motor cortex (sends efferent neural commands to the muscles).

Implications
We have stressed the importance of adhering to phonological facts in phonetic theorizing because decisions on phonological grounds will have considerable impact on phonetic analysis.In particular, this means that we take serious consideration of the following notions: (1) the most basic unit of phonology is the distinctive feature; (2) features are abstract (yet real), cognitive, substance-free units; and (3) features are transduced at the phonology-phonetics interface (PPI) by being converted into temporally coordinated muscular activity.Several theoretical and empirical implications follow from Cognitive Phonetics (CP), our theory of this interface.

Coarticulation
The concept of coarticulation, such as the lip rounding during production of [s] before the rounded vowel of soon, rests upon two premises: (a) that discrete units, segments, underlie the continuous, gradient speech signal (Hammarberg 1976: 357), and (b) that these segments are converted into articulatory gestures (Farnetani & Recasens 2013: 317f). 14The temporal overlapping of articulatory gestures pertaining to different linearly ordered segments can thus be dubbed 'intersegmental coarticulation'.However, if premise (a) is modified to be in line with much of modern phonology (see section 3.1), that is, if the phonological feature is taken as the atomic underlying unit, it follows that (c) features are converted into something more basic than segment-bound articulatory gestures (see section 4.2), and (d) that interaction in realization of features within a single segment is also possible, leading to what we will call 'intrasegmental coarticulation'.Here we will briefly sketch the consequence of approaching coarticulation from the framework of CP, assuming (c) and (d) instead of (just) the usual (a) and (b).CP performs the mapping SR → PR, or, in terms of individual valued features, [F] → PR [F] .We will therefore take transduced features (in a general format PR [F] , where [F] stands for an individual valued feature) to be the basic units that enter speech production.To illustrate intrasegmental coarticulation, consider the interaction of PR [HIGH] and PR [NASAL] observed, for example, in Lakhota (Boas & Deloria 1941), Yoruba (Ogunbowale 1970), and Koyra Chiini (Heath 1999), with sketches in Figure 7 based on Beddor (1983) and Ladefoged & Johnson (2010).In principle, PR [+NASAL] entails the opening of the velar port and PR [+HIGH] the raising of the tongue dorsum.In sketch (1) PR [+NASAL] can be observed in a 'default', non-coarticulated state, that is, with a substantial degree of velum lowering.The tongue dorsum is not raised due to PR [-HIGH] , more space in the oral cavity for the velum port to open.In (2) PR [+HIGH] pushes the tongue dorsum upward, leaving less space for the velum to lower. 15 The velar port is still opened as the realization of PR [+NASAL] , but to a substantially lesser extent than in (1).In other words, PR [NASAL] is coarticulated with PR [HIGH] and shows variation depending on the specification (+ or -) of PR [HIGH] .This effect can be observed by comparing how features are transduced within different segments; PR [NASAL] and PR [HIGH] interact differently within, say, [ã] than within [ ũ].Such variation in how individual features within a segment's feature matrix are transduced depending on the specification of other features in the matrix is 'intrasegmental coarticulation', as illustrated in Figure 7.This is distinct from variation in transduction of features due to influence of features from other matrixes, which constitutes 'intersegmental coarticulation'.
In CP, intrasegmental coarticulation results from the workings of A 1 , while intersegmental coarticulation arises from the effects of A 2 .As defined in section 4.2, A 1 takes a feature from the phonological SR and converts it into a neuromuscunition that there is a phenomenon of coarticulation requiring explanation."(Fowler 1980: 114) 15 Hajek & Maeda (2000: 6) offer a different explanation as to why the velum is lowered to a lesser degree if the tongue body is elevated compared to when the tongue body is not elevated.They argue that a given velopharyngeal opening has a greater acoustic effect in high vowels because the oral tract is more constricted, and as a result, less velum lowering is required in high vowels in order to realize perceptible nasalization.lar pattern.For each feature, this pattern is partially determined by specifications of other features within the same bundle, as shown in Figure 7. Therefore, A 1 will assign a different neuromuscular pattern to [+NASAL] depending how the feature [HIGH] is specified.If one imagines a certain SR (say, [d6g]) as a feature matrix where columns stand for segments and rows for features, then A 1 takes all columns (that were loaded into CP) at once, determines the specification of each feature in each column, and generates a full set of corresponding PR [F] s.Intrasegmental coarticulation, that is, contextual variation in transduction of features, arises when different features in the same column impose conflicting demands on A 1 .Information from A 1 , transmitted via a pathway connecting anterior insula to cerebellum and basal ganglia, is further manipulated by A 2 .A 2 arranges PR [F] s created by A 1 temporally, but more importantly for this discussion, A 2 extends certain PR [F] s over boundaries of their original column.This leads to intersegmental coarticulation.A familiar example is labial intersegmental coarticulation, where A 2 takes PR [+ROUND] , typically originating from a rounded vowel, and overextends it in the regressive (anticipatory) direction.This can be observed in the word soon, where PR [+ROUND] from the vowel is overextended to produce a labialized fricative.A 2 can also overextend PR [F] s in the progressive (perseverative) direction.This can be observed in the word seek, where the PR [-BACK] of [i] is overextended to influence the following [k], yielding ♪si: ♪, with a somewhat fronted velar stop. 16Neurobiological studies suggest (see section 4.3) that the results of A 1 and A 2 are integrated into a final true phonetic representation in a region of the supplementary motor cortex at its boundary with the primary motor cortex, from which efferent commands are issued to the musculature of speech organs.However, it would seem that further experimentation is needed in order to establish whether A 1 precedes A 2 or whether there is overlapping in their real-time neural implementation.
A great deal of variation in the execution of PR [F] s is of course to be expected among speakers, especially given that after transduction, various other nonlinguistic and non-phonetic factors influence the actual acoustic output of the human body.The output of CP is dependent on utterance-specific SRs that feed it and on the neurophysiological structures that serve as its physical implementation.Various other situational factors are introduced after transduction, which we have put aside due to their irrelevance for the general nature of CP, but it is important to keep in mind that, if not somehow recognized, these factors will 'contaminate' all experimental results (of neural imaging techniques, for example), thus leading to the impression of even greater variation in observed speech output.
The architecture of CP opens the possibility of simultaneously exploring coarticulation along two dimensions instead of just one, which leads to interesting empirical consequences.Here we will merely state a hypothetical situation to illustrate CP's potential empirical coverage.
Let us suppose that in some language we have detected that PR [+ROUND] is different in [u] than in [o] (see Linker (1982) for analogous examples from English, Cantonese, Finnish, French, and Swedish).In other words, A 1 assigns a slightly different configuration to [+ROUND] depending on whether it has to take into account [+HIGH] or [-HIGH] within the same bundle.This kind of intrasegmental coarticulation can clearly be observed in Figure 8. of the latter token will systematically differ, since PR [+ROUND] of the former will carry with it the effect of intrasegmental coarticulation due to A 1 , namely, the effect of PR [+HIGH] , while the latter will carry the effect of PR [-HIGH] .To reiterate, intersegmental coarticulation reflects the effects of intrasegmental coarticulation.If we consider only SRs, then there can be no explanation for a systematic difference in the realization of the rounding on the two [l]s, since in both cases [l] precedes [+ROUND].CP allows us to account for these subtle phonetic variations in an explicit and straightforward way-they follow naturally from its transduction algorithms.Thus, A 1 and A 2 are not just mechanisms that transduce features into information directly interpretable by the SM system, they are also mechanisms from which both types of coarticulation follow automatically, simply by adhering to the minimal architecture of CP.
Our discussion has focused on the variable neuromuscular realization of a given property, such as the rounding of the vowels [u] and [o].It is worth remembering that such a discussion of phonetic variability is predicated upon acceptance of the existence of a logically prior phonological category of vowels containing the feature [+ROUND]-it only makes sense to talk about variable realizations of x once we accept that x is a category. 17Why do we accept the existence of such a category?Because the two segments [o] and [u] behave alike with respect to linguistic phenomena.For example, in Turkish, a process called 'vowel harmony' generates different suffix vowels depending on the preceding root vowel.As we see in Table 1, the [+ROUND] root vowels [u] and [o] both trigger a suffix form with [u], whereas 17 This is an extension to the feature level of Hammarberg's (1976) argument for phonological segments as "logically and epistemologically prior" to their phonetic correlates.
the corresponding [-ROUND] vowels [W] and [A] trigger a suffix form with [W] (see Isac & Reiss (2013, section 6.4) for a more comprehensive analysis).
Table 1: Schematic of vowel harmony as found in Turkish.
As photographs (of a Turkish speaker) in Figure 8 show, the lip rounding on two vowels is realized differently, but we treat the vowels as members of a category [+ROUND] because of their phonological behavior.Such considerations explain why we must recognize a distinction between phonetics and phonology.Since the two domains are different but interact with each other, there must be a transduction between them.That transduction is CP.
We fully recognize that the properties of CP outlined in this paper are too general to serve immediately as a full model of coarticulation.Not only the properties of the two component transduction algorithms, A 1 and A 2 , but also the basic inventory of distinctive features must be made more explicit if CP is to be an empirically testable model.In principle, however, CP offers a theoretically coherent way to account for both intra-and inter-segmental coarticulation, and their complex interactions, while maintaining theoretical and empirical insights of generative phonology.

The (Illusory) Naturalness of Phonological Processes
The nature of the PPI as understood in CP shows the need to strictly distinguish between phonology and phonetics.This has implications for the idea of 'naturalness' in phonology.Naturalness is an elusive notion, but it usually entails explaining linguistic phenomena in terms of directly observable empirical facts grounded in acoustics, articulation, statistics, behavior, communication etc. Donegan & Stampe (1979), proponents of Natural Phonology, suggest that the same notion of naturalness plays a role in explaining synchronic phonological patterns, diachronic phonology, as well as patterns of speech development in children: Natural Phonology is a modern development of the oldest explanatory theory of phonology.[. . .] Its basic thesis is that the living sound patterns of language, in their development in each individual as well as in their evolution over the centuries, are governed by forces implicit in human vocalization and perception.(Donegan & Stampe 1979: 126) We follow Hale (2007, section 11.1) in denying any significance to apparent parallels among synchronic, diachronic and developmental 'sound patterns', therefore we will restrict our discussion to the 'naturalness' of synchronic phonology, as determined by phonetic facts.It is not difficult to find, on superficial inspection, phonological processes that seem natural in this sense.Why does [s] assimilate in voicing before adjacent [b] in a language L? Because it is easier for the human vocal system to maintain, and not to rapidly change the laryngeal configuration.Since voicing assimilation is indubitably a well-attested phonological process, and since this process receives an explanation from the efficient workings of "human vocalization" (Donegan & Stampe 1979: 126) Donegan & Stampe (1979), and many other phonologists more recently, proposed to offer phonetic explanations for phonological phenomena, but despite ongoing efforts in a variety of phonological frameworks (for example, see Hayes et al. (2004) for attempts within Optimality Theory), this enterprise has not been convincing: The attempts by those who are interested in psychological phonological grammars and in finding ways to represent phonological processes [. . .] in phonetically natural ways have been abysmal failures [. . .].One possible solution to this is not to put more phonetic sophistication into psychological grammars but rather to abandon phonetic naturalness as a necessary feature of them.(Ohala 2003: 685) Ohala's perspective (see also Ohala 1990) is not only that efforts to build naturalness into phonology have failed, but also that we would not want them to succeed, on grounds of scientific elegance.If certain recurrent phonological phenomena have a perfectly good phonetic explanation, then we do not get a better theory by duplicating the explanation inside phonological grammar-in science, it is not better to have two explanations than one.If naturalness (e.g., the prevalence of voicing assimilation) receives a perfectly fine phonetic explanation, then it is not better to posit another, quasi-phonological explanation, especially not if the latter explanation offers no new insight.We suggest that phonological naturalness is an illusion that arises when inspecting phonetic data with the purpose of understanding phonological processes.In other words, 'naturalness' is introduced into data in the process of externalization (and internalization in speech perception).Since we cannot have direct access to phonological representations and computations, all of our observations are of phonetic data, that is, data from actual utterances resulting from language use, which reflects many different factors.As we argued in sections 3 and 4, CP is the first step in externalization, so understanding CP can hopefully provide insight into what is mistakenly taken as phonological naturalness.Attaining such an insight removes the need for attributing naturalness to the phonological grammar, leading to a more parsimonious and elegant phonological theory.
Once we remove the traditional 'why' questions of Natural Phonology and its derivatives from the purview of phonology, we will be better prepared to answer the proper 'why' questions related to the phonological domain.At this level of inquiry, we will be uncovering the biological foundations, not of speech, but of language, the study of which is Universal Grammar.The 'why' questions the phonological grammar are answerable only in terms of the neurobiological substrate of the phonological faculty.

Gradience
Phonology is computation over discrete, categorical symbols.At the lowest taxonomic level, these symbols are features.However, the phonological literature is full of case studies showing the graded nature of 'phonological' units and processes (see Ernestus 2011 for an informative survey).We believe that the rejection of discreteness in phonology reflects a failure to distinguish the object of study from the data used to draw inferences about that object.
The following is a fairly standard definition of 'categoricality' vs. 'gradience', and by emphasizing certain words in it, we wish to draw the reader's attention to the conceptual level at which the definition is given: [C]ategorical sounds [. . .] are stable and represent clear distinct phonological categories (e.g.sounds showing all characteristics of voiced segments throughout their realizations) [. . .]; gradient sounds [. . .] may change during their realization and may simultaneously represent different phonological categories (e.g.sounds that start as voiced and end as voiceless).(Ernestus 2011(Ernestus : 2115) ) While we have no objection to such a characterization of categoricality vs. gradience, from the emphasized words it is obvious that the definition is immersed in the domain of the substance-laden and temporal, that is, speech (performance), not grammar (competence).The problem arises when phonetic data is used to make inferences about phonology directly and reflexively, as if every idiosyncratic datum recorded in speech or found in a corpus is relevant for phonology, without acknowledging the distance between competence and performance.Consider another passage from Ernestus (2011: 2118): Ellis and Hardcastle ( 2002) found [by using electropalatography and electromagnetic articulography-vv & cr] that four of their eight English speakers showed categorical place assimilation of /n/ to following velars in all tokens, two speakers showed either no or categorical assimilation, and two speakers showed gradient assimilation.Together, the data show that place assimilation processes [. . .] may be gradient in nature.These processes cannot simply be accounted for by the categorical spreading of a phonological feature from one segment to another.
What is to be inferred from these findings that is relevant for phonology?In our view, very little (see below).The cited results, showing inter-and intra-speaker variation, as well as both discrete and gradient effects, may constitute a salient illustration of the ubiquitous lack of uniformity in the behavior of members of a speech community, but it is not in the purview of phonology to provide an explanation of such phenomena.The fact that such variation "cannot simply be accounted for by the categorical spreading of a phonological feature from one segment to another" (ibid.), a claim most certainly true, does not automatically mean there is something wrong with phonology conceived as categorical symbol manipulation.is important to clearly distinguish between the object of study of phonology and the sources of evidence for that study.The object of phonological study is the human knowledge of externalizable aspects of I-language and the cognitive capacity required to construct that knowledge on exposure to limited experience.One of the sources of evidence, perhaps the primary one, bearing upon that object of inquiry are spoken utterances.Therefore, to a certain degree, it can be said that both phonology and phonetics draw from the same pool of evidence, namely, the analysis of speech.
The point is merely that not all data from that pool is relevant for phonology, and a phonologist qua cognitive scientist needs to peel off the various complications that were introduced in the process of externalization from the underlying system of linguistic knowledge she or he is studying.
As understood here, gradience is introduced by CP's A 2 , which is responsible for the temporal coordination of muscular activity specified by A 1 ; that is, gradience is not a phonological phenomenon.Notice the references to time highlighted in the above quote from Ernestus (2011Ernestus ( : 2118)), for example, "during" and "start as . . .end as".Gradience involves change over time.If we think of human phonology as involving a representational system (features and the like) that encodes the phonological portion of morphemes stored in the lexicon, and a computational system that can be thought of as a complex function of, say, composed rules (Bale & Reiss 2018), then there is no temporal aspect to phonology.(Questions about gradience in phonology are like questions about how fast a wh-element moves in syntax; both reflect a category error.)In this way phonology mirrors other competence modules, for the same reasons discussed at length by Chomsky (1980Chomsky ( , 1986Chomsky ( , 1988Chomsky ( , 2000a)), Anderson & Lightfoot (2002), and others.A fundamental property of the human language faculty is that on all analytical levels it fractionates languagerelated aspects of an analog signal into discrete elements to which formal operations apply. 18Even vastly different, mostly incompatible linguistic theories have acknowledged discreteness as a defining property of language: It can be found in Martinet's (1949: 30) notion of 'Double Articulation', Hockett's (1959: 32) 'Duality of Patterning', Chomsky's (2016: 4) 'Basic Property'.Adopting such a position not only preserves a clear distinction between competence and performance, a necessity on many different grounds, but it also facilitates disentangling phonological conclusions from phonetic conclusions even though both are drawn from the same data.The only kind of conclusion a phonologist can draw from the Ellis & Hardcastle experiment cited by Ernestus is that the I-language of (some) English speakers contains a following rule: [+NASAL, CORONAL] → [+NASAL, DORSAL] / [DORSAL].Phonologists can draw only this kind of conclusions because their theory both provides and determines the limits of their descriptive vocabulary.Phonological theory does not provide us with the vocabulary to describe a nasal consonant as 'kind of dorsal'.We pointed out above (section 5.1) that [o] and [u] behave phonologically the same, and that both must be analyzed as [+ROUND] vowels, despite the involvement of different muscles in realizing this feature, due to intrasegmental coarticulation with [-HIGH] and [+HIGH], respectively.Again, phonologists do not have, and do not want, the vocabulary to describe a segment as 'kind of round'. 19 If a featural assimilation rule correctly models a part of implicit phonological knowledge of a speaker, a phonetician can then posit hypotheses as to why such a pattern exists, why there is variability in externalization of this knowledge, what are the limits of its variation, whether the variation is purely biomechanical or partly/mostly/solely cognitive, and so on.For example, the first of these questions might be explained by arguing that the demands of the PR [+DORSAL] override the demands of the PR [+CORONAL] because of the robustness and mechanical inertness of the relatively massive dorsal part of the tongue compared to less constrained, more mobile coronal part.Therefore, the velar exerts its coarticulatory influence over the nasal.Taken this way, the relationship between assimilation and coarticulation is parallel to that of phonology and phonetics in general, that is, the former is a discretely and abstractly constructed mental representation of or an implicit knowledge of the latter (provided that the latter has been phonologized).
In brief, the data most often used in inferring about phonology comes from spoken utterances.But spoken utterances are not the object of phonological study.Therefore, it does not follow that gradience of phonetic objects automatically translates to gradience of phonological objects.

Speech Planning and the Case of the Intervocalic /j/ in Croatian
Anticipatory coarticulation is widely adduced as proof that coarticulation is not merely a reflection of biomechanical properties (e.g., inertness) of speech organs (Farnetani & Recasens 2013).In order for a coarticulatory effect of, say, labialization ([ w ]) to influence a unit preceding a rounded vowel from which the effect derives, it is necessary that some cognitive planning is involved.As we see it, phonology provides the knowledge about the discretely constructed form about to be loaded into the speech production mechanism, and CP the means to plan the coarticulatory effect.An example may be drawn from findings presented by Volenec (2013).
The purpose of that study was to see whether there is a statistically significant difference between the acoustic properties of a Croatian intervocalic palatal glide [j] present in the underlying representation, as in /pijem/ → [pijem] 'I am drinking', and a (supposedly) epenthesized palatal glide that is not present underlyingly, as in /vidio/ → [vidijo] 'I saw'.In the latter case, the glide is supposed to surface only when adjacent to a front vowel ( Škarić 2007: 75), therefore only intervocalic environments consisting of at least one front vowel were compared.For the comparison the study used minimal or subminimal pairs such as /gleda ix/ 'he looks at them' ∼ /gledaj ix/ 'look at them', and /priañati/ 'to stick (to)' ∼ /prijaViti/ 'to report'.The first result was that in both cases none of the typical acoustic correlates of palatal glides (lowering of F1 and heightening of F2 compared to adjacent vow-19 The idea that one's theoretical apparatus determines the range of possible observations that can be made is an old idea in the philosophy of science, discussed in particular reference to the domains of phonetics and phonology by Hammarberg (1976) and Bale & Reiss (2018).20 This is the main idea behind the 'degree of articulatory constraint' (DAC) model of lingual coarticulation (Recasens et al. 1997), which states that the degree of coarticulatory influence and resistance of a phonetic unit rises in proportion to the degree of tongue dorsum involvement in the production of that unit.els, lowering of the intensity between F1 and F2; see Stevens 1998, section 9.2.1) were found in the intervocalic position.This would suggest that the correct derivations are actually /pijem/ → [piem] /vidio/ → [vidio], that is, with deletion, not epenthesis intervocalically.However, the second result showed that in words with underlying /j/, vowels preceding the palatal glide had their F1 significantly lowered, suggesting that the glide exerted anticipatory coarticulatory influence on the vowel, despite not being otherwise present in the acoustic signal.In words with no underlying /j/, this lowering of F1 of the preceding vowel was not present.
We argue that this case shows a dissociation between three levels of analysis: phonological, cognitive phonetic, and articulatory phonetic.Since there is no incontrovertible evidence of discrete phonological alternations in any of these cases, the most plausible derivations are /pijem/ → [pijem] and /vidio/ → [vidio], despite the fact that the spectrogram corresponding to [pijem] contains no time span that independently corresponds to a segment [j].Note that segments are abbreviations for feature bundles.The A 1 of CP receives features and transduces them into PR [F] s.Identical adjacent PR [F] s are fused to make a continuum; the palatal glide and front vowels share many distinctive features, and therefore many PR [F] s.CP's A 2 temporally overextends the only PR [F] discriminating between the glide and front vowels-the neuromuscular command responsible for the narrowing of the palatal constriction, which results in the lowering of F1-to serve as an acoustic cue for the glide.The articulatory system then produces something like ♪piem♪, but with ♪i♪'s F1 lowered (as compared to a 'normal' /i/ that is not in the context of an underlying /j/).The hearer usually picks up this cue, which explains why native Croatian speakers consistently report vaguely hearing some sort of [j] in these cases ( Škarić 2008: 206-212).
Two conclusions can be drawn from this.First, what enters the articulatory system is not the output of phonology (which is [pijem]); if it were, we would expect to find at least some independent glide-like acoustic properties between the vowels, but there are none.Therefore, a cognitive phonetic stage, distinct from both phonology and articulatory phonetics, is needed for transduction and planning.Second, the phonetic transformations that CP introduces target features, which correspond to a finer level of granularity than segments.The phenomenon presented here makes sense only if the input to CP consists of features, and not indivisible segments; and if the output of CP does not consist of segment-bound articulatory gestures, but PR [F] s.This suggests that neither articulatory gestures nor segments, but transduced features (PR [F] s) are the basic units of speech production.The apparent necessity of units at this intervening level serves as yet another justification of our CP model.

Conclusion
In this paper, we have argued that the interface between phonology and phonetics (PPI) consists of a transduction process that converts elementary units of phonological computation, features, into temporally specified neuromuscular patterns, which are directly interpretable by the motor system of speech production.Our inquiry is inspired by Lenneberg's magisterial book Biological Foundations of Language (1967), in which he discussed the transformation of phones (segments) into neu-romuscular schemata.Our view of the PPI is constrained by substance-free generative phonological assumptions (section 3.1), on the one hand, and by insights gained from psycholinguistic and phonetic models of speech production (section 3.2), on other.To distinguish transduction of abstract phonological units into planned neuromuscular patterns, arguably the very first step in speech production, from the biomechanics of speech production usually associated with physiological (or more narrowly, articulatory) phonetics, we have termed our theory 'Cognitive Phonetics' (CP).The inner workings of CP (section 4) are described in terms of Marr's (1982Marr's ( /2010) tri-level approach, which we used to construct a 'bridge' from a formal phonological model to activity one might plausibly find in a human nervous system.In order to connect the substance-free and timeless (phonology) with the substance-laden and temporally coordinated (the SM system used in speech), CP takes features of phonological SRs and relates them to neuromuscular activity (A 1 ) and arranges that activity temporally (A 2 ), thus generating an array of information (in a format which we call 'True Phonetic Representation') directly interpretable by the SM system.We have also presented some potential neurobiological correlates of various parts of CP (section 4.3).Finally, we have explored some of the implications of CP (section 5), showing how such an approach might inform the study of certain phonetic phenomena, most notably coarticulation, and suggesting that CP provides better explanations of some phenomena often considered to fall within the purview of phonology, such as phonetic naturalness and gradience.Further development of CP as an explanatory model of coarticulation and other PPI phenomena will require sharpening the details of both steps of the transduction algorithm (A 1 and A 2 ) and of CP's output units (PF [F] ).We posit CP as a model intervening between phonology (grammar) and physiological phonetics, and it is not surprising that such ideas have implications for the nature of the adjacent systems.On the phonological side, CP calls for a reassessment of distinctive feature theory in a strict biolinguistic manner.Also, the transduction of other aspects of phonological structure (e.g., prosody) should be explored.Ideally, these further developments of CP should be driven by theoretically sound models of phonological representation and computation on the one hand, and should be grounded in neurobiological findings on the other, thus reducing the conceptual distance between formal linguistics and cognitive neuroscience.

Figure 2 :
Figure 2: Schema of the process of assigning muscle activity to a string of phones.Based on Lenneberg (1967: 100).

Figure 3 :
Figure 3: Schema of the process of temporal ordering of muscle activity for a given string of phones.Based on Lenneberg (1967: 101).

Figure 4 :
Figure 4: A neuromuscular schema as a result of transduction of a string of phones into information directly interpretable by the SM system.Based on Lenneberg (1967: 102).

Figure 5 :
Figure 5: The architecture of the phonology-phonetics interface and the place of Cognitive Phonetics within it.

Figure 6 :
Figure 6: A schematization of a distinctive feature.Features serve as the cognitive basis of the bi-directional translation between speech production and perception, and are part of the longterm memory representation for the phonological content of morphemes, thus forming a memoryaction-perception loop (Poeppel & Idsardi 2011) at the lowest conceptual level.

Figure 7 :
Figure 7: Intrasegmental coarticulation based on the interaction of PR [NASAL] and PR [HIGH] .

Figure 8 :
Figure 8: Intrasegmental labial coarticulation.Notice the difference in lip rounding corresponding to [u] on the left, and to [o] on the right.
, naturalness must obviously be a part phonology.However, this reasoning suffers from a failure of separating 'what' from 'why'.The 'what' and the 'why' do not have the same status in linguistic theory.If the goal of linguistics, phonology included, is to explicitly model the speaker's knowledge of language, that is, to model linguistic competence, then linguistics, phonology included, is to be concerned with the 'what' questions: 'What is it that a speaker knows when she or he is said to know phonology?' and 'What are the rules and representation of particular phonological grammars?'The 'why' question-'Why is phonology (or some aspect of it) the way it is?'-does not enter into discussion at this level of inquiry (but see below).Simply put, 'what' is part of competence, but 'why' is not.