Being ahead of time. A number of neural network simulations exploring the anticipation of clause-final heads

Thesis (M.A.) 2004 133 Pages

German Studies - Linguistics



1. Introduction

2. Complexity Measures
2.1. Why complexity measures?
2.2. Yngve
2.3. Bottom-up parsing
2.4. Left-corner parsing
2.5. Chomsky and Miller
2.6. Fodor and Garret
2.7. Other complexity measures
2.8. Gibson

3. The idea of anticipation
3.1. Shannon and Weaver
3.2. Entropy and anticipation
3.3. SOUL
3.4. New developments
3.5. My concept of anticipation

4. Empirical evidence for anticipation

5. Language processing and Working Memory
5.1. Baddeley’s model of Working Memory
5.2. Individual differences: reading span
5.3. The one-resource model
5.3.1. King and Just (1991)
5.3.2. Just and Carpenter (1992): CCREADER
5.4. The two-resource model
5.4.1. Waters and Caplan (1996)
5.4.2. Caplan and Waters (1999)
5.5. The alternative: Neural Networks
5.5.1. The model by MacDonald and Christiansen (2002)
5.5.2. Postscript to MacDonald and Christiansen’s model

6. A short break: Integration – anticipation – neural networks

7. The simulations
7.1. Introduction to Neural Network modeling
7.1.1. Simple Recurrent Networks
7.1.2. Training and testing the network
7.1.3. Evaluating performance
7.1.4. The grammars
7.1.5. Details of the simulations
7.1.6. Overview of the seven series
7.2. First series
7.3. Second series: Anticipation is modeled
7.4. Third series: Even distribution of verbs
7.5. Fourth series: Easy versus difficult context
7.5.1. Subject and object relative clauses
7.5.2. Simply- and doubly-embedded sentences With determiners Without determiners
7.6. Fifth series: Positive and negative evidence
7.6.1. The Restricted GPE
7.6.2. A simple grammar
7.6.3. A complicated grammar
7.7. Sixth series: Early and late evidence
7.7.1. A simple grammar
7.7.2. A complicated grammar Was the grammar too easy? Two effects?
7.8. Seventh series: “pure“ anticipation

8. General Discussion
8.1. Series 1 to
8.2. Series
8.3. Series 5 and 6
8.4. Series
8.5. Implications for the anticipation hypothesis
8.6. Connection to other theories

9. Conclusion

10. References

11. Appendix
11.1. Appendix A: Abbreviations
11.2. Appendix B: The grammars
11.2.1. Series
11.2.2. Series
11.2.3. Series
11.2.4. Series 4, first simulation
11.2.5. Series 4, second simulation
11.2.6. Series 4, third simulation
11.2.7. Series 5, first simulation
11.2.8. Series 5, second simulation
11.2.9. Series 6, first simulation
11.2.10. Series 6, second simulation
11.2.11. Series 6, third simulation
11.2.12. Series 6, fourth simulation
11.2.13. Series

“What next !?“

(Dante Hicks in “Clerks“)

1. Introduction

Natural language is a complicated thing. When processing a sentence, the human parser has to keep track of the structure of the sentence; this requires remembering the input string, integrating new words into already built structures, and many other things, – and everything has to be done on-line. If the sentence becomes too difficult, the parser will lose control, and processing becomes slow, or may eventually break down.

There have been a number of complexity measures for natural language; the most influential one at the moment is Gibson’s (2000) Dependency Locality Theory (DLT). However, in a recent experiment, Konieczny and Döring (2003) found that reading times on clause-final verbs were faster, not slower, when the number of verb arguments was increased. This was taken as evidence against DLT’s integration cost hypothesis and for the anticipation hypothesis originally developed by Konieczny (1996): During language processing, a listener / reader anticipates what is about to come – he is “ahead of time“.

This paper presents a series of simulations modeling anticipation. Due to the fact that Simple Recurrent Networks (SRNs; Elman 1990) seem to be the most adequate device for modeling verbal working memory (MacDonald & Christiansen 2002), neural networks were used for the simulations.

In seven series of simulations, I managed to model the anticipation effect. Next to a deeper understanding of anticipation, insights into the way SRNs function could be gained.

The paper is organized as follows. First I will give an overview of different complexity measures; then the experiment mentioned above will be described. Third, I will briefly discuss existing models for verbal working memory. After a short introduction into neural network modeling, the core part of the paper, the seven simulation series, will be presented in detail. Finally, in the Discussion I will argue that SRNs represent a good model for anticipation; implications for the anticipation hypothesis as well as implications for SRNs in general will be considered. Finally, predictions for further experiments will be discussed.

All grammars used in the simulations can be found in the Appendix, as well as a list of abbreviations.

I want to thank Lars Konieczny for accompanying the whole process of development of this paper; Jürgen Dittmann for his advice; Nicolas Ruh for lots of helpful advice concerning the simulations; Sven Hiss, Heidi Fischer, and Simone Burgi for our teamwork in an early simulation; Sarah Schimke and Daniel Müller for fruitful discussions and the nice working climate in the IIG in Freiburg; Gerhard Strube for reading all this; Neal O’Donoghue and my Mom for all the trouble of helping me fight English orthography and syntax; and Sophie for making the last weeks I worked on this paper a lot of fun.

2. Complexity measures

2.1 Why complexity measures?

One of the most basic properties of natural language is seriality. Whether we read or listen, we perceive language on a serial word-by-word basis. However, natural language is certainly hierarchically structured. Consider sentence (1):

(1) The girl hit the boy.

It seems very reasonable that the words the and girl somehow belong together. For example, one might argue that the structure of (1) could be expressed by

(2) [S [NP [Det The][N girl]] [VP [V hit][NP [Det the][N boy]]]]

Now, some sentences cause greater processing difficulties than others. Sentences differ in complexity: the more complex a sentence, the more difficult it is to process.[1] However, the complexity of a sentence is only very weakly correlated to its length: it depends on the structure of the sentence.[2]

If we want to determine how complex a sentence is, there will always be two things that need to be explained: first, how does the human parser function, and second, how is complexity measured with regard to the states the parser encounters when processing the sentence.

If an algorithm could be found that precisely predicts just how difficult a given sentence is to process, much insight into human language processing would be gained. A syntactic complexity metric tries to do exactly this. In particular, such a metric would have to predict processing differences between sentences that apparently have the same meaning; for example, compare (3) and (4):

(3) # The cotton clothing is made of grows in Mississippi.[3]

(4) The cotton which clothing is made of grows in Mississippi.

Also, an appropriate complexity metric should not only predict which sentences or structures are more difficult to process than others, but it should also be embedded in a theory than can explain why these structures are more difficult.

This chapter will present a number of traditional approaches to complexity. Before we begin, some terminology should be clarified. Sentences that are not correct sentences of a language will be referred to as ungrammatical; sentences that are correct but unprocessable are called unacceptable. According to Chomsky (1965), a speaker’s competence decides on the grammaticality of a sentence, while the acceptability (measured on a continuum) depends on a speaker’s performance [4].

For example, sentence (1) is grammatical and acceptable, sentence (5) is ungrammatical, and (6) and (7) are grammatical but unacceptable (but presumably for different reasons).

(5) * The girl hit.

(6) # Men women dogs bite love like fish.

(7) # The horse raced past the barn fell.

We can also find ungrammatical sentences that are deemed acceptable in an acceptability rating task, such as (8) (taken from Gibson & Thomas 1997):

(8) * The apartment that the maid who the service had sent over was well decorated.

The chapter is organized as follows: first, I will shortly sketch a number of classic complexity models; for all these accounts, I will roughly follow the overview in Gibson (1991). After that I will give a more detailed account of Gibson’s (1998) Syntactic Prediction Locality Theory, the most influential complexity metric at the moment.

2.2 Yngve

Yngve’s (1960) complexity measure concentrates on center-embedded sentences. His model consists of a top-down parser that uses a set of phrase structure rules and a stack. Sentences are produced strictly top-down and depth-first. Complexity is measured by counting the “total number of categories that the sentence generator has to keep track of at the point in generating the required node“[5] (Gibson 1991:44).

Processing breakdown occurs when a certain limit is exceeded; following Miller (1956), this limit is supposed to be at 7 ± 2 memory units.

Yngve’s metric correctly predicts that center-embedded sentences like (9)

(9) # The man the dog bit likes fish.

are harder to process than right-branching sentences like (10):

(10) The man hates women that love dogs that like fish.

However, due to the fact that complexity is measured strictly top-down, left-branching structures like (11):

(11) Fred’s aunt’s dog’s tail was very hairy.

are predicted to be very difficult to process, which is not the case. Moreover, there is no empirical evidence that predominantly left-branching languages such as Japanese are any harder to process than predominantly right-branching languages such as English. Therefore, Yngve’s complexity metric has to be rejected.

2.3 Bottom-up parsing

The problems in Yngve’s complexity metric were due to the fact that parsing worked exclusively top-down. However, a bottom-up-account won’t do much better either. With a top-down parsing algorithm, processing difficulties occur whenever categories are predicted but not yet found (which is the case in left-branching structures). On the contrary, in bottom-up parsing, processing difficulties are encountered when categories have to be stored but cannot yet be put together, that is, cannot be integrated[6]. Hence, bottom-up parsing correctly predicts that center-embedded structures are more difficult than left-branching structures, but right-branching structures can become infinitely complex:

(12) The man that saw the cat that chased the mouse that ate the cake.

Empirically, this sentence is easy; For any purely bottom-up parser, it will be difficult.

2.4 Left-corner parsing

An improvement of these two algorithms is not hard to find. If neither top-down nor bottom-up parsing works for itself, their advantages should be combined: the most prominent algorithm capable of doing this is called left-corner parsing (for a formal account see Aho & Ullman 1972; for psycholinguistic accounts see Kimball 1973; 1975; Frazier & Fodor 1978; Johnson-Laird 1983).

A left-corner parser “parses the leftmost category of the right hand side (RHS) of a grammar rule from the bottom up, and the rest of the grammar rule from the top down“ (Gibson 1991:51). Consider the following phrase structure rule:

(13) L à R1 R2 R3

The leftmost category of the right-hand side, which is R1, is parsed from the bottom up; once this structure R1 is built, L is predicted and R2 and R3 can be predicted top-down.

The advantage of left-corner parsing lies in the fact that not much stack space is needed for both left-branching and right-branching structures. Center-embedded structures, on the contrary, are predicted to be difficult to process, just as desired. Thus, we could propose that a left-corner parser, together with the assumption that complexity is measured by counting the categories that have to be kept track of at a certain point in the sentence, provide a good model for human sentence processing and can thus adequately predict how difficult a sentence is to process.

However, consider the following two sentences:

(14) # Men women dogs bite love like fish.

(15) The farmer chased the fox the dog bit into the woods. (from Gibson 1991:52)

If we just count the number of categories that need to be stored locally, both sentences receive a measure of 6. However, (15) is much easier to process.

This does not mean that left-corner parsing has to be rejected along with top-down and bottom-up parsing: maybe we just need a different way to calculate complexity than just counting the categories that need to be stored. The following section presents several suggestions that were made to tackle this problem.

2.5 Chomsky and Miller

Miller and Chomsky (1963) adopted a left-corner parser. They suggested that the complexity of a given sentence could be measured by calculating the ratio of terminal to nonterminal nodes. However, this predicts the same complexity for center-embedded and left- and right-branching structures, and therefore cannot be an appropriate measure.

Chomsky and Miller (1963) proposed that a metric could “consist of the number of interruptions in grammatical role assignment“ (Gibson 1991:56). As this idea was not specified in much detail, it is hard to evaluate. However, it presents the underlying idea of Gibson’s (1998) model, which will be discussed in much detail below.

2.6 Fodor and Garrett

Similar to Miller and Chomsky (1963), Fodor and Garrett (1967) suggested that the ratio for calculating complexity should not be terminal to nonterminal nodes, but the ratio of terminal nodes (i.e. words) to sentential nodes: the higher the ratio, the easier can the sentence be processed. However, this strict dependency on terminals predicts great differences between sentences such as

(16) The man saw the girl.


(17) Shem saw Shaun.

Therefore, it cannot be regarded an appropriate complexity measure.

2.7 Other complexity measures

There are still a great number of measures that will not be discussed in detail here.

Kimball (1973) proposed a hybrid algorithm that combines bottom-up and top-down parsing, called over-the-top parsing. Complexity is measured by how many sentential nodes are being considered at one point during parsing.

The Sausage Machine developed by Frazier and Fodor (1978) is a two-stage model: the first stage parses a part of the input string in a window of approximately five to seven words and assigns a structure to these portions, the second stage puts these structures together.

Frazier (1985) suggested that complexity should not be measured for a single parse state, but in terms of complexity over more than one parse state. She proposes a local nonterminal count which “is the sum of the value of all nonterminals introduced over any three adjacent terminals, where nonterminals are defined to be all nodes except for the immediate projections of lexical items“ (Gibson 1991:62); thus, a maximal local nonterminal count can be calculated, the maximum sum for a given sentence, that represents complexity.

For a more detailed discussion of all these models, as well as a number of reasons why all of them have to be rejected, see Gibson (1991).

2.8 Gibson

The most influential complexity theory at the moment seems to be Edward Gibson’s (2000) Dependency Locality Theory (DLT). The DLT has evolved out of Gibson’s (1998) Syntactic Prediction Locality Theory (SPLT), with two major changes: memory costs are now called storage costs, and these storage costs are not locality-based any more. Since these changes don’t affect the argument in this paper, I will present the model in the 1998 SPLT version.

Following Gibson (1991), Gibson (1998) assumes a ranked parallel parser (Kurtzman 1985; Gorell 1987; Jurafsky 1996). The parser works with two thresholds, T1 and T2, with T1 > T2. Each discourse structure for a given input string receives a certain activation. The parser seeks to activate a representation above the target activation threshold T1; this will be the favored interpretation for the input string. If another representation receives an activation equal or greater than T2, it will also be retained in the active representation set, on which the parser continues to work on. Representations that have an activation below T2 are not kept in the active representation set. All representations with an activation equal to or greater than T2 use resources available for language processing, while representations with an activation lower than T2 do not use resources. Hence, whether a possible representation is kept in working memory does not depend on how many representations are already active, but on the difference in activation between representations: the allowed activation difference between different structures is limited, while the maximum number of structures itself is not. Garden-path effects can be captured using this limited parallel approach.

Two kinds of costs compete for the limited pool of available resources during processing: memory costs and integration costs. As long as the input does not exceed the resource limits, processing will be successful; whenever the input is difficult, processing becomes slow or may eventually break down.

Memory costs are caused by incomplete structures. For example, if a sentence starts with the determiner “the “, then the parser will predict a noun phrase, and together with that it will predict a noun. The noun phrase remains incomplete until the noun is encountered. Thus, “there is a memory cost associated with remembering each category that is required to complete the current input string as a grammatical sentence“ (Gibson 1998:13).

Items are kept in memory until they can be integrated into a higher category, that is, until they can be attached to their syntactic head. The only exception is the matrix verb: as it is always predicted, its prediction does not cause any memory costs.

Unlike the approaches discussed above, Gibson calculates memory costs (as well as integration costs) by counting new discourse referents. The first and second person are never counted because they are always assumed to be present in the discourse space.

So here is the definition of “syntactic prediction memory cost“ (Gibson 1998:15):

"The prediction of the matrix predicate, V0, is associated with no memory cost. For each required syntactic head Ci other than V0, associate a memory cost of M(n) memory units MUs where M(n) is a monotone increasing function and n is the number of new discourse referents that have been processed since Ci was initially predicted.“[7]

Intuitive complexity judgements are supposed to be determined by the maximal memory complexity reached during processing of a given sentence, not by the average amount of complexity.

While memory costs arise whenever words are encountered that cannot be linked to their syntactic heads, integration costs arise when words that already have been parsed have to be integrated into other categories. Again, integration costs are measured over new discourse referents. Put simply, integration costs are calculated by counting how many categories have to be integrated at a certain point in the sentence. The exact definition of integration cost is as follows (Gibson 1998:12f):

"The integration cost associated with integrating a new input head h2 with a head h1 that is part of the current structure for the input consists of two parts: (1) a cost dependent on the complexity of the integration (e.g. constructing a new discourse referent); plus (2) a distance-based cost: a monotone increasing function I(n) energy units (EUs) of the number of new discourse referents that have been processed since h1 was last highly activated.“

Although both the memory and the integration function are supposedly non-linear, linearity is assumed for simplicity.

Together with Just and Carpenter (1992) and Waters and Caplan (1996), Gibson assumes that processing and storage access the same pool of working memory resources. For comparing estimated costs with reading times, Gibson introduces time units (TUs). To calculate how much energy is necessary for performing an integration, as defined in energy units, Gibson gives the following equation:

An energy unit (EU) = memory unit (M[U]) * time unit (TU) (Gibson 1998:15)

If there is a pool of 10 available MUs, and an integration requiring 5 EUs has to be performed, then the time required will be 5 EUs / 10 MUs = 0.5 TUs. If 10 EUs are needed for an integration, but only 5 MUs are available, then the time required for the integration is 10 EUs / 5 MUs = 2 TUs.

For the purpose of the argument in this paper, I will focus exclusively on integration costs.

Here is an example for how to calculate integration costs over a sentence for an object-extracted relative clause (ORC) (Gibson 1998:20):

(18) The reporter who the senator attacked admitted the error.

illustration not visible in this excerpt

(in EUs)

The first point where an integration has to be performed is at the word reporter. Since no new discourse referents have been processed between the and reporter, the integration cost is I(0) EUs at this point. However, the and reporter together form a new discourse referent, which also consumes some integration resources. These costs are ignored for simplicity.

The next word who attaches to reporter; since no new discourse referents have been processed in between, the cost is again I(0) EUs. The same holds for the and senator. However, when reaching the word attacked, more integration costs arise:

"Processing the next word, ‚attacked,’ involves two integration steps. First, the verb ‚attacked,’ is attached as the verb for the NP ‚the senator’. This attachment involves assigning the agent thematic role from the verb ‚attacked’ to the NP ‚the senator’. This integration consumes I(1) EUs, because one new discourse referent (‚attacked’) has been processed since the subject NP ‚the senator’ was processed. The second integration consists of attaching an empty-category as object of the verb ‚attacked’ and co-indexing it with the relative pronoun ‚who’. Two new discourse referents have been processed since ‚who’ was input – the object referent ‚the senator’ and the event referent ‚attacked’ – so this integration consumes an additional I(2) EUs, for a total cost of I(1) + I(2) EUs.“ (Gibson 1998:20)

The next word admitted has to be attached to reporter, which requires I(3) EUs, since three new discourse referents had to be processed in between: the senator, attacked and admitted. Similarly, the integration costs for the last two words can be calculated.

Now consider the following subject relative clause (SRC):

(19) The reporter who attacked the senator admitted the error.

illustration not visible in this excerpt

Costs are calculated analogous to the ORC.

If we assume that I(n) = n, the point where integration cost is greatest is at the verbs attacked and admitted for the ORC. Since I(n) is supposed to be a monotone increasing function, reading times should be higher on the matrix verb admitted than on attacked. Furthermore, integration cost on the matrix verb should be greater in the ORC when complexity of the intervening integrations (which are not captured by the simplifying formula) are taken into account.

King and Just (1991) compared ORCs and SRCs in a self-paced word-by-word reading experiment.[8] A similar experiment was conducted by Gibson and Ko (1998), who received similar results (only differing in details). The correlation between SPLT predictions for ORCs and SRCs and the data in the two experiments is high. Thus one can state that, for this type of sentence, the SPLT seems so be a good complexity metric.

The SPLT is later extended in a way that not only new discourse referents, but also new discourse predicates cause additional costs; however, as this is of no importance to the argument here, I will not be discussing this issue.

Gibson (1998) shows that the SPLT can explain a wide range of psycholinguistic phenomena such as unacceptability of multiple center-embedded sentences, lower complexity of cross-serial dependencies (as in Dutch) compared to center-embedded dependencies, heaviness effects, ambiguity effects, and others.

One can conclude that at the moment, the SPLT is the best complexity metric available and the best model for explaining processing load effects.

3. The idea of anticipation

The SPLT is able to explain a wide range of data. As most complexity measures and most parsing models, complexity only depends on costs in SPLT. These costs arise either from storing something in memory or from integrating stored items into already built structures. Both memory and integration costs don’t look ahead to what is about to come: whether a current input word could have been predicted in advance or not doesn’t play a role for measuring complexity. Predictions are made, but they are costly, and they don’t facilitate the later processing of what has been predicted.

This is where the principle of anticipation comes into play. The anticipation hypothesis roughly says that the difficulty of processing a word is influenced by the string that has been processed up to the current word, in the sense that processing can be facilitated by predictions made earlier. I take anticipation as a purely syntactic principle: When talking about anticipation, I mean anticipation of syntactic heads. Thus, anticipation benefits from exactly that which causes integration costs.

This chapter is organized as follows: First, I will describe the roots of the anticipation idea that date back to Shannon and Weaver’s communication theory. Then I will sketch the first appearance of the anticipation hypothesis similar to its current form. After that I will shortly mention a couple of other trends related to the anticipation idea; finally, I will give a formal definition of the anticipation hypothesis.

3.1 Shannon and Weaver

The idea of anticipation can be traced back to Shannon and Weaver’s (1949) Mathematical Theory of Communication. Shannon and Weaver (henceforth SW) were interested in what they call the technical problem of communication, which can be stated as “how accurately can the symbols of communication be transmitted“ (Shannon & Weaver 1949:4). Figure 1 shows the way every message has to go from its source to its destination:

illustration not visible in this excerpt

Fig. 1 (after SW:7)

A message carries information from a sender to a receiver. Information does not have anything to do with meaning; SW define information as “a measure of one’s freedom of choice when one selects a message [out of a pool of possible messages] (SW:9). This definition is crucial: the amount of information is greatest when freedom of choice is maximal, and thus when determination is minimal.

Technically, information is calculated as the logarithm to the base 2 of the number of possible choices, and is measured in bits (binary digits). For example, take the situation of throwing a fair dice. There are 6 possible events (choices, or messages), all equally probable. Then the information of, say, having thrown a “2“ is calculated by 2x = 6, or log2 6 ≈ 2.58 bits.

However, in most cases the given choices are not equally probable. When talking about language, one will always need to calculate with probabilities. For example, the probability that the letter “j“ is followed by a “g“ in English is zero. The probability of the sequence “colorless green ideas sleep furiously“ is very unprobable (in non-linguist language) but not impossible, whereas the sequence “Lukas hates gherkins“ is of much higher probability.

Language can be characterized as a system that produces sequences of symbols according to certain probabilities, and therefore is a stochastic process. Since probabilities of events depend on previous events, these stochastic processes are Markoff processes or Markoff chains. SW are interested in Markoff processes that generate messages; these are called ergodic processes.

Now, the main achievement of SW is that they discovered that the notion of information turns out to be exactly the same as the notion of entropy in thermodynamics. The second law of thermodynamics states that entropy always increases in physical systems. Entropy is a measure of the randomness of a situation: hence, physical systems have the tendency to become less and less organized.

Once one has calculated the information of a message, this value can be compared with the maximum possible value; this is called the relative entropy of the message. One minus the relative entropy is called the redundancy, which denotes the “fraction of the structure of the message which is determined not by the free choice of the sender, but rather by the accepted statistical rules governing the use of the symbols in question“ (SW:13). For example, natural languages have a redundancy of about 50 percent.

Finally, here is the definition of entropy H for a set (e.g. a language) with n independent symbols (or messages), whose probabilities of choices are p1, p2, ..., pn:

illustration not visible in this excerpt

While SW’s ideas about noise and other phenomena are of no importance for us here, there is one more thing that should be mentioned. A useful device for evaluating a model’s performance is the “Series of Approximations“ (SW:43ff). These approximations take probabilities of letters and/or words as found in corpuses. Given these probabilities, one can generate sequences of letters/words and compare these sequences to a model’s output. These are common approximations:

- zero-order approximation: all symbols independent and equi-probable
- first-order approximation: all symbols independent but with frequencies of text
- second-order approximation: digram structure (letter combinations) as in natural language
- third-order approximation: trigram structure (combinations of three letters) as in natural language
- first-order word approximation: words chosen independently but with frequencies as in natural language
- second-order word approximation: probabilities of word transitions as in natural language

A typical sequence for a second-order word approximation in English would be:


From about sixth-order word approximation on, almost all produced sequences are correct.

A third-order approximation usually provides a good measure for calculating a model’s performance. However, for the neural network simulations described below, no comparison was made, since either the model performed very well, or, in a few cases, learning did not converge at all.

3.2 Entropy and anticipation

So how does this idea connect to the anticipation hypothesis? Information, in SW’s terms, is defined as freedom of choice. When processing a sentence, one starts with maximum freedom of choice, and finally ends up with no freedom of choice at the end of the sentence (in an unambiguous sentence). Entropy is reduced at every step (every word) throughout the sentence. The basic idea is that when freedom of choice is small, processing is easy, and vice versa.[9] Contrary to the approaches described in the previous chapter, the focus lies on what can come next at some point in the sentence. For example, consider the following two sentence fragments:

(21) Die Einsicht, dass der Freund des Polizisten ...

the insight that the friend-nom the policeman-gen ...

“the insight, that the friend of the policeman ...“

(22) Die Einsicht, dass der Freund den Polizisten ...

the insight, that the friend-nom the policeman-acc ...

“the insight that the friend ... the policeman ...“

In (22), the accusative “den Polizisten“ reduces entropy in the sense that no intransitive verb is possible in the subordinate clause, whereas no such reduction takes place in fragment (21). The principle of anticipation of heads states that the listener/reader can make use of this kind of limitation of continuations; hence, reading times on the subordinate verb should be lower in (22) than in (21). Since the reduction is based solely on the case of “Polizist “, this is not a semantic, but a purely syntactic process.

In contrast, the SPLT would predict the exact opposite, namely higher reading times for (22) because integration costs are greater (two arguments instead of one need to be integrated at the subordinate verb).

While the anticipation hypothesis only takes some ideas from SW, other approaches like that of John Hale stick much more to the original ideas by SW. This will be touched in the General Discussion.

3.3 SOUL

The anticipation hypothesis was first developed by Lars Konieczny (1996) in his doctoral thesis. Konieczny developed a model called Semantics-Oriented Unification-based Language system (SOUL) in a HPSG framework (Head-driven Phrase Structure Grammar; Pollard & Sag 1994). This parser was mainly developed for modeling reading time data in ambiguity resolution. However, in the chapter about attachment preferences in verb-final construction he discusses the idea of anticipation (here called projection).

Roughly, the idea is as follows. HPSG works with feature structures. These feature structures include open slots for arguments being expected in the course of the sentence. What makes anticipation (or, projection) possible is that feature structures can be co-indexed, or shared, through unification.[10] Every new argument integrated into the sentence structure adds information to the verb prediction and thus constrains the class of possible verbs. Consider the following verb-final clause fragment:

(23) ..., daß Peter das Buch las

..., that Peter-nom the book-acc read

..., “that Peter read the book“ (Konieczny 1996:208)

When the parser encounters “das Buch “ it modifies the subordinate clause’s feature structure in such a way that transitive verbs are excluded (through “COMP-DTRS“). When encountering the verb, less possibilities need to be taken into account, prediction is richer, and thus shorter reading times are predicted.

In SOUL, only possible sentences (or sentence fragments) are represented (other than in neural networks). Every new word can possibly exclude structures that it cannot be integrated into. Hence, this limitation works bottom-up: an element seeks to attach to a higher structure; if a structure that is still represented does not allow this (i.e., if there are no open slots for the element), the structure is discarded.

Here, anticipation works by making predictions about possible continuations more and more precise. Konieczny calls this incremental prediction refinement. The focus lies on clause-final verbs: at the beginning of a subordinate clause, a finite verb is always predicted, and by successively encountering verb arguments the class of possible finite verbs can be reduced. The more verb arguments a verb-final clause contains, the more restricted are the possibilities at the verb position, and the easier the verb is to process.

3.4 New developments

At the moment, the concept of anticipation seems to have become quite popular. Since this is not the place for a complete review of the literature, I only want to mention three approaches: Jos van Berkum (van Berkum, Brown, Hagoort, & Zwitserlood 2003; van Berkum, Brown, & Hagoort 1999) found syntactic anticipation in ERP studies; secondly, the visual world paradigm is currently en vogue. Numerous studies observed anticipatory eye movements in a task where subjects heard a sentence and were looking at some “visual world“ scene (Altmann & Kamide 1999; Tanenhaus, Magnuson, Dahan, & Chambers 2000; Dahan, Magnuson, & Tanenhaus 2001; Kambe, Rayner, & Duffy, 2001; among others). Finally, the already mentioned frequency-based approach by John Hale will be discussed in chapter 8.

3.5 My concept of anticipation

The main goal of this paper will be to provide insights into a phenomenon labeled “anticipation“ by modeling the data of an eye-tracking experiment with neural network simulations. In the following, anticipation always means anticipation of syntactic heads. So here is a definition of what I take to be anticipation:

Anticipation hypothesis

The processing of a syntactic head is facilitated by syntactic constraints, especially concerning the head’s argument structure, imposed on it by the foregoing input string.

This means that not only clause-final verbs, but all final syntactic heads benefit from anticipation. In this paper, I have only studied verbs; however, it is assumed that the anticipation effect can be found for nouns as well.

It should be pointed out once more that the anticipation hypothesis is a principle opposite to SPLT’s integration costs; the relationship between these two will be one major issue in this paper.

4. Empirical evidence for anticipation

Konieczny and Döring (2003) compared the predictions of the anticipation hypothesis with those of Gibson’s SPLT in an eye-tracking study. We tested German sentences that included a subordinate clause with the verb in clause-final position, and varied the number of verb arguments. While SPLT predicts higher reading times for sentences with more verb arguments due to integration costs, the anticipation hypothesis predicts the exact opposite: more arguments should facilitate verb.

We compared sentences of the type:

(24) NPnom - “that“ - NPnom - NPgen/dat - NPacc- PPNmod/Vmod - verbsubord - verbmatrix - NPacc.

The second NP of the subordinate clause could either be a genitive, modifying the preceding subject NP, or a dative, hence belonging to the verb. Additionally, a PP right before the verb could be either noun-modifying or verb-modifying. Thus, sentences were always equal in length, but differed with respect to the number of verb arguments in the subordinate clause:

- two (in case of NPgen and PPNmod)
- three (in case of NPdat and PPVmod or in case of NPdat and PPNmod)
- or four (in case of NPdat and PPVmod)

Here is an example:

(25) Die Einsicht, dass - der Freund - des/dem Kunden - das Auto - aus Plastik/aus Freude - verkaufte, - erheiterte - die anderen.

The insight, that - the friend - the client-gen / the client-dat - the car - made of plastic / with pleasure - sold, - amused - the others.

”The insight that the friend sold the car made of plastic to the client amused the others.” OR

”The insight that the client’s friend sold the car with pleasure amused the others.”

Eye movements were recorded with a Generation 5.5 Dual Purkinje Image Eye-tracker; performance was measured by regression path durations (RPDs) for every phrase as indicated by the slashes in sentence (25). We were interested in performance at the subordinate verb.

Figure 2 illustrates mean reading times across the embedded clause and at the matrix verb. RPDs were shorter (229 ms on average) when a Dative, instead of a Genitive, was read beforehand. PP-type, however, had no reliable impact on reading the embedded verb.

illustration not visible in this excerpt

Fig. 2: Average regression path durations (RPD) per word for the subordinate clause and the matrix verb.

The results disconfirm the integration cost hypothesis, and support the anticipation hypothesis: an additional verb argument facilitated verb processing.

The lack of a PP-effect is compatible with neither the integration cost, nor the (unrestricted) anticipation hypothesis. However, the null effect may have been due to a number of reasons: first, adverbs generally impose weaker constraints on verbs than their arguments. Second, the distance to the verb and hence the time left for actually imposing its impact might have been too short for the PP. Third, the PPs in the materials may not have modified the verb or the noun as unambiguously as possible. Note that PPs, as opposed to NPs, are not morphologically marked as verb-arguments in German.

The results go in line with the data found by Konieczny (2000). Konieczny found that reading times of clause-final verbs were shorter when integration had to cross a longer distance to its arguments. Distance was manipulated by including a relative clause to the direct object, and by adding an adverbial PP. Thus, the processing of a clause-final head was facilitated when information was added to one of its arguments (by the relative clause), or when an argument itself (the PP) was added.

Furthermore, there is evidence for anticipation by Vasishth (2002), who found that clause-final verbs in Hindi were read faster when an adverb was added.

The data reported support the anticipation hypothesis and disconfirm integration cost as the main component of cognitive load. Given this encouraging result, I decided to try to implement a model for the anticipation hypothesis. The next chapter will review a number of already existing models of human working memory; the chapter following will present my own approach.

5. Language processing and Working Memory

5.1 Baddeley’s model of Working Memory

It is generally assumed that language processing takes place in working memory. The term “working memory“ (henceforth WM) was introduced by Baddeley and Hitch (1974). While the much older term short-term memory (Hebb 1949) was used for a device for storage and retrieval only, WM is responsive for both storage and processing (in terms of symbol manipulation). Baddeley and Hitch suggested that “working memory plays an important role in supporting a whole range of complex everyday cognitive activities including reasoning, language comprehension, long-term learning, and mental arithmetic.“ (Gathercole & Baddeley 1993:2).

The WM model by Baddeley and Hitch consists of three basic parts: a central coordinating unit called the Central Executive (CE), and two slave systems, the Visuo-Spatial Sketchpad (VSSP) and the Phonological Loop (PL). The two slave systems handle the short-time storage and processing of visuo-spatial and phonological information, respectively; all other processes take place in the CE. Language processing is supposed to start at the PL when hearing an utterance; everything beyond putting sounds together to words happens in the CE. All three components are connected to a long-time storage (long-term memory).

This model of WM has been predominant over the last three decades. Evidence for the model mostly came from dual-task experiments testing phenomena like the phonological similarity effect (Conrad 1964; Conrad & Hull 1964; Murray 1965), articulatory suppression (Murray 1967; 1968), or word length effects (Baddeley, Thomson, & Buchanan 1975; Ellis & Henneley 1980; Naveh-Benjamin & Ayres 1986). For critical views on different aspects of Baddeley’s model see Cowan (1995), Penney (1989), or Mohr (1996), among others.

The two most influential symbolic models for verbal working memory, the models by Just and Carpenter (1992) and Caplan and Waters (1999), are roughly based on Baddeley and Hitch’s model of WM. Both models are seated in the CE. Before discussing these models, however, I want to sketch one important experimental device that plays an important role in these models, namely the reading span test developed by Daneman and Carpenter (1980).

5.2 Individual differences: reading span

Much attention has been paid to individual differences in verbal WM capacity. Since older measures like the digit span or the word span seem to be only weakly correlated with reading ability (Perfetti & Lesgold 1977), Daneman and Carpenter (1980) developed the reading span test to measure reading comprehension performance.

Daneman and Carpenter assume one working memory component that is both responsible for storage and processing. As a “minimal hypothesis“, they suggest that the “processes of good and poor readers differ only in some quantitative way“ and thus that the “more efficient processes of the good reader could be functionally equivalent to a larger storage capacity“ (451). Hence, faster processing of a sentence is viewed as a purely quantitative phenomenon that could be measured by storage capacity as well, because these two domains use the same pool of WM resources. If this pool is not sufficient for a given task, a trade-off between storage and processing is assumed.

In the Daneman and Carpenter reading span test, subjects had to read a number of sentences. After that, they had to recall the last words of each of the sentences. The presentation of the sentences was rather quick, such that readers could not re-read a sentence. Throughout the experiment, the number of sentences was steadily increased.

Comprehension of the sentences was tested by recalling facts in one experiment, and by requiring the reader to compute pronominal reference in a second experiment. Reading span was defined as the number of sentence-final words that could be correctly recalled. Reading comprehension was measured by the US-standard Verbal Scholar Aptitude Test (SAT) and two other, more specific comprehension tests.

Daneman and Carpenter found high correlations in both experiments between reading span and reading comprehension. They conclude that the reading span test is an adequate measure for working memory capacity.

Although the reading span test only measures a quantitative component (namely verbal working memory capacity), the authors claim that the reason why readers with a high reading span are good readers is also due to the fact that they employ more efficient processes. Language processing and storage are thought to depend on exactly the same pool of resources, and hence underlie the same individual memory limitations.

Eventually, however, Daneman and Carpenter admit that good readers and bad readers make different kinds of mistakes, and that there might be more to working memory than just capacity, namely qualitative differences between subjects. Still, they conclude that these qualitative differences could at least partly be due to quantitative differences; for example, good readers might use different chunking strategies based on their greater working memory capacity.

Finally, one should mention that the reading span test has proved to be of very little reliability. Reading span is also supposed to greatly improve when, for example, a subject has been reading a book before taking the test.

Over the years the reading span test became a very popular measure; most studies concerned with language processing and working memory since then have paid great attention to individual differences in WM capacity. Thus, I will also focus on reading span when discussing models for verbal working memory, although reading span itself will only play a very minor role in my own model.

5.3 The one-resource model

5.3.1 King and Just (1991)

The first model to be discussed is that developed by King and Just (1991) and Just and Carpenter (1992).

King and Just compared center-embedded object relative clauses (ORCs, such as (26)) with center-embedded subject relatives (SRCs, such as (27)).

(26) The reporter that the senator attacked admitted the error.

(27) The reporter that attacked the senator admitted the error.

They state two kinds of demands that make sentences containing an object-relative clause harder to process: assigning thematic roles is more difficult with two consecutive noun phrases (as in the ORC) than with a verb between the two NPs (as in the SRC) (Holmes & O’Regan 1981); in addition, assigning two different thematic roles to one consituent (reporter is the agent of the matrix clause and the patient of the subordinate clause) causes difficulties (Bever 1970); the shift of perspective when having a different agent in the subordinate clause also seems to tax cognitive resources (MacWhinney & Pleh 1988).[11]

The more complex ORC poses greater difficulties on the verbal working memory than the SRC. King and Just hypothesize that these differences interact with individual differences in working memory capacity.

In order to test their hypothesis, King and Just designed two experiments using the reading span test described above. In one experiment, they imposed an extra load on working memory during sentence processing, in the second experiment they supplied additional pragmatic information to make comprehension easier. In the first case, performance should become worse, whereas in the second case the extra information should make understanding easier.

For data analysis, the sentences were divided into 4 areas as follows (sentences were prolonged for a couple of words at the end for measurement purposes):

illustration not visible in this excerpt

Subjects were divided into high-, middle- and low spans, but only high- and low-spans were compared. In both experiments, high-spans showed better comprehension and faster reading times than low-spans. An increase of memory load showed little effect in subject-relatives and a high effect in object-relatives for high-span readers, while no interaction between sentence type and memory load was found for low-spans. On the other hand, pragmatic information helped all readers, but improved the low-spans’ performance by a higher degree, supporting the hypothesis that additional information is more useful when resources are short.

Based on these results, a model of human verbal working memory was developed. This model, called Capacity Constrained Reader (CCREADER), is only shortly touched in King and Just (1991); a detailed description is to be found in Just and Carpenter (1992). In the following section I will give a short overview of what their model looks like, how it works, and what predictions it makes.

5.3.2 Just and Carpenter (1992): CCREADER

The model developed by King and Just (1991) and Just and Carpenter (1992) was based on the data found by King and Just (1991) and a number of further findings (Ferreira & Clifton 1986; Kemper 1986, 1988; MacDonald, Just, & Carpenter 1992; Carpenter & Just 1989). The model was intended to replicate these data and support Just and Carpenter’s argument that capacity is a basic feature of human verbal working memory.

CCREADER is built using the CAPS architecture, which combines symbolist and connectionist elements. It is supposed to be set in the Central Executive of Baddeley and Hitch’s WM model. All processes use the same pool of resources, hence this is a one-resource model.

CCREADER has a procedural and declarative memory and works with rules that have a conditional and an activational side. Rules may look like:

“If the word the occurs in a sentence, then increment the activation level of the proposition stating that a noun phrase is beginning at that word“.

Rules can fire in parallel (the connectionist element). Performance of the model is measured by counting the cycles the model needs to elevate at least one rule above a certain threshold; according to the authors, this measure can directly be mapped onto reading times. Thus, “being kept in working memory“ means that the activation (of a certain rule) is above threshold; being removed from working memory is not realized through simple displacement by other elements, but such that the activation level sinks back below the threshold.

The parser resembles the one developed by Winograd (1983); it is “fairly limited, with a focus on handling embedded clauses that modify the sentence subject“ (Just & Carpenter 1992:137). The model can be compared to an augmented transition network (ATN), with nodes corresponding to syntactic constituents and arcs linking the nodes corresponding to syntactic and sequential properties of the nodes.

Additionally, CCREADER has a second kind of threshold, which handles the overall amount of activation in the model: if too many rules are active, all of them are equally cut such that the amount of activation does not exceed the threshold. At the same time, activation of rules decays over time.

The overall activation threshold is essential: Just and Carpenter claim it to be equivalent to the capacity of human verbal working memory capacity. Low-spans are modeled using a low overall activation threshold, high-spans are modeled using a high threshold.


[1] The term “complexity“ is used in this sense throughout the paper.

[2] In this paper, complexity will always be referred to as syntactic complexity.

[3] Ungrammatical sentences will be marked with a *, sentences that are grammatical but unacceptable will be marked with a #.

[4] Although the strict distinction between grammatical and ungrammatical is barely maintained today, it will be used in this paper; however, no argument depends on that.

[5] This will be the complexity measure for the following models as well.

[6] If we assume again that complexity is measured by the categories that have to be kept track of at a certain point in the sentence.

[7] As mentioned above, in the DLT version M(n) is no longer a monotone increasing function; the distance plays no role, memory (that is, storage) costs stay the same until the item can be integrated.

[8] Chapter 5 will give a detailed description of their experiment.

[9] Note that this principle is not taken absolutely here: Otherwise, a sentence like “Fred sleeps.“ should be rather difficult.

[10] These co-indexed structures are token-identical.

[11] We have already seen in Chapter 2 how Gibson’s SPLT, which was developed few years later, explains the greater complexity of the ORC.


ISBN (eBook)
ISBN (Book)
File size
1 MB
Catalog Number
Institution / College
University of Freiburg – Germanistik
Linguistics Linguistik Sprachwissenschaft Neural networks neuronale Netzwerke computer model relativsatz language processing sprachverabeitung anticipation grammar grammatik modellierung




Title: Being ahead of time. A number of neural network simulations exploring the anticipation of clause-final heads