Current parsing techniques - an overview

Term Paper (Advanced seminar) 2005 19 Pages

English Language and Literature Studies - Linguistics



1 Introduction
1.1 Introduction
1.2 Definitions
1.3 Why parse?

2 Input / data
2.1 What can be parsed?
2.1.1 Texts
2.1.2 Spoken language
2.2 Tagging
2.2.1 Tagging techniques
2.2.2 Output

3 Grammar
3.1 Representing syntax
3.1.1 Dependency Syntax
3.1.2 Constituent Structure Syntax
3.2 Grammar Types
3.2.1 Context Free Grammars
3.2.2 Probabilistic Context Free Grammars

4 Parsing
4.1 Algorithms
4.1.1 Direction of processing
4.1.2 Direction of analysis
4.1.3 Search strategy
4.1.4 Backtracking vs. Chart parsing
4.2 Approaches to parsing
4.2.1 Robust parsing
4.2.2 Shallow parsing
4.2.3 Integrative vs. sequential architectures
4.2.4 Probabilistic approaches
4.3 Ambiguity
4.3.1 Types of ambiguity
4.3.2 Disambiguation techniques
4.4 Evaluating parsing systems
4.4.1 Coverage

5 Conclusion

6 References

7 Appendix
7.1 Acronyms used

1 Introduction

1.1 Acronyms used

Abbildung in dieser Leseprobe nicht enthalten

1.2 Introduction

This paper intends to provide a brief introduction to the current techniques used in syntactic parsing. It will present different techniques used in representing grammars, conducting searches and resolving ambiguities. Different approaches used in robust parsing will be presented, and a brief look will be taken on the evaluation of parser performance.

Although sample projects are mentioned throughout the text, they will not be presented in full due to limitations of time and space.

1.3 Definitions

In this work, the definition of parsing given by Carroll, namely as “using a grammar to assign a (more or less detailed) syntactical analysis to a string of words, a lattice of word hypotheses output by a speech recognizer or similar” (Carroll 2003: 233) will be used. The process of annotating a text with lexical information, while sometimes referred to as parsing as well, has come to be called tagging by most people.

There is a difference between a parser, which annotates a sentence with syntactical information and a recognizer, which merely decides whether or not a sentence (or parts of it) are grammatical according to a given grammar.

1.4 Why parse?

Methods for automatic syntactical analysis are becoming more and more important in a number of areas. This reaches from natural language interfaces (NLIs), both text- or speech-based, that can be used where a graphical user interface (GUI) would be impractical or too restricted in its functions (e.g. public information systems or database querying), where very large amounts of data have to be searched (with responses evaluated and ranked for relevance), or information is to be extracted from texts. Another area where syntactic analysis is quite useful is machine translation. By extending the purely lexical approach to include a syntactical analysis, the quality of machine generated translations can be improved dramatically. Programs for automatic correction of spelling or punctuation profit from syntactic information as well (Langer 2001: 203).

Finally, syntactic analysis is a prerequisite step if texts are to be analysed semantically, which is necessary for many of the applications discussed above.

2 Input / data

2.1 What can be parsed?

Following the definition of parsing given in the outline, both natural (e.g. English) and artificial languages (e.g. C++) can be parsed syntactically. In the following, I will consider only the parsing of natural language but most of the techniques and algorithms can (and are) used for the processing of artificial languages as well.

Both the written and the spoken form of language are parsed, but each form presents its own difficulties and there are differences in processing them.

2.1.1 Texts

Texts come in both electronic and handwritten form, the digital input usually being less problematic. Problems of digital input include typos, character encoding as well as general questions of document formats. Processing of handwritten language is similar to that of spoken languages since the first step in both cases is to convert the input into machine processable symbols. As with spoken language, syntactic analysis can help to disambiguate input hypotheses.

2.1.2 Spoken language

Parsing spoken language is more problematic than processing text. This is due to the fact that the first step of the syntactic analysis of speech is to determine what actually has been said, which involves dealing with all known problems of speech recognition ( ungrammaticality of input, problems of register, self- correction, false starts, restarts, poor signal quality etc.). When processing spoken language, syntactic analysis can help to resolve ambiguities that arise during the formulation of word-hypotheses.

The most notable example of a system which parses spoken language is the Verbmobil project for translating spontaneous speech, conducted by a number of German companies and academic partners. [INT 1]

2.2 Tagging

Tagging or part of speech tagging is the annotation of texts with lexical informa- tion (e.g. noun, verb, symbol etc.) Most parsers either rely on pre-tagged input or perform the lexical analysis themselves before proceeding to the syntactical ana- lysis.

2.2.1 Tagging techniques

There are two main approaches to tagging texts, rule-based and stochastic tagging. Both kinds of taggers use a lexicon, which contains the most frequent words and their possible parts-of-speech. Whenever a word cannot be tagged unambiguously or is missing from the lexicon, a rule-based tagger relies on some kind of grammar to determine the correct tag. In that case, the tagger's quality is directly influenced by the accuracy and completeness of the rules. A stochastic tagger bases its decisions on the probability of a word class occurring in context with the word classes of its neighbours. These probabilities are computed by analysing a corpus of (hand-) tagged texts. Good stochastic taggers can reach accuracy rates between 90 and 97% (Evert, Fitschen 2001: 374).

2.2.2 Output

The kind of output produced by a tagger depends on the purpose of the process. The precision of the information returned by the tagger varies according to the tagset (i.e. the amount of possible different categories). The smaller the tagset, the lower the possibilities for errors or ambiguities. The output can range from printing a short list, giving all information in textual form to passing an array of data on to the next processing system, the parser. So-called combined models do not strictly separate between the processes of tagging and parsing, but use a parser to solve lexical ambiguities during the tagging process.



ISBN (eBook)
ISBN (Book)
File size
473 KB
Catalog Number
Institution / College
University of Marburg
Current Human Language Technologies




Title: Current parsing techniques - an overview