What is computational linguistics?
Posted in COMPUTATIONAL LINGUISTICS What is computational linguistics?1.1 The objectives of computational linguisticsComputational linguistics is the study of computer systems for understanding
and generating natural language. In this volume we shall be particularly
interested in the structure of such systems, and the design of algorithms for the
various components of such systems.
Why should we be interested in such systems? Although the objectives of
research in computational linguistics are widely varied, a primary motivation
has always been the development of specific practical systems which involve
natural language. Three classes of applications which have been central in the
development of computational linguistics are
Machine translation. Work on machine translation began in the late
1950s with high hopes and little realization of the difficulties
involved. Problems in machine translation stimulated work in both
linguistics and computational linguistics, including some of the
earliest parsers. Extensive work was done in the early 1960s, but a
lack of success, and in particular a realization that fully-automatic
high-quality translation would not be possible without fundamental
work on text 'understanding', led to a cutback in funding. Only a
few of the current projects in computational linguistics in the
United States are addressed toward machine translation, although
there are substantial projects in Europe and Japan (Slocum 1984,
1985; Tucker 1984).
Information retrieval. Because so much of the information we use
appears in natural language form - books, journals, reports -
another application in which interest developed was automatic
information retrieval from natural language texts. In response to a
query, the system was to extract the relevant text from a corpus and
either display the text or use the text to answer the query directly.
Because the texts in most domains of interest (particularly technical
and scientific reports) are quite complex, there was little immediate
success in this area, but it led to research in knowledge representation.
Automatic information retrieval is now being pursued by a
few research groups (Sager 1978; Hirschman and Sager 1982;
Montgomery 1983).
4
1 What is computational linguistics? 5
Man-machine interfaces. Natural language seems the most convenient
mode for communication with interactive systems (such as
data base retrieval and command language applications), particularly
for people other than computer specialists. It has several
advantages over the first two application areas as a test for natural
language interfaces. First, the input to such systems is typically
simpler (both syntactically and semantically) than the texts to be
processed for machine translation or information retrieval. Second,
the interactive nature of the application allows the system to
be useable even if it occasionally rejects an input (rplease rephrase',
'what does ... mean?'). As a result, a greater measure of success has
been obtained here than in other applications. We are reaching the
point where such systems are being used for real (albeit simple)
applications, and not just for demonstrations. Most computational
linguistics work since the early 1970s has involved interactive
interfaces.
In addition to these 'engineering', applications-oriented motives for work in
computational linguistics, most investigators have some 'scientific' research
objectives which are independent of any particular application. One natural
function for computational linguistics would be the testing of grammars
proposed by theoretical linguists. Because of the complex interactions possible
in transformational grammars, it would be desirable to use the computer to
verify that a proposed set of rules actually works. At least one such system has
been described - Friedman's Transformational Grammar Tester (Friedman
J 971). This system generated sentences in accordance with a proposed
transformational grammar, so that linguists could verify that their grammars
did in fact generate only grammatical sentences. However, much of the formal
framework of linguistic theory (the nature of movement rules, the constraints
on transformations, the form of semantic interpretation rules) is being
questioned, and the emphasis in theoretical linguistics is not on the building of
substantial grammars for which computerized testing would be suitable. As a
result, there has been little use of computers as a test vehicle for linguistic
theories.
On the other hand, the need to develop complete 'understanding' systems
has forced computational linguists to develop areas of research which had
been inadequately explored by the traditional sciences. Two of these areas are
Procedural models of the psychological processes of language understanding.
While traditional linguists have sought to focus on
particular aspects of language, such as grammaticality, some
computational linguists have tried to look on the understanding
process as a whole. They have tried to model these processes, as yet
very crudely, and tried to mimic some aspects of human performance.
An example of this is Marcus's parser (Marcus 1980), which
6 1 What is computational linguistics?
was designed to mimic human performance on 'garden path'
sentences (a 'garden path' sentence is one where people get stuck
and have to retrace their steps in analyzing a sentence, such as 'The
horse raced past the barn fell. '). These efforts, together with those
of psychologists and other researchers, have led to the creation of a
new subfield, cognitive science.
Representation of knowledge. The recognition that language processors
must make reference to large amounts of 'real-world knowledge'
and the need to translate natural language into some formal
representation to facilitate such operations as search and
inferencing have led computational linguists to study the problems
of knowledge representation. Many general suggestions for structuring
information - frames, scripts, information formats - have
developed since the early 1970s; some of these will be discussed in
the chapter on discourse analysis.
Engineering and scientific objectives, of course, usually go hand in hand.
The needs of practical systems may lead to research in and better understanding
of linguistic processes, which in turn produces better natural language
systems. In some areas, such as syntactic analysis, a distinction can be made
between systems oriented towards psychological modeling and those designed
for a particular application. In other areas, however, which have been less
intensively studied, there is as yet no clear division between psychologicallymotivated
and applications-oriented approaches. To the extent that a division
can be made, we shall emphasize applications-oriented approaches.
1.2 Computational and theoretical linguistics
Although both are ultimately concerned with understanding linguistic
processes, computational and theoretical linguists have rather different
approaches and outlooks. Computational linguists have been concerned with
developing procedures for handling a useful range of natural language input.
They are (in general) willing to accept approximate solutions which cover most
sentences of interest, and put up with a system which fails on a few peculiar
inputs. The requirement of constructing complete, working systems has led
them to seek an understanding of the entire process of natural language
comprehension and generation.
Theoretical linguists, in contrast, have focused primarily on one aspect of
language performance, grammatical competence - how people come to accept
some sentences as grammatical and reject others as ungrammatical. They are
concerned with language universals - principles of grammar which apply to all
natural languages - and are interested in finding the simplest, computationally
most restricted theory of grammar which can account for natural language.
They hope thereby to gain some insight into the innate language mechanisms
1 What is computational linguistics? 7
which enable people to learn and use languages so readily. In their efforts to
evaluate alternative theories, they are often led to study peculiar sentences
which some computational linguists would regard as pathological.
Despite these differences in outlook, theoretical linguistics can provide
valuable input to computational linguists, an input which is too often ignored.
Questions of grammaticality are important, because experience has shown
that a grammatical constraint which in one case determines whether a sentence
is or is not acceptable will in other cases be needed to choose between correct
and incorrect analyses of a sentence. The relations between sets of sentences,
which are a prime focus of transformational grammar, particularly in the
Harrisian framework, are essential to language analysis procedures, since they
enable a large variety of sentences to be reduced to a relatively small number of
structures. Formal rules of semantic interpretation, studied by Montague and
his disciples and increasingly by other linguists, are also beginning to make a
significant contribution to computational linguistics.
On the other hand, one should not assume that a 'solution' in an area of
theoretical linguistics (e.g., a formal, concise grammar of English) is per se a
solution to the corresponding problem of computational linguistics. As we
shall see in our discussion of early transformational parsers, direct implementations
of simple theories do not always lead to effective analysis procedures.
As in many areas of science, considerable effort may be required to translate an
elegant formal theory into a computable one.
1.3 Computational linguistics as engineering
Constructing a fluent, robust natural language interface is a difficult and
complex task. Perhaps as our understanding of the language faculty improves,
we will be able to construct simpler natural language systems. For the present,
however, much of the challenge of building such a system lies in integrating
many different types of knowledge - syntactic knowledge, semantic knowledge,
knowledge of the domain of discourse - and using them effectively in
language processing. In this respect, the building of natural language systemslike
other large computer systems - is a major task of engineering.
As with other system building tasks, there are certain general techniques we
can use to make our job easier. One of these is modularity: dividing our system's
knowledge into relatively independent components. Dividing the problem
allows us to attack the subproblems independently (or nearly so), so that we
are not overwhelmed by the task before us. If the modules are carefully
designed, we may find that the division reduces not just the size of the
individual components but also the size of the total system (Grishman 1980).
Another technique for simplifying complex systems is the use of formal
models. Large programs are difficult to design, modify, or understand. Our
odds of developing a successful program are much increased if we can create a
8 1 What is computational linguistics?
relatively simple abstract model and then develop our system as an implementation
of that model. The use of a simple model will also increase the chances
that our work will be understood by our colleagues, so that we can contribute
to the development of the field as a whole.
As our exposition of computational linguistics proceeds, we shall return to
these issues from time to time, considering how alternative approaches impact
the task of system design.
1.4 The structure of this survey - a tree diagram/ompu,alional linguistics
Languageanalysis Lang~cn""tion
Analysis and generation
Most natural language systems require the ability to both analyze and generate
language. Analysis has been regarded as the more crucial component for
practical systems. If a natural language system is to be successful, it must be
able to recognize many paraphrases for the same command or information; on
the other hand, it is sufficient ifit is able to generate anyone of these forms. We
shall therefore devote most of our time to language analysis. However, as we
shall see, there may be substantial symmetry between the analysis and
generation procedures.
/Language anaIY'i'~
Sentence analysis Discourse and
dialog structure
Sentence and discourse analysis
Much more is known about the processing of individual sentences than about
the determination of discourse structure, and presumably any analysis of
discourse structure presupposes an analysis of the meaning of individual
sentences. Furthermore, for many simple applications an analysis of discourse
or dialog structure is not essential (even when references are made to earlier
sentences in the discourse, they can often be understood without a thorough
analysis of the discourse structure). As a result, we shall concentrate first on the
processing of individual sentences, and follow this with a less detailed study of
discourse and dialog.
1 What is computational linguistics? 9
Sentence analysis
SynlaX~ ~t;c analysis
Syntax and semantic analysis
The overall objective of sentence analysis is to determine what a sentence
'means'. In practice, this involves translating the natural language input into a
language with a simple semantics (e.g., a formal logic) or into a language which
can be interpreted by an existing computer system (e.g., a data base retrieval
command language). In most systems, the first stage of this translation is
syntax analysis - the determination (and possible regularization) of the
sentence structure. This stage was also historically the first to be developed by
computational linguists. We will therefore begin our survey with an examination
of syntax analysis.
and generating natural language. In this volume we shall be particularly
interested in the structure of such systems, and the design of algorithms for the
various components of such systems.
Why should we be interested in such systems? Although the objectives of
research in computational linguistics are widely varied, a primary motivation
has always been the development of specific practical systems which involve
natural language. Three classes of applications which have been central in the
development of computational linguistics are
Machine translation. Work on machine translation began in the late
1950s with high hopes and little realization of the difficulties
involved. Problems in machine translation stimulated work in both
linguistics and computational linguistics, including some of the
earliest parsers. Extensive work was done in the early 1960s, but a
lack of success, and in particular a realization that fully-automatic
high-quality translation would not be possible without fundamental
work on text 'understanding', led to a cutback in funding. Only a
few of the current projects in computational linguistics in the
United States are addressed toward machine translation, although
there are substantial projects in Europe and Japan (Slocum 1984,
1985; Tucker 1984).
Information retrieval. Because so much of the information we use
appears in natural language form - books, journals, reports -
another application in which interest developed was automatic
information retrieval from natural language texts. In response to a
query, the system was to extract the relevant text from a corpus and
either display the text or use the text to answer the query directly.
Because the texts in most domains of interest (particularly technical
and scientific reports) are quite complex, there was little immediate
success in this area, but it led to research in knowledge representation.
Automatic information retrieval is now being pursued by a
few research groups (Sager 1978; Hirschman and Sager 1982;
Montgomery 1983).
4
1 What is computational linguistics? 5
Man-machine interfaces. Natural language seems the most convenient
mode for communication with interactive systems (such as
data base retrieval and command language applications), particularly
for people other than computer specialists. It has several
advantages over the first two application areas as a test for natural
language interfaces. First, the input to such systems is typically
simpler (both syntactically and semantically) than the texts to be
processed for machine translation or information retrieval. Second,
the interactive nature of the application allows the system to
be useable even if it occasionally rejects an input (rplease rephrase',
'what does ... mean?'). As a result, a greater measure of success has
been obtained here than in other applications. We are reaching the
point where such systems are being used for real (albeit simple)
applications, and not just for demonstrations. Most computational
linguistics work since the early 1970s has involved interactive
interfaces.
In addition to these 'engineering', applications-oriented motives for work in
computational linguistics, most investigators have some 'scientific' research
objectives which are independent of any particular application. One natural
function for computational linguistics would be the testing of grammars
proposed by theoretical linguists. Because of the complex interactions possible
in transformational grammars, it would be desirable to use the computer to
verify that a proposed set of rules actually works. At least one such system has
been described - Friedman's Transformational Grammar Tester (Friedman
J 971). This system generated sentences in accordance with a proposed
transformational grammar, so that linguists could verify that their grammars
did in fact generate only grammatical sentences. However, much of the formal
framework of linguistic theory (the nature of movement rules, the constraints
on transformations, the form of semantic interpretation rules) is being
questioned, and the emphasis in theoretical linguistics is not on the building of
substantial grammars for which computerized testing would be suitable. As a
result, there has been little use of computers as a test vehicle for linguistic
theories.
On the other hand, the need to develop complete 'understanding' systems
has forced computational linguists to develop areas of research which had
been inadequately explored by the traditional sciences. Two of these areas are
Procedural models of the psychological processes of language understanding.
While traditional linguists have sought to focus on
particular aspects of language, such as grammaticality, some
computational linguists have tried to look on the understanding
process as a whole. They have tried to model these processes, as yet
very crudely, and tried to mimic some aspects of human performance.
An example of this is Marcus's parser (Marcus 1980), which
6 1 What is computational linguistics?
was designed to mimic human performance on 'garden path'
sentences (a 'garden path' sentence is one where people get stuck
and have to retrace their steps in analyzing a sentence, such as 'The
horse raced past the barn fell. '). These efforts, together with those
of psychologists and other researchers, have led to the creation of a
new subfield, cognitive science.
Representation of knowledge. The recognition that language processors
must make reference to large amounts of 'real-world knowledge'
and the need to translate natural language into some formal
representation to facilitate such operations as search and
inferencing have led computational linguists to study the problems
of knowledge representation. Many general suggestions for structuring
information - frames, scripts, information formats - have
developed since the early 1970s; some of these will be discussed in
the chapter on discourse analysis.
Engineering and scientific objectives, of course, usually go hand in hand.
The needs of practical systems may lead to research in and better understanding
of linguistic processes, which in turn produces better natural language
systems. In some areas, such as syntactic analysis, a distinction can be made
between systems oriented towards psychological modeling and those designed
for a particular application. In other areas, however, which have been less
intensively studied, there is as yet no clear division between psychologicallymotivated
and applications-oriented approaches. To the extent that a division
can be made, we shall emphasize applications-oriented approaches.
1.2 Computational and theoretical linguistics
Although both are ultimately concerned with understanding linguistic
processes, computational and theoretical linguists have rather different
approaches and outlooks. Computational linguists have been concerned with
developing procedures for handling a useful range of natural language input.
They are (in general) willing to accept approximate solutions which cover most
sentences of interest, and put up with a system which fails on a few peculiar
inputs. The requirement of constructing complete, working systems has led
them to seek an understanding of the entire process of natural language
comprehension and generation.
Theoretical linguists, in contrast, have focused primarily on one aspect of
language performance, grammatical competence - how people come to accept
some sentences as grammatical and reject others as ungrammatical. They are
concerned with language universals - principles of grammar which apply to all
natural languages - and are interested in finding the simplest, computationally
most restricted theory of grammar which can account for natural language.
They hope thereby to gain some insight into the innate language mechanisms
1 What is computational linguistics? 7
which enable people to learn and use languages so readily. In their efforts to
evaluate alternative theories, they are often led to study peculiar sentences
which some computational linguists would regard as pathological.
Despite these differences in outlook, theoretical linguistics can provide
valuable input to computational linguists, an input which is too often ignored.
Questions of grammaticality are important, because experience has shown
that a grammatical constraint which in one case determines whether a sentence
is or is not acceptable will in other cases be needed to choose between correct
and incorrect analyses of a sentence. The relations between sets of sentences,
which are a prime focus of transformational grammar, particularly in the
Harrisian framework, are essential to language analysis procedures, since they
enable a large variety of sentences to be reduced to a relatively small number of
structures. Formal rules of semantic interpretation, studied by Montague and
his disciples and increasingly by other linguists, are also beginning to make a
significant contribution to computational linguistics.
On the other hand, one should not assume that a 'solution' in an area of
theoretical linguistics (e.g., a formal, concise grammar of English) is per se a
solution to the corresponding problem of computational linguistics. As we
shall see in our discussion of early transformational parsers, direct implementations
of simple theories do not always lead to effective analysis procedures.
As in many areas of science, considerable effort may be required to translate an
elegant formal theory into a computable one.
1.3 Computational linguistics as engineering
Constructing a fluent, robust natural language interface is a difficult and
complex task. Perhaps as our understanding of the language faculty improves,
we will be able to construct simpler natural language systems. For the present,
however, much of the challenge of building such a system lies in integrating
many different types of knowledge - syntactic knowledge, semantic knowledge,
knowledge of the domain of discourse - and using them effectively in
language processing. In this respect, the building of natural language systemslike
other large computer systems - is a major task of engineering.
As with other system building tasks, there are certain general techniques we
can use to make our job easier. One of these is modularity: dividing our system's
knowledge into relatively independent components. Dividing the problem
allows us to attack the subproblems independently (or nearly so), so that we
are not overwhelmed by the task before us. If the modules are carefully
designed, we may find that the division reduces not just the size of the
individual components but also the size of the total system (Grishman 1980).
Another technique for simplifying complex systems is the use of formal
models. Large programs are difficult to design, modify, or understand. Our
odds of developing a successful program are much increased if we can create a
8 1 What is computational linguistics?
relatively simple abstract model and then develop our system as an implementation
of that model. The use of a simple model will also increase the chances
that our work will be understood by our colleagues, so that we can contribute
to the development of the field as a whole.
As our exposition of computational linguistics proceeds, we shall return to
these issues from time to time, considering how alternative approaches impact
the task of system design.
1.4 The structure of this survey - a tree diagram/ompu,alional linguistics
Languageanalysis Lang~cn""tion
Analysis and generation
Most natural language systems require the ability to both analyze and generate
language. Analysis has been regarded as the more crucial component for
practical systems. If a natural language system is to be successful, it must be
able to recognize many paraphrases for the same command or information; on
the other hand, it is sufficient ifit is able to generate anyone of these forms. We
shall therefore devote most of our time to language analysis. However, as we
shall see, there may be substantial symmetry between the analysis and
generation procedures.
/Language anaIY'i'~
Sentence analysis Discourse and
dialog structure
Sentence and discourse analysis
Much more is known about the processing of individual sentences than about
the determination of discourse structure, and presumably any analysis of
discourse structure presupposes an analysis of the meaning of individual
sentences. Furthermore, for many simple applications an analysis of discourse
or dialog structure is not essential (even when references are made to earlier
sentences in the discourse, they can often be understood without a thorough
analysis of the discourse structure). As a result, we shall concentrate first on the
processing of individual sentences, and follow this with a less detailed study of
discourse and dialog.
1 What is computational linguistics? 9
Sentence analysis
SynlaX~ ~t;c analysis
Syntax and semantic analysis
The overall objective of sentence analysis is to determine what a sentence
'means'. In practice, this involves translating the natural language input into a
language with a simple semantics (e.g., a formal logic) or into a language which
can be interpreted by an existing computer system (e.g., a data base retrieval
command language). In most systems, the first stage of this translation is
syntax analysis - the determination (and possible regularization) of the
sentence structure. This stage was also historically the first to be developed by
computational linguists. We will therefore begin our survey with an examination
of syntax analysis.
0 comments: