I got recently interested in the mathematical approach to linguistics. My interest sparkled from the desire to play with MCMC methods (Markov Chain Monte Carlo) over something relevant, and computational linguistics is one of the possible fields that profit from those methods. I started studying a few things and will condense some of them in here.
First of all, even before specifying what a grammar is I think it’s useful to state what a grammar is for. A grammar is a synthetic description of the admissible sentences in a given language: it tells whether to accept a sentence or not. It doesn’t say anything about the specific content of the sentence, but only cares about the structure. It’s synthetic in that it is smaller than the language itself, or of a list of all possible sentences, so to say.
Of course a list of all the acceptable sentences in any non finite language (such are the human languages for example) would be impossible to produce and mostly irrelevant since we wouldn’t learn anything from it. When dealing with countable sets one very basic (and vague) question is if these sets are enumerable in some way. That means: is there some concise algorithm with a finite number of steps that allows to produce every admissible element of the list, given enough time? That’s what a grammar of, say, German tries to do: it tells you how phrases are built in german, that the verb occupies the second relevant position in the sentence, that adjectives are conjugated, etc. After reading a german grammar you are (well suppose you’re very good at this) able to tell if a given sentence is correct in german or not. But you are also able to produce phrases: you learned how to write german. So that there are two approaches to test if a sentence is properly formed: one can go backward and break it down into pieces, checking at every step that the grammar is not violated anywhere. The other method is to go forward, i.e. start from an empty sentence and using the rules of grammar generate all the possible sentences until we recognize the one we’ve been given (which is finite, so at a certain point if we haven’t found it, then it’s not acceptable). If it’s a sentence we could produce, then it’s acceptable.
Anyway, books of grammars fail to be acceptable algorithms, because they have too many exceptions and fuzzy notions. Also, a whole book is not really so small for us to call it concise.
It turns out that asking for an algorithm is too vague to have an answer. Some computational machines are more powerful than others, and some languages that can’t be enumerated by a machine could be enumerable by a smarter machine (I use the adjective “smarter” just to scare the humanities). Formal grammar is about the amount of computational power you need to enumerate a language, related to how complex the language is.
Enough with the gibberish.
Definition 1 A grammar is a quadruple , where
- is a set called set of Non-Terminal symbols (we use capital Latin letters for them);
- is a set called set of Terminal symbols, which is disjoint from ;
- is a privileged element of , called start symbol;
- is a finite set of substitution/production rules, each of the form
where is the Kleene star. In other words, is subset of
Let us remind the reader that the Kleene star is the set-theoretic operator that applied to a set outputs the (infinite) set of all finite strings with alphabet the elements of the original set. Thus , and denotes the empty string.
We describe how to read substitution rules and how to apply them. Take a string in , that is a string formed by nonterminal and/or terminal symbols mixed. If is a rule in , where are strings and is a non terminal symbol, then if contains the string we can replace the in with the string . Thus becomes after applying the rule.
We said what a grammar is, but we didn’t say anything about languages. We adopt here a broad definition of language:
Definition 2 A language associated to a grammar is a subset of the subset of given by all the strings that can be produced starting from and applying rules until all the nonterminal symbols have disappeared.
So we accept as languages even those who discard certain strings without apparent reason. That spares us questions like “is this the minimal set of rules that generates ”, which we’ll deal with separately.
We want to be a little more rigorous, so we introduce a binary relation on strings of . This relation is written as (or abbreviated as ) and is read as “produces in one step in ”, and it’s defined by
that is we can get by applying rule to . From we can define , which is read “produces in a finite number of steps in ” and means
Then the set
is called the set of sentential forms of , and its subset
is the maximal language we defined.
Now, the cornerstone of the subject was put in place by Noam Chomsky, as almost everybody knows, and it’s a hierarchical classification of the grammars. Let’s start with an arbitrary – for now – subdivision:
Definition 3 The grammars are subdivided in the following decreasing levels:
- Type 0 grammars: any grammar safistying the definition given above;
- Type 1 grammars, or context-sensitive grammars: grammars with rules of the form
where , and :
- Type 2 grammars, or context-free grammars: grammars with rules of the form
where and ;
- Type 3 grammars, or right-regular grammars: grammars with rules of the form
where , .
We notice here that some authors restrict context-sensitive grammars to those where are strings of terminal elements. Also, of course Type 3 grammars also include the mirror image of right-regular grammars, that is left-regular grammars, defined in the obvious way.
Of course the subdivision is not based on Chomsky’s personal taste but involves the computational power required to produce those languages, and results in a beautiful theory with many insights into complexity theory, the power of memory and so on.
Let’s see some examples of grammars from Type 1 to Type 3. First of all, the following is a regular grammar
with terminal symbols , nonterminal symbols , start symbol . That the rules are those of a right-regular grammar can be seen directly. What are the sentences produces by this grammar? notice there are two rules for the substitution of , which means that whenever we have an we can choose. The process of generating sentences involves choices, which makes it non-deterministic. An example of sentence can be
where we applied the first rule to (it was the only possibility), then we chose to apply second rule to the nonterminal , which produced a further , and we chose to apply second rule again to this new , and finally used third rule to substitute with the terminal , which ends the process because there are no more occurrences of nonterminal symbols.
The typical way to see the structure of the process is to draw trees. The root of the tree is the start symbol , and every time we apply a substitution rule to a nonterminal symbols we add leaves to that symbols, one for each new symbol and in the correct order. Thus the tree for the previous example of sentence generation is
Notice that because of the nature of rules of regular grammars the tree happens to be a binary tree, but for more complex grammars this isn’t true anymore.
Another sentence is
where we iterated the second rule one more time. It’s easy to see that all the possible sentences produced by this grammar are of the form , which we shorten as . One sees clearly that sentences always start with because of the first rule, they always end with because the third rule is the only one that doesn’t produce new nonterminal symbols, and second rule produces the same symbol it applies to, allowing for recursion.
Before moving to Type 2 grammars, observe how the following right-regular grammar
is ill-posed, since there are no rules that allow elimination of , and therefore a sentential form containing never generates a sentence.
An example of a context-free grammar is
which generates trees like
The sentence is read from left to right and is therefore . It’s not hard to see that the sentences generated by this grammar are those of the form
where is a string of consecutive -s and means or . Notice the amount of -s on the two sides are the same.
This one is another context-free grammar that produces sentences in with an even number of -s in them:
The sentence is (the is the empty string and is thus ignored). We changed style of tree just to see the resulting sentence better.
Finally, this is a context-sensitive grammar, or Type 3 grammar:
Thus this is an example of sentence
and this is another
Notice that because of the fifth rule we can have all sentences but we can’t have say. The grammar is making a distinction. Anyway, everytime we have things like , the sentence is accepted as well.
Enough for today.