# Generative grammars I

I got recently interested in the mathematical approach to linguistics. My interest sparkled from the desire to play with MCMC methods (Markov Chain Monte Carlo) over something relevant, and computational linguistics is one of the possible fields that profit from those methods. I started studying a few things and will condense some of them in here.

First of all, even before specifying what a grammar is I think it’s useful to state what a grammar is for. A grammar is a synthetic description of the admissible sentences in a given language: it tells whether to accept a sentence or not. It doesn’t say anything about the specific content of the sentence, but only cares about the structure. It’s synthetic in that it is smaller than the language itself, or of a list of all possible sentences, so to say.

Of course a list of all the acceptable sentences in any non finite language (such are the human languages for example) would be impossible to produce and mostly irrelevant since we wouldn’t learn anything from it. When dealing with countable sets one very basic (and vague) question is if these sets are enumerable in some way. That means: is there some concise algorithm with a finite number of steps that allows to produce every admissible element of the list, given enough time? That’s what a grammar of, say, German tries to do: it tells you how phrases are built in german, that the verb occupies the second relevant position in the sentence, that adjectives are conjugated, etc. After reading a german grammar you are (well suppose you’re very good at this) able to tell if a given sentence is correct in german or not. But you are also able to produce phrases: you learned how to write german. So that there are two approaches to test if a sentence is properly formed: one can go backward and break it down into pieces, checking at every step that the grammar is not violated anywhere. The other method is to go forward, i.e. start from an empty sentence and using the rules of grammar generate all the possible sentences until we recognize the one we’ve been given (which is finite, so at a certain point if we haven’t found it, then it’s not acceptable). If it’s a sentence we could produce, then it’s acceptable.

Anyway, books of grammars fail to be acceptable algorithms, because they have too many exceptions and fuzzy notions. Also, a whole book is not really so small for us to call it concise.

It turns out that asking for an algorithm is too vague to have an answer. Some computational machines are more powerful than others, and some languages that can’t be enumerated by a machine could be enumerable by a smarter machine (I use the adjective “smarter” just to scare the humanities). Formal grammar is about the amount of computational power you need to enumerate a language, related to how complex the language is.

Enough with the gibberish.

Definition 1 A grammar ${G}$ is a quadruple ${G=\{V_N, V_T, S, R\}}$, where

• ${V_N}$ is a set called set of Non-Terminal symbols (we use capital Latin letters for them);
• ${V_T}$ is a set called set of Terminal symbols, which is disjoint from ${V_N}$;
• ${S}$ is a privileged element of ${V_N}$, called start symbol;
• ${R}$ is a finite set of substitution/production rules, each of the form

$\displaystyle \left(V_T \cup V_N\right)^{\ast} V_N \left(V_T \cup V_N\right)^{\ast} \rightarrow \left(V_T \cup V_N\right)^{\ast},$

where ${\ast}$ is the Kleene star. In other words, ${R}$ is subset of

$\displaystyle \left(V_T \cup V_N\right)^{\ast}\times V_N \times \left(V_T \cup V_N\right)^{\ast} \times \left(V_T \cup V_N\right)^{\ast}.$

Let us remind the reader that the Kleene star is the set-theoretic operator that applied to a set outputs the (infinite) set of all finite strings with alphabet the elements of the original set. Thus ${\{a,b\}^\ast = \{\epsilon, a,b,aa,ab,ba,bb,aaa,aab,aba,\ldots\}}$, and ${\epsilon}$ denotes the empty string.

We describe how to read substitution rules and how to apply them. Take a string ${s}$ in ${\left(V_T \cup V_N\right)^{\ast}}$, that is a string formed by nonterminal and/or terminal symbols mixed. If ${(\alpha, A, \beta, \gamma)}$ is a rule in ${R}$, where ${\alpha, \beta,\gamma}$ are strings and ${A}$ is a non terminal symbol, then if ${s}$ contains the string ${\alpha A \beta}$ we can replace the ${A}$ in ${s}$ with the string ${\gamma}$. Thus ${s=\ldots \alpha A \beta\ldots}$ becomes ${\ldots \alpha\gamma\beta\ldots}$ after applying the rule.

We said what a grammar is, but we didn’t say anything about languages. We adopt here a broad definition of language:

Definition 2 A language ${L}$ associated to a grammar ${G}$ is a subset of the subset of ${V_T^{\ast}}$ given by all the strings that can be produced starting from ${S}$ and applying rules ${R}$ until all the nonterminal symbols have disappeared.

So we accept as languages even those who discard certain strings without apparent reason. That spares us questions like “is this the minimal set of rules that generates ${L}$”, which we’ll deal with separately.

We want to be a little more rigorous, so we introduce a binary relation on strings of ${\left(V_T \cup V_N\right)^{\ast}}$. This relation is written as ${\vdash_G}$ (or abbreviated as ${\vdash}$) and is read as “produces in one step in ${G}$”, and it’s defined by

$\displaystyle x \vdash y \;\;\Leftrightarrow\;\; \exists \alpha,\beta,\gamma,\delta \in \left(V_T \cup V_N\right)^{\ast}\,: (x=\alpha\gamma\beta)\wedge (y=\alpha\delta\beta)\wedge ((\gamma,\delta)\in R),$

that is we can get ${y}$ by applying rule ${\gamma\rightarrow \delta}$ to ${x}$. From ${\vdash_G}$ we can define ${\vdash_G^\ast}$, which is read “produces in a finite number of steps in ${G}$” and means

$\displaystyle x \vdash^\ast y \;\;\Leftrightarrow\;\; \exists n\in\mathbb{N}, z_1,\ldots,z_n\in \left(V_T \cup V_N\right)^{\ast}\,:\, x\vdash z_1\vdash\ldots \vdash z_n\vdash y.$

Then the set

$\displaystyle \{s\in \left(V_T \cup V_N\right)^{\ast} \,:\, S\vdash_G^\ast s\}$

is called the set of sentential forms of ${G}$, and its subset

$\displaystyle \{s\in V_T^{\ast} \,:\, S\vdash_G^\ast s\}$

is the maximal language we defined.

Now, the cornerstone of the subject was put in place by Noam Chomsky, as almost everybody knows, and it’s a hierarchical classification of the grammars. Let’s start with an arbitrary – for now – subdivision:

Definition 3 The grammars are subdivided in the following decreasing levels:

• Type 0 grammars: any grammar safistying the definition given above;
• Type 1 grammars, or context-sensitive grammars: grammars with rules of the form

$\displaystyle \alpha A \beta \rightarrow \alpha\gamma\beta$

where ${\alpha, \beta,\gamma \in \left(V_T \cup V_N\right)^{\ast}}$, ${A\in V_N}$ and ${\gamma \neq \epsilon}$:

• Type 2 grammars, or context-free grammars: grammars with rules of the form

$\displaystyle A\rightarrow \alpha,$

where ${\alpha\in \left(V_T \cup V_N\right)^{\ast}}$ and ${A\in V_N}$;

• Type 3 grammars, or right-regular grammars: grammars with rules of the form

$\displaystyle \begin{array}{rcl} && A\rightarrow aB \\ && A\rightarrow a, \end{array}$

where ${a\in V_T}$, ${A,B \in V_N}$.

We notice here that some authors restrict context-sensitive grammars to those where ${\alpha,\beta}$ are strings of terminal elements. Also, of course Type 3 grammars also include the mirror image of right-regular grammars, that is left-regular grammars, defined in the obvious way.

Of course the subdivision is not based on Chomsky’s personal taste but involves the computational power required to produce those languages, and results in a beautiful theory with many insights into complexity theory, the power of memory and so on.

Let’s see some examples of grammars from Type 1 to Type 3. First of all, the following is a regular grammar

$\displaystyle \begin{array}{rcl} && S \rightarrow aA \\ && A \rightarrow bA \\ && A \rightarrow c \end{array}$

with terminal symbols ${V_T=\{a,b,c\}}$, nonterminal symbols ${V_N=\{S,A\}}$, start symbol ${S}$. That the rules are those of a right-regular grammar can be seen directly. What are the sentences produces by this grammar? notice there are two rules for the substitution of ${A}$, which means that whenever we have an ${A}$ we can choose. The process of generating sentences involves choices, which makes it non-deterministic. An example of sentence can be

$\displaystyle S \rightarrow aA \rightarrow abA \rightarrow abbA \rightarrow abbc,$

where we applied the first rule to ${S}$ (it was the only possibility), then we chose to apply second rule to the nonterminal ${A}$, which produced a further ${A}$, and we chose to apply second rule again to this new ${A}$, and finally used third rule to substitute ${A}$ with the terminal ${c}$, which ends the process because there are no more occurrences of nonterminal symbols.

The typical way to see the structure of the process is to draw trees. The root of the tree is the start symbol ${S}$, and every time we apply a substitution rule to a nonterminal symbols we add leaves to that symbols, one for each new symbol and in the correct order. Thus the tree for the previous example of sentence generation is

Notice that because of the nature of rules of regular grammars the tree happens to be a binary tree, but for more complex grammars this isn’t true anymore.

Another sentence is

$\displaystyle S \rightarrow aA \rightarrow abA \rightarrow abbA \rightarrow abbbA \rightarrow abbbc,$

where we iterated the second rule one more time. It’s easy to see that all the possible sentences produced by this grammar are of the form ${ac, abc, abbc, abbbc, abbbbc, \ldots}$, which we shorten as ${a b^\ast c}$. One sees clearly that sentences always start with ${a}$ because of the first rule, they always end with ${c}$ because the third rule is the only one that doesn’t produce new nonterminal symbols, and second rule produces the same symbol it applies to, allowing for recursion.

Before moving to Type 2 grammars, observe how the following right-regular grammar

$\displaystyle \begin{array}{rcl} && S \rightarrow aA \\ && A \rightarrow bA \\ && A \rightarrow cB \\ && A \rightarrow a \\ && B \rightarrow bB \end{array}$

is ill-posed, since there are no rules that allow elimination of ${B}$, and therefore a sentential form containing ${B}$ never generates a sentence.

An example of a context-free grammar is

$\displaystyle \begin{array}{rcl} && S \rightarrow Aa \\ && A \rightarrow bAb \\ && A \rightarrow cBB \\ && B \rightarrow a \\ && B \rightarrow c, \end{array}$

which generates trees like

The sentence is read from left to right and is therefore ${bcacba}$. It’s not hard to see that the sentences generated by this grammar are those of the form

$\displaystyle b^n c (a|c)(a|c) b^n a$

where ${b^n}$ is a string of ${n}$ consecutive ${b}$-s and ${(a|c)}$ means ${a}$ or ${c}$. Notice the amount of ${b}$-s on the two sides are the same.

This one is another context-free grammar that produces sentences in ${\{a,b\}^\ast}$ with an even number of ${b}$-s in them:

$\displaystyle \begin{array}{rcl} && S \rightarrow A \\ && A \rightarrow ABABA \\ && A \rightarrow aA \\ && A \rightarrow \epsilon \\ && B \rightarrow b \end{array}$

for example

The sentence is ${baabaa}$ (the ${\epsilon}$ is the empty string and is thus ignored). We changed style of tree just to see the resulting sentence better.

Finally, this is a context-sensitive grammar, or Type 3 grammar:

$\displaystyle \begin{array}{rcl} && S \rightarrow aA \\ && A \rightarrow aA \\ && A \rightarrow Ab \\ && A \rightarrow c \\ && aaAbb \rightarrow aadbb. \end{array}$

Thus this is an example of sentence

and this is another

Notice that because of the fifth rule we can have all sentences ${acb^\ast}$ but we can’t have ${adbb}$ say. The grammar is making a distinction. Anyway, everytime we have things like ${aaadbb}$, the sentence ${aaacbb}$ is accepted as well.

Enough for today.