Generative grammars I

I got recently interested in the mathematical approach to linguistics. My interest sparkled from the desire to play with MCMC methods (Markov Chain Monte Carlo) over something relevant, and computational linguistics is one of the possible fields that profit from those methods. I started studying a few things and will condense some of them in here.

First of all, even before specifying what a grammar is I think it’s useful to state what a grammar is for. A grammar is a synthetic description of the admissible sentences in a given language: it tells whether to accept a sentence or not. It doesn’t say anything about the specific content of the sentence, but only cares about the structure. It’s synthetic in that it is smaller than the language itself, or of a list of all possible sentences, so to say.

Of course a list of all the acceptable sentences in any non finite language (such are the human languages for example) would be impossible to produce and mostly irrelevant since we wouldn’t learn anything from it. When dealing with countable sets one very basic (and vague) question is if these sets are enumerable in some way. That means: is there some concise algorithm with a finite number of steps that allows to produce every admissible element of the list, given enough time? That’s what a grammar of, say, German tries to do: it tells you how phrases are built in german, that the verb occupies the second relevant position in the sentence, that adjectives are conjugated, etc. After reading a german grammar you are (well suppose you’re very good at this) able to tell if a given sentence is correct in german or not. But you are also able to produce phrases: you learned how to write german. So that there are two approaches to test if a sentence is properly formed: one can go backward and break it down into pieces, checking at every step that the grammar is not violated anywhere. The other method is to go forward, i.e. start from an empty sentence and using the rules of grammar generate all the possible sentences until we recognize the one we’ve been given (which is finite, so at a certain point if we haven’t found it, then it’s not acceptable). If it’s a sentence we could produce, then it’s acceptable.

Anyway, books of grammars fail to be acceptable algorithms, because they have too many exceptions and fuzzy notions. Also, a whole book is not really so small for us to call it concise.

It turns out that asking for an algorithm is too vague to have an answer. Some computational machines are more powerful than others, and some languages that can’t be enumerated by a machine could be enumerable by a smarter machine (I use the adjective “smarter” just to scare the humanities). Formal grammar is about the amount of computational power you need to enumerate a language, related to how complex the language is.

Enough with the gibberish.

Definition 1 A grammar {G} is a quadruple {G=\{V_N, V_T, S, R\}}, where

  • {V_N} is a set called set of Non-Terminal symbols (we use capital Latin letters for them);
  • {V_T} is a set called set of Terminal symbols, which is disjoint from {V_N};
  • {S} is a privileged element of {V_N}, called start symbol;
  • {R} is a finite set of substitution/production rules, each of the form

    \displaystyle \left(V_T \cup V_N\right)^{\ast} V_N \left(V_T \cup V_N\right)^{\ast} \rightarrow \left(V_T \cup V_N\right)^{\ast},

    where {\ast} is the Kleene star. In other words, {R} is subset of

    \displaystyle \left(V_T \cup V_N\right)^{\ast}\times V_N \times \left(V_T \cup V_N\right)^{\ast} \times \left(V_T \cup V_N\right)^{\ast}.

Continue reading