A peek at Markov chains

Last time, we introduced the basic concepts of our name generator and we saw how webcrawlers can help us gather information easily and efficiently. Now, it’s time to use this information to actually produce some words! To do so, we will rely on statistics and Markov chains.


As explained on the wiki page, “a Markov chain is a a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event”. It basically depicts the evolution of a finite state machine where state transitions are not deterministic but rather depend on probabilities (they are stochastic).

Originally conceptualized by the Russian mathematician Andrey Markov, Markov processes satisfy the Markov property; this property roughly states that the system is “memoryless”, meaning you can always derive the next step of the chain just by looking at its current state (you don’t need the full history of the chain to know its future behavior).

As we will see later in the article, the property can be adapted to approximate a “short-term memory” by making an N-step Markov chain instead of a single step chain…

Since Markov chains are a discrete version of Markov processes (with only a fixed number of available states and/or a discrete time), they are also closely related to random walks.

A Markov chain is like a random walk on a finite state machine with stochastic transitions

Markov chains can be used to model numerous situations in various fields: physics, thermodynamics, chemistry, language analysis, economics, genetics, music… (e.g.: to forecast the motion of particles in a fluid with a Brownian motion, to get random scattering of points in a mathematical space thanks to the Poisson process or for Google’s PageRank algorithm).

I worked on this project after seeing another great video from David Louapre on his YouTube channel “Science Etonnante” called “La machine à inventer des mots” (I would translate it as: “The word-generating machine”). I’ve already mentioned him before, but I really think he is one of the best French popularizer nowadays. If you speak French and are into science, don’t hesitate to check out his work!

Here is the code for this project (it doesn’t contain the webcrawler we talked about in the last article, only the words generator).

[projectinfo languages=”python” team=”1″ duration=”3 days”]

Basic concepts on Markov chains

To better understand the vocabulary associated with Markov chains, let’s work on a little example. This diagram represents a very simple 2-steps Markov chain:

As you can see, we define a chain as a set of states and transitions. The circled letters are the states, the arrows represent the possible transitions and the numbers the probability each transition has of occurring. Here, we have:

  • 2 states: A and B
  • 4 transitions: A to A (50% chance), A to B (50% chance), B to A (90% chance) and B to B (10% chance)

We also need an initial state to know where to start from. Let’s say that we are in A to begin with. Then we have a 50% chance of staying in this state and a 50% chance of switching to state B. Assume we moved to B. Now, it is very probable we will instantly go back to state A (90% chance!), still there is a small probability we stay in B.

A problem you often want to solve when working with Markov chains is what its long-term behavior is like, i.e. what will happen if you run the chain for a large amount of steps. In particular:

  • how long will it take in average to reach a given state in the chain (for the first time)?
  • how long will it take in average to return to a state in the chain?
  • what state will the chain be in for the longest time during a run of X steps?

Given our transitions, we clearly see that we will stay in A more often than in B and that we will usually revert back to A quickly after arriving in B. So we can guess that, in average: we will reach A and B rapidly, we will return to A faster than we return to B, and we will be in A longer than we are in B during a run of X steps.

These are just a few properties we can study for a Markov chain; there are plenty of others! Typically, you can study the periodicity, transience and recurrence of steps, whether some are inaccessible or not, etc.

Moreover, it is interesting to note that these mathematical objects can be used both for data representation and analysis (like the periodicity and recurrence of a particular state) or for predictions (like a “probable” sequence of states starting from an initial state). Here, we focus more on this second usage and look at Markov chains as tools for procedural generation (in the context of generated data that depends on sequences).

Using Markov chains for our name generator: simple version

Let’s now try and apply the Markov chain concepts to name generation. The idea is to use the data we webcrawled to learn the most probable letter transitions and build a chain with a “realistic” behavior; in other words, we want our chain to generally produce the most usual sequences of letters and as little as possible wrong sequences.

For example, we rarely see the pattern “wz” in a name, but we often see “wy”. So, if our chain’s current state was “w”, we would like it to have a higher probability of switching to “y” than to “z”.

Analyzing data to learn letter sequences

To discover those common patterns, we first need to parse the data we got thanks to our spiders and to analyze it with some statistics. Nothing fancy or too complex, don’t worry! We will essentially go through all the names we stored and search for 3 things:

  1. the possible word starts
  2. the possible word ends
  3. the frequency of each letter sequence (renormalized)

The following figure shows one example of a part of the analysis of a reference word, the city name “Birmingham”:

Example of reference word analysis (with extraction of word start, word end and of one pattern in the middle)

So, for each word, we remember the beginning and the end, and we count the number of times a particular sequence is read (like how many “aa”, “ab”, “ac”… there are in total in our training list). Then, in the end, we normalize the counts to have probabilities (i.e. numbers between 0 and 1) and that will give us our transition matrix.

After renormalization, this matrix is a 2D 42×42 table – because we also consider some special characters with accents – that contains the probability of going from one letter to another (the row being the chain’s current state and the column the next state). To make things clearer, suppose the top-corner of our matrix looks like this (this is not from any real data):

a b c d
a 0.01 0.08 0.12 0.11
b 0.2 0.1 0.03 0.07
c 0.12 0.08 0.1 0.03
d 0.1 0.03 0.03 0.04

If the generated name is currently “Tec”, meaning that our current state is “c”, we see that a letter our chain could probably pick to put after is “a”. Then, we would likely get “c” or “d”, and so on.

If for a given letter “α” the number in the column “β” is 0, it means we never encountered the sequence “αβ” in our dataset and this transition is considered impossible. On the other hand, if it is 1, it means the only transition we know from “α” is to “β” and this transition is considered certain.

Note: because of the math property that states probabilities sum to 1, we know each row of this matrix sums to 1 (since a row holds all the possible transitions from this state).

Generating new words

Now that we have this matrix, to create a brand new word, we just have to:

  • pick one word start at random
  • look at the row corresponding to the last letter in our word and choose from the most probable characters (i.e. the letters corresponding to the columns where the numbers are the highest)
  • keep on adding letters until we reach a known word end or a maximal length

So, to know when to stop the word, we consider two criteria: either we rudely cut off the word after a maximum number of letters (but it can lead to strange endings), or we wait until we see a recorded word end. In the version I offer here, the two methods are used but the max length is kind of a “last resort”. Plus, to avoid having short words that end as soon as they have found a word-ending sequence, we add a stop_threshold variable that gives the generator a chance to continue working on the word even after seeing a word end.

Using Markov chains for our name generator: advanced version

This first version of our name generator produces okay results, but not as good as expected. There are still some unusual sequences because, in truth, our language doesn’t really work on a “2-letters sequence” basis but more on a “3- or more-letters sequence” basis.

Let’s say you have gradually composed a word and eventually reached the sequence: “cr”. It is quite alright: so far so good. But now, what stops our chain from examining the last character (“r”) and switching back to “c”? After all, many city names containing the sequence “rc”: “Arcachon”, “Christchurch”, “Murcia”… However, overall, the sequence “crc” is pretty unlikely. Instead, we would rather have our chain produce something like “cra” or “cre”. As human beings, we can differentiate between consonants and vowels easily and avoid these kinds of patterns; but since our chain is trained independently of this criterion, it cannot use this rule to pick the best answer.

To improve its results, we can transform our Markov chain to make it a 2-steps chain: with this new definition, the next state of our chain will depend on its last and its last-but-one states. So, for example, if our word currently ends with “cr”, we will look at the most common letters after “cr”, not only after “r”. Therefore, it is more probable a vowel will be picked (because in our dataset, the sequence “cr” is usually followed by a vowel and not a consonant).

To actually do this change in our code, the idea is to replace our 2D transition matrix by a 3D one: instead of just considering a row as current state, you look both at depth and row to get your last-but-one and last states, and then you pick one of the most probable columns for your new state. This diagram roughly sums up the process with our new 2-steps Markov chain on a small example:

Example of a letter sequence production with the 3D transition matrix (we search for the most probable letter after the sequence “pa”)

This yields way better results and actually allows us to generate credible names: “Zanzherma“, “Shimo”, “Losh”, “Rio Sur”, “Ulante”…

Note: the code I offer at the top of the article implements this advanced version only.

How to use the WordsGenerator class?

The downloadable archive contains a little main.py script to show you how to use the WordsGenerator class and a data/ subfolder that contains some CSV files used by these examples.

In particular, it provides several examples of WordsGenerator instances so you can get a feel of the different parameters available. The main ideas are that:

  • you create a word generator from a file containing your reference dataset (here, the various CSV files) or from another generator (see below)
  • you can define a specific parsing function to better control how this reference list is read
  • you can also define specific post-processing so that, once generated with a Markov chain, your words are treated in a custom way (you can add or remove determined characters, replace some…)

And when you want to create new words, you can do it in by calling one of the following 3 methods:

  • generate_words: returns a list of generated words of a given length
  • makeiter: returns a Python generator of generated words of a given length (this takes less memory than a completely computed list, but it forces you to go through the iterator to get the items)
  • output: directly prints a list of generated words of given length to the given stream (by default, it is the shell but you can pass in an open file)

As you can see in the scripts, for these 3 functions, you can specify the number of words to generate, their maximal length, the value of the stop_threshold variable and whether or not the generator is allowed to produce existing words (disabled by default).

The WordsGenerator also allows you to set a WordsCasing to have a given casing on your outputs: lowercase (by default), uppercase, camelcase or capital case (respectively: WordsCasing.DFT, WordsCasing.UPPER, WordsCasing.CAMEL, WordsCasing.CAPITAL). For example, with these four casing types, the string “hello world” would become: “hello world”, “HELLO WORLD”, “Hello World” and “Hello world”.

Finally, another feature is the import-export system. When you have a WordsGenerator instance, you can easily export its “computors” (meaning its matrix, its word starts and its word ends) to formatted text files in a specific folder. These computors can then be imported just as easily to create a brand new WordsGenerator object. This allows you to transfer some generators from one script to another, or to work with two similar – but slightly different – generators without losing the first one (e.g.: two generators with different word casing).

Just a little issue with re-imported generators: since you only get back the transition matrix, the word starts and the word ends, but not the original reference list, you cannot tell your new generator to create only non-existing words… it may output some that we already know!

Final notes

Given how our chain “trains” on the dataset we give it, it will only produce results as good as the data it is fed! In particular, the more limited your number of references, the more limited the possible outcomes (if you only give a dozen names to your generator to learn from and ask it to create thousands of new names, chances are it will start to go in circles quite fast). And even if it is less likely with a large dataset, it is still possible for some sequences we have so little possibilities we have loops anyway. This can result in names that begin well but eventually repeat one syllable (that is not a word end) over and over until it reaches a maximal length.

Another obvious trend for our WordsGenerator is, if our data contains several types of words, to produce words that resemble the most present type (e.g.: if we have a lot more Asian city names than European ones, then our name generator will output more “Asian sounding” city names and less “European sounding” ones). This can be counteracted by providing specific reference subsets, but by default you need to be aware that a discrepancy in your data will transfer immediately to your generator!

The great advantage of generating words this way is that we can very easily change our training dataset and therefore get a new matrix and new sequences. So it is simple to have French, German or Swahili sounding names, assuming you have references for each of these word “types”.

This was just a very short and gross overview of Markov chains. The topic is way more profound than that and I encourage anyone who is interested in statistics and modeling to dive deeper into this theory!

As pointed out in one of the comments of David Louapre’s video, a words generator like this one, trained on a specific dictionary, can even be used to create passwords not too difficult to remember but that don’t contain any real words and thus cannot be broken bluntly by the most basic algorithms.

Some projects also use Markov chains to analyse or to produce music that is composed automatically but follows the rules of harmony and thus sounds good. For example, I came across the paper “Musical Markov chains” by Volchenkov and Dawin (2011), or the thesis “Music Improvisation using Markov Chains” by Linskens (2014)… but there are many more projects on this topic around the net! On the other hand, nowadays, recurrent neural networks are the most used way to tackle this problem, because Markov chains “memorylessness” kind of go against the long-time dependencies usually essential to musical pieces!

References
    1. Science Etonnante (David Louapre)’s website, in French: https://sciencetonnante.wordpress.com/
    2. “La machine à inventer des mots”, Science Etonnante (David Louapre)’s video on words generation thanks to Markov chains, in French: https://www.youtube.com/watch?v=YsR7r2378j0
    3. I. Wikimedia Foundation, “Markov chain” (https://en.wikipedia.org/wiki/Markov_chain), October 2018. [Online; last access 8-November-2018].
    4. I. Wikimedia Foundation, “Random walk” (https://en.wikipedia.org/wiki/Random_walk), November 2018. [Online; last access 4-November-2018].
    5. I. Wikimedia Foundation, “Brownian motion” (https://en.wikipedia.org/wiki/Brownian_motion), October 2018. [Online; last access 4-November-2018].
    6. I. Wikimedia Foundation, “Poisson point process” (https://en.wikipedia.org/wiki/Poisson_point_process), October 2018. [Online; last access 4-November-2018].
    7. I. Wikimedia Foundation, “PageRank” (https://en.wikipedia.org/wiki/PageRank), October 2018. [Online; last access 4-November-2018].
    8. D. Volkenchov, J. R. Dawin, “Musical Markov chains” (https://www.worldscientific.com/doi/pdf/10.1142/S2010194512007829), 2011. [Online; last access 8-November-2018].
    9. E. Linskens, “Music Improvisation using Markov Chains” (https://dke.maastrichtuniversity.nl/gm.schoenmakers/wp-content/uploads/2015/09/Linskens-Final-Draft.pdf), 2014. [Online; last access 8-November-2018].

Leave a Reply

Your email address will not be published. Required fields are marked *