Refining my math-based name generator…

Because data-driven algorithms rely a lot on… data, who could have guessed?

This article is also available on Medium.

Earlier this week, I talked about a little JavaScript tool I made recently, rangen-name (available as an NPM package or on Github). As explained in my previous post, this lib is a basic random name generator that uses Markov chains to create credible names.

Today, I want to talk about my latest update of the package that allows users to pass in some custom references to better “control the randomness” and get completely different names:

How does it work?

For firstnames

The firstname generator in this lib is fairly simple: it picks one firstname at random in given lists of male/female firstnames, depending on the gender of the name you asked for (or with a 50/50% chance of male/female name if you didn’t specify anything).

By default, the generator is loaded up with a list of common firstnames but you can now use the maleFirstnameReferenceList and femaleFirstnameReferenceList options to specify your own:

For lastnames

The lastname generator is a bit more complex: it relies on Markov chains.

We saw last time that Markov chains are a specific kind of stochastic state machine where you have a specific probability distribution for each state transition; in our case, this means that we can model letter sequence probability and chain letters in a “logical” way.

For example, in a French text, we’ll rarely see a “z” after a “g”, but we’ll see lots of “ss” sequences: this implies that a probability matrix based on the French language would have a low probability for the “z” if the current letter is “g” but a high probability for the “s” if the current letter is “s”.

So, something crucial in determining these transition distributions is the initial list of words you take as reference, because it’s by reading each item in this list that you’ll gradually build a (n + 1)-D probability matrix of probable letter sequences (here, n = 2 because we have a 2-steps Markov chain).

In the initial release, I’d constructed this matrix based on a list of common English surnames that I’d gathered from the net using a webscraper, and you couldn’t change it afterwards. So you were “forced” to get names resembling this initial reference that all sounded English…

But in my latest update, I’ve improved rangen-name to also accept user-defined reference lists for the firstnames and the lastnames! This way, you can change the probability matrix and impact the probable letter sequences… which in turn means that you’ll get completely different results! 🙂

In the above example, I showed how passing in some custom firstname or lastname references had a huge influence on the names that were produced:

# with default parameters ("English-sounding" names)
Sheridan Lint
Cindy Pown
Randi Baye

# with custom parameters ("Japanese-sounding" names)
Daido Igata
Jiro Seki
Banzan Ashi

That’s because when the generator first initialises and fills its probability matrix, it’s going to encounter very different letter sequences and therefore store different probabilities for the Markov chain state transitions.

Of course, the longer your list of reference the words the better, because if it’s too short you run the risk of learning from only a handful of examples and getting badly skewed probability distributions. Here’s what can happen if just give 5 reference lastnames:

As you can see, the generator just keeps on repeating the same letter sequences again and again, and once it’s hit a letter it’s usually doomed to always continue the same way. That’s because, as with all data-driven algorithms, the knowledge your program as of the world is limited to the dataset you fed it…

Let’s visualise our Markov chain!

Now, the problem is that talking about 3D probability matrices is quite abstract and sometimes hard to grasp. It’s often valuable to have a tool to visualise your data in a user-friendly and readable way. As a dev, this can be particularly useful if you have a bug somewhere and you’re not sure whether the problem comes from your algorithm or the data itself; as a user, it’s nice to better identify the different steps of the process and the inner logic of the lib (it’s a workaround to avoid the usual “black-box” issue).

So! In addition to having this new entry-point, we can now also see the probability matrix that was generated in your program! 🙂

If you want to understand how the Markov chain of your generator is currently configured, you can export the 3-D matrix as a series of PNG images and visualise how probable each 3-letters sequence is.

The thing is: we need to visualise a 3-dimensional object, which is never handy. We could try and make a big 3D cube with small voxels that create a 3D heatmap, but it is actually a lot more readable as “slices”! Foreach symbol in the alphabet, we can compute and save the corresponding 2D sub-probability matrix (i.e. the next 2-letters sequence probability if the current character is the one we chose).

Each image is a 2D heatmap grid that shows with white-to-red cells the probability each transition has of occurring (white is very low, red is very high):

  • in the bottom-right corner, you see the current “slice”, i.e. the current character c1 you’re at in your generated name
  • then, the rows list the possibilities for the next character c2
  • and for each row, the color of the cell at a given column c3 corresponds to the probability of having the c1-c2-c3 sequence in your word

For example, here, we have the heatmap for c1 = “a”. We see that:

  • if c1 = “a” and c2 = “a” also, then we’re forced to have c3 = “c”: in our reference list, only the sequence “aac” exists
  • but if c2 = “n”, we have lots of different possibilities that are each about as probable: “and”, “ann”, “ans”, “ant”…
  • we apparently have never encountered a sequence with “ao” in this reference list: there is no possible transition after that!

What’s really interesting is to compare the probability matrices for two different reference lists. For example, here, I compare the “a” slice of my 3D probability matrix for the default reference list of English surnames, and for my reference list of Japanese surnames:

We clearly see that, even though we’re using the same letters, the letter sequences are completely different! Meaning that, in turn, the generated names will be widely different too 🙂

(For example, the sequence “aka” is quite common in Japanese and non-existent in English, while “aha” is quite common in both reference lists!)

Conclusion

With this new feature, rangen-name is now way more customisable: by passing in your own reference list of surnames, you can modify how the probability matrix gets built and directly influence the names the generator can create.

Also, we can visualise the probability matrix to easily see how our reference list was interpreted and why specific letter sequences appear more frequently than others in the results.

Feel free to test this tool and give me some feedback! Also, don’t hesitate to post comments with ideas of other tools you’d like to see in the rangen suite 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *