N-grams are a well established method in natural language processing. They can be used in situations like predictive text, sentiment analysis and other useful task. I have used n-grams to create a small jupyter notebook for creating plausible synthetic names. These could be useful for role playing games, names for fiction writing, or for inserting into examples in textbooks to avoid using stereotypical names.
How it Works
In essence the script downloads a corpus of real names from the US Social Security Agency, which makes them publically availble. To speed things up it only tries to download it does not already find the downloaded zip file locally. It then uses the python nltk library to obtain n-grams from these real names.
We then count the frequency of the last letters in these n-grams. Then we create a dictionary containing the first n-1 letters of each n-gram as the key and a list of possible next letters repeated in the same frequency as that found in our names corpus.
This dictionary can then be used to generate new names via a simple iterative process. We use > and < as stop characters, since it seems likely that the letter distribution in n-grams may be dependent on the n-gram’s position in the word. You can see the details of this in the notebook.
I noticed that the generation process quite often produces English words. To prevent this, I included a simple dictionary check to exclude English words from the returned names (though not names). The names corpus I am using also notes gender for each name occurence, so I divide names to generate synthetic male, female and gender neutral names.
Results
Not every name this generator produces is great, but a suprising number seem plausible.
For instance it generates Ferard, Braul, and Jord as male names. However it also generates Kwask which seems less plausible and Leon which already exists (I decided not to filter out pre-existing names).
Female names generated included Calyn, Susele and Gilada. There were some homonym’s of female names in Haylee and Kaelea and also some existing names like Gina and Marian. It did not always get things right though. Perhaps Vilain is best avoided unless you favour nominative determinism in your evil NPCs.
My gender neutral names were arrived at using n-grams from names which are used for both genders. I I set a threshold of a ten to one ratio either way. This gave a rather smaller pool of names to create n-grams from and so the names tend not to be so plausible. However while Shtor and Alogales seem a little strange, Rashajen, Trie and Skylorig all have a certain charm.
The script is available from my Github. By all means download it and give it a go.
Further Reading
There are plenty of interesting articles about N-grams on the web too if you want to learn more. Here are just a few to get you started:
Understanding Word N-grams and N-gram Probability in Natural Language Processing