Generating text as a Markov chain
Wikipedia : “A Markov chain is a sequence of random variables X1, X2, X3, … with Markov property, namely that, given the present state, the future and past states are independent.”
Less formally, if we have a string of words like –
“The quick brown fox jumps…”
then, if we assume this sentence to be a Markov chain, we can calculate how probable it is for any word to be the next to appear in it, based only on the knowledge of a fixed number of words preceding it. The words already in the sentence are called the context, and you can choose various sizes of context, like “jumps”, or “brown fox jumps”, or even “” – a context that has zero elements.
Some words have a tendency to appear in certain contexts. For example, “jumps” is likely to be followed by “over”, but “cabagge” is probably not going to appear there. It is possible to analyze a text and determine how often words appear in specific contexts. These statistics can then be used for various purposes.
This method is useful for various tasks in natural language processing, but the fun part is you can use it to generate a random text that appears to be meaningful, and that’s what I’m going to do 🙂
(Source available – see at the end of article)
I decided to create a subclass of TList implement the essential functionality. It has two main methods – AddWord(Context,Word:string) – to gather statistics – and GetRandomWord(Context:string) – to generate text.
The statistics need to show all pairs of context – word found in the text, and how often they occur. The easiest way to do that is probably a huge array. As we don’t know how many records there’ll be, it might be a good idea to allocate new records as needed and only store pointers. The array should also be sorted, as we’ll need to be able to quickly find a given context.
cMaxWordLen=20; //should probably be lower for English
“Total” and “First” will store the total amount of records with that context and the index of first matching record, respectively. This will come handy when we’re generating text.
Initially I simply created every new record by calling New() and storing the pointers in TList’s list, but it turned out that somehow my application was using way more memory than it was supposed to. When my calculations indicated that it should use 1 MB, it actually showed up as 35 MB in Task Manager. The memory manager was probably getting confused, or something like that. So I instead allocated a big chunk of memory and stored all my records there. The procedure to add a word now looked like this :
if MemPosition>=cMaxStatItems then begin
Error(‘StatList size exceeded, can”t add item %d‘,MemPosition);
//Find where it should be placed
mMetadataReady:=false; //indicates that Total and First haven’t been updated
I use binary search to find the position where new records should be inserted. To save time, “Total” and “First” are only updated when all the records have been added. This is done by a simple procedure that I will not show here; you can look it up in the source.
When statistics have been gathered, generating random text is very easy. To generate a single word we can use something like this :
if not mMetadataReady then exit; //Metadata not available
if not found then exit; //Couldn’t find context
//Select a random word matching this context
i:=Item.First + cardinal(Random(Item.Total));
Initially we start with an empty context and add each generated word to it, up to N words (the size of the context). Usually you N=4 is the most you’ll need, and 3 or 2 seem to get the best results with small amounts of statistics. When the context is full, the first word is removed and the newly generated are added at the end. It might be possible that a given context doesn’t appear anywhere in the statistics – if that happens you should try calling GetRandomWord with progressively smaller contexts until one of them works.
Of course there is more to it than I showed here – you need to extract the contexts and words from input files, shuffle them around, implement binary search algorithms, etc. Some of these are demonstrated in my implementation. Note – it is very slow, because I do lots of string manipulation, which isn’t very fast in Delphi.
Source (8 KB)
Required : Delphi 7 or above (just because it uses XPManifest)
Compiled application (179 KB)
Required : Any Windows OS, around 20 Mb RAM
Things to do
- Store every context-word pair only once and add a new field to count the number of occurences. I’ve done that in a different implementation and it can decrease memory requirements notably.
- Use a tree-like structure for statistics – single words as nodes, branching out to show other words that have occured in theri context.
- Create a proper database to store statistics. I had made a PHP+MySQL implementation of this algorithm, but I lost it during OS reinstallation.
- Throw in some formal grammar (production rules) to create texts that are more realistic.
I like some of your plugins. I’m interested in a wordpress markov chain plugin.
If you know one or want to build one for payment, please contact me with price etc.
Check out this PHP Markov chain class, it should be easy enough to create a plugin based on it.
Personally, I am focusing on other matters at the moment, so I don’t have time to make a plugin like that.
[…] the way, I’ve done something similar before (in Delphi). Share and Enjoy: These icons link to social bookmarking sites where readers […]
This has an application in intelligence and military operations – assigning operational names which are memorable – but don’t accidentally give away the storeby revealing the associational reasoning by which people pick names.
In less adversarial contexts – I’m involved in disaster planning – it might be quite useful to be able to design memorable word-combinations in order to designate locations, grid coordinates, unit designations, etc. – especially if “sound-alike” combination could be eliminated – since they’ll often be transmitted by voice – under less than optimal conditions.
Even more so – if we were able to take the outputs, assign definitions, and generate directories. Some of the entries, rather than names, might be designations of text strings – much like the telegraph abbreviations of the late 19th and early 20th centuries. I cannot estimate what it’s market value would be. But if it were reliable and user-friendly – I’d certainly promote it in disaster-prep circles – and in a disaster best practices compilation which is currently underway.
However – to the extent I could afford it – I’d certainly be willing to pay for a license.
Just a thought. Nice work.
An interesting thought. There are already algorithms for determining how similar two words sound, so that would be doable. However, is there really any demand for a special software that basically generates and maintains a list of made-up words?
On the other hand, I’ll admit I know nothing aboud disaster-preparedness, so perhaps there is a market. If you think the idea is worth pursuing, feel free to contact me (see the link in the main menu) with a more detailed description of the app should work. I might cook something up in my free time.
Nice article, you explained A Markov chain is a sequence of random variables very well.