Wikipedia : “A Markov chain is a sequence of random variables X1, X2, X3, … with Markov property, namely that, given the present state, the future and past states are independent.”
Less formally, if we have a string of words like –
“The quick brown fox jumps…”
then, if we assume this sentence to be a Markov chain, we can calculate how probable it is for any word to be the next to appear in it, based only on the knowledge of a fixed number of words preceding it. The words already in the sentence are called the context, and you can choose various sizes of context, like “jumps”, or “brown fox jumps”, or even “” – a context that has zero elements.
Some words have a tendency to appear in certain contexts. For example, “jumps” is likely to be followed by “over”, but “cabagge” is probably not going to appear there. It is possible to analyze a text and determine how often words appear in specific contexts. These statistics can then be used for various purposes.
This method is useful for various tasks in natural language processing, but the fun part is you can use it to generate a random text that appears to be meaningful, and that’s what I’m going to do 🙂
(Source available – see at the end of article)
I decided to create a subclass of TList implement the essential functionality. It has two main methods – AddWord(Context,Word:string) – to gather statistics – and GetRandomWord(Context:string) – to generate text.
The statistics need to show all pairs of context – word found in the text, and how often they occur. The easiest way to do that is probably a huge array. As we don’t know how many records there’ll be, it might be a good idea to allocate new records as needed and only store pointers. The array should also be sorted, as we’ll need to be able to quickly find a given context.
cMaxWordLen=20; //should probably be lower for English
“Total” and “First” will store the total amount of records with that context and the index of first matching record, respectively. This will come handy when we’re generating text.
Initially I simply created every new record by calling New() and storing the pointers in TList’s list, but it turned out that somehow my application was using way more memory than it was supposed to. When my calculations indicated that it should use 1 MB, it actually showed up as 35 MB in Task Manager. The memory manager was probably getting confused, or something like that. So I instead allocated a big chunk of memory and stored all my records there. The procedure to add a word now looked like this :
if MemPosition>=cMaxStatItems then begin
Error(‘StatList size exceeded, can”t add item %d‘,MemPosition);
//Find where it should be placed
mMetadataReady:=false; //indicates that Total and First haven’t been updated
I use binary search to find the position where new records should be inserted. To save time, “Total” and “First” are only updated when all the records have been added. This is done by a simple procedure that I will not show here; you can look it up in the source.
When statistics have been gathered, generating random text is very easy. To generate a single word we can use something like this :
if not mMetadataReady then exit; //Metadata not available
if not found then exit; //Couldn’t find context
//Select a random word matching this context
i:=Item.First + cardinal(Random(Item.Total));
Initially we start with an empty context and add each generated word to it, up to N words (the size of the context). Usually you N=4 is the most you’ll need, and 3 or 2 seem to get the best results with small amounts of statistics. When the context is full, the first word is removed and the newly generated are added at the end. It might be possible that a given context doesn’t appear anywhere in the statistics – if that happens you should try calling GetRandomWord with progressively smaller contexts until one of them works.
Of course there is more to it than I showed here – you need to extract the contexts and words from input files, shuffle them around, implement binary search algorithms, etc. Some of these are demonstrated in my implementation. Note – it is very slow, because I do lots of string manipulation, which isn’t very fast in Delphi.
Source (8 KB)
Required : Delphi 7 or above (just because it uses XPManifest)
Compiled application (179 KB)
Required : Any Windows OS, around 20 Mb RAM
Things to do
- Store every context-word pair only once and add a new field to count the number of occurences. I’ve done that in a different implementation and it can decrease memory requirements notably.
- Use a tree-like structure for statistics – single words as nodes, branching out to show other words that have occured in theri context.
- Create a proper database to store statistics. I had made a PHP+MySQL implementation of this algorithm, but I lost it during OS reinstallation.
- Throw in some formal grammar (production rules) to create texts that are more realistic.