Markov Chain Algorithm in Python
19 Apr 2018
In my humble opinion, Kernighan and Pike's The Practice of Programming [1] is a book every programmer should read (and not just because I'm a fan of all things C and UNIX). A few years ago I was reading Chapter 3, Design and Implementation, whichs examines how programming problems influence the way data structures are used, and recognizes that “[p]rogram design can certainly be colored by a language but is not usually dominated by it." [2] To put some (virtual) flesh on these assertions, the authors chose to implement the Markov chain algorithm in five programming languages (C, Java, C++, Awk, and Perl). [3]
This particular Markov chain algorithm reads English text and generates (sometimes humorous) output that resembles English. Input text is broken up into three-word tuples consisting of a two-word prefix (w1 and w2 shown below) followed by a single suffix word (w3):
set w1 and w2 to the first two words in the text print w1 and w2 loop: randomly choose w3, one of the successors of prefix w1 w2 in the text print w3 replace w1 and w2 by w2 and w3 repeat loop [4](Obviously, the last two words in the input will not have a suffix.) A word consists of all non-whitespace characters (including leading and/or trailing punctuation) contained between whitespace and/or EOF.
For instance, given this input text: [5]
Show your flowcharts and conceal your tables and I will be mystified. Show your tables and your flowcharts will be obvious.the algorithm will discover the following prefixes (left column) and suffixes (right column):
Show your flowcharts tables your flowcharts and will flowcharts and conceal and conceal your conceal your tables your tables and and tables and I your and I will I will be will be mystified. obvious. be mystified. Show mystified Show your and your flowcharts flowcharts will be be obvious. EOFand then print the first prefix “Show your", randomly select, say, “tables", print this word, create a new prefix “your tables", and so on until the final prefix, “be obvious." is reached (and printed).
At the time, I knew enough Python to be dangerous, so I decided to implement the Markov chain algorithm in this scripting language (in particular, Python 2.7). The rest of the article explains what I came up with and why.
Source Code
Not surprisingly, my implementation [6] resembles the ones found in the book, although the Python-isms made it more streamlined. The code presented in the following sections are not necessarily in the order required by Python; I chose this method for pedagogical reasons.
First Things
The markov-tpop.py script requires the services of the following standard Python modules:
import sys import random import string
Next, two global variables are defined:
NPREF = 2 NONWORD = '\n'A prefix consists of two words, because for English at least, “using two words to select a third is a good compromise: it seems to recreate the flavor of the input while adding its own whimsical touch." [7]
We also define a variable that represents a non-word, something that is not seen by the input text processing code. In addition to acting as a “marker word to terminate the algorithm," [8] this variable plays the part of a sentinal value, in that the programmatic appending of a non-word to the input data “simplifies the main processing loops of the program significantly." [9] Finally, to handle the case of insufficient input (e.g., a file consisting of a single word) without the use of ugly special-case code, prior to reading the input, a prefix consisting of two non-words is constructed, “which guarantees there is always enough input for the program... This has the nice benefit that the first word of the input file will be the first suffix of the fake prefix, so the generation loop needs to print only the suffixes it produces." [10]
Data Structures
In their C implementation, the authors had to create their own data structures (linked lists, hash tables, etc.) by hand, but Python's native lists and dictionaries seem almost tailor-made for programs like ours. We merely have to define a single empty dictionary:
statetab = {}
A prefix and all of its possible suffixes is known as a state, “which is standard terminology for Markov algorithms," [11] hence the name of this global variable. In our script, a prefix is a Python list that serves as a key to look up in statetab the corresponding suffix(es), which reside in a separate list.
The Driver
The “main" function looks like this:
if __name__ == '__main__': MAXGEN = 10000 prefix = [] for i in range(NPREF): prefix.append(NONWORD) build(prefix) add(prefix, NONWORD) generate(MAXGEN) sys.exit(0)
(Note the lack of special-purpose code to catch edge cases. Our script can handle empty input files, files with a single word, and files with two words.) The initial prefix is constructed, which is then used to kick off input file processing. A non-word is made the synthetic suffix for the last prefix prior to generating the output text. For grins, we limit the program output to 10,000 generated words.
Processing Input
The work of processing the input file is divided between two functions. The simplest way to obtain input data is to force the user to provide the contents of a file to stdin. (On the Unix command line, this may be accomplished by cat-ing a text file and then piping the output to our Python script.)
def build(prefix): for line in sys.stdin.readlines(): if len(line.strip( string.whitespace)) == 0: continue for word in line.split(): add(prefix, word) prefix.pop(0) prefix.append(word) return
Each input line is broken up into zero or more words (as defined above). We skip lines that have no words. Otherwise, each word encountered is associated with the current prefix. Do not miss the fact that every time build() is called, prefix is modified (think “call by reference") for use by the subsequent build() invocation.
If word is the first suffix encountered by prefix, we “create" a new statetab record, otherwise the new suffix is appended to the existing list. Note that duplicate suffixes are legal. Python does not allow mutable objects like lists to serve as dictionary keys, so we have to employ immutable tuples.
def add(prefix, word): key = tuple(prefix) if key in statetab: statetab[key].append(word) else: statetab[key] = [word] return(In their various implementations, Kernighan and Pike modify prefix in add(), but we decided to edit that parameter in build() instead, since the final call to add() in build() has no need to modify prefix.)
If we were to execute print prefix, statetab[prefix] before the return statement in add(), the following diagnostic output will be produced given our sample input sentence:
['\n', '\n'] ['Show'] ['\n', 'Show'] ['your'] ['Show', 'your'] ['flowcharts'] ['your', 'flowcharts'] ['and'] ['flowcharts', 'and'] ['conceal'] ['and', 'conceal'] ['your'] ['conceal', 'your'] ['tables'] ['your', 'tables'] ['and'] ['tables', 'and'] ['I'] ['and', 'I'] ['will'] ['I', 'will'] ['be'] ['will', 'be'] ['mystified.'] ['be', 'mystified.'] ['Show'] ['mystified.', 'Show'] ['your'] ['Show', 'your'] ['flowcharts', 'tables'] ['your', 'tables'] ['and', 'and'] ['tables', 'and'] ['I', 'your'] ['and', 'your'] ['flowcharts'] ['your', 'flowcharts'] ['and', 'will'] ['flowcharts', 'will'] ['be'] ['will', 'be'] ['mystified.', 'obvious.'] ['be', 'obvious.'] ['\n']
Generating Output
In the diagnostic output, the first record in statetab consists of two non-words with a single suffix that happens to be the first input word, and the second record contains the second input word. This is an ingenious method on the authors' part for arranging the printing of the first input prefix without the use of special-purpose code.
def generate(nwords): prefix = [] for i in range(NPREF): prefix.append(NONWORD) random.seed() for i in range(nwords): suffixes = statetab[tuple(prefix)] word = random.choice(suffixes) if word == NONWORD: break print word, prefix.pop(0) prefix.append(word) return
The random Python module does a passable job of enabling generate() to choose one of the suffixes from the list. The comma in the print statement “suppresses the end-of-line character normally added at the end of the printed text." [12] In practice, word is printed along with a single space character, except for the last word to be printed, which is followed by a non-word, the ‘\n' character, which just so happens to be the “suffix" associated with the final prefix, one reason why the main driver makes that call to add(). The other reason is that processing will halt if this suffix is encountered before nwords have been printed.
So our original sample input:
Show your flowcharts and conceal your tables and I will be mystified. Show your tables and your flowcharts will be obvious.may be rewritten many different ways:
$ ./markov-tpop.py < brooks.txt | fmt -w 40 Show your flowcharts will be obvious. $ ./markov-tpop.py < brooks.txt | fmt -w 40 Show your flowcharts and conceal your tables and I will be mystified. Show your tables and I will be obvious. $ ./markov-tpop.py < brooks.txt | fmt -w 40 Show your tables and your flowcharts and conceal your tables and your flowcharts will be obvious.
Fun Example
Chapter 1 of Jane Austin's Pride and Prejudice opens with this famous sentence (you may have to click twice on the link to get the actual text):
It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.and ends like this:
The business of her life was to get her daughters married; its solace was visiting and news.
When we ran our script against Chapter 1, the following was generated:
$ ./markov-tpop.py < pridech1.txt | fmt -w 40 It is a truth universally acknowledged, that a single man in possession of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is to take possession before Michaelmas, and some of his servants are to be sure! A single man in possession of a wife. However little known the feelings or views of such a way? You take delight in vexing me. You have no compassion for my little Lizzy.” “I desire you will do no such thing. Lizzy is not a bit better than the others; and I am thinking of her life was to get her daughters married; its solace was visiting and news.
Notes
Copyright © 2018, 2020, 2023 by Paul Eissen. Powered by w3.css