Markov Chain Algorithm in Python

In my humble opinion, Kernighan and Pike's The Practice of Programming [1] is a book every programmer should read (and not just because I'm a fan of all things C and UNIX). A few years ago I was reading Chapter 3, Design and Implementation, whichs examines how programming problems influence the way data structures are used, and recognizes that “[p]rogram design can certainly be colored by a language but is not usually dominated by it." [2] To put some (virtual) flesh on these assertions, the authors chose to implement the Markov chain algorithm in five programming languages (C, Java, C++, Awk, and Perl). [3]

This particular Markov chain algorithm reads English text and generates (sometimes humorous) output that resembles English. Input text is broken up into three-word tuples consisting of a two-word prefix (w1 and w2 shown below) followed by a single suffix word (w3):

set w1 and w2 to the first two words
    in the text
print w1 and w2
loop:
    randomly choose w3, one of the
        successors of prefix w1 w2
        in the text
    print w3
    replace w1 and w2 by w2 and w3
    repeat loop [4]

(Obviously, the last two words in the input will not have a suffix.) A word consists of all non-whitespace characters (including leading and/or trailing punctuation) contained between whitespace and/or EOF.

For instance, given this input text: [5]

Show your flowcharts and conceal your tables and I will be mystified. Show your tables and your flowcharts will be obvious.

the algorithm will discover the following prefixes (left column) and suffixes (right column):

Show your       flowcharts tables
your flowcharts and will
flowcharts and  conceal
and conceal     your
conceal your    tables
your tables     and and
tables and      I your
and I           will
I will          be
will be         mystified. obvious.
be mystified.   Show
mystified Show  your
and your        flowcharts
flowcharts will be
be obvious.     EOF

and then print the first prefix “Show your", randomly select, say, “tables", print this word, create a new prefix “your tables", and so on until the final prefix, “be obvious." is reached (and printed).

At the time, I knew enough Python to be dangerous, so I decided to implement the Markov chain algorithm in this scripting language (in particular, Python 2.7). The rest of the article explains what I came up with and why.

Source Code

Not surprisingly, my implementation [6] resembles the ones found in the book, although the Python-isms made it more streamlined. The code presented in the following sections are not necessarily in the order required by Python; I chose this method for pedagogical reasons.

First Things

The markov-tpop.py script requires the services of the following standard Python modules:

import sys
import random
import string

Next, two global variables are defined:

NPREF = 2
NONWORD = '\n'

A prefix consists of two words, because for English at least, “using two words to select a third is a good compromise: it seems to recreate the flavor of the input while adding its own whimsical touch." [7]

We also define a variable that represents a non-word, something that is not seen by the input text processing code. In addition to acting as a “marker word to terminate the algorithm," [8] this variable plays the part of a sentinal value, in that the programmatic appending of a non-word to the input data “simplifies the main processing loops of the program significantly." [9] Finally, to handle the case of insufficient input (e.g., a file consisting of a single word) without the use of ugly special-case code, prior to reading the input, a prefix consisting of two non-words is constructed, “which guarantees there is always enough input for the program... This has the nice benefit that the first word of the input file will be the first suffix of the fake prefix, so the generation loop needs to print only the suffixes it produces." [10]

Data Structures

In their C implementation, the authors had to create their own data structures (linked lists, hash tables, etc.) by hand, but Python's native lists and dictionaries seem almost tailor-made for programs like ours. We merely have to define a single empty dictionary:

statetab = {}

A prefix and all of its possible suffixes is known as a state, “which is standard terminology for Markov algorithms," [11] hence the name of this global variable. In our script, a prefix is a Python list that serves as a key to look up in statetab the corresponding suffix(es), which reside in a separate list.

The Driver

The “main" function looks like this:

if __name__ == '__main__':
    MAXGEN = 10000

    prefix = []

    for i in range(NPREF):
        prefix.append(NONWORD)

    build(prefix)
    add(prefix, NONWORD)

    generate(MAXGEN)

    sys.exit(0)

(Note the lack of special-purpose code to catch edge cases. Our script can handle empty input files, files with a single word, and files with two words.) The initial prefix is constructed, which is then used to kick off input file processing. A non-word is made the synthetic suffix for the last prefix prior to generating the output text. For grins, we limit the program output to 10,000 generated words.

Processing Input

The work of processing the input file is divided between two functions. The simplest way to obtain input data is to force the user to provide the contents of a file to stdin. (On the Unix command line, this may be accomplished by cat-ing a text file and then piping the output to our Python script.)

def build(prefix):
    for line in sys.stdin.readlines():
        if len(line.strip(
                string.whitespace)) == 0:
            continue
        for word in line.split():
            add(prefix, word)
            prefix.pop(0)
            prefix.append(word)
    return

Each input line is broken up into zero or more words (as defined above). We skip lines that have no words. Otherwise, each word encountered is associated with the current prefix. Do not miss the fact that every time build() is called, prefix is modified (think “call by reference") for use by the subsequent build() invocation.

If word is the first suffix encountered by prefix, we “create" a new statetab record, otherwise the new suffix is appended to the existing list. Note that duplicate suffixes are legal. Python does not allow mutable objects like lists to serve as dictionary keys, so we have to employ immutable tuples.

def add(prefix, word):
    key = tuple(prefix)
    if key in statetab:
        statetab[key].append(word)
    else:
        statetab[key] = [word]
    return

(In their various implementations, Kernighan and Pike modify prefix in add(), but we decided to edit that parameter in build() instead, since the final call to add() in build() has no need to modify prefix.)

If we were to execute print prefix, statetab[prefix] before the return statement in add(), the following diagnostic output will be produced given our sample input sentence:

['\n', '\n'] ['Show']
['\n', 'Show'] ['your']
['Show', 'your'] ['flowcharts']
['your', 'flowcharts'] ['and']
['flowcharts', 'and'] ['conceal']
['and', 'conceal'] ['your']
['conceal', 'your'] ['tables']
['your', 'tables'] ['and']
['tables', 'and'] ['I']
['and', 'I'] ['will']
['I', 'will'] ['be']
['will', 'be'] ['mystified.']
['be', 'mystified.'] ['Show']
['mystified.', 'Show'] ['your']
['Show', 'your'] ['flowcharts', 'tables']
['your', 'tables'] ['and', 'and']
['tables', 'and'] ['I', 'your']
['and', 'your'] ['flowcharts']
['your', 'flowcharts'] ['and', 'will']
['flowcharts', 'will'] ['be']
['will', 'be'] ['mystified.', 'obvious.']
['be', 'obvious.'] ['\n']

Generating Output

In the diagnostic output, the first record in statetab consists of two non-words with a single suffix that happens to be the first input word, and the second record contains the second input word. This is an ingenious method on the authors' part for arranging the printing of the first input prefix without the use of special-purpose code.

def generate(nwords):
    prefix = []

    for i in range(NPREF):
        prefix.append(NONWORD)

    random.seed()

    for i in range(nwords):
        suffixes = statetab[tuple(prefix)]
        word = random.choice(suffixes)
        if word == NONWORD:
            break
        print word,
        prefix.pop(0)
        prefix.append(word)

    return

The random Python module does a passable job of enabling generate() to choose one of the suffixes from the list. The comma in the print statement “suppresses the end-of-line character normally added at the end of the printed text." [12] In practice, word is printed along with a single space character, except for the last word to be printed, which is followed by a non-word, the ‘\n' character, which just so happens to be the “suffix" associated with the final prefix, one reason why the main driver makes that call to add(). The other reason is that processing will halt if this suffix is encountered before nwords have been printed.

So our original sample input:

Show your flowcharts and conceal your tables and I will be mystified. Show your tables and your flowcharts will be obvious.

may be rewritten many different ways:

$ ./markov-tpop.py < brooks.txt | fmt -w 40
Show your flowcharts will be obvious.

$ ./markov-tpop.py < brooks.txt | fmt -w 40
Show your flowcharts and conceal your
tables and I will be mystified. Show
your tables and I will be obvious.

$ ./markov-tpop.py < brooks.txt | fmt -w 40
Show your tables and your flowcharts
and conceal your tables and your
flowcharts will be obvious.

Fun Example

Chapter 1 of Jane Austin's Pride and Prejudice opens with this famous sentence (you may have to click twice on the link to get the actual text):

It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

and ends like this:

The business of her life was to get her daughters married; its solace was visiting and news.

When we ran our script against Chapter 1, the following was generated:

$ ./markov-tpop.py < pridech1.txt | fmt -w 40
It is a truth universally acknowledged,
that a single man in possession of a
wife. However little known the feelings
or views of such a man may be on his
first entering a neighbourhood, this
truth is so well fixed in the minds of
the surrounding families, that he is
to take possession before Michaelmas,
and some of his servants are to be
sure! A single man in possession
of a wife. However little known the
feelings or views of such a way? You
take delight in vexing me. You have no
compassion for my little Lizzy.” “I
desire you will do no such thing. Lizzy
is not a bit better than the others;
and I am thinking of her life was to
get her daughters married; its solace
was visiting and news.

Notes

^ Brian W. Kernighan and Rob Pike, The Practice of Programming (Reading, MA: Addison Wesley, 1999).
^ Ibid., p. 61.
^ The authors kindly placed their source code here.
^ K&P, p. 62.
^ Ibid. This is a paraphrase of a quotation from Frederick P. Brooks, Jr., The Mythical Man-Month, another great book you should read.
^ My script (and sample text files) may be downloaded from my Github page.
^ K&P, p. 63.
^ Ibid., p. 69.
^ Ibid.
^ Ibid.
^ Ibid., p. 64.
^ Mark Lutz, Learning Python, 5th edition (Sebastopol, CA: O'Reilly Media, 2013), p. 362. To support both Python 2.7 and 3.x, replace the print word, statement with print(word, end=' ') and add from __future__ import print_function to the top of markov-tpop.py.