chuckwondo

chuckwondo / MarkovText / 0.1.0

README.md

Overview

Generates sentences from words taken from an input text using a very simple Markov chain algorithm in which equal weighting is used. For an approachable explanation, including how weighting can be used, see Automated text generator using Markov Chain.

This algorithm uses only a single input text to construct a dictionary of ngrams mapped to the list of words that occur immediately following each occurrence of the ngram in the text. By default, the size of the ngram is 2, and the minimum size of the output is 50 words. Both values can be specified. If the output size reaches the specified minimum number of words, words will continue to be generated until the end of a sentence is reached.

Usage

Input may be either the source text from which to generate sentences, or an object of the following form:

{
    "text": String,
    "file": String,
    "max_words": Integer,
    "ngram_size": Integer
}

Either text or file must be specified. If both are specified, file is ignored.

  • text: source text from which to generate sentences
  • file: URI of the data source containing the source text
  • min_words (optional, default: 50): minimum number of words to generate in output (output may contain more than the minimum, stopping as soon as an end of sentence is generated)
  • ngram_size (optional, default: 2): size of the ngrams used for generating the output

When the source text is provided alone, it is equivalent to providing the following object, and thus assumes the default values mentioned above:

{
    "text": "source text"
}

where source text is what was provided alone.

The output is one or more generated sentences using words from the source text.

Examples

Given the following input:

The cat ate the fish. The dog barked and bit the cat. The dog ate John's bone.

the following is a possible output:

The dog barked and bit the cat. The dog barked and bit the cat. The dog barked
and bit the cat. The dog ate John's bone. The dog ate John's bone. The dog
barked and bit the cat. The dog barked and bit the cat. The dog barked and bit
the cat.

Of course, given the short text and the default use of bigrams, this is not very interesting.

Here's another example with the same input text, but using unigrams instead:

{
    "text": "The cat ate the fish. The dog barked and bit the cat. The dog ate John's bone.",
    "ngram_size": 1,
    "min_words": 25
}

yielding the following possible output, which again is still not very interesting due to the short input text:

The cat ate John's bone. The cat ate the cat. The dog barked and bit the cat.
The cat ate John's bone. The dog ate the fish.

A data source is better suited for a large source text. Here's an example:

{
    "file": "data://chuckwondo/unstructured/siddhartha.txt"
}

with the following possible output:

"You've experienced suffering, Siddhartha, but I have lived in the bath of
repentance, sacrificing in the yellow, in the process." Siddhartha opened his
eyes, how still and beautiful is the arrow's target, That one should incessantly
hit. After the usual path of salvation? Would you like to keep him from his hair
the water quietly flowing, did not really feel like oil or soap, and others like
leaves, others like sand, and every day, a servant prepared a bath for him.