CS

CS / SubsetSequenceFrequency / 0.1.0

README.md

Fast Algorithm to compute the most common prefixes in a large dataset. 

AKA "Starting Pattern Occurrence Frequency"

Results ranked by the most common.

Applications:
  • Next letter prediction
  • Word completion
  • DNA sequencing
  • Protein sequencing
  • Computational linguistics
  • Compression algorithms

Optional parameters:

  • minScore - result frequency cutoff [default: 2]
  • minLength - minimum prefix length [default: 1]
  • startsWith - fixed prefix filter, useful for predictions (not used by default)
  • maxResults - return at most this many results (by default return all matching)

Examples:

{
"minLength": 4,
"maxResults": 10,
"dataset": ["John", "William", "James", "Charles", "George", "Frank", "Joseph", ...]
}

 Returns the top10 Baby Name prefixes, minimum length 4, from a 20th century US Baby Names list (2.5Mb), example dataset trimmed, result:

{"Mari": 1941, "Fran": 1420, "Chris": 1227, "Chri": 1227, "Juli": 1167, "Will": 1151, "Char": 1066, "Christ": 1057, "Marg": 1041, "Kath": 983}

Sample input and output against a "buzzwords" list: