nlp

nlp / FuseNGrams / 1.0.0

README.md
This routine is used for processing text prior to tagging (with an algorithm such as https://algorithmia.com/algorithms/kenny/LDA or https://algorithmia.com/algorithms/nlp/KeywordsForDocumentSet). It turns important multi-word terms (n-grams) into a single word (by replacing spaces with underscores, so "machine learning" becomes "machine_learning") so tagging routines will "recognize" the importance of the term and not treat it as a collection of unrelated words. It can either do this for a designated set of terms or it can automatically detect which n-grams are likely to be important and process these.

If you already know the terms you want to fuse, simply supply the text to process as well as a list of the terms to fuse, for example, if you enter 

["Machine learning is the future of technology!",["machine learning","future of technology"]]

you will get

"machine_learning is the future_of_technology!"

If you want to automatically discover terms that should be fused, input

  • The input text, either a String or a String[].
  • n for the desired type of n-gram. 2 is the most common, for example, "machine learning" and "big data" are important 2-grams/bigrams.
  • A List<String> of known n-grams. If you don't know what these should be just use an empty list.
  • An int for the maximum number of n-grams to consider (default is 5).
  • An int for the minimum frequency of the n-gram in the text (the number of times it appears) that will be considered (default is 5). If the first argument of a String, frequency is the total number of times the n-gram appears in it, if it is a String[], counts are taken across all entries. 
The output is the input with all relevant n-grams fused. All strings are converted to lower case for both processing and output.

As mentioned above, if you input just a String or String[], the algorithm will search for a maximum of 5 bigrams, counting only those that appear more than five times, and will return, respectively, a String or String[] with all eligible bigrams fused.