This routine is used for processing text prior to tagging (with an algorithm such as https://algorithmia.com/algorithms/kenny/LDA or https://algorithmia.com/algorithms/nlp/KeywordsForDocumentSet). It turns important multi-word terms (n-grams) into a single word (by replacing spaces with underscores, so "machine learning" becomes "machine_learning") so tagging routines will "recognize" the importance of the term and not treat it as a collection of unrelated words. It can either do this for a designated set of terms or it can automatically detect which n-grams are likely to be important and process these.
If you already know the terms you want to fuse, simply supply the text to process as well as a list of the terms to fuse, for example, if you enter
["Machine learning is the future of technology!",["machine learning","future of technology"]]
you will get
"machine_learning is the future_of_technology!"
If you want to automatically discover terms that should be fused, input
- The input text, either a String or a String.
- n for the desired type of n-gram. 2 is the most common, for example, "machine learning" and "big data" are important 2-grams/bigrams.
- A List<String> of known n-grams. If you don't know what these should be just use an empty list.
- An int for the maximum number of n-grams to consider (default is 5).
- An int for the minimum frequency of the n-gram in the text (the number of times it appears) that will be considered (default is 5). If the first argument of a String, frequency is the total number of times the n-gram appears in it, if it is a String, counts are taken across all entries.
The output is the input with all relevant n-grams fused. All strings are converted to lower case for both processing and output.
As mentioned above, if you input just a String or String, the algorithm will search for a maximum of 5 bigrams, counting only those that appear more than five times, and will return, respectively, a String or String with all eligible bigrams fused.