nlp

nlp / KeywordSetSimilarity / 0.1.4

README.md

Determines similarity between sets of weighted keywords. 

How it works:

Each keyword set is represented as a Map<String,Double>, where the String is the keyword and the Double is it's weight. The similarity of two sets is the sum of the products of the weights of their shared keywords, for instance, if set A has keywords "dog", "cat", and "mouse" with weights 1,2, and 2, respectively and set B has keywords "dog", "cat", and "moose" with weights 1.5,3, and 4, their similarity by this metric is 1*1.5 + 2*3 = 7.5. This can be thought of as the inner product of word vectors.

Input format:

[{id1:{word1:weight1, word2:weight2}, id2:{word3:weight3}}, 2]

The most convenient input format for a set of keyword sets is Map<String,Map<String,Double>>, where the first String key is an identifier for the keyword set, and its value, a Map<String,Double>, is the set of keywords with their respective weights as values. The algorithm also requires an int that determines the maximum number of similar sets to return for each keyword set. 

You do not have to name the sets, if you just provide a List<Map<String,Double>>, the algorithm takes the index of each set as the id and returns the output accordingly.

Output format:

The output is a Map from Strings to Set<String>'s, where the Set<String> value is the set of most similar keyword sets (denoted by a String identifier).