nlp

nlp / KeywordsForDocumentSet / 0.1.7

README.md

Given a set of documents (represented as a List<String>) and a maximum number of keywords to return per document, returns a list, each entry of which contains the most relevant (as measured by weight) keywords for the respective document. 

How it works:

In a given document, a keyword receives a higher weight for the number of times it appears in the document and a lower weight for the number of other documents it appears in. This is a simple implementation of tf-idf scoring.

Input format:

A set of documents (represented as a List<String>) and a maximum number of keywords to return per document:

e.g. [["word1 word2 word3", "word4 word5", "word2 word5"], 2]

Output format:

An ordered list (with respect to the input order) with keywords that are the most relevant for each document.