nlp / KeywordAnalysisForReviews / 0.1.0
This algorithm can be used to identify the important terms that differentiate good reviews from bad reviews. It is based on tf-idf scoring (a version of which is implemented for general multiple-document keyword identification in /nlp/KeywordsForDocumentSet), but unlike these, it calculates term frequencies for the entire set of reviews sharing a given rating. It takes as input
  • The set of reviews (as a String[])
  • The ratings of each review (as Integer[]). The i-th review in the first argument corresponds to the i-th rating in the second.
  • The number of rating options (general 5, for a 1-5 star system)
  • The maximum number of keywords to return for each rating.
It returns the set of keywords corresponding to weighting in order, from lowest rating to highest rating. Terms that appear frequently across all rating classes will be assigned low weight and thus will not likely appear as keywords.