mheimann / WMD / 0.1.0

Implementation of techniques from Kusner et. al, 2015: From Word Embeddings to Document Distances.  Represents documents as list of word vectors using the celebrated word2vec embedding to represent the words in the documents.  Implements heuristic distance metrics described in the paper (word centroid distance and relaxed word mover's distance) to compute distances between documents for k-nearest neighbor classification.  (The original paper showed that rWMD outperformed a wide variety of other ways of representing text data on a number of benchmark document classification tasks).  

This algorithm has a preprocess mode: takes in a text file where each line is a label followed by the text of a document on the same line (with the two separated by a tab).  Additionally, a user must pass in the path to an ID map file in a data collection.  This file will hold a map from (human-readable) IDs for documents (e.g. "sports", "politics") to numeric algorithm-readable labels (e.g. 0, 1).   If a file exists at this path, the labels will be preprocessed using this ID map.  (So test data can have its IDs converted the same way that the training data was, which is important.)  If no file exists at this path, one will be formed when preprocessing the data and written to that path.  Currently, the preprocess mode assumes that even test data has tab-separated labels as well as the document text on a line, though the labels need not be used when testing (unless the user wants to compute the accuracy for test format).  So if test data is unlabeled, random labels may be added to each line of the data file to satisfy the required data format.  When the data is preprocessed, it will be written to the same data collection as the original data and a message to this effect will be returned.

Additionally, this algorithm may be run in performance mode, where it is used for kNN classification.  Here the user must pass in training data (in a pickle file as returned by the preprocess mode of this algorithm), training labels, test data, and the number of neighbors to use for kNN classification.  Optionally, the user may pass in the test labels, in which case the program will return the accuracy.  Otherwise, a message informing the user that the predicted labels for the test data have been written to the data collection will be returned (either way, the predicted labels will be written to the data collection).