mahout / RandomForestApply / 0.3.5

This routine applies a previously learned Mahout Random Forest Classifier to a set of test data. It takes as input a JSON array of four items, the first three are Data API URLs, of the (unlabeled) test data, the model file, and the (labelled) data used to train the model file, respectively, followed by a descriptor that details the type of each field in the dataset. It outputs the predicted labels of each instance in the test set. Note that test and training files are assumed to be CSVs.

Data Format and Descriptor

We assume that the first entry of any instance is the label, though Mahout does support other placement. The descriptor must be of form "L X X X ...", where each X designates the type of its respective field, either I (ignored), N (numerical), or C (categorical). L designates the label label. Think of the descriptor as a header for the data. As an example, a dataset with four attributes (beyond the label) might have the first two as categorical, the third numerical, and the last ignored, and its header would be "L C C N I".

With the test data we don’t have a label and as a matter of convention Mahout expects us to pass a "-". Our code currently handles this as long as your data has the label as the first field, and this field is missing in test data.