nlp

nlp / DocumentClassifier / 1.1.1

README.md

text categorization Note: image made using https://www.wordclouds.com/

Introduction

This algorithm is used to classify documents using pre-defined labels and document vectors. Technologically, it uses a document embedding model and K-nearest neighbours to discover the most relevent neighbour in document space, along with a KDTree algorithm to reduce the prediction lookup time.

Things to note

  • This algorithm can take a long time during construction operations, please ensure that you set a custom timeout to your algorithm call.
  • Your dataset doesn't need to be entirely labelled. When you provide a document without a label, the document will automatically be labelled with 'N/A'.
  • You should prepare the algorithm model state with data before running any prediction tasks.
  • Constructing a new dataset will cost you approximately 2.25 credits per document, keep that in mind before you decide to construct your own model state.
  • In a similar vein, there is a maximum of 70,000 training samples that can be provided at once to prevent the algorithm from reaching the timeout limit of 50 minutes. We recommend you construct your large dataset iteratively in batches of 70,000 elements or less, to ensure that you don't waste any credits.
  • If your data is unclean and has html tags or other things that can gum up NLP operations, take a look at our new CleanDocuments algorithm. It can format your difficult to work with documents into a form that's directly ingestible by this algorithm.

Example Models

We provide our users a few toy/demo classification/recommendation models to work with.

namespacedescription
data://nlp/arxiv_modelCreated from scraping arxiv.org abstracts, supervised the search queries used to obtain the abstracts themselves
data://nlp/customer_support_modelTaken from Algorithmia's Intercom chat support data, labelled using the typical assignee or no assignee for a message.
data://nlp/amazon_reviews_modelTaken from the amazon reviews http://jmcauley.ucsd.edu/data/amazon/ dataset, unsupervised only, unlabelled documents

Accuracy

We provide various accuracy options that helps you best fit the algoritm to your accuracy / performance requirements. Below is a comparison table we constructed using our arxiv abstract query dataset.

optionaccuracyavg compute per document (ms)
very accurate82.42%97.8
accurate79.27%57.92
fast76.71%42.69
very fast62.79%34.99

I/O

Input

{  
   "mode":String,
   "mixtureSize":Int,
   "returnDocs":Bool,
	 "withPrivacy":Bool,
	 "option":String,
	 "namespace":String,
   "n":Int,
   "data":String / Array [  
      {  
         "text":String,
         "label":String
      }
   ]
}
  • mode - (required) - This is a functionality switch between training and prediction, use pass construct for construction and predict for prediction.
  • data - (required) - This may be either a list of datapoints (if less than 1k), or a URL to either an http:// web hosted or a data API (data://, s3://, dropbox://, etc) hosted json file containing a list of datapoints.
  • namespace - (optional) - The data collection (or namespace) you're using for storing state (model files, etc), if not present this defaults to the algorithm's temp directory.
  • mixtureSize - (optional) - When using the construct mode along with the default vectorization backend, this determines the maximum number of words we use to construct document vectors. if not present this defaults to 6.
  • n - (optional) - When using the predict mode, this defines the number of elements to return with each prediction. if not present this defaults to 5.
  • returnDocs - (optional) - When using the predict mode, setting this value to true will return the best neighbours' documents in your topN output.
  • withPrivacy - (optional) - When using the construct mode, setting this value to true will ensure that your models documents are not returned to users, even if they set the returnDocs command to true. If your data is private, please remember to set this variable to true as by default it's false.
  • option - (optional) - When using the predict mode, acts as a performance / accuracy slider. Options are very fast, fast, accurate, and very accurate. for more information please check the Accuracy header.

Construct Output

{  
    "indexFiles":List[String]
}

  • indexFiles - A list of the index files created during this construction event.

Predict Output

{
   "predictions":Array[
      {
         "specimen":String,
         "topN":Array[
            {
               "prediction":String,
               "distance":Double,
               "text":String
            }
         ]
      }
   ]
}
  • predictions - an array of prediction objects for each document.
  • specimen - the same input text you provided for this prediction, makes it easier to match tensor with document for later.
  • topN - An array of the nearest neighbours predicted by the model.
  • prediction - The label in string form for this predicted class. If this neighbour has no label, the prediction will be "N/A".
  • distance - The Euclidean distance between the specimen and this neighbour. normalized to a range between 0 and 1
  • text - if requested with returnDocs being true, returns the document that this neighbour references.

Datapoint

The main data storage object used for construction and predicting.

{  
   "text":String,
   "label":Option[String]
}
  • text - (required) - the document you wish to process, be careful to scrub the document for any special characters as they might interfere with processing. be extra careful to remove any quotation marks " and ' from your document, as they break the json string.
  • label - (optional) - the label you define for this document, used during training - but optional.

Examples

Files

here are a couple of examples of how to construct a Train and Predict json file for large datasets:

Training:
https://gist.github.com/zeryx/0c3ca0d5e6aea1f981a76f55e40a8ef3
Prediction:
https://gist.github.com/zeryx/d185f16d8e8fe5a2130e971affd0bec3

Construct - list of datapoints

{  
   "data":[{  
            "text":"Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as the and of. Other words that may seem visual can often be predicted reliably just from the language model e.g., sign after behind a red stop or phone following talking on a cell. In this paper, we propose a novel adaptive attention model with a visual sentinel.At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel.  The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation.   We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K.   Our approach sets the new state-of-the-art by a significant margin",
            "label":"machine learning"
         }],
"mode":"construct"
}

Construct - url to json file

{  
   "data":"data://.my/collection/training_data.json",
   "namespace":"data://.my/classifier",
   "mode":"construct"
}

Output:

{"indexFiles": ["data://./my/classifier/index-0.bin"}

Predict - list of datapoints

{  
   "mode":"predict",
   "n":5,
	 "namespace":"data://.my/classifier",
	 "option":"very accurate",
	 "data":[
      {
         "text":"This is a english sentence that contains information useful for testing."
      }
   ]
}

Predict - url to json file

{  
   "data":"data://nlp/arxiv_model/arxiv_10_testing.json",
   "namespace":"data://nlp/arxiv_model",
   "mode":"predict",
   "n":4
}

Output - single:

{  
   "predictions":[  
      {  
         "specimen":"This is a english sentence that contains information useful for testing.",
         "topN":[  
            {  
               "distance":0.25484830141067505,
               "prediction":"Diego"
            },
            {  
               "distance":0.25508666038513184,
               "prediction":"No assignee"
            },
            {  
               "distance":0.3290671706199646,
               "prediction":"Stephanie Kim"
            },
            {  
               "distance":0.32909464836120605,
               "prediction":"James Sutton"
            },
            {  
               "distance":0.35733461380004883,
               "prediction":"Jon Peck"
            }
         ]
      }
   ]
}

Output - multiple:

{  
   "predictions":[  
      {  
         "specimen":"Deep holes play an important role in the decoding of generalized ReedSolomon codes. Recently Wu and Hong citeWH found a new class of deep holes for standard ReedSolomon codes. In the present paper we give a concise method to obtain a new class of deep holes for generalized ReedSolomon codes. In particular for standard ReedSolomon codes we get the new class of deep holes given in citeWH.   Li and Wan citeL.W1 studied deep holes of generalized ReedSolomon codes GRSkfD and characterized deep holes defined by polynomials of degree k1. They showed that this problem is reduced to be a subset sum problem in finite fields. Using the method of Li and Wan we obtain some new deep holes for special ReedSolomon codes over finite fields with even characteristic. Furthermore we study deep holes of the extended ReedSolomon code i.e. Df and show polynomials of degree k2 can not define deep holes.",
         "topN":[  
            {  
               "distance":0.1830906867980957,
               "prediction":"electron"
            },
            {  
               "distance":0.2650790810585022,
               "prediction":"learning"
            },
            {  
               "distance":0.40913957357406616,
               "prediction":"economics"
            },
            {  
               "distance":0.4253190755844116,
               "prediction":"nuclear"
            }
         ]
      },
			...
      {  
         "specimen":"Various types of galaxies observed in the cosmological scales show PCygni type profiles in the Lyman alpha emission lines. A Monte Carlo code is developed to investigate the Lyman alpha line transfer in an optically thick and moving medium with a careful consideration of the scattering in the damping wing. The main features in emergent line profiles include a primary emission peak in the red part and a much weaker secondary peak in the blue part. The primary peak recedes to the red and the width of the feature increases as NHI increases. The PCygni type profile in the Lyman alpha emission line of DLA 2233131 z3.15 is noted to be similar to those found in the nearby galaxies and distant 3z3.5 H II galaxies. Our numerical results are applied to show that the DLA system may possess an expanding H I supershell with bulk flow of sim 200 kms and that the H I column density NHI is approximately 1020 cm2. From the observed Lyman alpha flux and adopting a typical size of the emission region sim 1 kpc we estimate the electron density of the H II region to be sim 1 cm3 and the mass of H II region sim 108 msun. We also conclude that it requires sim 103 O5 stars for photoionization which is comparable to firstranked H II regions found in nearby spiral and irregular galaxies.",
         "topN":[  
            {  
               "distance":0.094332754611969,
               "prediction":"electron"
            },
            {  
               "distance":0.1840466856956482,
               "prediction":"nuclear"
            },
            {  
               "distance":0.2911069989204407,
               "prediction":"economics"
            },
            {  
               "distance":0.36049818992614746,
               "prediction":"learning"
            }
         ]
      }
   ]
}