nlp

nlp / CleanDocuments / 0.1.2

README.md

Overview

This algorithm cleans your raw documents and prepares them for other nlp algorithms. Specifically it removes html tags, emojis and other utf-8 encodings, along with all non-punctuation symbols and symoblics to ensure our downstream algorithms are able to handle it.

Applicable Scenarios and Problems

This algorithm is designed to work with the Document Classifier, Doc2Vec, Word2Vec, LDA suite of algorithms. It's recommended if your data is unclean to preprocess it here.

The output from this algorithm can be used directly with the Document Classifier for both construction and prediction.

Usage

Input

ParameterDescription
input URLa data collection URI pointing to your raw data file.

Your documents must be separated by the the delimiter: """. If no delimiters are found, the algorithm will return an exception.

An example file can be found here: https://gist.github.com/zeryx/ae1b91b94e7d09269b7f9a134edb5005

Output

ParameterDescription
output URLa data collection URI pointing to your cleaned json file

The file name is randomly generated, and will point to your "algorithm temporary data collection" folder. For more info please read the data docs

Example

Input

https://gist.github.com/zeryx/ae1b91b94e7d09269b7f9a134edb5005

"data://nlp/classification_datasets/example_file.txt"

Output

https://gist.github.com/zeryx/9148a2b347a2b771c78a239b61043bae

"data://.algo/nlp/CleanDocuments/temp/2f1e6f56-abf5-4b57-ac96-76df6ae9c568.json"