This algorithm cleans your raw documents and prepares them for other nlp algorithms. Specifically it removes html tags, emojis and other utf-8 encodings, along with all non-punctuation symbols and symoblics to ensure our downstream algorithms are able to handle it.
Applicable Scenarios and Problems
The output from this algorithm can be used directly with the Document Classifier for both construction and prediction.
|input URL||a data collection URI pointing to your raw data file.|
Your documents must be separated by the the delimiter:
""". If no delimiters are found, the algorithm will return an exception.
An example file can be found here: https://gist.github.com/zeryx/ae1b91b94e7d09269b7f9a134edb5005
|output URL||a data collection URI pointing to your cleaned json file|
The file name is randomly generated, and will point to your "algorithm temporary data collection" folder. For more info please read the data docs