nlp

nlp / SentimentTimeSeries / 0.1.0

README.md

Introduction

This algorithm combines the Social Sentiment Analysis algorithm and R time series to produce a sentiment plot showing positive, negative and neutral trends. It also produces a JSON file with the frequencies split into their sentiment and the corresponding dates to use in either the Forecast algorithm or another algorithm that requires frequency counts.

Inputs

This algorithm requires a JSON object as its input and all fields are required.

The input_file field is required and is your csv file containing your timestamp and your comments. It must not contain headers and the timestamp must come first.

"input_file" : "data://username/data_collection_name/time_comments.csv"

Note: The timestamp either must be in Unix Epoch datetime format OR a date format such as 2016/12/09. The later format can be of various date formats which you will pass in as another argument later. including the timestamp such as: 2016/10/19 18:42:46. The first column must be the timestamp for the algorithm to run. The benefit of using Unix Epoch is that you can pass in any date format that you want such as just the month and day.

9/12/2016, "Some message comment"
9/14/2016, "Some message comment"
1433392388, "Some message comment"
1432900542, "Some message comment"

Remember, other formats are accepted and all will need their format passed in to the JSON object under the dt_format field.

The output_plot field is required and is the location of where you want to store the sentiment time series plot. It will need to be stored in your data collections. For more information see the Docs.

"output_plot": "data://username/data_collection_name/sent_timeseries_plot.png"

The output_file field is required and is the location of where you want to store the sentiment time series file that holds the JSON data. It will need to be stored in your data collections. For more information see the Docs.

"output_file": "data://username/data_collection_name/sent_freq_file.json"

The start field is required and is a JSON array that holds the year and the month that the time series begins on while the end field holds the year and month that the dataset stops on. You can also pass in only the year. These fields are used to create the sentiment time series plot.

"start": [2015, 4]
"end": [2016, 9]

The freq field is also required and it is defined as the number of observations per unit of time, so if your data was collected once per day, then the frequency would be 365 while if it is by month then freq would be 12, etc.

The dt_format is required and it will need to represent the date format that your data is in. For example if your date is: 5/26/2016 then your dt_format should be:

"dt_format": "%m/%d/%Y"

The tm_zone is required and is the timezone your data was collected in. 

Here is a full sample JSON input:

{
    "input_file": "data://username/data_collection_name/time_comments.csv",
    "output_plot": "data://username/data_collection_name/sent_timeseries_plot.png",
    "output_file": "data://username/data_collection_name/sent_freq_file.json",
    "start": [2015, 4],
    "end": [2016,9],
    "freq": 12,
    "dt_format": "%m/%Y",
    "tm_zone": "GMT"
}

Outputs

This algorithm outputs two files, one an R generated plot of the sentiment time series and another that is the sentiment frequency JSON object.

Sentiment Time Series Plot

The sentiment time series tick marks will show the frequency on the y-axis and the time in numeric form (2016.05 for May of 2016) on the x-axis. The positive sentiment line will be shown in green, the negative in red and the neutral in blue.

Sentiment Time Series Frequency JSON File

The sentiment time series frequency JSON file is to be used with any forecasting algorithm and is split into positive, negative and neutral time series with the corresponding frequencies.

{
"pos":{"tm":["01/01/2016","01/21/2016"], "freq":[2,1]},

"neg":{"tm":["02/13/2016","02/18/2016"], "freq":[1,1]},

"neu":{"tm":["01/02/2016","01/05/2016"], "freq":[1,1]}
}

Note: Both the plot and the file outputs are determined in the input JSON object.