This algorithm finds the most unexpected events in a set of geographic events relative to some reference set of events. Specifically, it takes, in the form of a python dictionary
- "reference" - a reference set of events as a list of latitude/longitude pairs
- "data" - the evaluation set, which is also a set of lat/long pairs
- "n" - the maximum number of events to return as outliers.
The algorithm uses the first set to train a probabilistic model event occurences, specifically, using density estimation for gaussian kernels. It then evaluates the probability of every event in the evaluation set. It then returns a dictionary with the following entries
- 'outliers' - whose value is a list of the indices of the n least probable events in the evaluation set
- 'logprobs' - itself a dictionary whose keys are the indices returned above, with corresponding values being the log probabilities of the events.
- 'all_lp' - a list of the log probabilities of the events in the evaluation set in the original order. The purpose of this is to allow the user to make their own determination of what counts as an outlier. For instance, if the log probabilities of the outliers do not differ significantly from those of the rest of the set, the user may decide to ignore the classification.
This is based on scikit-learn's implementation of kernel density estimation.