Affinity analysis is an analytical technique that aims to discover relationships between activities and preferences that pertain to specific individuals. Based on recorded information, after the analysis, future behaviour can be statistically predicted. For a general overview see http://en.wikipedia.org/wiki/Affinity_analysis. Specific applications include clickstream analysis and market basket analysis.
One important area of application is market basket analysis, which has widespread use in planning promotions, designs and sales strategies. Market basket analysis is necessarily somewhat open-ended, but one of the more useful angles of attack is the extraction of association rules.
Input: [url, options]
The program takes a DataAPI url to a file with one session per line. A session represents the entities that were bought/used/visited in a single recorded event. This could be the urls seen in a given browsing session or items bought in a single visit to a store. For example:
bread milk eggs
bread bottled_water hot_dogs lemonade
The items are not ordered and there is no customer identification. The items are separated by whitespace, and the only constraint on the format is that items must be uniquely identifiable by the string and the string may contain no white spaces.
You can also use a CSV file instead, in which values can contain any character except comma:
bread,bottled water,hot dogs,lemonade
You can customise the behaviour of the FP-Growth algorithm by supplying your own custom parameters:
- -I <max items> The maximum number of items to include in large items sets (and rules). (default = -1, i.e. no limit.)
- -N <require number of rules> The required number of rules. (default = 10)
- -T <0=confidence | 1=lift | 2=leverage | 3=Conviction> The metric by which to rank rules. (default = confidence)
- -C <minimum metric score of a rule> The minimum metric score of a rule. (default = 0.9)
- -U <upper bound for minimum support> Upper bound for minimum support. (default = 1.0)
- -M <lower bound for minimum support> The lower bound for the minimum support. (default = 0.1)
- -D <delta for minimum support> The delta by which the minimum support is decreased in each iteration. (default = 0.05)
- -S Find all rules that meet the lower bound on minimum support and the minimum metric constraint. Turning this mode on will disable the iterative support reduction procedure to find the specified number of rules.
- -transactions <comma separated list of attribute names> Only consider transactions that contain these items (default = no restriction)
- -rules <comma separated list of attribute names> Only print rules that contain these items. (default = no restriction)
- -use-or Use OR instead of AND for must contain list(s). Use in conjunction with -transactions and/or -rules
An array of rules, where each rule contains:
- confidence: the proportion of the examples covered by the premise that are also covered by the consequence. Alternatively, this can be described as the probability that a rule is correct for a new transaction.
- lift: confidence divided by the proportion of all examples that are covered by the consequence. This is a measure of the importance of the association that is independent of support.
- leverage: the proportion of additional examples covered by both the premise and consequence above those expected if the premise and consequence were independent of each other.
- conviction: another measure of departure from independence.
- premise: the 'premise' part of the rule.
- consequence: the 'consequence' part of the rule.
- premiseSupport: the number of transactions (sessions) in the data set which contain the premise items.
- consequenceSupport the number of transactions (sessions) in the data set which contain the consequence items.