Intelligence_lab

Intelligence_lab / ReferenceGraphBlockingAlgorithm / 0.1.0

README.md

Overview

The reference graph blocking algorithm is a graph based algorithm that takes an input of proper names (strings) and generates a reference graph based on the similarity between names. The reference graph utilizes an onomastic gazetteer (a collection of relationships between proper names from specific origins) to support this task before the graph is optimally pruned using graph topology, and finally each name is classified using community detection. The output is a numeric vector of 'keys' relating to the proper names provided as the input.

Optimization

As this algorithm is optimized for blocking purposes keys, representing smaller groups of proper names, will be combined together in an agglomerative fashion as the goal is balanced between making sure the most similar names have the same key (i.e. are in the same block) and the blocks are optimally sized. The blocks should ideally be sufficiently small so pairwise computation is efficient (e.g. splitting a list of 7,000 family names into 70 blocks of 100) & sufficiently large so latency (the amount of time between following instructions rather than computing) does not become a problem (e.g. splitting a list of 7,000 family names into 700 blocks of 10 is less efficient due to latency). The naturally occurring result of this optimization is that small to mid sized sets of proper names (e.g. when the input is less than about 10,000 proper names) will generate blocks (i.e. sets of names with the same key) that may include a diverse range of names - this is due to the agglomerative nature of optimizing the blocks.

Background

Entity resolution is the detection of "pairs of entities in the data that actually represent the same real-world entity, and then taking some action to take this into account (Benjelloun, et al., 2009)" (Robinson & Scogings, 2017). Pairwise computation is an intractable problem and therefore approaches to reduce the number of pairwise comparisons is required - these approaches are commonly known as blocking algorithms.

Entity resolution is a problem that is fundamentally heterogeneous and therefore any single abstract or generic approach is unlikely to produce high quality results. The reference graph blocking algorithm is specifically designed to be a solution to unintentional/intentional typographical error - the most pervasive source of ambiguity. There are many other sources which will be addressed through a complementary suite of algorithms (e.g see the Unique Ordered Characters blocking algorithm).

Performance

Reference graph technology has been tested and been found to consistently outperform other blocking algorithms (Robinson, 2016). Initial published benchmarking done by Robinson (2016) indicates a consistent 1-5% improvement across a variety of data sources.

For more detail please refer to:

Robinson D. (2016) The Use of Reference Graphs in the Entity Resolution of Criminal Networks. In: Chau M., Wang G., Chen H. (eds) Intelligence and Security Informatics. PAISI 2016. Lecture Notes in Computer Science, vol 9650. Springer, Cham.

Usage

Input

The input is a list where each component is a string (character) of a length of one (i.e. there is only one element per component).

Output

The output is a numeric vector.

Examples

Input

in R

list("Robinson", "Roberts", "Robertson", "Smith", "Smithson", "Smythe", "Robins", "Gibson", "Gibbons", "Gibbins")

[[1]] [1] "Robinson"

[[2]] [1] "Roberts"

[[3]] [1] "Robertson"

[[4]] [1] "Smith"

[[5]] [1] "Smithson"

[[6]] [1] "Smythe"

[[7]] [1] "Robins"

[[8]] [1] "Gibson"

[[9]] [1] "Gibbons"

[[10]] [1] "Gibbins"

... or in JSON

[["Robinson"],["Roberts"],["Robertson"],["Smith"],["Smithson"],["Smythe"],["Robins"],["Gibson"],["Gibbons"],["Gibbins"]]

Output:

in R

c(2,2,2,3,3,3,2,1,1,1)

[1] 2 2 2 3 3 3 2 1 1 1

... or in JSON

[2,2,2,3,3,3,2,1,1,1]