PetiteProgrammer

PetiteProgrammer / TextSimilarity / 1.0.0

README.md

This algorithm will compare documents (can be any kind of document) and report which documents are the most similar.

Some examples this algorithm could be used for:

  • Plagiarism detection (natural language, programming source, etc.)
  • Removal of similar copies within some directory
  • Analysis and clustering of documents.

Example

Input:

{
    "files": [
        ["doc1", "this is an example input"],
        ["doc2", "this is another example input"],
        ["doc3", "the third document is not like the others"]
    ]
}

Output:

[
    [0.6825611979794738, "doc1", "doc2"],
    [0.1303428532021814, "doc2", "doc3"],
    [0.05714684431258296, "doc1", "doc3"]
]


Input

argument      type                 description

files         [[String, String]]   list of document id's and document content
num_results   Int (optional)       number of results, default = 100 (less if less document pairs can be computed)


Output

[[Float, String, String]]: Similarity value, document id 1, document id 2.


Contents