nlp

nlp / Doc2Vec / 0.6.0

README.md

doc2vec

Notice: Image courtesy of data science central New: Now with file saving and Word Movers Distance support!

Table of Contents

Introduction

This algorithm creates a vector representation of an input text of arbitrary length (a document) by using LDA to detect topic keywords and Word2Vec to generate word vectors, and finally concatenating the word vectors together to form a document vector. The output document vector is within the same vector space as each Word2Vec word (300 dimensions, with a range of -1, +1). Notice: Recently added NLTK lemmatization support for detected LDA keywords, this should improve overall accuracy rates.

I/O

Single Input

input: String

  • input - (required) - A single document input, no cleaning required, cleans html non-ascii characters and new lines.

Batch Input

{
    "docs": String[] or String,
    "mode": String,
    "save_path": String
}

  • docs - (required) - An array of documents (or single document) to vectorize, output will be in the same order as the input docs.
  • mode - (optional) - changes the mode of the document vectorization - has two modes:
    • wmd (word movers distance) - preserves word separation and returns the [Bag of Words] model for each document.
    • wcd (word centroid distance) - calculates the weighted centroid word vector. defaults to wcd

Output - word centroid distance

{  
   "vectors":[  
      {  
         "doc":String 
         "vector":Array[Float]
      }
   ]
}

  • doc - The provided document used to create the document model
  • vector - A 300 dimensional document vector, in the same order as input.

Output - word movers distance

{"vectors": [
  {
    "doc": String
    "nbow": [
     "vector": Array[Float],
     "weight": Float,
     "word": String
    ]
  }
 ]
}

  • doc - The provided document used to create the document model
  • vector - a 300 dimensional word vector
  • weight - the proportional weight for this word in the model
  • word - the word token defining this component in the bow

Examples

Single Input Example

"This algorithm creates a vector representation of an input text of arbitrary length (a document) by using LDA to detect topic keywords and Word2Vec to generate word vectors, and finally concatenating the word vectors together to form a document vector. The output document vector is within the same vector space as each Word2Vec word (300 dimensions, with a range of -1, +1)."

{  
   "vectors":[  
      {  
         "doc":"This algorithm creates a vector representation of an input text of arbitrary length (a document) by using LDA to detect topic keywords and Word2Vec to generate word vectors, and finally concatenating the word vectors together to form a document vector. The output document vector is within the same vector space as each Word2Vec word (300 dimensions, with a range of -1, +1).",
         "vector":[  
            0.027143991901539266,
            0.01981183455791325,
            0.025684720138087872,
            0.008205269230529666,
            -0.05381948390277103,
            0.013424265664070845,
            0.006839196765213275,
            ...
            0.010006775264628228,
            0.03212420269846917,
            0.01704227679874748,
            -0.00718907653936185,
            0.035260906035546206,
            -0.0422189606470056,
            0.006693855626508591,
            -0.02193701290525497,
            -0.027333557925885547,
            0.012483243364840742,
            -0.0030233282013796274,
            0.00449293394922279,
            -0.031209250737447288,
            -0.031159733771346503,
            0.009491354343481365
         ]
      }
   ]
}

nearest words from Word2Vec

"word": [
        ["vectors", 0.6540068387985232],
        ["vector", 0.6501382589340211],
        ["bitmap_graphics", 0.5863461494445801],
        ["dimensions", 0.5720823407173155],
        ["polyline", 0.558686375617981],
        ["tuple", 0.553295373916626],
        ["bitmap_images", 0.5529519915580752],
        ["affordances", 0.552589237689972],
        ["XSLT_stylesheet", 0.5489472746849061],
        ["raster_images", 0.5400040149688723]
      ]

Batch Input Example

{
    "docs": [
        "This algorithm creates a vector representation of an input text of arbitrary length (a document) by using LDA to detect topic keywords and Word2Vec to generate word vectors, and finally concatenating the word vectors together to form a document vector. The output document vector is within the same vector space as each Word2Vec word (300 dimensions, with a range of -1, +1).",
        "The JSON Formatter was created to help with debugging. As JSON data is often output without line breaks to save space, it is extremely difficult to actually read and make sense of it. This little tool hoped to solve the problem by formatting the JSON data so that it is easy to read and debug by human beings."
    ]
}

{  
   "vectors":[  
      {  
         "doc":"This algorithm creates a vector representation of an input text of arbitrary length (a document) by using LDA to detect topic keywords and Word2Vec to generate word vectors, and finally concatenating the word vectors together to form a document vector. The output document vector is within the same vector space as each Word2Vec word (300 dimensions, with a range of -1, +1).",
         "vector":[  
            0.027143991901539266,
            0.01981183455791325,
            0.025684720138087872,
            0.008205269230529666,
            -0.05381948390277103,
            0.013424265664070845,
            0.006839196765213275,
            ...
            0.006693855626508591,
            -0.02193701290525497,
            -0.027333557925885547,
            0.012483243364840742,
            -0.0030233282013796274,
            0.00449293394922279,
            -0.031209250737447288,
            -0.031159733771346503,
            0.009491354343481365
         ]
      },
      {  
         "doc":"The JSON Formatter was created to help with debugging. As JSON data is often output without line breaks to save space, it is extremely difficult to actually read and make sense of it. This little tool hoped to solve the problem by formatting the JSON data so that it is easy to read and debug by human beings.",
         "vector":[  
            0.034090492757968605,
            0.0010834950953721937,
            -0.017244145623408258,
            0.02305529569275678,
            -0.06192096314043738,
            0.01593235426116735,
            0.019022508524358286,
            0.03136622009333223,
            ....
            0.05408464418724179,
            -0.01748840426444077,
            0.02004028484225273,
            -0.029898532608058307,
            0.024512667965609587,
            -0.029175986095651755,
            0.015900821512332186,
            -0.010429314308566973,
            -0.043275312287732966,
            -0.024509177892468873,
            0.009128060191869732
         ]
      }
   ]
}