deeplearning

deeplearning / Parsey / 1.1.1

README.md

sawman

Note: Now introducing batch mode! Parse all the documents at lightening speed!

Table of Contents

Introduction

Parsey McParseface is a language parsing tool that is fantastic at tagging word meanings within sentences and forming a parse tree (in Tree or Conll format) for other NLP algorithms to use.

I/O

Input

{  
"src": String or List[String],
"format": String,
"language": String
}
  • src - (required) - The source text(s) to parse, sentences should end with either a !, ? or . and will be parsed sequentially in the output, if you provide a list of strings the algorithm runs in batch mode.
  • format - (optional) - The output format of the parse, can be either tree, graph or conll, defaults to tree. If in Batch mode, the only available formats are graph and conll.
  • language - (optional) - the language of the text you wish to parse, this selects the particular parsey model for that language, defaults to English.

Batch mode

When providing multiple strings as a list, the algorithm runs in batch mode. In this mode we return a list of outputs for each input string. Instead of the top element being called output it's called batchOutput. As for formats, only conll and graph modes are supported.

It should be noted that the performance characteristics highly favor batching of sentences, give it a shot and let us know what you think!

Outputs

there are 3 output formats, conll tree and graph.

Conll Format

This is the Conll-U style of output, more info about Conll-U can be found here. The conll output is parsed to JSON and is in the following format:

{
 "output": {
   "sentences": [
     {
       "words": [
         {
           "lemma": String,
           "head": Int,
           "features": {...},
           "universal_pos": String,
           "extra_deps": [...],
           "language_pos": String,
           "misc": String,
           "dep_relation": String,
           "form": String,
           "index": Int
         }
       ]
     }
   ]
 }
}

  • output - a wrapper object around the main json body

  • sentences - This is a json array of each sentence passed in as an input, each sentence has a single variable, "words".

  • words - This is a json array of each word in a sentence, conll is word based so this is the main structure of the parse.

  • form - The specific word or punctuation symbol.

  • index - the position in which this word or punctuation symbol is in the sentence.

  • lemma - If the language model uses base words, then this refers to the lemma.

  • head - The value stored here is the "root" of this word in relation to the total sentence itself, see root.

  • features - List of morphological features from the universal feature inventory or from a defined language-specific extension; empty string if unavailble.

  • universal_pos - refers to this word's universal part of speech tag.

  • extra_deps - if supported by the language model, is a list of secondary dependencies (head-deprel pairs).

  • dep_relation - Universal Stanford dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.

  • language_pos - refers to this words language specific part of speech tag, empty string if unavailable.

  • misc - any other annotation.

Graph Format

This format is a directed graph abstraction of the Conll format, utilizing the Json Graph Specification:

{
 "output": {
   "sentences": [
     {
       "nodes": [
         {
           "lemma": String,
           "features": {..},
           "universal_pos": String,
           "language_pos": String,
           "id": Int,
           "misc": String,
           "form": String
         },
       ],
       "edges": [
         {"target": Int,"source": Int,"relationship": String},
       ]
     }
   ]
 }
}
  • output - a wrapper object around the main json body
  • sentences - This is a json array of each sentence passed in as an input, each sentence has two variable arrays, nodes and edges.
  • nodes - a list of word verticies.
  • edges - a list of part of speech edges, denotes word token relationships.

nodes

  • form - The specific word or punctuation symbol.
  • id - the position in which this word or punctuation symbol is in the sentence, and also the vertex ID.
  • lemma - If the language model uses base words, then this refers to the lemma.
  • features - List of morphological features from the universal feature inventory or from a defined language-specific extension; empty string if unavailble.
  • universal_pos - refers to this word's universal part of speech tag.
  • language_pos - refers to this words language specific part of speech tag, empty string if unavailable.
  • misc - any other annotation.

edges

  • source - outgoing vertex id, equivalent to the "head" in the conll format.
  • target - the incomming vertex id, equivalent to the "index" in the conll format.
  • relationship - the specific dependency relationship between the source vertex and the target.

Tree Format

This output format comes directly from parsey as a string object, it's not parsable as json however it's compact nature makes is much easier to read:

Input: Bob brought the pizza to Alice .
Parse:
brought VBD ROOT
 +-- Bob NNP nsubj
 +-- pizza NN dobj
 |   +-- the DT det
 +-- to IN prep
 |   +-- Alice NNP pobj
 +-- . . punct

How to read

The sentence Bob brought the pizza to Alice. is parsed as such:

  • tThe syntatic root word of the sentence is brought, meaning it forms the structure. brought grammatically a passed tense verb.
  • The first branch from brought is Bob, Bob is a proper noun and syntatically is a nominal subject.
  • The second branch from brought is pizza, which is a singular noun and syntatically is a direct object, it creates the phrase the pizza.
  • The only branch from pizza is the, which is a diet and a determiner.
  • The third and final branch from brought is to, which is a preprosition, it creates the preprositional phrase to Alice
  • The only branch from to is Alice, which is a proper noun like Bob, however syntatically it is a appositional modifier

Examples

Tree Example - English

{"src":"Bob brought the pizza to Alice.", "format":"tree", "language":"English"}
  Input: Bob brought the pizza to Alice .
  Parse:
  brought VBD ROOT
   +-- Bob NNP nsubj
   +-- pizza NN dobj
   |   +-- the DT det
   +-- to IN prep
   |   +-- Alice NNP pobj
   +-- . . punct

Graph Example - German

{"src":"Bob brachte die Pizza zu Alice.", "format":"graph", "language":"german"}
{
 "output": {
   "sentences": [
     {
       "nodes": [
         {
           "lemma": "",
           "features": {"fPOS": "PROPN++"},
           "universal_pos": "PROPN",
           "language_pos": "",
           "id": 1,
           "misc": "",
           "form": "Bob"
         },
         {
           "lemma": "",
           "features": {"fPOS": "VERB++"},
           "universal_pos": "VERB",
           "language_pos": "",
           "id": 2,
           "misc": "",
           "form": "brachte"
         },
         {
           "lemma": "",
           "features": {"fPOS": "DET++"},
           "universal_pos": "DET",
           "language_pos": "",
           "id": 3,
           "misc": "",
           "form": "die"
         },
         {
           "lemma": "",
           "features": {"fPOS": "NOUN++"},
           "universal_pos": "NOUN",
           "language_pos": "",
           "id": 4,
           "misc": "",
           "form": "Pizza"
         },
         {
           "lemma": "",
           "features": {"fPOS": "ADP++"},
           "universal_pos": "ADP",
           "language_pos": "",
           "id": 5,
           "misc": "",
           "form": "zu"
         },
         {
           "lemma": "",
           "features": {"fPOS": "PROPN++"},
           "universal_pos": "PROPN",
           "language_pos": "",
           "id": 6,
           "misc": "",
           "form": "Alice."
         }
       ],
       "edges": [
         {"target": 1,"source": 2,"relationship": "nsubj"},
         {"target": 3,"source": 4,"relationship": "det"},
         {"target": 4,"source": 2,"relationship": "dobj"},
         {"target": 5,"source": 6,"relationship": "case"},
         {"target": 6,"source": 4,"relationship": "nmod"}
       ]
     }
   ]
 }
}

Conll Example - Arabic

{"src":"bir, iki, üç! Algorithmia Go!","format":"conll","language":"turkish"}
 {
 "output": {
   "sentences": [
     {
       "words": [
         {
           "lemma": "",
           "head": 2,
           "features": {
             "NumType": "Ord",
             "fPOS": "NUM++ANum"
           },
           "universal_pos": "NUM",
           "extra_deps": [""],
           "language_pos": "NNum",
           "misc": "",
           "dep_relation": "amod",
           "form": "bir,",
           "index": 1
         },
         {
           "lemma": "",
           "head": 3,
           "features": {
             "NumType": "Ord",
             "fPOS": "NUM++ANum"
           },
           "universal_pos": "NUM",
           "extra_deps": [""],
           "language_pos": "NNum",
           "misc": "",
           "dep_relation": "amod",
           "form": "iki,",
           "index": 2
         },
         {
           "lemma": "",
           "head": 0,
           "features": {
             "NumType": "Ord",
             "fPOS": "NUM++ANum"
           },
           "universal_pos": "NUM",
           "extra_deps": [""],
           "language_pos": "NNum",
           "misc": "",
           "dep_relation": "ROOT",
           "form": "üç!",
           "index": 3
         }
       ]
     },
     {
       "words": [
         {
           "lemma": "",
           "head": 2,
           "features": {
             "Case": "Nom",
             "Number": "Sing",
             "Person": "3",
             "fPOS": "PROPN++Prop"
           },
           "universal_pos": "PROPN",
           "extra_deps": [""],
           "language_pos": "Prop",
           "misc": "",
           "dep_relation": "name",
           "form": "Algorithmia",
           "index": 1
         },
         {
           "lemma": "",
           "head": 0,
           "features": {
             "Case": "Loc",
             "Number": "Sing",
             "Person": "3",
             "fPOS": "PROPN++Prop"
           },
           "universal_pos": "PROPN",
           "extra_deps": [""],
           "language_pos": "Prop",
           "misc": "",
           "dep_relation": "ROOT",
           "form": "Go!",
           "index": 2
         }
       ]
     }
   ]
 }
}

Languages

We've added a bunch of new languages with the tensorflow release of Parsey's Cousins.

The newly supported languages are:

  • Ancient Greek
  • Arabic
  • Basque
  • Bulgarian
  • Catalan
  • Chinese
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • Galician
  • German
  • Gothic
  • Greek
  • Hebrew
  • Hindi
  • Hungarian
  • Indonesian
  • Irish
  • Italian
  • Kazakh
  • Latin
  • Latvian
  • Norwegian
  • Old_Church_Slavonic
  • Persian
  • Polish
  • Portuguese
  • Portuguese-BR (Brazilian Portugese)
  • Romanian
  • Russian
  • Slovenian
  • Spanish
  • Swedish
  • Tamil
  • Turkish

Credits

A TensorFlow implementation of the models described in Andor et al. (2016). The syntaxnet github repository is entirely open source and can be found here.