Property/data extraction from PDF files

Description
<p>I would like to extract properties of popular microcontrollers, such as <a href="http://www.atmel.com/images/atmel-8271-8-bit-avr-microcontroller-atmega48a-48pa-88a-88pa-168a-168pa-328-328p_datasheet_complete.pdf">ATmega328</a><a href="https://www.google.ca/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=1&amp;cad=rja&amp;uact=8&amp;sqi=2&amp;ved=0CCwQFjAAahUKEwiB0a3lw5PHAhUSL4gKHRadAig&amp;url=http%3A%2F%2Fwww.atmel.com%2Fimages%2Fatmel-8271-8-bit-avr-microcontroller-atmega48a-48pa-88a-88pa-168a-168pa-328-328p_datasheet_complete.pdf&amp;ei=cdbCVcHUNpLeoASWuorAAg&amp;usg=AFQjCNEfBUXFG85G6huz6IpUbv0LQqhvGQ&amp;sig2=rcK35aesWbEKcdKw8g65fw&amp;bvm=bv.99556055,d.cGU"></a>, <a href="http://www.atmel.com/images/atmel-7766-8-bit-avr-atmega16u4-32u4_datasheet.pdf">ATmega32u4</a>, <a href="http://www.atmel.com/images/atmel-2586-avr-8-bit-microcontroller-attiny25-attiny45-attiny85_datasheet.pdf">ATtiny85</a> and etc. Ultimately this would allow me to compare different microcontrollers side by side.</p><p><span>Ideally I would provide a search term, and the algorithm would parse the datasheets of micrcontrollers to generate a set of values which it think are the answers.</span><br/></p><p><span>For example, if I give &#34;clock speed&#34;, the algorithm would return something like this:</span></p><ul><li>ATmega328 - 20mhz</li><li>ATmega32u4 - 16mhz</li><li>ATtiny85 - 20mhz</li></ul><p>This can be a simple text parsing problem. However, a heuristic approach utilizing natural language processor would be really nice. For example, &#34;clock speed&#34;, and &#34;clock frequency&#34; would be interpreted the same, because the algorithm knows the word &#34;speed&#34; and &#34;frequency&#34; are synonymous.</p>
Discussion
  • {{comment.username}}
Status
Active
submission(s) pending review
Bounty expires in
Bounty expired
Bounty
0
Tags
(no tags)