bubble

bubble / GetCommons / 0.1.2

README.md

Overview

Extracts:

  • Emails
  • URL links
  • phones
  • dates
  • times
  • credit cards
  • street addresses
  • PO boxes
  • zip codes
  • bitcoin addresses
  • litecoin addresses
  • IP addresses (IPv4 and IPv6)
  • prices
  • acronyms
  • hex colors

from given text.

Possible Uses

  • May be used in conjunction with web crawling.
  • Extract info from crawled online material.
  • Parse resumes (C.V.s).
  • Parse (OCRed?) donation forms.
  • Structuring unstructured data (e.g. bodies of text).

Usage

Input

Text. Yep, just text.

Examples: https://raw.githubusercontent.com/indir1/public/master/getcommons-text-examples.txt

Example 1

Input:

9:00am April 25, 2017 Mr. Jeffrey Jones, Recruiting Manager, Council on Foreign Relations, 58 East 68th Street, New York, NY 10065.

Output:

('date', 'April 25, 2017', '9:00am April 25, 2017 Mr. Jeffrey Jo', 7, 21)

('time', '9:00am', '9:00am April 25, 2017', 0, 6)

('time', '2017', '00am April 25, 2017 Mr. Jeffrey Jo', 17, 21)

('street_address', ' 58 East 68th Street', 'eign Relations, 58 East 68th Street, New York, NY ', 90, 110)

('zip_code', '10065', ', New York, NY 10065.', 125, 130)

('acronym', 'NY', 'eet, New York, NY 10065.', 122, 124)

Explained:

(Category, Extracted, Surrounding context, Start position, End position)

Category: Category of the extracted text. (e.g. date, time, street_address).

Extracted: Text extracted (e.g. 'April 25, 2017')

Surrounding context: Text before and after extracted area (e.g. '9:00am April 25, 2017 Mr. Jeffrey Jo')

Start position and End position: Start and end indexes of text extracted. Zero-indexing convention. (e.g. 7, 21 mean that the extracted text spans indexes 7 to 21 in input text)

Example 2

Input:

ONLINE POSTS IN THE WILD

1-12-15 Hey guys! to buy xxx drug send 4 BTC to 1LgvButDNV2rVHe9DATt6WqE8tKZEKvaK2.

2/22/2015 BLah blah blah blah scam scam scam 3 LTC to get 4 LTC in return. LKKSCYdyWP7fJDMZ1KUDbpj3yPmQ22MQrv shill wanna buy lambo. 1,000,000 € yay yay blah. :)

17.01.2017 Selling 25 blah blah at $123 !! come get it while it's hot.

4 apr 18 FB at $150. Buy blah blah blah. Tech sector at a bargain MSFT, GOOG, NVDIA, BABA. Much wow. buy buy buy. :D :D :D

11 sep Cheap and trustworthy accommodation for 10000 ¥ ! Prev. at 12000¥ per night.

Output:

('date', '1-12-15', '\n 1. 1-12-15 Blah blah blah', 8, 15)

('date', '2/22/2015', '8tKZEKvaK2.\n2. 2/22/2015 BLah blah blah', 92, 101)

('date', '17.01.2017', 'ay blah. :)\n3. 17.01.2017 Selling 25 bl', 257, 267)

('date', '4 apr 18', "e it's hot.\n4. 4 apr 18 FB at $150. B", 332, 340)

('date', '11 sep ', 'y. :D :D :D\n5. 11 sep Cheap and trust', 459, 467)

('time', '2015', 'KvaK2.\n2. 2/22/2015 BLah blah blah', 97, 101)

('time', '17.01', 'ay blah. :)\n3. 17.01.2017 Selling ', 257, 262)

('time', '2017 ', 'h. :)\n3. 17.01.2017 Selling 25 bla', 263, 268)

('time', '0000 ', 'mmodation for 10000 ¥ ! Prev. at 12', 508, 513)

('time', '2000', ' ¥ ! Prev. at 12000¥ per night.', 527, 531)

('price', '1,000,000 €', 'nna buy lambo. 1,000,000 € yay yay blah. ', 225, 236)

('price', '$123', '5 blah blah at $123 !! come get it', 293, 297)

('price', '$150', ' apr 18 FB at $150. Buy blah blah', 348, 352)

('price', '10000 ¥', 'ommodation for 10000 ¥ ! Prev. at 120', 507, 514)

('price', '12000¥', '0 ¥ ! Prev. at 12000¥ per night.', 526, 532)

('btc_address', '1LgvButDNV2rVHe9DATt6WqE8tKZEKvaK2', ' send 4 BTC to 1LgvButDNV2rVHe9DATt6WqE8tKZEKvaK2.\n2. 2/22/2015 ', 53, 87)

('ltc_address', 'LgvButDNV2rVHe9DATt6WqE8tKZEKvaK2', 'send 4 BTC to 1LgvButDNV2rVHe9DATt6WqE8tKZEKvaK2.\n2. 2/22/2015 ', 54, 87)

('ltc_address', 'LKKSCYdyWP7fJDMZ1KUDbpj3yPmQ22MQrv', 'LTC in return. LKKSCYdyWP7fJDMZ1KUDbpj3yPmQ22MQrv shill wanna bu', 167, 201)

...

('acronym', 'BTC', 'buy xyz send 4 BTC to 1LgvButDNV2', 46, 49)

...

('acronym', 'LTC', 'am scam scam 3 LTC to get 4 LTC i', 139, 142)

('acronym', 'LTC', '3 LTC to get 4 LTC in return. LKK', 152, 155)

...

('acronym', 'FB', '.\n4. 4 apr 18 FB at $150. Buy b', 342, 344)

('acronym', 'MSFT', 'r at a bargain MSFT, GOOG, NVDIA, ', 399, 403)

('acronym', 'GOOG', ' bargain MSFT, GOOG, NVDIA, BABA. ', 405, 409)

('acronym', 'NVDIA', 'in MSFT, GOOG, NVDIA, BABA. Much wo', 411, 416)

('acronym', 'BABA', ', GOOG, NVDIA, BABA. Much wow. buy', 418, 422)

Future Work and Thoughts

  • Option to output results in a .csv file.
  • Improve precision. "Precision" as in "precision vs recall".
  • Expand scope of reach. E.g. support more currency symbols.
  • Feel free to give suggestions.