Extracting email addresses from an actual paper list with Google Vision API

So you got this amazing list of email addresses that your marketing team is literally drooling over, but you can't feed it into mailchimp/sendgrid/{insert-your-favorite-mass-mailer-here} because THAT LIST IS ON PAPER.

Image for post
Image for post

There you think: "WTH? Printing a list of emails on paper? Who does that?"

Obviously someone did, since one day I got face to face with a 200 pages pile of dead tree covered with finely printed email addresses. That was the input. Expected output: a csv list of email addresses in an actual computer file.

This article shows the steps I took from input to output.

The secret sauce: Popper for pdf manipulation, Pillow to extract images from pdfs, and google cloud's Vision API to extract text from those images.

Step 1: Scan the papers

You need access to a scanner that can produce a pdf file. Any modern corporate printer does that. The pdf file will actually contain images, one image for every page scanned. Select a reasonably high resolution/dpi, otherwise Google Vision will have a hard time identifying letters.

Step 2: Get your tools

Get yourself a linux. Preferably Ubuntu. Running natively or in a VM in Virtualbox or similar.

Install poppler:

sudo apt-get install cmake libpoppler-cpp-dev python-poppler

Clone pdf2emails and install its dependencies

git clone git@github.com:erwan-lemonnier/pdf2emails.git
cd pdf2emails
pip install -r requirements.txt

Put your pdf files in your working directory.

Step 3: Setup your google cloud credentials

That's the hard part.

We need a Google cloud account to call the Vision API in text annotation mode, to extract text data from the scanned image of each paper page.

I'll just assume you are familiar with setting up the google cloud console tools. You'll need to have a Google credentials json file looking like this:

{
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"client_email": <your google client email>,
"client_id": <your google client id>,
"client_x509_cert_url": <your cert url>,
"private_key": <your private key>,
"private_key_id": <your private key id>,
"project_id": <your google cloud project name>,
"token_uri": "https://oauth2.googleapis.com/token",
"type": "service_account"
}

You'll also need a Google cloud storage bucket, in which to put images fetched by the Vision api.

Step 4: Run pdf2emails

python pdf2emails.py \
--gcloud-json-cred <path to json creds> \
--bucket-name <name of your storage bucket> \
--pdf my.pdf | grep '@' > list.csv

And voila, a csv file!

So how does it work?

pdf2emails uses popper to extract the scanned image from each page in the pdf:

Then each image is converted to png and uploaded to the Google storage bucket, for later access by the Vision api:

Next we call the Vision api to extract text from that image. The tricky part is extracting the email strings from the text identified by the vision api…

Of course, there are numerous possible improvements to that method. One is to ensure that the scanner takes high resolution scans of your documents. Low-res scans will results in typos, like when the Vision ai confuses 'l' for '|' or 'g' for '8'.

And depending on the actual content of the document, you'll probably have to write completely different code to extract the emails from the text_annotation objects.

You could also call vision api simultaneously for all pages, via a batch call, or by using asyncio, instead of one page at a time.

With the script above, scanning 10.000 emails out of a pdf takes about 2–3 minutes.

CTO at GoFrendly. Fullstack developer turned entrepreneur. Ex-employee of Spotify and Trustly. Author of the PyMacaron microservice framework.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store