Extracting email addresses from an actual paper list with Google Vision API

Step 1: Scan the papers

You need access to a scanner that can produce a pdf file. Any modern corporate printer does that. The pdf file will actually contain images, one image for every page scanned. Select a reasonably high resolution/dpi, otherwise Google Vision will have a hard time identifying letters.

Step 2: Get your tools

Get yourself a linux. Preferably Ubuntu. Running natively or in a VM in Virtualbox or similar.

sudo apt-get install cmake libpoppler-cpp-dev python-poppler
git clone git@github.com:erwan-lemonnier/pdf2emails.git
cd pdf2emails
pip install -r requirements.txt

Step 3: Setup your google cloud credentials

That's the hard part.

"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"client_email": <your google client email>,
"client_id": <your google client id>,
"client_x509_cert_url": <your cert url>,
"private_key": <your private key>,
"private_key_id": <your private key id>,
"project_id": <your google cloud project name>,
"token_uri": "https://oauth2.googleapis.com/token",
"type": "service_account"

Step 4: Run pdf2emails

python pdf2emails.py \
--gcloud-json-cred <path to json creds> \
--bucket-name <name of your storage bucket> \
--pdf my.pdf | grep '@' > list.csv

So how does it work?

pdf2emails uses popper to extract the scanned image from each page in the pdf:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Erwan Lemonnier

Erwan Lemonnier


CTO. Fullstack developer turned entrepreneur. Ex-employee of Spotify and Trustly. Author of the PyMacaron microservice framework.