Extracting email addresses from an actual paper list with Google Vision API

Step 1: Scan the papers

You need access to a scanner that can produce a pdf file. Any modern corporate printer does that. The pdf file will actually contain images, one image for every page scanned. Select a reasonably high resolution/dpi, otherwise Google Vision will have a hard time identifying letters.

Step 2: Get your tools

Get yourself a linux. Preferably Ubuntu. Running natively or in a VM in Virtualbox or similar.

sudo apt-get install cmake libpoppler-cpp-dev python-poppler
git clone git@github.com:erwan-lemonnier/pdf2emails.git
cd pdf2emails
pip install -r requirements.txt

Step 3: Setup your google cloud credentials

That's the hard part.

"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"client_email": <your google client email>,
"client_id": <your google client id>,
"client_x509_cert_url": <your cert url>,
"private_key": <your private key>,
"private_key_id": <your private key id>,
"project_id": <your google cloud project name>,
"token_uri": "https://oauth2.googleapis.com/token",
"type": "service_account"

Step 4: Run pdf2emails

python pdf2emails.py \
--gcloud-json-cred <path to json creds> \
--bucket-name <name of your storage bucket> \
--pdf my.pdf | grep '@' > list.csv

So how does it work?

pdf2emails uses popper to extract the scanned image from each page in the pdf:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Erwan Lemonnier

Erwan Lemonnier

CTO at GoFrendly. Fullstack developer turned entrepreneur. Ex-employee of Spotify and Trustly. Author of the PyMacaron microservice framework.