How to collect thousands of email addresses using Python Scrapy ?
At the end of this post you will be able to extract data from the websites. Specifically we will learn how to extract email addresses from job posting websites. This script is written for collecting data from a few real websites and we changed the site name and real email addresses for learning.
Authors: Adnan Kaya
Date: 09 November 2022
Keywords: data mining, web crawling, scrapy, scraping email address, css selectors
Table of Contents
- Installation
- Project Initializing and Generating Spider
- Scrapy Shell and CSS Selectors
- Final Code
- Running Spider and Exporting Emails to CSV
Scrapy is a Python framework which helps us for creating web crawlers. For data mining and data extraction from the web sites, it is quite helpful.
It has easily understood documentation and tutorial. You can check its official documentation.
Before installing scrapy, create a virtualenv.
x
$ python3.11 -m venv env311
$ source env311/bin/activate
# now install Scrapy
(env311) $ pip install Scrapy
# optional install
(env311) $ pip install ipython
Let's get started by creating the scrapy project. Open up your terminal and type scrapy startproject email_fetcher
. After running this command you will get an output message like this:
x
$ scrapy startproject email_fetcher
New Scrapy project 'email_fetcher', using template directory '/home/adnan/myscrapy/env311/lib/python3.11/site-packages/scrapy/templates/project', created in:
/home/adnan/myscrapy/email_fetcher
You can start your first spider with:
cd email_fetcher
scrapy genspider example example.com
To see files and folders of our project tree
command can be used.
xxxxxxxxxx
$ tree email_fetcher/
email_fetcher/
├── email_fetcher
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders # We put our spiders into this folder and customize them for every website.
│ └── __init__.py
└── scrapy.cfg
2 directories, 7 files
Navigate to email_fetcher folder. cd email_fetcher/spiders/
and generate your first spider by running the following command
scrapy genspider EmailFetchingSpider example.com
You will see the output like this
x
(env311) $ scrapy genspider EmailFetchingSpider example.com
Created spider 'EmailFetchingSpider' using template 'basic' in module:
email_fetcher.spiders.EmailFetchingSpider
(env311) $ ls
EmailFetchingSpider.py __init__.py
Let's take a look EmailFetchingSpider.py
file
x
import scrapy
class EmailfetchingspiderSpider(scrapy.Spider):
name = 'EmailFetchingSpider' # our spider name
allowed_domains = ['example.com'] # domain list that our spider will crawl
start_urls = ['http://example.com/job-posts/'] # the urls that we fetch data
def parse(self, response):
pass
This is a simple template for creating spiders. Now lets head over to terminal again and run the scrapy shell command.
x
$ scrapy shell example.com/job-search/
You will get some outputs like
x
2022-11-09 01:22:20 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: email_fetcher)
........
...........
[s] view(response) View response in a browser
2022-11-09 01:22:22 [asyncio] DEBUG: Using selector: EpollSelector
In [1]: # Now you are here to play with response data. Response data is the page you received from example.com/job-posts/
In [1]:
In [1]: # We will use css selectors to access data we want in response
In [1]:
Let's access the urls that are listed in example.com/job-search/
x
In [1]: urls = response.css("div.jobs-listing > ul > li > div.jobs-content > div.cs-text > div.post-title > h5 > a::attr(href)").extract()
It seems little bit complex. Let's simplify it. First of all let's see what is our html page look like. I opened the page, we want to crawl, and saw the job listing part in the middle of the page. Right click -> inspect will show me the html source code. For our case it is like
x
<div class="jobs-listing">
<ul>
<li>
<div class="jobs-content">
<div class="cs-media">
<figure>
<a href="https://www.example.com/jobs/217035102/ontario/mississauga/bookkeeper/">
<img alt="" src="https://www.example.com/wp-content/plugins/wp-jobhunt/assets/images/img-not-found16x9.jpg">
</a>
</figure>
</div>
<div class="cs-text">
<div class="post-title">
<h5>
<a href="https://www.example.com/jobs/217035102/ontario/mississauga/bookkeeper/">Bookkeeper</a>
</h5>
<span class="cs-post-date">2 days ago</span>
<span id="employer">by HNM Logistics Inc.</span>
</div>
<div class="post-options"> <span class="cs-position cs-color">Accounting & Finance</span>
<span class="cs-location">Mississauga, Ontario </span>
</div>
<div class="job-post">
<a href="https://www.example.com/job-search/?job_type=full-time" class="jobtype-btn"
style="border-color:#b76935;color:#b76935;">Full-Time</a> <a
class="heart-btn heart-btn shortlist" data-toggle="tooltip" data-placement="top" title=""
onclick="trigger_func('#btn-header-main-login');" data-original-title="Add to Shortlist"><i
class="icon-heart-o"></i>
</a>
</div>
</div>
</div>
</li>
<li>...
</li>
<li>...
</li>
<li>...
</li>
.........
</ul>
</div>
We want to access the urls in this page. If we look at the html element structure we will see that <div class="jobs-listing">
element has ul
element and it has li
and it has <div class="jobs-content">
and it has <div class="cs-text">
and it has <div class="post-title">
and it has h5
and it has a
tag which contains the url. So to access that url we use css selector like this
x
In [1]: urls = response.css("div.jobs-listing > ul > li > div.jobs-content > div.cs-text > div.post-title > h5 > a::attr(href)").extract_first()
Out[5]: 'https://www.example.com/jobs/217035102/ontario/mississauga/bookkeeper/'
We accessed first url by using extract_first()
. To access all urls we can use just exctract()
xxxxxxxxxx
In [2]: urls = response.css("div.jobs-listing > ul > li > div.jobs-content > div.cs-text > div.post-title > h5 > a::attr(href)").extract()
In [3]: urls
Out[3]:
['https://www.example.com/jobs/217035102/ontario/mississauga/bookkeeper/',
'https://www.example.com/jobs/211997326/alberta/calgary/bookkeeper/',
'https://www.example.com/jobs/559003805/british-columbia/surrey/administrative-assistant/',
'https://www.example.com/jobs/928937789/british-columbia/delta-bc/weaver-metal-products-manufacturing/',
'https://www.example.com/jobs/741516188/british-columbia/surrey/office-administrator/',
'https://www.example.com/jobs/791951657/british-columbia/salmon-arm/food-service-supervisor/',
'https://www.example.com/jobs/849248443/british-columbia/richmond-bc/kitchen-helper/',
'https://www.example.com/jobs/919265832/alberta/edmonton/shipper-receiver/',
'https://www.example.com/jobs/878494760/british-columbia/vancouver/restaurant-hosthostess/',
'https://www.example.com/jobs/291216205/ontario/guelph/shift-manager-fast-food-restaurant/']
Every url navigates to job detail page which contains email contact address. We want that email address in every detail page.
Lets open up another terminal and use scrapy shell detail_page_url
command.
xxxxxxxxxx
$ scrapy shell https://www.example.com/jobs/217035102/ontario/mississauga/bookkeeper/
We will be in the shell
xxxxxxxxxx
In [1]: # Use query selectors to access email address
If you go to the detail page and find out the email address and then right click + inspect the element you can see the html elements structure. For our case the html structure like this
x
<li class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<div class="listing-inner">
<i class="icon-envelope-o"></i>
<div class="cs-text">
<span>How to apply:</span>
<strong> email to bcc@gmail.com</strong>
</div>
</div>
</li>
There are lots of way to access the email address. I will find email icon and then its sibling element and sibling's inner element data. So to access the email address we will find <i class="icon-envelope-o"></i>
and then go its parent and after that its sibling <div class="cs-text">
and after that its <strong>
tag which has the email address.
Return back to shell
x
In [2]: response.xpath("//i[contains(@class,'icon-envelope-o')]/parent::div").css("div.cs-text > strong").extract()
Out[2]: ['<strong>\temail to <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6d0f0e0e2d0a000c0401430e0200">[email\xa0protected]</a></strong>']
# Email Protection!
As you see there is email protection! This is cloudflare which offers a feature to obfuscate emails.
To access email we need to get data-cfemail
value as you see in the output
x
In [11]: encoded_email = (response.xpath("//i[contains(@class,'icon-envelope-o')]/parent::div").css("div.cs-text > strong > a::attr(data-cfemail)").get())
In [12]: encoded_email
Out[12]: '6d0f0e0e2d0a000c0401430e0200'
To fix this problem we will use custom function to decode the email address
x
def decode_email(e_mail: str) -> str:
de = ""
if not e_mail:
return de
k = int(e_mail[:2], 16)
for i in range(2, len(e_mail) - 1, 2):
de += chr(int(e_mail[i : i + 2], 16) ^ k)
return de
On shell
x
In [13]: decode_email('6d0f0e0e2d0a000c0401430e0200')
Out[13]: 'bcc@gmail.com'
Now we know how to get urls and see every detail page and extract email address from the detail page.
Let's edit EmailFetchingSpider.py
like the following.
x
import scrapy
from urllib.parse import urlencode
import time
def decode_email(e_mail: str) -> str:
"""Example usage
In [13]: decode_email('6d0f0e0e2d0a000c0401430e0200')
Out[13]: 'bcc@gmail.com'
"""
de = ""
if not e_mail:
return de
k = int(e_mail[:2], 16)
for i in range(2, len(e_mail) - 1, 2):
de += chr(int(e_mail[i : i + 2], 16) ^ k)
return de
class EmailfetchingspiderSpider(scrapy.Spider):
name = 'EmailFetchingSpider'
allowed_domains = ["example.com"]
start_urls = ["https://example.com/job-search/"]
def parse(self, response):
# access urls
urls = response.css(
"div.jobs-listing > ul > li > div.jobs-content > div.cs-text > div.post-title > h5 > a::attr(href)"
).extract()
for i, url in enumerate(urls):
# every 100. record we delay
if i % 100 == 0:
time.sleep(10 / 1000)
# make a request for detail page
yield scrapy.Request(url=url, callback=self.parse_details)
# pagination link
next_page_url = response.css(
"ul.pagination > li > a[aria-label='Next']::attr(href)"
).extract_first()
# to get 200 items in every page
http_params = {"pagesize": 200}
if next_page_url:
# 'https://www.example.com/job-search/?page_job=2'
next_page_url = response.urljoin(next_page_url)
url = f"{next_page_url}&{urlencode(http_params)}"
yield scrapy.Request(url=url, callback=self.parse)
def parse_details(self, response):
encoded_email = (
response.xpath("//i[contains(@class,'icon-envelope-o')]/parent::div")
.css("div.cs-text > strong > a::attr(data-cfemail)")
.get()
)
yield {
"email": decode_email(encoded_email),
"url": response.url,
}
The script basically gets urls in the first job listing page. After that in parse
method we loop over extracted urls and call detail page by using parse_details
method. This method works like requesting a specific url and extracting encoded email and then decodes email by using decode_email
and it yields
data which is that we want to extract to csv file. We can export more data by using css selectors.
Running the spider
In the project directory run the following command
xxxxxxxxxx
$ scrapy crawl EmailFetchingSpider -o EmailFetchingSpider.csv -t csv
Output
xxxxxxxxxx
email,url
asd@gmail.com,https://www.example.com/jobs/cook/
qwe.qwe@gmail.com,https://www.example.com/jobs/12344/us-columbia/surrey/administrative-assistant/
ytyu@gmail.com,https://www.example.com/jobs/12345/alberta/calgary/bookkeeper/
tyut.mjm@gmail.com,https://www.example.com/jobs/12346/us-columbia/vancouver/restaurant-hosthostess/
DISCLAIMER :
The information on this post is for general informational purposes only. The author makes no representation or warranty, express or implied. Your use of this code is solely at your own risk. This code may contain links to third party content, which we do not warrant, endorse, or assume liability for.
Yorumlar