Email_Fetching_Crawler

How to collect thousands of email addresses using Python Scrapy ?

Scrapy

Gmail

At the end of this post you will be able to extract data from the websites. Specifically we will learn how to extract email addresses from job posting websites. This script is written for collecting data from a few real websites and we changed the site name and real email addresses for learning.

Authors: Adnan Kaya

Date: 09 November 2022

Keywords: data mining, web crawling, scrapy, scraping email address, css selectors

Installation
Project Initializing and Generating Spider
Scrapy Shell and CSS Selectors
Final Code
Running Spider and Exporting Emails to CSV

Scrapy is a Python framework which helps us for creating web crawlers. For data mining and data extraction from the web sites, it is quite helpful.

It has easily understood documentation and tutorial. You can check its official documentation.

Before installing scrapy, create a virtualenv.


x
$ python3.11 -m venv env311
$ source env311/bin/activate
# now install Scrapy
(env311) $ pip install Scrapy
# optional install
(env311) $ pip install ipython

Let's get started by creating the scrapy project. Open up your terminal and type scrapy startproject email_fetcher. After running this command you will get an output message like this:


x
$ scrapy startproject email_fetcher
New Scrapy project 'email_fetcher', using template directory '/home/adnan/myscrapy/env311/lib/python3.11/site-packages/scrapy/templates/project', created in:
    /home/adnan/myscrapy/email_fetcher

You can start your first spider with:
    cd email_fetcher
    scrapy genspider example example.com

To see files and folders of our project tree command can be used.


xxxxxxxxxx
$ tree email_fetcher/
email_fetcher/
├── email_fetcher
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders    # We put our spiders into this folder and customize them for every website.
│       └── __init__.py
└── scrapy.cfg

2 directories, 7 files

Navigate to email_fetcher folder. cd email_fetcher/spiders/ and generate your first spider by running the following command

scrapy genspider EmailFetchingSpider example.com You will see the output like this


x
(env311) $ scrapy genspider EmailFetchingSpider example.com
Created spider 'EmailFetchingSpider' using template 'basic' in module:
  email_fetcher.spiders.EmailFetchingSpider
(env311) $ ls
EmailFetchingSpider.py  __init__.py

Let's take a look EmailFetchingSpider.py file


x
import scrapy


class EmailfetchingspiderSpider(scrapy.Spider):
    name = 'EmailFetchingSpider'          # our spider name
    allowed_domains = ['example.com']     # domain list that our spider will crawl
    start_urls = ['http://example.com/job-posts/']  # the urls that we fetch data

    def parse(self, response):
        pass

This is a simple template for creating spiders. Now lets head over to terminal again and run the scrapy shell command.


x
$ scrapy shell example.com/job-search/

You will get some outputs like


x
2022-11-09 01:22:20 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: email_fetcher)
 ........
 ...........
[s]   view(response)    View response in a browser
2022-11-09 01:22:22 [asyncio] DEBUG: Using selector: EpollSelector

In [1]: # Now you are here to play with response data. Response data is the page you received from example.com/job-posts/
In [1]: 
In [1]: # We will use css selectors to access data we want in response
In [1]:

Let's access the urls that are listed in example.com/job-search/


x
In [1]: urls = response.css("div.jobs-listing > ul > li > div.jobs-content > div.cs-text > div.post-title > h5 > a::attr(href)").extract()

It seems little bit complex. Let's simplify it. First of all let's see what is our html page look like. I opened the page, we want to crawl, and saw the job listing part in the middle of the page. Right click -> inspect will show me the html source code. For our case it is like


x
<div class="jobs-listing">
    <ul>
        <li>
            <div class="jobs-content">
                <div class="cs-media">
                    <figure>
                     <a href="https://www.example.com/jobs/217035102/ontario/mississauga/bookkeeper/">
                     <img alt="" src="https://www.example.com/wp-content/plugins/wp-jobhunt/assets/images/img-not-found16x9.jpg">
                     </a>
                    </figure>
                </div>
                <div class="cs-text">
                    <div class="post-title">
                        <h5>
                            <a href="https://www.example.com/jobs/217035102/ontario/mississauga/bookkeeper/">Bookkeeper</a>
                        </h5>
                        <span class="cs-post-date">2 days ago</span>
                        <span id="employer">by HNM Logistics Inc.</span>
                    </div>
                    <div class="post-options"> <span class="cs-position cs-color">Accounting &amp; Finance</span>
                        <span class="cs-location">Mississauga, Ontario </span>
                    </div>
                    <div class="job-post">
                        <a href="https://www.example.com/job-search/?job_type=full-time" class="jobtype-btn"
                            style="border-color:#b76935;color:#b76935;">Full-Time</a> <a
                            class="heart-btn heart-btn shortlist" data-toggle="tooltip" data-placement="top" title=""
                            onclick="trigger_func('#btn-header-main-login');" data-original-title="Add to Shortlist"><i
                                class="icon-heart-o"></i>
                        </a>
                    </div>
                </div>
            </div>
        </li>
        <li>...
        </li>
        <li>...
        </li>
        <li>...
        </li>
        .........
    </ul>
</div>

We want to access the urls in this page. If we look at the html element structure we will see that <div class="jobs-listing"> element has ul element and it has li and it has <div class="jobs-content"> and it has <div class="cs-text"> and it has <div class="post-title"> and it has h5 and it has a tag which contains the url. So to access that url we use css selector like this


x
In [1]: urls = response.css("div.jobs-listing > ul > li > div.jobs-content > div.cs-text > div.post-title > h5 > a::attr(href)").extract_first()
Out[5]: 'https://www.example.com/jobs/217035102/ontario/mississauga/bookkeeper/'

We accessed first url by using extract_first() . To access all urls we can use just exctract()


xxxxxxxxxx
In [2]: urls = response.css("div.jobs-listing > ul > li > div.jobs-content > div.cs-text > div.post-title > h5 > a::attr(href)").extract()

In [3]: urls
Out[3]: 
['https://www.example.com/jobs/217035102/ontario/mississauga/bookkeeper/',
 'https://www.example.com/jobs/211997326/alberta/calgary/bookkeeper/',
 'https://www.example.com/jobs/559003805/british-columbia/surrey/administrative-assistant/',
 'https://www.example.com/jobs/928937789/british-columbia/delta-bc/weaver-metal-products-manufacturing/',
 'https://www.example.com/jobs/741516188/british-columbia/surrey/office-administrator/',
 'https://www.example.com/jobs/791951657/british-columbia/salmon-arm/food-service-supervisor/',
 'https://www.example.com/jobs/849248443/british-columbia/richmond-bc/kitchen-helper/',
 'https://www.example.com/jobs/919265832/alberta/edmonton/shipper-receiver/',
 'https://www.example.com/jobs/878494760/british-columbia/vancouver/restaurant-hosthostess/',
 'https://www.example.com/jobs/291216205/ontario/guelph/shift-manager-fast-food-restaurant/']

Every url navigates to job detail page which contains email contact address. We want that email address in every detail page.

Lets open up another terminal and use scrapy shell detail_page_url command.


xxxxxxxxxx
$ scrapy shell https://www.example.com/jobs/217035102/ontario/mississauga/bookkeeper/

We will be in the shell


xxxxxxxxxx
In [1]: # Use query selectors to access email address

If you go to the detail page and find out the email address and then right click + inspect the element you can see the html elements structure. For our case the html structure like this


x
<li class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
    <div class="listing-inner">
        <i class="icon-envelope-o"></i>
        <div class="cs-text">
            <span>How to apply:</span>
            <strong> email to bcc@gmail.com</strong>
        </div>
    </div>
</li>

There are lots of way to access the email address. I will find email icon and then its sibling element and sibling's inner element data. So to access the email address we will find <i class="icon-envelope-o"></i> and then go its parent and after that its sibling <div class="cs-text"> and after that its <strong> tag which has the email address.

Return back to shell


x
In [2]: response.xpath("//i[contains(@class,'icon-envelope-o')]/parent::div").css("div.cs-text > strong").extract()

Out[2]: ['<strong>\temail to <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6d0f0e0e2d0a000c0401430e0200">[email\xa0protected]</a></strong>']

# Email Protection!

As you see there is email protection! This is cloudflare which offers a feature to obfuscate emails.

To access email we need to get data-cfemail value as you see in the output


x
In [11]: encoded_email = (response.xpath("//i[contains(@class,'icon-envelope-o')]/parent::div").css("div.cs-text > strong > a::attr(data-cfemail)").get())

In [12]: encoded_email
Out[12]: '6d0f0e0e2d0a000c0401430e0200'

To fix this problem we will use custom function to decode the email address


x
def decode_email(e_mail: str) -> str:
    de = ""
    if not e_mail:
        return de
    k = int(e_mail[:2], 16)
    for i in range(2, len(e_mail) - 1, 2):
        de += chr(int(e_mail[i : i + 2], 16) ^ k)

    return de

On shell


x
In [13]: decode_email('6d0f0e0e2d0a000c0401430e0200')
Out[13]: 'bcc@gmail.com'

Now we know how to get urls and see every detail page and extract email address from the detail page.

Let's edit EmailFetchingSpider.py like the following.


x
import scrapy
from urllib.parse import urlencode
import time

def decode_email(e_mail: str) -> str:
    """Example usage
    In [13]: decode_email('6d0f0e0e2d0a000c0401430e0200')
    Out[13]: 'bcc@gmail.com'
    """
    de = ""
    if not e_mail:
        return de
    k = int(e_mail[:2], 16)
    for i in range(2, len(e_mail) - 1, 2):
        de += chr(int(e_mail[i : i + 2], 16) ^ k)

    return de

class EmailfetchingspiderSpider(scrapy.Spider):
    name = 'EmailFetchingSpider'
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/job-search/"]

    def parse(self, response):
        # access urls
        urls = response.css(
            "div.jobs-listing > ul > li > div.jobs-content > div.cs-text > div.post-title > h5 > a::attr(href)"
        ).extract()
        for i, url in enumerate(urls):
            # every 100. record we delay
            if i % 100 == 0:
                time.sleep(10 / 1000)
            # make a request for detail page
            yield scrapy.Request(url=url, callback=self.parse_details)

        # pagination link
        next_page_url = response.css(
            "ul.pagination > li > a[aria-label='Next']::attr(href)"
        ).extract_first()
        # to get 200 items in every page
        http_params = {"pagesize": 200}
        if next_page_url:
            # 'https://www.example.com/job-search/?page_job=2'
            next_page_url = response.urljoin(next_page_url)
            url = f"{next_page_url}&{urlencode(http_params)}"
            yield scrapy.Request(url=url, callback=self.parse)


    def parse_details(self, response):
        encoded_email = (
            response.xpath("//i[contains(@class,'icon-envelope-o')]/parent::div")
            .css("div.cs-text > strong > a::attr(data-cfemail)")
            .get()
        )
        yield {
            "email": decode_email(encoded_email),
            "url": response.url,
        }

The script basically gets urls in the first job listing page. After that in parse method we loop over extracted urls and call detail page by using parse_details method. This method works like requesting a specific url and extracting encoded email and then decodes email by using decode_email and it yields data which is that we want to extract to csv file. We can export more data by using css selectors.

Running the spider

In the project directory run the following command


xxxxxxxxxx
 $ scrapy crawl EmailFetchingSpider -o EmailFetchingSpider.csv -t csv

Output


xxxxxxxxxx
email,url
asd@gmail.com,https://www.example.com/jobs/cook/
qwe.qwe@gmail.com,https://www.example.com/jobs/12344/us-columbia/surrey/administrative-assistant/
ytyu@gmail.com,https://www.example.com/jobs/12345/alberta/calgary/bookkeeper/
tyut.mjm@gmail.com,https://www.example.com/jobs/12346/us-columbia/vancouver/restaurant-hosthostess/

DISCLAIMER :

The information on this post is for general informational purposes only. The author makes no representation or warranty, express or implied. Your use of this code is solely at your own risk. This code may contain links to third party content, which we do not warrant, endorse, or assume liability for.

Adnan Kaya Blog

Bu Blogda Ara