How to collect thousands of email addresses using Python Scrapy ?

Email_Fetching_Crawler

How to collect thousands of email addresses using Python Scrapy ?

 

PythonScrapy

Gmail

At the end of this post you will be able to extract data from the websites. Specifically we will learn how to extract email addresses from job posting websites. This script is written for collecting data from a few real websites and we changed the site name and real email addresses for learning.


Authors: Adnan Kaya

Date: 09 November 2022

Keywords: data mining, web crawling, scrapy, scraping email address, css selectors


Table of Contents

  • Installation
  • Project Initializing and Generating Spider
  • Scrapy Shell and CSS Selectors
  • Final Code
  • Running Spider and Exporting Emails to CSV

Scrapy is a Python framework which helps us for creating web crawlers. For data mining and data extraction from the web sites, it is quite helpful.

It has easily understood documentation and tutorial. You can check its official documentation.

Before installing scrapy, create a virtualenv.

Let's get started by creating the scrapy project. Open up your terminal and type scrapy startproject email_fetcher. After running this command you will get an output message like this:

To see files and folders of our project tree command can be used.

Navigate to email_fetcher folder. cd email_fetcher/spiders/ and generate your first spider by running the following command

scrapy genspider EmailFetchingSpider example.com You will see the output like this

Let's take a look EmailFetchingSpider.py file

This is a simple template for creating spiders. Now lets head over to terminal again and run the scrapy shell command.

You will get some outputs like

Let's access the urls that are listed in example.com/job-search/

It seems little bit complex. Let's simplify it. First of all let's see what is our html page look like. I opened the page, we want to crawl, and saw the job listing part in the middle of the page. Right click -> inspect will show me the html source code. For our case it is like

We want to access the urls in this page. If we look at the html element structure we will see that <div class="jobs-listing"> element has ul element and it has li and it has <div class="jobs-content"> and it has <div class="cs-text"> and it has <div class="post-title"> and it has h5 and it has a tag which contains the url. So to access that url we use css selector like this

We accessed first url by using extract_first() . To access all urls we can use just exctract()

Every url navigates to job detail page which contains email contact address. We want that email address in every detail page.

Lets open up another terminal and use scrapy shell detail_page_url command.

We will be in the shell

If you go to the detail page and find out the email address and then right click + inspect the element you can see the html elements structure. For our case the html structure like this

There are lots of way to access the email address. I will find email icon and then its sibling element and sibling's inner element data. So to access the email address we will find <i class="icon-envelope-o"></i> and then go its parent and after that its sibling <div class="cs-text"> and after that its <strong> tag which has the email address.

Return back to shell

As you see there is email protection! This is cloudflare which offers a feature to obfuscate emails.

To access email we need to get data-cfemail value as you see in the output

To fix this problem we will use custom function to decode the email address

On shell

Now we know how to get urls and see every detail page and extract email address from the detail page.

Let's edit EmailFetchingSpider.py like the following.

The script basically gets urls in the first job listing page. After that in parse method we loop over extracted urls and call detail page by using parse_details method. This method works like requesting a specific url and extracting encoded email and then decodes email by using decode_email and it yields data which is that we want to extract to csv file. We can export more data by using css selectors.

Running the spider

In the project directory run the following command

Output

 

DISCLAIMER :

The information on this post is for general informational purposes only. The author makes no representation or warranty, express or implied. Your use of this code is solely at your own risk. This code may contain links to third party content, which we do not warrant, endorse, or assume liability for.

Yorumlar