Web scraping with Scrapy - Printable Version

Web scraping with Scrapy - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: General (https://python-forum.io/forum-1.html)
+--- Forum: Tutorials (https://python-forum.io/forum-4.html)
+---- Forum: Web Scraping (https://python-forum.io/forum-43.html)
+---- Thread: Web scraping with Scrapy (/thread-326.html)

Web scraping with Scrapy - metulburr - Oct-05-2016

Originally posted by setrofim....
please do not PM me.

Introduction

This tutorial is intended to show how to use the Scrapy framework to quickly write web scrapers. As an example, I will implement a simple scraper to extract comics links and associated alt text and transcripts from xkcd.com.

This tutorial assumes that you are comfortable with Python. It also assumes a basic understanding of HTTP and some familiarity with Xpath notation.

Installation

Scrapy works with Python 2.6 or 2.7. If you're on Windows, you will also have to install OpenSSL.

Once you have the pre-requisites, the easiest way to install Scrapy is with pip:

pip install scrapy

If this does not work for you, please refer to the installation guide in Scrapy documentation.

Concepts

Scrapy is a framework, which means that it implements a lot of the "boilerplate" functionality for you, and all you need to do is implement the bits specific to your application. Scrapy breaks these bits down into several categories. For this tutorial, we'll only focus on the following:

Items: these define the data that you want to extract. If you are familiar with ORMs such as Django ORM or sqlalchemy, then Items are equivalent to Models. If you're not familiar with ORMs, you can think of Items as classes that define what comprises a single "data record" that you want to extract.
Spiders: these do the actual crawling and scraping of web pages. They contain the "business logic" for your crawler. Scrapy comes with a couple base implementations that you can subclass.
Link Extractors: these are used to extract links that a spider would use to crawl web sites. Scrapy comes with a couple of built-in extractors that should suffice for the majority of use cases.
Selectors: these are used to extract data from web pages. They parse the contents of the HTTP response into an easily query-able format. If you weren't using Scrapy, you would typically do this with lxml or BeautifulSoup (and in fact, you could just use one of these instead of a selector, if you wanted). Scrapy comes with an excellent Xpath-based selector by default, so we'll stick with that for the tutorial.

Project Structure

The bits you implement need to be located somewhere Scrapy can find them, so you project must follow the structure that Scrapy expects. Luckily, Scrapy can generate the project directory for you. So let's start by creating a project:

~/projects$ scrapy startproject xkcd_scraper
~/projects$ tree xkcd_scraper/
xkcd_scraper/
|-- scrapy.cfg
|-- xkcd_scraper
    |-- __init__.py
    |-- items.py
    |-- pipelines.py
    |-- settings.py
    |-- spiders
        |-- __init__.py

2 directories, 6 files

The top-level project directory contains the Scrapy configuration file scrapy.cfg (which we will not need to worry about in this tutorial) and the Python package with the code for the project (with the same name as the project). Within the package, there are files for defining the various parts of the scraper. In this tutorial, we're only concerned with items and spiders.

Writing the Scraper

Item

First, we need to define what it is that we want to scrape from a web site. We do this by implementing an Item to describe the data inside of xkcd_scraper/xkcd_scraper/items.py. This is as easy as subclassing Item and creating a Field for each individual bit of data we want to scrape:

from scrapy.item import Item, Field

class XkcdComicItem(Item):
    image_url = Field()
    alt_text = Field()
    transcript = Field()

Here, we're saying that we want to extract the image URL, alt text, and transcript for each xkcd comic.

Spider

Now let's define how we are going to extract the data by creating a spider inside xkcd_scraper/xkcd_scraper/spiders/__init__.py

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

from xkcd_scraper.items import XkcdComicItem

class XkcdComicSpider(CrawlSpider):
    name = 'xkcd-comics'
    start_urls = ['http://xkcd.com/1140/']

    rules = (
        Rule(SgmlLinkExtractor(restrict_xpaths='//a[@rel="next"]'),
             follow=True,
             callback='parse_comic'),
    )

    def parse_comic(self, response):
        hxs = HtmlXPathSelector(response)
        image = hxs.select('//div[@id="comic"]/img')

        item = XkcdComicItem()
        item['image_url'] = image.select('@src').extract()
        item['alt_text'] = image.select('@title').extract()
        item['transcript'] = hxs.select('//div[@id="transcript"]/text()').extract()

        return item

OK, this is a bit more complicated, so let's break it down.

class XkcdComicSpider(CrawlSpider):

We're subclassing the CrawlSpider class. A CrawlSpider will start with start with the initial set of URLs and will crawl from there according to a set of rules.

    name = 'xkcd-comics'
    start_urls = ['http://xkcd.com/1140/']

    rules = (
        Rule(SgmlLinkExtractor(restrict_xpaths='//a[@rel="next"]'),
             follow=True,
             callback='parse_comic'),
    )

Here, we're defining the behaviour of the crawler.

The name is used to uniquely identify the crawler, and we'll use it to invoke the crawler later.
start_urls is the list of initial URLs that the spider will start crawling. For this, we're specifying the URL of the first xkcd comic we want to scrape (if we wanted to scrape all comics, we would start with 'http://xkcd.com/1/', but that would take a good while to run).
rules specifies how the crawler will proceed from the initial URLs. Our spider has only one rule -- it will use the built-in link extractor to extract the link to the next comic (the cryptic Xpath just means "find <a> HTML elements with rel attribute set to "next" anywhere in the contents"); it will then invoke the specified callback to process the contents of that URL; finally, with attempt to "follow" that URL, i.e. recursively apply the crawl rules to the contents of that URL).

In summary, the crawler will start with the comic specified in start_urls and will then keep following next comic links, scraping the pages as it goes.

    def parse_comic(self, response):
        hxs = HtmlXPathSelector(response)
        image = hxs.select('//div[@id="comic"]/img')

        item = XkcdComicItem()
        item['image_url'] = image.select('@src').extract()
        item['alt_text'] = image.select('@title').extract()
        item['transcript'] = hxs.select('//div[@id="transcript"]/text()').extract()

        return item

This is the callback that will get invoked for each page the spider crawls, and this is where the actual scraping happens. All we're doing here is using a selector to find the relevant data in the HTML returned in the HTTP response and populating an instance of the Item we've created earlier with that data. We're using Scrapy's HtmlXPathSelector here, but you could use something like lxml if you are more comfortable with that.

OK, that's it. We're done. We now have a fully functioning web scraper. It's time to take it for a spin.

Running the Scraper

To run the scraper, navigate to the top-level project directory (the one with the scrapy.cfg file) in your favorite shell and run scrapy like so:

~/projects/xkcd_scraper$ scrapy crawl xkcd-comics -t json -o xkcd-comics.json

Here, we're telling scrapy that we want it to crawl using the xkcd-comics spider (the name we've given to our Spider earlier) and we want the output to be formatted as json and to be written to xkcd-comics.json file in the current directory.

Once you type that in and hit enter, you'll see a whole bunch of log output (by default, the verbosity level is set to DEBUG) telling you exactly what Scrapy is doing. When all available comics have been scraped, Scrapy will print out a summary and then exit, leaving the json file with the output. One of the cool things about Scrapy is that you can hit CTRL-C at any point to abort the crawling, and you'll always get a well-formatted JSON file with the data that has been scraped so far.

Batteries Included

This tutorial focused on how to write a web scraper with the minimum amount of fuss. As such, it barely scrapes the functionality available in Scrapy. If there is a demand for it (and if I have the time/motivation), I might cover some of the more advanced features in the future tutorial. For now, here is a subset of the features that are available:

You can use pipelines to process the items once they are extracted, e.g. to clean the data or handle missing items.
You can use signals to hook into any part of the scraping process.
Scrapy provides an easy way of collecting stats about what has been scraped.
Scrapy has a daemon and a web service for managing several scrapers using JSON RPC.
Scrapy comes with a command line tool and an interactive shell.

Check out the docs for the full list of features and in-depth guides.