Web scraping with Scrapy - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: General (https://python-forum.io/forum-1.html) +--- Forum: Tutorials (https://python-forum.io/forum-4.html) +---- Forum: Web Scraping (https://python-forum.io/forum-43.html) +---- Thread: Web scraping with Scrapy (/thread-326.html) |
Web scraping with Scrapy - metulburr - Oct-05-2016 Originally posted by setrofim.... please do not PM me. Introduction This tutorial is intended to show how to use the Scrapy framework to quickly write web scrapers. As an example, I will implement a simple scraper to extract comics links and associated alt text and transcripts from xkcd.com. This tutorial assumes that you are comfortable with Python. It also assumes a basic understanding of HTTP and some familiarity with Xpath notation. Installation Scrapy works with Python 2.6 or 2.7. If you're on Windows, you will also have to install OpenSSL. Once you have the pre-requisites, the easiest way to install Scrapy is with pip: pip install scrapyIf this does not work for you, please refer to the installation guide in Scrapy documentation. Concepts Scrapy is a framework, which means that it implements a lot of the "boilerplate" functionality for you, and all you need to do is implement the bits specific to your application. Scrapy breaks these bits down into several categories. For this tutorial, we'll only focus on the following:
The bits you implement need to be located somewhere Scrapy can find them, so you project must follow the structure that Scrapy expects. Luckily, Scrapy can generate the project directory for you. So let's start by creating a project: ~/projects$ scrapy startproject xkcd_scraper ~/projects$ tree xkcd_scraper/ xkcd_scraper/ |-- scrapy.cfg |-- xkcd_scraper |-- __init__.py |-- items.py |-- pipelines.py |-- settings.py |-- spiders |-- __init__.py 2 directories, 6 filesThe top-level project directory contains the Scrapy configuration file scrapy.cfg (which we will not need to worry about in this tutorial) and the Python package with the code for the project (with the same name as the project). Within the package, there are files for defining the various parts of the scraper. In this tutorial, we're only concerned with items and spiders. Writing the Scraper Item First, we need to define what it is that we want to scrape from a web site. We do this by implementing an Item to describe the data inside of xkcd_scraper/xkcd_scraper/items.py. This is as easy as subclassing Item and creating a Field for each individual bit of data we want to scrape: from scrapy.item import Item, Field class XkcdComicItem(Item): image_url = Field() alt_text = Field() transcript = Field()Here, we're saying that we want to extract the image URL, alt text, and transcript for each xkcd comic. Spider Now let's define how we are going to extract the data by creating a spider inside xkcd_scraper/xkcd_scraper/spiders/__init__.py from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from xkcd_scraper.items import XkcdComicItem class XkcdComicSpider(CrawlSpider): name = 'xkcd-comics' start_urls = ['http://xkcd.com/1140/'] rules = ( Rule(SgmlLinkExtractor(restrict_xpaths='//a[@rel="next"]'), follow=True, callback='parse_comic'), ) def parse_comic(self, response): hxs = HtmlXPathSelector(response) image = hxs.select('//div[@id="comic"]/img') item = XkcdComicItem() item['image_url'] = image.select('@src').extract() item['alt_text'] = image.select('@title').extract() item['transcript'] = hxs.select('//div[@id="transcript"]/text()').extract() return itemOK, this is a bit more complicated, so let's break it down. class XkcdComicSpider(CrawlSpider):We're subclassing the CrawlSpider class. A CrawlSpider will start with start with the initial set of URLs and will crawl from there according to a set of rules. name = 'xkcd-comics' start_urls = ['http://xkcd.com/1140/'] rules = ( Rule(SgmlLinkExtractor(restrict_xpaths='//a[@rel="next"]'), follow=True, callback='parse_comic'), )Here, we're defining the behaviour of the crawler.
def parse_comic(self, response): hxs = HtmlXPathSelector(response) image = hxs.select('//div[@id="comic"]/img') item = XkcdComicItem() item['image_url'] = image.select('@src').extract() item['alt_text'] = image.select('@title').extract() item['transcript'] = hxs.select('//div[@id="transcript"]/text()').extract() return itemThis is the callback that will get invoked for each page the spider crawls, and this is where the actual scraping happens. All we're doing here is using a selector to find the relevant data in the HTML returned in the HTTP response and populating an instance of the Item we've created earlier with that data. We're using Scrapy's HtmlXPathSelector here, but you could use something like lxml if you are more comfortable with that. OK, that's it. We're done. We now have a fully functioning web scraper. It's time to take it for a spin. Running the Scraper To run the scraper, navigate to the top-level project directory (the one with the scrapy.cfg file) in your favorite shell and run scrapy like so: ~/projects/xkcd_scraper$ scrapy crawl xkcd-comics -t json -o xkcd-comics.jsonHere, we're telling scrapy that we want it to crawl using the xkcd-comics spider (the name we've given to our Spider earlier) and we want the output to be formatted as json and to be written to xkcd-comics.json file in the current directory. Once you type that in and hit enter, you'll see a whole bunch of log output (by default, the verbosity level is set to DEBUG) telling you exactly what Scrapy is doing. When all available comics have been scraped, Scrapy will print out a summary and then exit, leaving the json file with the output. One of the cool things about Scrapy is that you can hit CTRL-C at any point to abort the crawling, and you'll always get a well-formatted JSON file with the data that has been scraped so far. Batteries Included This tutorial focused on how to write a web scraper with the minimum amount of fuss. As such, it barely scrapes the functionality available in Scrapy. If there is a demand for it (and if I have the time/motivation), I might cover some of the more advanced features in the future tutorial. For now, here is a subset of the features that are available:
|