A quick introduction to web crawling using Scrapy - Part I

“The soup is not that beautiful.”

— Peter Winter

Web crawling is usually the very first step of data research. In the past people have relied upon various software packages for this job. In the case of Python, urllib2 and BeatifulSoup are widely used, despite the effort it requires to extract structured information with these tools. The reason why people fought with stones is that they didn’t have bronze and iron.

Scrapy is a new Python package that aims at easy, fast, and automated web crawling, which recently gained much popularity. I first heard the name from Adam Pah, decided to give it a try, and fell in love with it.

Here I’m gonna show a brief step-by-step example of crawling the website metacritic.com to get the meta scores of pc games using Scrapy.

1. Installation
Scrapy is offered via pip. Use the following command to get it:

sudo pip install Scrapy

2. Start a Scrapy project
Unlike using other Python packages, you DON’T IMPORT Scrapy into an existing Python project. Scrapy functions as a stand-alone package. For simplicity, we use the project name “metacritic”. To start our project, type the command:

scrapy startproject metacritic

This will create a directory /metacritic/.

3. Define the data structure
Scrapy uses a class called Item as a container for the crawled data. To define our crawled item, we write our own class derived from the basic Item class. Edit /metacritic/item.py to add the class MetacriticItem:

from scrapy.item import Item, Field
class MetacriticItem(Item):
    '''
    Class for the item retrieved by scrapy.
    '''
    # Here are the fields that will be crawled and stored
    title = Field() # Game title
    link = Field()  # Link to individual game page
    cscore = Field() # Critic score
    uscore = Field()   # User score
    date = Field()  # Release date
    desc = Field()  # Description of game

4. Define the crawler
The class that actually does the crawling is called Spider (for obvious reasons). We feed the spider with a list of starting URLs. The spider goes to each of the URL, extracts data that is desired, and stores them as a list of instances of the class MetacriticItem. Let’s go into the directory /metacritic/spiders/, and create a file named metacritic_spider.py containing the following contents:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from metacritic.items import MetacriticItem
class MetacriticSpider(BaseSpider):
    name = "metacritic" # Name of the spider, to be used when crawling
    allowed_domains = ["metacritic.com"] # Where the spider is allowed to go
    start_urls = [
        "http://www.metacritic.com/browse/games/title/pc?page=0"
    ]
    def parse(self, response):
        pass # To be changed later

This piece of code inherits from the BaseSpider class and creates a class that is tailored to our need. Its member variables indicate that the spider starts from each URL in start_urls and is allowed to navigate within the domain “metacritic.com” so that it doesn’t follow any links that goes outside this domain.
Here comes the question of how to extract the data we want. In BeautifulSoup, we need to deal with the HTML ourselves, and sometimes it is a huge pain to navigate through the HTML tree. To void this, Scrapy uses XPath, a language that finds information within XML easily. A few examples:

  • /html/body/li selects the <li> element within the <body> element.
  • //li selects ALL <li> elements within the html.
  • //li[@class=”game”] selects ALL <li> elements that have attribute ‘class=”game”’, i.e. <li class=”game”>.
  • //li[contains(@class, ”game”)] selects ALL <li> elements that have attribute “class” that contains the string ”game”’, i.e. <li class=”game pc”>, <li class=”game xbox360”>, <li class=”game ps3”>, etc.

After inspecting the source of http://www.metacritic.com/browse/games/title/pc?page=0, we find that all the games are contained within the tags

<li class="product game_product">
  <div class="product_wrap">
    Game content
  </div>
</li>

We use XPath to select the games and put them in a list:

sites = hxs.select('//li[contains(@class, "product game_product")] /div[@class="product_wrap"]')

For each game in the list, we create an item to store it:

item = MetacriticItem()

We find all the fields that we want, again with XPath. For example, we wish to get the title of the game:

item['title'] = site.select('div[@class="basic_stat product_title"]/a/text()').extract()

With XPath, we can fill out the function parse(self, response):

    def parse(self, response):
        hxs = HtmlXPathSelector(response) # The XPath selector
        sites = hxs.select('//li[contains(@class, "product game_product")]/div[@class="product_wrap"]')
        items = []
        for site in sites:
            item = MetacriticItem()
            item['title'] = site.select('div[@class="basic_stat product_title"]/a/text()').extract()
            item['link'] = site.select('div[@class="basic_stat product_title"]/a/@href').extract()
            item['cscore'] = site.select('div[@class="basic_stat product_score brief_metascore"]/div/div/
                                     span[contains(@class, "data metascore score")]/text()').extract()
            item['uscore'] = site.select('div[@class="more_stats condensed_stats"]/ul/li/
                                     span[contains(@class, "data textscore textscore")]/text()').extract()
            item['date'] = site.select('div[@class="more_stats condensed_stats"]/ul/li/
                                     span[@class="data"]/text()').extract()
            items.append(item)
        return items

5. Crawl the web & Store the data
We’ve finished setting up the code. Now we can go and crawl the website and store the data as json. Yes, as JSON.
Go to the project directory, i.e. /metacritic/, and enter the command:

scrapy crawl metacritic -o metacritic.json -t json

That is to say, “Run the spider named metacritic and save the retrieved items in the format of json to a file named metacritic.json.”

Woohoo! Now we’ve got all the games with titles, critic scores, user scores, saved as json objects! You can save the items in MongoDB, visualize them with D3, and do whatever with them as you want!

That’s basically where I am now; I’ll continue to play with Scrapy and maybe write future blogs about other features like pipeline, callback, etc.

Enjoy!

Han