“The soup is not that beautiful.”
— Peter Winter
Web crawling is usually the very first step of data research. In the past people have relied upon various software packages for this job. In the case of Python, urllib2 and BeatifulSoup are widely used, despite the effort it requires to extract structured information with these tools. The reason why people fought with stones is that they didn’t have bronze and iron.
Scrapy is a new Python package that aims at easy, fast, and automated web crawling, which recently gained much popularity. I first heard the name from Adam Pah, decided to give it a try, and fell in love with it.
Here I’m gonna show a brief step-by-step example of crawling the website metacritic.com to get the meta scores of pc games using Scrapy.
Scrapy is offered via pip. Use the following command to get it:
sudo pip install Scrapy
2. Start a Scrapy project
Unlike using other Python packages, you DON’T IMPORT Scrapy into an existing Python project. Scrapy functions as a stand-alone package. For simplicity, we use the project name “metacritic”. To start our project, type the command:
scrapy startproject metacritic
This will create a directory
3. Define the data structure
Scrapy uses a class called Item as a container for the crawled data. To define our crawled item, we write our own class derived from the basic Item class. Edit
/metacritic/item.py to add the class MetacriticItem:
from scrapy.item import Item, Field class MetacriticItem(Item): ''' Class for the item retrieved by scrapy. ''' # Here are the fields that will be crawled and stored title = Field() # Game title link = Field() # Link to individual game page cscore = Field() # Critic score uscore = Field() # User score date = Field() # Release date desc = Field() # Description of game
4. Define the crawler
The class that actually does the crawling is called Spider (for obvious reasons). We feed the spider with a list of starting URLs. The spider goes to each of the URL, extracts data that is desired, and stores them as a list of instances of the class MetacriticItem. Let’s go into the directory
/metacritic/spiders/, and create a file named
metacritic_spider.py containing the following contents:
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from metacritic.items import MetacriticItem class MetacriticSpider(BaseSpider): name = "metacritic" # Name of the spider, to be used when crawling allowed_domains = ["metacritic.com"] # Where the spider is allowed to go start_urls = [ "http://www.metacritic.com/browse/games/title/pc?page=0" ] def parse(self, response): pass # To be changed later
This piece of code inherits from the BaseSpider class and creates a class that is tailored to our need. Its member variables indicate that the spider starts from each URL in start_urls and is allowed to navigate within the domain “metacritic.com” so that it doesn’t follow any links that goes outside this domain.
Here comes the question of how to extract the data we want. In BeautifulSoup, we need to deal with the HTML ourselves, and sometimes it is a huge pain to navigate through the HTML tree. To void this, Scrapy uses XPath, a language that finds information within XML easily. A few examples:
<li>element within the
<li>elements within the html.
<li>elements that have attribute ‘class=”game”’, i.e.
//li[contains(@class, ”game”)]selects ALL
<li>elements that have attribute “class” that contains the string ”game”’, i.e.
<li class=”game pc”>,
<li class=”game xbox360”>,
<li class=”game ps3”>, etc.
After inspecting the source of http://www.metacritic.com/browse/games/title/pc?page=0, we find that all the games are contained within the tags
<li class="product game_product"> <div class="product_wrap"> Game content </div> </li>
We use XPath to select the games and put them in a list:
sites = hxs.select('//li[contains(@class, "product game_product")]
For each game in the list, we create an item to store it:
item = MetacriticItem()
We find all the fields that we want, again with XPath. For example, we wish to get the title of the game:
item['title'] = site.select('div[@class="basic_stat product_title"]/a/text()').extract()
With XPath, we can fill out the function parse(self, response):
def parse(self, response): hxs = HtmlXPathSelector(response) # The XPath selector sites = hxs.select('//li[contains(@class, "product game_product")]/div[@class="product_wrap"]') items =  for site in sites: item = MetacriticItem() item['title'] = site.select('div[@class="basic_stat product_title"]/a/text()').extract() item['link'] = site.select('div[@class="basic_stat product_title"]/a/@href').extract() item['cscore'] = site.select('div[@class="basic_stat product_score brief_metascore"]/div/div/ span[contains(@class, "data metascore score")]/text()').extract() item['uscore'] = site.select('div[@class="more_stats condensed_stats"]/ul/li/ span[contains(@class, "data textscore textscore")]/text()').extract() item['date'] = site.select('div[@class="more_stats condensed_stats"]/ul/li/ span[@class="data"]/text()').extract() items.append(item) return items
5. Crawl the web & Store the data
We’ve finished setting up the code. Now we can go and crawl the website and store the data as json. Yes, as JSON.
Go to the project directory, i.e. /metacritic/, and enter the command:
scrapy crawl metacritic -o metacritic.json -t json
That is to say, “Run the spider named metacritic and save the retrieved items in the format of json to a file named metacritic.json.”
Woohoo! Now we’ve got all the games with titles, critic scores, user scores, saved as json objects! You can save the items in MongoDB, visualize them with D3, and do whatever with them as you want!
That’s basically where I am now; I’ll continue to play with Scrapy and maybe write future blogs about other features like pipeline, callback, etc.