Crawl Reddit.com with Python scrapy

Nowadays, there's a lot of information on the internet. Yet when they separate among web pages, there's no way analysis can be done. In this article, we will visit Reddit.com, download posts, parse information and store in a database with Python scrapy.

The steps are:

1. Visiting URL

This includes finding the specific URL that your desired information is in, and interact with the website's server by GET or POST. In practice, it can be sending usernames and password, setting cookies and so on. At the end of this step, you'll get a html file with all the contents.

2. Parsing information

In the previous step, we have a html file, we can simply store it as a text file. But this is not too helpful for further analysis as there're irrelevant contents. We want to extract information. To do this, we need some background knowledge on HTML and CSS. Identify in which HTML element the information is contained and parse it out!

3. Store information in database

This is an optional step. If you're only crawling through a small number of pages, you can just print the output on the console, or save them in text files. But for big crawlings, there're two reasons you want to use a database: first, it's safer. If the information is stored in program variables only, once the program terminates, it's gone. A crawler can be stocked or terminated due to internet issues. That happens often! Second, it's more efficient. Web data can be huge, always save information on the disk, and clear ram is a better way to avoid memory issues. Third, it's easier to manage. Data in databases can be inspected, copied and converted easier.

Before I turned to Python scrapy, I used Java with Jsoupf. It can POST username and password, set cookies and parse information well. I crawled a couple of websites. But there's a big weakness of Jsoup, that is it can't deal with JavaScript. When doing GET or POST, it sends requests to the server and gets the responses back. However, for modern webpages, the responses often contain JavaScript code (or link to JavaScript code) that needs to be executed to finish page loading. In Jsoup's case, it downloads the code without running it, and end up having an unloaded or half loaded webpage.

What is scrapy and how to use?

Scrapy is a powerful Python module designed for crawling websites. I started with following this tutorial step by step. Really helpful!
https://doc.scrapy.org/en/latest/intro/tutorial.html

Basically, when finishing the setup, we no longer need to worry about details in implementations. We only focus on the three steps: decide URL, parse information and store outputs. In this example, all can be done in the spider file.

Reddit Crawler

The following code is all about the spider file. I didn't change anything in other files. They're all the same as in the tutorial.

We'll need some other modules:

1. bs4, for parseing HTML.

2. pyodbc, for connecting and writing into MS Access.

Before crawling, it's important to look at our target website, reddit.com. When it comes to Reddit, people might go to https://www.reddit.com/ and find boards (subreddits). But unfortunately, it won't work. I tried to do so myself and ended up wasted a whole week! Looking carfully at each boards, we find that they only contain RECENT posts! Posts that are too old (for some boards, only has the current month's posts) can't be found.

According to this discussion, we need to use ~/board/search...

Searching through a time range involves a special data-time format: Unix time. A converter to and from human-readable format is Epoch.

Now let's get to the code!

Start from the tutorial code

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

The above is the tutorial code.
QuotesSpider is a class inherits scrapy.Spider. There's a member, name, which is required for executing the spider. All spiders must have unique names otherwise when executing, scrapy won't know witch spider is called (note that it's the name variable taken as identifier, not the class name). The start_requests(self) function will be called once execute. Inside of it, it calls parse(self, response) function.

Edit just a little bit

Now we can keep the structure while edit something to make it a Reddit crawler.

# -*- coding: utf-8 -*-
import scrapy
 
 
class RedditSpider(scrapy.Spider):
    name = "test1"
    allowed_domains = ["http://www.reddit.com"]
    start_urls = (
        'https://www.reddit.com/r/LifeProTips/search?q=timestamp%3A1427232122..1437773560&sort=new&restrict_sr=on&syntax=cloudsearch',
    )
 
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': { 'wait': 0.5 }
                }
            })

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = '%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

The name is arbitrary, I call it "test1" temporarily. The allowed_domains is a list contains all root addresses that the crawler is allowed to visit. If not set, ALL websites are allowed. The start_urls is
a list of URL's to visit. Here we only do one. We'll add more later. Take a look at the start_requests(self) function. We iterate through all URL's in start_url and call scrapy.Request(url, self.parse) on each of them. This function will send web request to servers and get responses back, then pass the response to parse function. The parameter meta is a dictionary containing details about the request. Here we use module splash for running JavaScript. endpoint specifies the return is HTML and args ask the crawler to wait 0.5 seconds before running JavaScript code.
Now move on to parse function. Here, just to show the result we crawled, we save it in a .html file. Note that response.body is the HTML.
After running this spider, we should see a file LifeProTips.html in the same directory as scrapy.cfg. Open it with the browser, we'll see

It's guly... indeed, due to the CSS, but it contains all information we need!

This is the minimum functional code.

How about NEXT button?

There's another problem. Scrolling down and we'll see, not all posts are presented. There's a "next" button link to next page.

Right click on it, and choose inspect, we see it links to an URL.

So now the question is, how to make our crawler move on to next page?

Before we can do so, of course we need to find what next page's URL is. Here, we'll need bs4. A document is available on the official website. It's not hard to learn.

   def parse(self, response):
        
        soup = BeautifulSoup(response.body, 'html.parser')

        #parse html for next link
        nextEle = soup.findAll("a", { "rel" : "nofollow next" })
        if len(nextEle) != 0:
            link = nextEle[0]['href']
            print(link)  
            yield scrapy.Request(link, self.parse, meta={
                'splash': {
                'endpoint': 'render.html',
                'args': { 'wait': 0.5 }
                }
            })

We first make a BeautifulSoup object soup from response.body which is an HTML. We take the html element that contains next page's link as nextEle. If it exist (there's next page button) parse the link as link. We can print it out to see if it's correct.

Then how to visit next page? Just call scrapy.Request on the new URL. By doing so, it does a deep recursion on parse function until all next pages are visited.

Write into database

Here we use MS Access as our database. First we make a new table called LPT (since we'll store LifeProTips data) BY HAND. Then we make another member function writedb(self, soup). It takes a beautifulSoup object and parse and store information into the database. This function will be called in parse.

   def parse(self, response):
        
    def writedb(self, soup):
        print("connecting...")
        conn_str = (
            r'DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};'
            r'DBQ=C:\Users\david_000\AppData\Local\Programs\Python\Python35\Scripts\tutorial\reddit.accdb;' 
        )
        cnxn = pyodbc.connect(conn_str)
        cursor = cnxn.cursor()
        print("connected")
        
        for post in soup.findAll("div", {"class" : "contents"})[0].children:
            
            titlesEle = post.findAll("a", { "class" : "search-title may-blank" })[0]
            pointEle = post.findAll("span", { "class" : "search-score" })[0]
            commentEle = post.findAll("a", {"class" : "search-comments may-blank"})[0]
            timeEle = post.findAll("time")[0]
            authorEle = post.findAll("span", { "class" : "search-author" })[0]
            contentEles = post.findAll("div", {"class" : "md"})
            if len(contentEles) != 0:
                contentEle = post.findAll("div", {"class" : "md"})[0]
                content = contentEle.text.replace("'", "''")
            else:
                content = ""
            
            title = titlesEle.text.replace("'", "''")
            point = pointEle.text.replace(" point", "")
            point = point.replace("s", "")
            comment = commentEle.text.replace(" comment", "")
            comment = comment.replace("s", "")
            time = timeEle['datetime'][0:10].replace("-", "")
            author = authorEle.text[3:]
            
            SQL = ("INSERT INTO LPT (title, point, comment, dt, author, content) VALUES (" 
                  "'" + title + "'" 
                  ",'" + point + "'" 
                  ",'" + comment + "'" 
                  ",'" + time + "'" 
                  ",'" + author + "'" 
                  ",'" + content + "'" 
                  ")")
                           
            cursor.execute(SQL)
            cnxn.commit()
        
        cursor.close()
        cnxn.close()

In the first 11 lines, it connects to a database. cnxn is a connection object and cursor can be used to interact with the database. In the first half of the for loop, it's no more than identifying elements holding info and parsing them. SQL hold's the SQL command for writing into LPT. Remember to cnxn.commit() after execute. Otherwise it won't be updated.

Conclusion

This article only presents some basic parts. The full code is available on my GitHub. Note that this crawler doesn't record comments. Though it could be modified to do so. To use it, make a Access database reddit.accdb, a table with columns title, point, comment, dt, author, content. Set the board name and date range in the constructor.

Search This Blog

Enshuo's Data Mining Notes

Sentiment Analysis with Deep Learning and Traditional Approaches: An Ensemble Modeling Example