Web Scraping With Beautiful Soup

Beautiful Soup is an excellent library for parsing a web page and building a structured representation. It allows you to access any element of the DOM by type, ID, class or any tag properties.

I use it a lot on my Data Science projects, as it allows me to create datasets from any websites. It is a tool that every good data scientist needs to know.

In this post, I will show you how to:

1- Install Beautiful Soup

2- Scrape Box Office Mojo website

3- Create a Dataset of movies

This tutorial assume that you have already installed Python >= 2.7.2 and the Python Package Index (pip), A tool for installing and managing Python packages. I also assume, you have basic knowledge of HTML, CSS, the DOM and Python.


I) How to install Beautiful Soup

You can easily install Beautiful Soup using pip or any Python package manager. If you don’t have pip, run through a quick tutorial on installing python modules to get it running. Run the following command in your terminal :

Beautiful Soup Installation
1
$ pip install beautifulsoup4

Once you have installed Beautiful Soup, you can see it in action in your python interpreter:

Beautiful Soup Test
1
2
3
4
5
6
7
>> import urllib2
>> from bs4 import BeautifulSoup
>> page = urllib2.urlopen('http://www.boxofficemojo.com/movies/?id=matrix.htm')
>> soup = BeautifulSoup(page)
>> links = soup('a')
>> links[0]
      <a href="/goto.php?a=5" target="4"><font face="Verdana" size="3"><b>'Guardians of the Galaxy' passes $300 million... &gt;</b></font><br/></a>

II) Scraping Box Office Mojo

Box Office Mojo is an online movie publication and box office reporting service. I use it a lot when I need to get data on movies such as : title, cast, director, domestic total gross, opening weekend gross etc…

Let’s start by getting data on one of my favorite movies: The Matrix.

Create a new file called “bomojoscraper.py” and insert the following code:

bomojoscraper.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import re, urllib2
from bs4 import BeautifulSoup

class BOMojoScraper():
    def __init__(self, url):
      self.url = url
      self.soup = self.connect()

    def connect(self):
        '''
        returns the BeautifulSoup object
        '''
        page = urllib2.urlopen(self.url)
        soup = BeautifulSoup(page)
        return soup

    def get_movie_director(self):
        anchor = self.soup.find('a',href=re.compile('^/people/chart/\?view=Director'))
        director = None
        try:
            director = anchor.text
        except:
            pass
        return director

The BOMojoScraper class has a method get_movie_director that search for an anchor tag whose link starts with the following string: '^/people/chart/\?view=Director'. Then it tries to extract the text and return it.

You can test this running the following commands in your python interpreter:

1
2
3
4
5
6
7
8
In [1]: from bomojoscraper import BOMojoScraper

In [2]: url = 'http://www.boxofficemojo.com/movies/?id=matrix.htm'

In [3]: scraper = BOMojoScraper(url)

In [4]: scraper.get_movie_director()
Out[4]: u'Andy & Lana Wachowski'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def get_budget(self):
    '''
    Gets the budget as a string, extract numbers, then converts it into an integer.
    '''
    label = self.soup.find(text=re.compile('Production Budget: '))
    budget = -1
    try:
        budget_string = label.findNextSibling().text
        remove_dol = budget_string.split('$')[1]
        remove_dol = remove_dol.split(' ')
        if remove_dol[1] == 'million':
            budget = remove_dol[0]+'000000'
        elif remove_dol[1] == 'thousand':
            budget = remove_dol[0]+'000'
        budget = int(budget.replace('.',''))
    except:
        pass
    return budget

def get_domestic_(self):
    '''
    Gets the budget as a string, extract numbers, then converts it into an integer.
    '''
    label = self.soup.find(text=re.compile('Production Budget: '))
    budget = -1
    try:
        budget_string = label.findNextSibling().text
        remove_dol = budget_string.split('$')[1]
        remove_dol = remove_dol.split(' ')
        if remove_dol[1] == 'million':
            budget = remove_dol[0]+'000000'
        elif remove_dol[1] == 'thousand':
            budget = remove_dol[0]+'000'
        budget = int(budget.replace('.',''))
    except:
        pass
    return budget

Comments