Beautiful Soup is an excellent library for parsing a web page and building a structured representation. It allows you to access any element of the DOM by type, ID, class or any tag properties.
I use it a lot on my Data Science projects, as it allows me to create datasets from any websites. It is a tool that every good data scientist needs to know.
This tutorial assume that you have already installed Python >= 2.7.2 and the Python Package Index (pip), A tool for installing and managing Python packages.
I also assume, you have basic knowledge of HTML, CSS, the DOM and Python.
I) How to install Beautiful Soup
You can easily install Beautiful Soup using pip or any Python package manager.
If you don’t have pip, run through a quick tutorial on installing python modules to get it running.
Run the following command in your terminal :
Beautiful Soup Installation
1
$ pip install beautifulsoup4
Once you have installed Beautiful Soup, you can see it in action in your python interpreter:
Beautiful Soup Test
1234567
>>importurllib2>>frombs4importBeautifulSoup>>page=urllib2.urlopen('http://www.boxofficemojo.com/movies/?id=matrix.htm')>>soup=BeautifulSoup(page)>>links=soup('a')>>links[0]<ahref="/goto.php?a=5"target="4"><fontface="Verdana"size="3"><b>'Guardians of the Galaxy'passes$300million...></b></font><br/></a>
II) Scraping Box Office Mojo
Box Office Mojo is an online movie publication and box office reporting service.
I use it a lot when I need to get data on movies such as : title, cast, director, domestic total gross, opening weekend gross etc…
Let’s start by getting data on one of my favorite movies: The Matrix.
Create a new file called “bomojoscraper.py” and insert the following code:
bomojoscraper.py
123456789101112131415161718192021222324
importre,urllib2frombs4importBeautifulSoupclassBOMojoScraper():def__init__(self,url):self.url=urlself.soup=self.connect()defconnect(self):''' returns the BeautifulSoup object '''page=urllib2.urlopen(self.url)soup=BeautifulSoup(page)returnsoupdefget_movie_director(self):anchor=self.soup.find('a',href=re.compile('^/people/chart/\?view=Director'))director=Nonetry:director=anchor.textexcept:passreturndirector
The BOMojoScraper class has a method get_movie_director that search for an anchor tag whose link starts with the following string: '^/people/chart/\?view=Director'. Then it tries to extract the text and return it.
You can test this running the following commands in your python interpreter:
12345678
In [1]: from bomojoscraper import BOMojoScraperIn [2]: url = 'http://www.boxofficemojo.com/movies/?id=matrix.htm'In [3]: scraper = BOMojoScraper(url)In [4]: scraper.get_movie_director()Out[4]: u'Andy & Lana Wachowski'
defget_budget(self):''' Gets the budget as a string, extract numbers, then converts it into an integer. '''label=self.soup.find(text=re.compile('Production Budget: '))budget=-1try:budget_string=label.findNextSibling().textremove_dol=budget_string.split('$')[1]remove_dol=remove_dol.split(' ')ifremove_dol[1]=='million':budget=remove_dol[0]+'000000'elifremove_dol[1]=='thousand':budget=remove_dol[0]+'000'budget=int(budget.replace('.',''))except:passreturnbudgetdefget_domestic_(self):''' Gets the budget as a string, extract numbers, then converts it into an integer. '''label=self.soup.find(text=re.compile('Production Budget: '))budget=-1try:budget_string=label.findNextSibling().textremove_dol=budget_string.split('$')[1]remove_dol=remove_dol.split(' ')ifremove_dol[1]=='million':budget=remove_dol[0]+'000000'elifremove_dol[1]=='thousand':budget=remove_dol[0]+'000'budget=int(budget.replace('.',''))except:passreturnbudget