Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.
How to use simple Python libraries and built-in capabilities to scrape the web for movie information and store them in a local SQLite database.
Often after a few introductory courses in Python, beginners wonder how to write a cool Python program which demonstrates somewhat advanced capabilities of the language such as web scraping or database manipulation. In this article, I will show how to use simple Python libraries and built-in capabilities to scrape the web for movie information and store them in a local SQLite database, which can later be queried for data analytics with movie info. Think of this as a project to build your own mini IMDB database!
This type of data engineering task — gathering from web and building a database connection — is often the first step in a data analytics project.Before you do any cool predictive modeling, you need to master this step. This step is often messy and unstructured i.e. there is no one-shot formula or one-stop shop library which does it all for you. So, you have to extract the data from web, examine its structure and build your code to flawlessly crawl through it.
Specifically, this demo will show the usage of following features,
- Python urllib library
- Web API service (with a secret key) for retrieving data
- Python json library
- Python OS module
- Python SQLite library
Brief descriptions of these are given below,
Python urllib module
The gateway from Python to web is done through urllib module. It is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations — like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers.
Web API service (with a secret key) for retrieving data
Web scraping is often done by API services hosted by external websites. Think of them as repository or remote database which you can query by sending search string from your own little program. In this particular example, we will take help from Open Movie Database (OMDB) website which gives an API key to registered users for downloading information about movies. Because it is a free service, they have a restriction of 1000 requests per day. Note, you have to register on their website and get your own API key for making request from your Python program.
The data obtained from this API service comes back as a JSON file. Therefore, we need to parse/convert the JSON file into a Python object, which we can work with easily.
Python json module
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition — December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.
The json library can parse JSON pages from strings or files. The library parses JSON into a Python dictionary or list. It can also convert Python dictionaries or lists into JSON strings. It is an extremely useful module and very simple to learn. This module is likely to be used in any Python based web data analytics program as the majority of webpages nowadays use JSON as primary object type while returning data.
Python OS module
This module provides a portable way of using operating system dependent functionality. If you just want to read or write a file see open(), if you want to manipulate paths, see the os.path module, and if you want to read all the lines in all the files on the command line see the fileinput module. For creating temporary files and directories see the tempfile module, and for high-level file and directory handling see the shutil module. In this demo, we will use OS module methods for checking existing directory and manipulate files to save some data.
SQLite and Python SQLite3
SQLite is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language. Some applications can use SQLite for internal data storage. It’s also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle. The sqlite3 module of Python provides a SQL interface compliant with the DB-API 2.0 specification.
Main Flow of the Program
The flow of the program is shown below. Please note that the boiler plate code is available in my Github repository. Please download/fork/star if you like it.
Fig: Main flow of the program
The basic idea is to send request to external API with a movie title that is entered by the user. The program then tries to download the data and if successful, prints it out.
Just for example, the JSON file looks like following,
Fig: An example JSON file retrieved from the API.
If the program finds a link to an image file for the poster of the movie, it asks the user if (s)he wants to download it. If user says OK, it downloads the image file to a local directory with the movie title as file name.
Next, it asks the user if (s)he wants to save some basic information about the movie in a local database. If user gives the nod, it creates or inserts into a SQLite database a subset of the downloaded movie information.
Here is the function definition to save in the database.
The notebook also contains a function to save the information in an Excel file from an existing database.
A word about the secret API key
You will notice that the program uses a secret API key for accessing the data. This key can be obtained freely by going to OMDB website and be used for up to 1000 times a day. It is a very common practice to use a secret (user-specific) key for web scraping. The way I protect the integrity of my personal API key is that I create a small JSON file in the same directory of the Jupyter notebook, called APIkeys.json. The content of this file is hidden from the external user who will see my code. My Jupyter notebook reads this JSON file as a dictionary and copies the key corresponding to the movie website and appends that to the encoded URL request string that is sent by the urllib.request method.
Summary
This article goes over a demo Python notebook to illustrate how to retrieve basic information about movies using a free API service and to save the movie posters and the downloaded information in a lightweight SQLite database.
Above all, it demonstrates simple utilization of Python libraries such as urllib, json, and sqlite3, which are extremely useful (and powerful) tools for data analytics/ web data mining tasks.