Step-by-step guide to build your own ‘mini IMDB’ database

Tirthajyoti Sarkar Tirthajyoti Sarkar
February 15, 2019 Big Data, Cloud & DevOps

Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.

How to use simple Python libraries and built-in capabilities to scrape the web for movie information and store them in a local SQLite database.

Often after a few introductory courses in Python, beginners wonder how to write a cool Python program which demonstrates somewhat advanced capabilities of the language such as web scraping or database manipulation. In this article, I will show how to use simple Python libraries and built-in capabilities to scrape the web for movie information and store them in a local SQLite database, which can later be queried for data analytics with movie info. Think of this as a project to build your own mini IMDB database!

This type of data engineering task — gathering from web and building a database connection — is often the first step in a data analytics project.Before you do any cool predictive modeling, you need to master this step. This step is often messy and unstructured i.e. there is no one-shot formula or one-stop shop library which does it all for you. So, you have to extract the data from web, examine its structure and build your code to flawlessly crawl through it.

Specifically, this demo will show the usage of following features,

  • Python urllib library
  • Web API service (with a secret key) for retrieving data
  • Python json library
  • Python OS module
  • Python SQLite library

Brief descriptions of these are given below,

Python urllib module

The gateway from Python to web is done through urllib module. It is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations — like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers.

Web API service (with a secret key) for retrieving data

Web scraping is often done by API services hosted by external websites. Think of them as repository or remote database which you can query by sending search string from your own little program. In this particular example, we will take help from Open Movie Database (OMDB) website which gives an API key to registered users for downloading information about movies. Because it is a free service, they have a restriction of 1000 requests per day. Note, you have to register on their website and get your own API key for making request from your Python program.

The data obtained from this API service comes back as a JSON file. Therefore, we need to parse/convert the JSON file into a Python object, which we can work with easily.

experfy-blog

Python json module

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition — December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

The json library can parse JSON pages from strings or files. The library parses JSON into a Python dictionary or list. It can also convert Python dictionaries or lists into JSON strings. It is an extremely useful module and very simple to learn. This module is likely to be used in any Python based web data analytics program as the majority of webpages nowadays use JSON as primary object type while returning data.

Python OS module

This module provides a portable way of using operating system dependent functionality. If you just want to read or write a file see open(), if you want to manipulate paths, see the os.path module, and if you want to read all the lines in all the files on the command line see the fileinput module. For creating temporary files and directories see the tempfile module, and for high-level file and directory handling see the shutil module. In this demo, we will use OS module methods for checking existing directory and manipulate files to save some data.

SQLite and Python SQLite3

SQLite is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language. Some applications can use SQLite for internal data storage. It’s also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle. The sqlite3 module of Python provides a SQL interface compliant with the DB-API 2.0 specification.

Main Flow of the Program

The flow of the program is shown below. Please note that the boiler plate code is available in my Github repository. Please download/fork/star if you like it.

experfy-blog

Fig: Main flow of the program

The basic idea is to send request to external API with a movie title that is entered by the user. The program then tries to download the data and if successful, prints it out.

Just for example, the JSON file looks like following,

experfy-blog

Fig: An example JSON file retrieved from the API.

If the program finds a link to an image file for the poster of the movie, it asks the user if (s)he wants to download it. If user says OK, it downloads the image file to a local directory with the movie title as file name.

 

Next, it asks the user if (s)he wants to save some basic information about the movie in a local database. If user gives the nod, it creates or inserts into a SQLite database a subset of the downloaded movie information.

Here is the function definition to save in the database.

The notebook also contains a function to save the information in an Excel file from an existing database.

A word about the secret API key

You will notice that the program uses a secret API key for accessing the data. This key can be obtained freely by going to OMDB website and be used for up to 1000 times a day. It is a very common practice to use a secret (user-specific) key for web scraping. The way I protect the integrity of my personal API key is that I create a small JSON file in the same directory of the Jupyter notebook, called APIkeys.json. The content of this file is hidden from the external user who will see my code. My Jupyter notebook reads this JSON file as a dictionary and copies the key corresponding to the movie website and appends that to the encoded URL request string that is sent by the urllib.request method.

Summary

This article goes over a demo Python notebook to illustrate how to retrieve basic information about movies using a free API service and to save the movie posters and the downloaded information in a lightweight SQLite database.

Above all, it demonstrates simple utilization of Python libraries such as urllib, json, and sqlite3, which are extremely useful (and powerful) tools for data analytics/ web data mining tasks.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Tirthajyoti Sarkar

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Why Chatbots Are Critical to Driving Enterprise Value

    Why Chatbots Are Critical to Driving Enterprise Value

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: support@experfy.com

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.