It will be useful to know why and how SQL holds such an important stature in the field of data science.
Data science is a field that uses many scientific methods and processes to collect data and get useful insights from them. Doing this creates new opportunities, designs, business cases and even revolutionary innovations.
The future of AI lies in Data science. Data science involves use of algorithms, software programs and statistics to collate and process huge complex chunks of data. The complexity of data comes from its unstructured nature – data today can be in any form – text, image, output from sensors and instruments and what not!
As of today, the benefits of data science can be felt in domains like e-commerce, healthcare, manufacturing, banking, finance and transport.
SQL or Structured Query Language is a ‘programming language’ that manages data in a relational database through ‘queries’. By using SQL, we can insert, update, delete and select data based on various filters and conditions. SQL also helps run certain scheduled scripts automatically, that need to run regularly.
With SQL, Developers and data analysts can easily code and run queries. For example, select * from customer; is a simple query that gives details of the customer table.
Analysts can access huge amounts of data, process it without having to copy or store data anywhere else.
It is easy to analyze data using SQL queries than using excel sheets or any other methods. There are a lot of aggregation functions in SQL that can work on large datasets and multiple tables.
To understand how SQL is used in data science, we need to know the different phases of data science.
There are 3 important phases in the lifecycle of Data Science –
SQL can also help with data cleaning, which means irrelevant, incorrect or corrupt records can be removed periodically by automatic batch scripts without any manual intervention. You can also use R for data cleaning and transformation, however using SQL in conjunction with R can reduce the amount of code to be written and bring in more flexibility.
Each of these can be a table in the relational DBMS and the relationship between them can be established using a common key like the customer_id, phone_number or subscription_id which can be unique. Thus, we are creating entities (tables), attributes (columns), relationships (customer_id linked to subscription_id or phone_number etc…) and integrities (customer_id should be unique, phone_number and subscription_id together can be a composite key and so on).
Data mining is a process where huge sets of data is examined for patterns and trends for evaluating data to solve a business purpose. The process becomes easier when large datasets are organized and structured.
As a continuation, let us say we want to identify what are the value-added services that most users are interested in. Since every person nowadays has at least one mobile phone and the number of services that can be added are unlimited, this will be a huge data to consolidate, sort and examine.
We can also use in-built SQL functions to list the data in ascending order based on service_end_date, group by a certain order and so on.
You might want to argue that excel sheet with huge amount of data can also fetch the results and do some filtering operations as well. It is true, but not comparable to the scale at which SQL stores its data. But how much SQL is enough for you to be a data scientist?
Most of the time, the database and schema will be already present with all the relationships established. As a data scientist, you will need to understand the schema, relationships and how you can query the database to get the best of the results from a business perspective.
Let us create some data and then work our way to fetch this data using the different techniques we will learn below.
Below are the most important statements and commands that you will use on a daily basis.
Selecting rows and columns
This will fetch something like –
first_name deptt date_of_birth
Maria ECE 17-11-1984
Naman ECE 06-06-1984
Keshav CSE 14-12-1983
It can so happen that there may be similar records and we want to get only one of them. For example, the deptt (departments) column. If we want to list all the departments present in the database, we can use the keyword distinct. This will give all the unique values in the column.
Same way, we can get the count of students based on certain conditions, find sum, average, find records that have a particular date of birth and many more things.
Aggregate functions are used with group by, having clause to get a scalar value from a select statement. The most common functions are – SUM, AVG, COUNT, MAX, MIN. Except count function, all others ignore a NULL value. In the above example, if we want to get count of students in ECE deptt, we can form a query as –
Similarly, if we want to get the average percentage obtained by CSE students of all grades, we can use the AVG function and so on.
A relational database, as we know has multiple small tables that can be mapped to each other. To obtain data from multiple tables in a single result, we join the tables and display the required columns. There are different types of joins –
If you would like to get into details of each or any of the above, this interactive course caters to all the SQL that is required for you to be a data scientist and all that we have discussed above.
There are many sub fields in data science, but irrespective of that, SQL remains an important ingredient in the Data science dish. Without the essence of SQL, your entry into this field would be incomplete. Learning and implementing SQL will go a long way in helping you think of more creative ideas and turn your data into useful business use cases or insights.