It is quite often that in our blog we explore intricate connections between state-of-the-art technologies, or explore the mesmerizing depth of a new technique. However, AI or data science is not only bragging about new exciting methods that boost the accuracy by 2% (which is a big gain), but about making data and technology work for you. It will help you increase sales, understand your customers, predict future faults in process lines, or make an insightful presentation, submit a term project or have a good time with your friends working on a new idea that will change the world. And in this sense, everyone can — and to some extent should — become a data scientist.
We already discussed what makes a good data scientist (part 1) and what you should learn before you set to a real project. In this post, we’ll walk you through the process of building a backbone data project in simple steps.
Find a story behind an idea
You have an excellent idea in your head — the one you have cherished since you were a child about having a toys-cleaning robot or the one that just came into your mind about accessing the customers in your shop by sending them fortune cookies with predictions based on their purchase preferences. However, to make your idea work you need the attention of others. Find a compelling narrative for it; make sure that it has a hook or a captivating purpose, that it is up-to-date and relevant. Finding the narrative structure will help you decide whether you actually have a story to tell.
Such a narrative will be the basis for your business model. Ask yourself: What is it that you develop, what resources do you need, and what value do you provide to the customer? For what values are customers going to pay?
A nice way to do this is the business model canvas. It’s simple and cheap, you can create it on a sheet of paper.
Prepare the data
The first practical step is collecting data to fuel your project. Depending on your field and goals, you can search for ready datasets available on the Internet, such as for example, this collection. You can choose to scrape data from websites or access data from social networks through public APIs. For the latter option, you need to write a small program that can download data from social networks in a programming language you feel the most comfortable with. For the cloud option, you can spin up a simple AWS EC2 Linux instance (nano or micro), and run your software on in.
The best way to store the data is to use a simple .csv format with each line including the text and meta information, such as the person, timestamp, replies, and likes.
As to the amount of data needed, the rule of the thumb is to get as much data as possible in a reasonable time, for example, a few days of running your program. Another important consideration is to collect as much data as the machine you are using for analytics can handle. How much data to get is not an exact science, but it rather depends on the technical limitations and the question you have.
Finally, in collecting and managing data it is crucial to be devoid of bias and do not be selective about inclusion of exclusion of data. This selectivity includes using discrete values when the data is continuous; how you deal with missing, outlier and out of range values; arbitrary temporal ranges; capped values, volumes, ranges, and intervals. Even if it is arguing to influence, it should be based upon what the data says–not what you want it to say.
Choose the right tools
To perform a valid analysis, you need to find the proper tools. After getting the data you need to select the proper tool to explore it. To make a choice, you can write down a list of analytics features you think you need and compare available tools. Sometimes you can use user-friendly graphical tools like Orange, Rapid Miner or Knime. In other cases, you’ll have to write the analysis on your own with such languages as Python or R.
Prove your theory
With the data and tools available, you can prove your theory. In Data Science, theories are statements of how the world should be or is, and are derived from axioms that are assumptions about the world, or precedent theories (Das, 2013). Models are implementations of the theory; in data science, they are often algorithms based on theories that are run on data. The results of running a model lead to a deeper understanding of the world based on theory, model, and data.
To assess your theory at an initial step, in line with the more general and conventional content analysis, you can pinpoint trends present in the data. One way we use quite a lot is to select significant events that have been reported. Then you can try to create an analytics process that finds these trends. If analytics can find the trends you specified, then you are on the right track. Look for instances where analytics finds new trends. Confirm these trends, for instance by searching the internet. The results are not going to be reliable 100% of the time, so you’ll need to decide how many falsely reported trends (the error rate) you want to tolerate.
Build a minimum viable product
When you have your business model and a proven theory, it is time to build the first version of your product, the so-called minimum viable product (MVP). Basically this can be the first version that you offer to customers. As a minimum viable product (MVP) is a product with just enough features to satisfy early customers, and to provide feedback for future development, it should focus only on the core functionality without any fancy solutions. You should stick to simple functions that will work in the beginning and expand your system later. At this stage, the system could look something like this:
Automate your system
In principle, your focus should be on the future development of your product, not on a system operation. For this, you need to automate as much as possible: uploading to S3, starting the analysis, or data storing. In this article we discussed automation in more detail.
The other face of automation is logging. When everything is automated you can feel that you are losing control over your system and do not know how it performs. Besides, you need to know what to develop next, both in terms of new features and fixing problems. For this, you need to set up a system for logging, monitoring and measuring all meaningful data. For instance, you should log statistics for the download of your data or upload to S3, the time of the analytics process and the users’ behavior.
There are multiple tools to help you log server statistics like CPU, RAM, network, code level performance, and error monitoring, many of them having a user-friendly interface.
Reiterate and expand
You probably know that AI, Machine Learning, Data Science and other new developments are all about reiteration and fine-tuning. So, when you have your MVP running, automation and monitoring in place, you can start enhancing your system. It is time to get rid of weaknesses, optimize the overall performance and stability, and add new functions. Implementing new features will also allow you to offer new services or products.
Present your product
Finally, when your product is ready, you need to present it to the customers. This is where your story behind the data and business model come to help.
First of all, think about your target audience. Who are your customers and how are you going to sell your product to them? What does the audience you are going to present your product to know about the topic? The story needs to be framed around the level of information the audience already has, correct and incorrect:
- Novice: first exposure to the subject, but doesn’t want oversimplification
- Generalist: aware of the topic, but looking for an overview understanding and major themes
- Managerial: in-depth, actionable understanding of intricacies and interrelationships with access to detail
- Expert: more exploration and discovery and less storytelling with great detail
- Executive: only has time to glean the significance and conclusions of weighted probabilities
Afterwards, visualize your data and incorporate trends, significance and proportion you built your project into a narrative. Your story about the product should never end with a fixed event, but rather a set of options or questions to trigger an action from the audience. Never forget that the goal of data storytelling is to encourage and energize critical thinking for business decisions or to purchasing your product.