This article is the fifth part of a ten-part series on Digital Transformation Debt, post-Covid-19. Part 1focused on Culture. Part 2 delved deeper into Operational Excellence and inter-enterprise Value Streams As A Service. Part 3 explained the spectrum of Automation and the shifts post-Covid-19.. Part 4 demonstrated how a new harvest of Low Code/No Code platforms is empowering Citizen Developers.
Due to Covid-19, organizations need to be agile and responsive. They need to understand trends and predict actions leveraging enterprise, sensor, customer, and partner Data. They also need to be in motion and autonomic. This article focuses on another critical dimension for alleviating Digital Transformation Debt: the emergence of the Citizen Data Scientist. Mining patterns from increasingly exploding data lakes and then acting upon those in real-time is critical for survival post-Covid-19.
By any estimate, the digital era is facing an unprecedented explosion of information. Digital technologies, solutions, and content generate 2.5 quintillion bytes of data each day! However, like the IT application development bottleneck, a more severe data scientist challenge is the shortage of Data Scientist. Organizations are hoarding data – but often mining and benefiting from the heterogeneous data lakes is a challenge. A new harvest of productivity, self-service, drag-and-drop data tools is emerging and allowing citizens to discover and deploy analytical models – predictive, machine learning, or even deep learning. Nothing short of Artificial Intelligence platforms for the masses. We are witnessing the emergence of easy to use Citizen AI tools for customer engagement, with proven results.
In the Covid-19, Data is becoming even more critical. The application of the models mined from the Covid-19 infection databases is obvious. Equally important are the supply chain, societal interaction, and overall economic trends amid shifts and transformation. The Covid-19 era is also accelerating the “Process +Data” narrative, where organizations need to complement and balance data-centricity combined with digitization and Automation of value streams. Bottom line – pre or post-Covid-19 – it is not just about the data. The insights need to be mined, discovered, or harvested from the vast, often messy lakes of data. Raw data to insights should be the mantra. Once insights are discovered, they need to be acted upon.
Database Management Systems (DBMS)
DBMSs that separated the management of the data from the application started to appear in the 1970s with navigational hierarchical and network models. In the 1980s, we saw a significant evolution to relational databases that became quite popular, especially with SQL’s emergence as the de-facto query language for databases! The evolution of databases from relational included Object-Oriented Databases that combined Object-Oriented and Database capabilities for persistent storage of objects and Object-Relational Databases that combine the characteristics of both relational and object-oriented databases.
More recently – especially for handling large unstructured multi-media data in new digital applications – we saw the emergence of NoSQL to handle the demands of Big Data: with large volume, variety, velocity, and veracity. This new generation of database focuses on dealing with the explosion of heterogeneous data and the storage and management of this Data for innovative Internet applications (especially IoT). Still, by and large, most transactional data for mission-critical systems of record (which require transactional integrity) remains relational. All these trends are culminating in intelligent DBMSs.
Recently we have also seen the emergence of “Data Lakes.” Here is how AWS explains “Data Lakes:”
Faced with massive volumes and heterogeneous types of data, organizations are ﬁnding that in order to deliver insights in a timely manner, they need a data storage and analytics solution that offers more agility and ﬂexibility than traditional data management systems… Data Lake allows an organization to store all their data, structured and unstructured, in one, centralized repository.
The following illustrates the key components and capabilities of a Data Lake.
The emergence of many heterogeneous data sources is at the core of the Data Lake. According to Aberdeen, there is a clear distinction in business execution between Data Lake leaders and followers (aka lagers). Strategic Data Lake investments and maturity characterize the leaders.
The Data Scientist
The sections above illustrate the complexity of Data in enterprises – too many databases, repositories, sources, and strategies. The Data Scientist role is a relatively new one. Many assumptions that we had taken for granted in the management of databases, including integrity or logic pertaining to the independence of the data from the application, are now being challenged. The past couple of decades have created powerful gatekeepers of the enterprise data (the Database Administrators (DBA)) who sometimes block agility and the speed of change needed to sustain business requirements. The world – or I should say the digital world – is changing. The introduction of NoSQL databases, especially for Big Data, has introduced additional complexity for managing and maintaining heterogeneous DBMSs consistency. This transformational change emanates from the need to engage customers directly. It also results from the explosion of information on the Internet, especially with the Internet of Things. But more importantly, the mining of business value through analysis and machine learning techniques has given rise to this new – and sometimes DBA evolved – role in the enterprise, namely the “Data Scientist.”
Data Science is complicated. Data Science is multi-disciplinary. Here is a definition of the role of a Data Scientist from a business perspective:
A data scientist identifies important questions, collects relevant data from various sources, stores and organizes data, decipher useful information, and finally translates it into business solutions and communicate the findings to affect the business positively.
Data Science involves many disciplines. Data Scientists need to have many skills – from mathematics, statistics, machine learning, to programming, and more. Perhaps more importantly, Data Scientists need to communicate and present their findings in clear terms that the business understands. They also need to be subject matter experts and creative—one role for all this spectrum. No wonder Data Scientists are in great demand! Here is a great illustration of Data Science:
The Data Scientist’s continuous activities are three fundamental areas: Data Analysis, Programming, and Business Analysis for concrete business results. Unfortunately, poor data quality complicates the Data Scientist’s tasks and objectives. About 70% of their effort is to ingest, prepare, and cleanse the data.
In my interactions with Data Scientists, they sometimes object to this estimate. It is more. In other words, only 10% – 30% of their time is the discovery of meaningful insights and business value from the often unruly and heterogeneous data sets!
The Citizen Data Scientist
As indicated above, Data Science involves many disciplines. According to Gartner, ” citizen data scientist as a person who creates or generates models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics .” The 2017 article predicts 40% of Data Science tasks will be automated by 2020! Well, we are in 2020 and not even close to that level of Automation. Still, Data Scientists spend 70% or higher their time cleaning and preparing the analysis and discovery data.
Despite many technological advances, methodologies, and techniques, most organizations still suffer from Business-Technical Developers-Operations silos. The trend towards empowered Citizens who can achieve Data Science objectives is not hype. It is also not a panacea. It does have challenges.
The good news is that some emerging tools and platforms are addressing the requirements of Data Scientists. Intelligence and Automation in all the milestones and phases of the Data Science workflow make it real for Citizen Data Scientists
Here are some productivity, intelligence, and automation technologies that are targeting the inevitable trends towards a Citizen Data Scientist platform:
- Automation of Data Preparation: This is the most crucial category, like cleaning and preparing the data constitutes more than 70% of the Data Scientists’ effort. We are starting to see some tools addressing these needs. Tableau Prep, for example, “… changes the way traditional data prep is performed in an organization. By providing a visual and direct way to combine, shape, and clean data, Tableau Prep makes it easier for analysts and business users to start their analysis, faster.”
- Low Code/No Code Data Integration: Several emerging and robust tools automate data integration and aggregation from different sources. Most structured and unstructured databases have Application Programming Interfaces (APIs). These productivity and automation tools provide easy to use drag and drop capabilities for Data integration. Parabola is an example of a Low Code/No Code platform for automating integration.
- Automating Machine Learning (AutoML): Automation in data integration and preparation is a pre-requisite for analysis and machine learning. Machine Learning leverages Artificial Intelligence (AI) algorithms to discover patterns in the data. It is critical in the overall Data Science process. Now, when we shift to Citizen Data Scientists, it becomes critical to automate Machine Learning. Here is one definition of AutoML – which is a bit extreme but drives home the objective of AutoML: “Automated machine learning, or AutoML, aims to reduce or eliminate the need for skilled data scientists to build machine learning and deep learning models. Instead, an AutoML system allows you to provide the labeled training data as input and receive an optimized model as output.” Several vendors are positioning their advanced AI automation tools as AutoML – this includes Google’s Cloud AutoML and IBM Watson’s AutoAI.
- End-To-End Citizen Data Science Tools: As described earlier, the multi-discipline Data Science has many phases. The overall workflow involves data sourcing, preparation, analysis, modeling, prioritizing the models, and then deployment. One example of such a platform is DataRobot. Here is how they describe their support for Citizen Data Scientists: “Citizen data scientists can upload a dataset to DataRobot and pick a target variable based on the practical business problem they wish to solve. The platform automatically applies best practices for data preparation and preprocessing, feature engineering, and model training and validation.” The following illustrates the end-to-end workflow for Citizen Data Scientists.
With these platforms, the dream of a Citizen Data Scientist spanning Automation and self-service with drag and drop intuitive productivity tools are slowly becoming a reality. We still have a long way to go.
Recommendation: Citizen Data Scientists for the Data-Centric Enterprise
Data Science is complicated. The solution market is fragmented and confusing. Yet, they provide tremendous advantages when developing and deploying innovative applications. The speed of development could be existential – especially in the post-Covid-19
Covid-19 delivers a robust opportunity to rethink roles and tools for innovation and become a startup or an enterprise in motion. Here are the top recommendations:
- Citizen Data Scientist Culture: This is extremely important. Some business stakeholders in enterprises or founders in startups might be reluctant to get involved in “Data Science.” Given the complexity of Data Science, this will most likely be a partnership between conventional data science technical roles and business savvy Citizen Data Scientists for specific data science workflow milestones.
- Data Cleansing and Preparation Automation: The first place to start the automation and self-service Data Science is the data cleansing and preparation phase, which typically consumes 70%+ of the Data Scientists’ efforts. Given the heterogeneous data sources, this is quite complex, but it is critical for success. This typically needs a partnership between technical Data Scientists and Citizen Data Scientists – with most of the technical tasks assigned to the former and the data schemata assigned to the latter.
- Reskill and Upskill for Data Visualization and AutoML: Organizations need to leverage their employees, especially for the Data Visualization and the increasingly important area of AutoML or AutoAI. The visualization market is quite mature with tools such as Tableau. AutoML is more challenging but also more promising in terms of business value. Many software vendors are starting to provide robust solutions for AutoML. Therefore, following and re-skilling Citizen Data Scientists from Visualization to AutoML is critical.
- Digital Design Sprints – being lean and effective: Check the following article on the Sprint methodology. There is a perfect fit either during or immediately post the 4-5 day methodology to leverage Low Code/No Code for a Minimum Viable Product (MVP). The end-user testing can – and most likely will – end up with enhancements that could be easily and speedily achieved with a Low Code/No Code platform.
Request: Let me know if you have case studies and best practices leveraging emerging Data Science and AutoML platforms.