The Case for Using Data Simulators to Drive Big Data Success

Ready to learn Big Data? Browse courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Business Rationale for a Data Simulator

While a lot has been said and written about the business value of unifying data silos for insights, Big Data solution providers often encounter problems in convincing their customers to break down these silos. It is not that customers are unwilling to act by either finding data in-house or procuring third party data that can be brought together, they need to be convinced that this extra effort will result in significantly enhanced business outcomes. The speed of vendor access to data in a complex corporate environment is directly proportional to the number of legacy systems and proprietary mechanisms in the company’s IT operations. Case studies and generic demos only help in starting a conversation but not in closing the deal. As a solutions provider trying to get work done, this became a challenging issue, and to address it we started using Data Simulators to break the impasse. The use of a simulator allowed us to generate synthetic data in numerous types, shapes, and value to suit most business cases. It also helped our Proof of Concepts (POCs) to be more focused on insights and action, and less on ingestion and ETL processes. While ingestion of data is a real problem and needs to be solved, we found that our customers prefer deferring it to the production phase rather than the POC phase. This way the insight horse was always ahead of the ingestion cart! Data simulators are also accelerating our deployments in newer domains where there is little historical data or where it’s hard to get data – wearables, industrial internet, expensive third-party data and more.

Now that I have said something to highlight the business value of a Data Simulator, I present below a deep dive of the architecture and implementation of our simulator – BigSim.

Under The Hood – BigSim

BigSim is designed to provide flexibility and control in generating large data sets through templates and minimal coding. Users just need to provide the data specifications in an XML template defining the semantic type, range, volume, velocity, and shape. Since much of the data generation process is an independent task, multiple simulator instances can run independently on different machines; thereby creating large data sets that can be pushed to a common data storage, or streamed. These simulated data sets can be used for capacity planning, what-if scenario testing, extrapolating small data sets with certain amount of randomness so as to simulate real-world data sets, fill in missing data in incomplete data sets and such.

Key Features of BigSim

Extensibility and Adaptability

The simulator can easily be extended and adapted to generate custom data patterns using a library of pre-built primitive and user defined types.The XML snippets below show examples of how this can be done.

Fine Grain Control

A robust simulation platform should be able to support easy control of the volume and velocity of the data to support multiple usage scenarios. Smart grids, Black Friday sales, high frequency trading, and Twitter fire hose, all generate data of varying types, volumes and velocity. BigSim provides adequate dials and knobs to deal with such needs.

This load distribution template shown below generates data records for an hour with varying loads distributed across different time slices.

Support for Data in Motion and Data at Rest

With streaming analytics gaining popularity alongside batch analytics, simulators are expected to generate large volumes of data to support both forms of analytics. BigSim has the ability to push data into a CSV file and into various SQL and NoSQL databases. It can also stream the generated data in real-time or at desired intervals for consumption by stream-based services.

The snippet below shows the configuration for a Batch (CSV, Cassandra) and Streaming data generation.

Conclusion

For a long time now, simulators have played a vital role in engineering domain with offerings such as wind tunnels, flight simulators, and load and stress testers. These have without a doubt resulted in bringing innovative and safer products faster to market. Our experience has shown that rolling out data-driven products and services targeting both enterprises and consumers can be accelerated through a robust data simulator. Big Data projects no longer have to be stymied by not enough data, cannot access data, missing data, or incorrect data.