Originally posted on The BigchainDB blog
Ready to learn Big Data? Browse courses like Big Data Training & Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.
From Data Audit Trails to a Universal Data Exchange
Credit: Mariusz Kluzniak, CC-by-NC-ND
Big Data is Big Business
Big data arose in the early and mid 2000s to meet internet-scale computation needs: ZooKeeper at Yahoo, BigTable and MapReduce at Google, Cassandra at Facebook; and so on. Then came open source projects like Hadoop File System (HDFS), Hadoop MapReduce, Cassandra, and more.
By the late 2000s and early 2010s, startups like MongoDB, Cloudera, and DataStax had created businesses to transform the open source successes into enterprise-grade offerings.
Now, big data technology is quietly transforming every enterprise backend on the planet. For example, in many places “data warehouses” of relational databases are getting replaced by “data lakes” running big data software. More than $100B annually is going towards big iron compute clusters, the software on top, and the services to keep it all running smoothly.
Big Data Challenges
But big data has its challenges, which include control, data authenticity and monetization.
First, who controls the infrastructure when there are multiple actors involved? For example:
- If you’re a multinational enterprise, how do you share data around the planet? If you have multiple copies, how do you know which one is the most up-to-date? How do you reconcile a different system administrator role at each regional office?
- If you’re an industry consortium, how to share control of the ecosystem infrastructure among the companies in your consortium? This is especially hard if those companies are competitors!
- Why can’t there be data just “out there” as a single shared source of truth that no one on the planet owns or controls, per se? Rather, data would be a public utility like electricity or the internet itself.
Second, how well can you trust the data? For example:
- If you generate the data yourself, how do you prove you were the originator? If you get data from others, how do you know it was truly them?
- What about crashes and malicious behavior? Machines crash, glitches happen, bits flip. Zombie IoT toasters might be inputting garbage. So after all your fancy Spark calculations, is it still just garbage out?
Finally, how do you monetize the data? For example:
- How do you transfer the rights of the data, or buy rights from others?
- There’s a long standing dream of a universal data marketplace; how?
A New Tool For Big Data: Blockchain Technology
The recent surge in blockchain technology was sparked by Bitcoin. Technically, all blockchains are simply databases, but databases with “blue ocean” benefits: decentralized / shared control, immutability / audit trails, and native assets / exchanges.
By modern database standards, traditional blockchains have terrible scalability and don’t even have query languages; nonetheless, the blue ocean benefits were enough to capture the imagination of the globe.
Better yet, more recent technology — the BigchainDB blockchain database— combines the benefits of distributed databases (scale, queryability) and blockchains (decentralized, immutable / audit trails, assets / exchanges).
This new blockchain database technology has the scalability needed in big data environments, by building on top of best-in-class distributed databases like MongoDB. This unlocks the potential for highly interesting applications in big data: shared control of infrastructure, audit trails on data, and the possibility for a universal data exchange. Let’s explore both in detail.
Shared Control of Big Data Infrastructure
Summary. Being a blockchain database means that control of the database infrastructure is shared across the entities, whether within-enterprise, within a consortium, or across the planet.
How. A big data blockchain database like BigchainDB is decentralized, which means that its control can be shared.
That sharing can happen in one of many contexts:
- Across offices within an enterprise. That is, you get shared control of a big data database across geographically-spread offices.
- Across companies within an ecosystem. That is, you get shared control of a big data database among companies (even competitors) in an ecosystem.
- On a planetary level. Shared control of an open, public big data database means “data as a utility” like air or the internet. Such a database is getting rolled out now: it’s called IPDB. IPDB is both a network (powered by BigchainDB) and a nonprofit foundation.
Being a big data database, it has the scale to actually hold the data itself, unlike traditional blockchains. As that database fills up, one can add more databases, and connect them with Interledger protocol for interoperability.
Benefits. Let’s see how this addresses the problems I’d described earlier.
- Problem: If you’re a multinational enterprise, how do you share data around the planet? If you have multiple copies, how do you know which one is the most up-to-date? How do you reconcile a different system administrator role at each regional office?
- A: Each regional office with its own sysadmin controls one node of the overall database. So they control the database collectively. The decentralized nature also means that if a sysadmin or two goes rogue, or a regional office is hacked, the data is still protected. (Assuming encryption is in place too, of course).
- Problem: If you’re an industry consortium, how to share control of the ecosystem infrastructure among the companies in your consortium? This is especially hard if those companies are competitors!
- A: Similar to above, each company controls one node in the overall database.
- Problem: Why can’t there be data just “out there” as a single shared source of truth that no one on the planet owns or controls per se? Rather, data would be a public utility like electricity or the internet itself.
- A: IPDB, the Interplanetary Database, is getting rolled out now.
Audit Trails on Data
Summary. Blockchain technology allows us to have audit trails on data, to improve the trustworthiness of the data. You get authenticated data stories.
How 1. Here’s how it works. Let’s say that you have a data pipeline of six steps: IoT sensors → Kinesis / Event Hub + stream analytics → HDFS storage→ Spark data cleaning → Spark normalization → MongoDB storage →Tableau analytics.
Before each data pipeline step starts, time-stamp the input data as follows:
- Create a transaction, shaped as a JSON document, that includes a hash of the data, hashes of each row and column if you like, and any meta data you wish to include (e.g. where you got the data from, precise hashing recipe).
- Cryptographically sign the transaction with your private key. This is a classic digital signature.
- Write the transaction to the blockchain database (BigchainDB). It will automatically time-stamp the transaction. Now you have immutable evidence that you had access to that data at that point in time, which others can cryptographically verify based on your public key.
After each data pipeline step is done, time-stamp the step’s output data in the same three steps.
Side note: dsensors are complementary here, as IoT sensors that no single entity owns or controls.
How 2. There’s an even simpler way for some steps, if you’re using a distributed database that BigchainDB already wraps (e.g. MongoDB, RethinkDB). You then simply swap out that database (e.g. MongoDB) with its blockchain-ified version (e.g. MongoDB wrapped by BigchainDB).
There’s no need for hashing or anything, because it’s all implicit. Note that BigchainDB does not expose the whole interface of the wrapped database, though over time it will expose more based on user feedback.
Benefits. Let’s see how this addresses the problems I’d described earlier.
- Problem: If you generate the data yourself, how do you prove you were the originator?
- A: People who have your public key can see that you cryptographically signed it
- Problem: If you get data from others, how do you know it was truly from them?
- A: You can verify the transaction against that person’s public key
- Problem: What about crashes and malicious behavior? Machines crash, glitches happen, bits flip.
- A: You can run periodic processes to re-hash the data stored in the pipeline. If the new hash doesn’t match the previous hash, something’s wrong.
- Problem: Zombie IoT toasters might be inputting garbage. So after all your fancy Spark calculations, is it still just garbage out?
- A: First, use IoT devices with proper security, no need to take down the DNS again:) Those IoT devices should have a way to sign the data where their private key is not compromised. Then, like before, you can verify the IoT device’s data-input transaction against its public key.
Universal Data Exchange
Summary. We can build a universal data marketplace, to help evaporate walls of data silos. A scalable blockchain database speaking the protocol of IP rights transfer enables data to be bought and sold as an asset. It would be collectively controlled by a public ecosystem. People can build data exchanges on top to suit their desires.
How. Here’s how it works.
You need a global public blockchain database, which exists in the form of IPDB. There could even be multiple networks, where assets flow among them them with the Interledger protocol for interoperability.
The asset is the data rights, backed by copyright law. The asset “lives” on the blockchain database. You own the asset if you control the private key. The asset can be sliced & diced, and transferred to others, using a modern, flexible, blockchain-friendly IP protocol. This is also a recent invention, called Coala IP.
So, the stack is BigchainDB software + IPDB network + Coala IP protocol. With this, we have the substrate on which creative hackers and entrepreneurs can build data exchanges of various shapes and sizes.
Benefits. Let’s see how this addresses the problems I’d described earlier.
- Problem: How do you transfer the rights of the data, or buy rights from others?
- A: Create a transaction to transfer rights to another person, speaking the language of the Coala IP protocol. Sign it. Write it to the database.
Conclusion
Big data is big bucks. Blockchain database technology can help resolve two of big data’s outstanding challenges: how to trust the data, and how to build a universal data exchange.
Acknowledgements
Thanks to the following reviewers: Adam Drake, Donald Gossen, Franta Polach, Bruce Pon, Carly Sheridan, and Gaston Besanson. (If you have more feedback, please let me know and I’ll be happy to update the article.)