It’s common knowledge that using the right data model for the right job provides important advantages for developers, software architects and data scientists. Mapping data natively into the database, precisely tailored queries for optimal performance or leveraging data model specific optimizations, is naming only a few.
Since the early 2010s, the concept of using multiple, single-model database technologies to harness the advantages of each data model became popular quite quickly. Polyglot Persistence was the architectural pattern that spread from large tech companies to the software development ecosystem. The concept provided developers with abundant choices to pick the right data tool for the right job, promoting the use of relational, document or graph databases where they fit best. But, using separate databases for each model comes at a high cost for development, maintenance and data consistency.
Today, large corporations easily have 25-30 different databases in their production stack, which requires a substantial investment to maintain the infrastructure and substantial skilled resources to run their application landscapes. This complexity is not exclusive to blue chip corporations; mid-sized and young companies also have to invest in either a bloated infrastructure or push all their data into a single model database. This leads to stunts in the application layer and potentially inefficient query processing.
As a result, multi-model databases, those that can store information in a variety of data models, have rapidly gained importance in the market. In many ways, multi-model databases have been a response to Polyglot Persistence, delivering the benefits of flexibility without the cost and operational complexity. But, are multi-model databases the right choice for your project or enterprise?
Today, developers can choose from two different approaches to multi-model databases: “layered” multi-model databases, available from established players who have added additional data access layers to their existing databases, and “native” multi-model, which provide one core and one query language for multiple data models.
Layered Multi-Model Databases
Layered multi-model databases follow the approach of having a straightforward data model as the foundation and provide additional “layers” on top to enable users to treat their data as documents, relational tables, graphs, etc.
Let’s take a quick look at the basic architecture of a typical layered multi-model database.
Layered multi-model databases can be a fit for developers who have familiarity with a general-purpose database that has optional capabilities for serving multiple models. It can reduce the learning effort and developers can build upon a technology that they already use and trust. For example, some vendors provide a key/value store as a persistence layer and add different data layers or modules connected via an API. Another offers columnar store as the persistence layer and has mounted a graph access layer on top also connected via an API.
This configuration brings up an obvious question for anyone familiar with even moderately complex queries; do the different layers “know” about each other? It is not unusual in today’s applications that the intermediate result of one query has to be used for another. The result of a graph query identifies records to aggregate upon or the result of a JOIN operation is being used as a starting point for a graph traversal.
In that case, different layers have to have some knowledge about each other to execute these connected queries most efficiently. The data flow of such connected layers might look like what’s shown in the schema below (orange arrows):
In order to provide this flow in the most convenient and efficient way, it makes sense to allow access to all data models with the same query language. But right now, that is not available in layered multi-model databases. For instance, some offer a series of different APIs and others provide one query language for columnar data and another for graphs.
To enable users to combine various data access patterns within a single query, a developer managing a layered multi-model database has to develop a rather complex system of APIs among the data layers to allow the knowledge exchange. In basically all layered multi-model databases this is the case. In addition, some logic optimizing query processing would be preferable to achieve acceptable performance.
Native Multi-Model Databases
The native multi-model approach offers one database core, and a single query language for all the supported data models. By nature, it means developers have fewer datastores to learn, manage and maintain, inherently reducing the complexity of a tech stack while keeping the advantage of native data mapping into the persistence layer. And since it’s native, built from the ground up to be a multi-model database, developers can execute a wider range of queries and tailor their request precisely to their needs, making it a good fit for modern and agile application development.
Architecture of a native multi-model database
Compared to layered approaches, the architecture of a native multi-model database allows developers to work against a single API and work with the very same dataset. The integrated data access layer has full knowledge about the queries at any given time and can therefore automatically optimize even combined queries efficiently.
Another key benefit of native multi-model databases is scalability. The more complex a data model and related access patterns are, the more complex it is to horizontally scale with related queries.
Scaling vertically with graph or relational data is comparably easy as data can be accessed in-memory. But, as soon as data is needed during query execution, residing on different machines, it becomes a different story. Distributed data from a query perspective means not only complex query planning, but most importantly network hops during query processing. Just to refresh memory about some latencies: an in-memory lookup is executed in ~50 nanoseconds, a network round-trip within the same datacenter needs ~200-300 microseconds. Join operations or graph traversals, even in their simplest form, potentially have to jump many times between servers.
To circumvent the problem of these network hops a developer needs access to data locality to allow optimal sharding of data for a given use case. This can not be done by a simple access layer; it calls for optimizations deep down within the persistence layer. If the persistence layer or the integration of the access layer has not been harmonized, query execution can show poor performance.
Even when running a database cluster, developers and users still expect the performance to be close to a single instance, even for complex queries. Hence, a database that promises to be suitable for both graph and relational workloads has to provide solutions for these operations at scale. With native multi-model, designed for today’s challenges, developers can build high-performance applications and scale horizontally with all supported data models.
Conclusion
While there are plenty of advantages, multi-model is not the ultimate solution for every situation. It is not a way to force developers to use a variety of data models, nor can one layered or native multi-model database integrate every data model efficiently. It’s more about enabling developers to leverage the advantages of different models for different aspects of their applications. Often this can be done more efficiently with one technology in many cases.
The good news is that developers have choices. The future of data is all about flexibility. Enterprises are seeing projects blossom from 10s to 100s to 1,000s of apps, dashboards and unique systems, which require a variety of data models. And developers building applications know that they will keep evolving, that requirements to query, analyze and serve up data will become more advanced.
Originally published on IT Toolbox.