While “software is [still actively] eating the world”, it’s also clear that open source is taking over software.
Simply put, open source is a superior approach at building and distributing software because it provides important guaranties around how software can be discovered, tried, operated, collaborated on and packaged. For those reasons, it is not surprising that it has taken over most of the modern data stack: infrastructure, databases, orchestration, data processing, AI/ML and beyond.
Looking back, the main reason why I originally created both Apache Airflow and Apache Superset while I was at Airbnb in 2014-17 is because the vendors in the data space were failing to:
- Keep up with the pace of innovation in the data ecosystem
- Give power to users who wanted to satisfy their more advanced use cases
As it is often the case with open source, the capacity to integrate and to extend were always at the core of how we approached the architecture of those two projects.
Headaches with Tableau
More specifically for Superset, the main driver to start the project at the time was the fact that Tableau (our main data visualization solution at the time) couldn’t connect natively to Apache Druid and Trino / Presto, our data engines of choice that provided the properties and guaranties that we needed to satisfy our data use cases.
With Tableau’s “Live Mode” misbehaving in intricate ways at the time (I won’t get into this!), we were steered towards using Tableau Extracts. Extracts crumbled under the data volumes we had at Airbnb, creating a whole lot of challenges around non-additive metrics (think distinct user counts) and forcing us to intricately pre-compute multiple “grouping sets” which broke down some of the Tableau paradigms and confused users. Secondarily, we had a limited number of licenses for Tableau, and generally had an order of magnitude more employees that wanted/needed access to our internal than our contract allowed. That’s without mentioning the fact that for a cloud-native company, Tableau’s Windows-centric approach at the time didn’t work well for the team.
Some of the premises mentioned above have since changed, but the power of open source and the core principles on which it’s built have only grown. In this blog post, I will explain why the future of business intelligence is open source.
Benefits of Open Source
If I could only use a single word to describe why the time is right for organizations to adopt open source BI, the word would be freedom. Flowing from the principle of freedom comes a few more concrete superpowers for an organization:
- the power to customize, extend and integrate
- the power of the community
- avoid vendor lock-in
Extend, customize and integrate
Airbnb wanted to integrate in-house tools like Dataportal and Minerva with a dashboarding tool to enable democratization of data within their organization. Because Superset is open source and Airbnb actively contributes to the project, they were able to supercharge Superset with in-house components with relative ease.
On the visualization side, organizations like Nielsen are creating new visualizations and deploying in their Superset environments. They’re going a step further by empowering their engineers to contribute to the customizability and extensibility of Superset. The Superset platform is now flexible enough so that anyone can build their own custom visualization plugins, a benefit that is unmatched in the marketplace.
Within the wider community, many report using the rich REST API that ships with Superset, allowing them full programmatic control over all aspects of the platform. Given that pretty much everything that user can do in Superset can be done through the API, sky is the limit when it comes to automating processes in and around Superset.
Around the topic of integration, members from the Superset community have added support for over 30 databases (and growing!) by submitting code and documentation contributions. Because the core contributors bet on the right open source components (SQLAlchemy and Python DB-API 2.0), the Superset community both gives and receives to/from the wider Python community.
The power of the community
Open source communities are composed of a diverse group of people that come together over a similar set of needs, and this group is empowered to contribute to the common good. Vendors, on the other hand, tend to focus on their most important customers. Open source is a fundamentally different model that’s much more collaborative and frictionless. Communities, as a result of this fundamentally de-centralized model, are very resilient to changes that vendor-led products struggle with. As contributors and organizations come and go, the community lives on!
At the core of the community are the active contributors that typically operate as a dynamic meritocracy. Network effects attracts attention and talent, and communities welcome and offer guidance to newcomers because their goals are aligned. With the rise of platforms like GitHub, software is pretty unique in that engineers and developers from around the world seem to be able to come together and work collaboratively with minimal overhead. Those dynamics are fairly well understood and accepted as a disruptive paradigm shift in how people collaborate to build modern software.
Beyond the software at the core of the project, dynamic communities are contributing in all sorts of ways that provide even more value to the community. Here are some examples:
- rich and up-to-date documentation
- example use cases and testimonials, often in the form of blog posts
- bug reports and bug fixes, contributing to stability and quality
- ever-growing knowledge bases and FAQs in places like GitHub issues and StackOverflow
- how-to videos and conference talks on Youtube
- a real time support network of enthusiasts and experts on Slack
- dynamic mailing lists where core contributors propose and debate over complex issues
- feedback loops, ways to propose features and influence roadmaps
Avoid lock-in
Recently, Atlassian acquired the proprietary BI platform Chart.io, started to downsize the Chart.io team, and announced their intention to shut down the platform. Their customers now have to scramble and find a new home for their analytics assets that they now have to rebuild.
This isn’t a new phenomenon. Given how mature and dynamic the BI market is, consolidation has been accelerating over the past few years:
- Tableau was acquired by Salesforce
- Looker was acquired by Google Cloud
- Periscope was acquired by Sisense
- Zoomdata was acquired by Logi Analytics
While consolidation is likely to continue, these concerns don’t arise when your BI platform is open source. If you’re self hosting, you are essentially immune to vendors lock in. If you choose to partner with a commercial open source (COSS), you should have an array of options from alternative vendors to hiring expertise in the marketplace, all the way to taking ownership and operating the software on your own.
For example, if you were using the Amazon Managed Workflows for Apache Airflow service to take care of your Airflow needs, and somehow Amazon decided to shut down the service (this is totally hypothetical!), you’d be left with a set of viable options:
- Select and migrate to another service provider in the space: most likely Astronomer or Google Cloud Composer
- Hire or consult: find Airflow talent that can help you take control. The community has fostered a large number of professionals who know and love Airflow and can help your organization
- Learn and act: take control and tap into the amazing resources provided by the community to run the software on your own (Docker, Helm, k8s operator, …)
Even at Preset, where we offer a cloud hosted version of Superset, we don’t fork the Superset code and instead run the same Superset that’s available to everyone. In Preset cloud, you can freely import and export data sources, charts, and dashboards. This is not unique to Preset, COSS vendors understand that “no lock-in!” is core to their value proposition and are incentivized to provide clear guarantees around this.
To conclude
Clearly open source is extremely disruptive as it provides freedom and a set of guaranties that really matter when it comes to adopting software. Even more clearly, these guaranties fully apply when it comes to business intelligence.
More specifically around business intelligence, Apache Superset has matured to a level where it’s a very compelling choice over the proprietary solution. Since the original creation in 2015 at an Airbnb hackathon, the project has come a very long way. Here are some recent highlights:
- Reached the major milestone of 1.0 (2021), which represents the project and product reaching adulthood and being fully ready to compete in the BI marketplace. Superset is ready for mass adoption.
- Has grown to become one of the most popular open source project, and asserted itself as the most popular open source BI platform by miles
- Has been adopted by thousands of organizations of all shapes and sizes, some of whom have self-identified here
- Is accelerating while benefiting from network effects unique to massively successful open source projects
- Graduated as a top-level project at the Apache Software Foundation (2020), which provides a clear and proven governance model for the project that guarantees a certain maturity
All in all, Apache Superset offers a combination of features and guaranties that is unique to open source BI. To learn more, please visit and join our growing community!
Originally published at https://preset.io.