Data Ingestion Is (Almost) A Solved Problem

Gianluca Gindro Gianluca Gindro
December 30, 2020 AI & Machine Learning

A review of two out of the box tools to solve a traditionally tedious problem

Ask anyone who has been involved in a data related job over the past 10–15 years what is the most boring task they would rather avoid, and chances are many would answer ‘data ingestion’.

Everyone would like to have data ready to be analysed. Sure, data cleaning and pre-processing is also another good candidate to avoid, but there is something uniquely annoying in the feeling of spending time purely on replicating data, without supposedly adding any value to it. In a sense at least data cleaning gives that sense of purpose of ‘putting your house in order’.

Yet, data ingestion is a fundamental task, and, until some time ago, you had to work on a combination of API calls, CSV get requests, web-hooks, incremental loads rules, streaming services and ODBC connections just to replicate external data into your data warehouse or local system.

I’ve faced lots of these challenges myself over the past years, which is why I was particularly excited when in my last project I had the chance to explore in detail how data ingestion tools have transformed this space.

There are two data ingestion tools that I am going to review here:

  • Fivetran
  • Stitch Data

When a data ingestion tool is suitable?

The ideal scenario of using a data ingestion tool is when:

  • You have a data warehouse and want to enrich it with new data sources
  • You do not need real-time synchronisation
  • Your sources are within a relatively large set of standard pre-defined databases or SaaS tools (lists here and here)
  • Your source data is already structured in a relational format or in simple JSON structures
  • Your data pipeline requirements fit an ELT approach (Extract-Load-Transform), rather than an ETL one

The ELT requirement is a crucial one here. While, I would argue, ELT vs ETL is often down to individual preference, there are scenarios where it is preferable to transform data before ingestion, for example if lots of data cleaning, transformation or filtering is required before ingestion, to avoid un-necessary complicated processing after the data has already been loaded into the DB.

When this approach is NOT suitable?

While the above use cases cover a vast range of use cases, as exclusion of points above, they won’t cover cases where:

  • You need to have a real-time replication
  • You have unstructured or heavily nested JSON source data
  • You have unsupported technologies as data source systems or data warehouse
  • You want to transform your source data before ingestion
Data Ingestion Is (Almost) A Solved Problem
Photo by Roman Kraft on Unsplash

Now let’s compare the two tools on a number of key dimensions.

Replication methods

An obvious advantage of using a data ingestion tool is that it will handle incremental uploads for you. If you have small enough data sources that can be replicated in full every day, of course you don’t need to worry too much about this point.

Generally speaking, both Fivetran and Stitch Data perform log-based replications, that is, they access source systems change-logs to consume the incremental changes in the data, and replicate them downstream. This is something that happens under the hood, so you don’t have to worry about that, apart from making sure your source systems have change-logs enabled.

The other point to consider is whether your data has a primary key. You obviously want to have a primary key, because this is the only way the tools can recognize that an existing record has been updated. If not, they might insert a new copy of the record each time it is modified, which is obviously an undesired behaviour.

Fivetran here is a bit more sophisticated in that it generates a synthetic primary key even when source tables have none.

Stitch Data on the other hand resorts to an append-only replication method when no primary key is present at source, as it explains in its documentation.

From my experience, in most practical cases you will deal with sources that have a primary key (and if you have control on the source databases, you need to make sure the source tables have one).

However, there are situations where this is not the case, a notable example I have encountered is the Stitch Data Mixpanel integration, which lacks a primary key. In this case, the same record can sometimes be inserted multiple times, which means you have to dedupe it at destination, for example by taking a unique combination of some fields (such as event_type, user_distinct_id, page_url and timestamp).

This is far from ideal, but should not generate too much trouble as long as you are careful with your queries to consume the data.

Handling JSON data sources

JSON data sources can in general be a pain to handle, but how much of a pain would depend on the actual data model and the tools used.

This scenario is commonly found if you have a NoSql data source such as MongoDB, but would also apply if you have a relational data source with some JSON fields, such as in a Postgres database.

The most critical case here is if you have source tables (ie collections), that are truly unstructured, that is, they contain a high number of different fields each. This could be the case, for example, if each object represents a book and its fields represent the chapter names. Chances are that the chapter names would be different for each book. Cases like this would not match well the JSON-to-relational mapping, and, both with Fivetran and Stitch Data, you would find your destination tables having as many columns as there are unique fields in the source collection. This is not only impractical, but it would also throw an error above a certain threshold, as for example Postgres has a limit of 1600 columns per table.

If instead you have a limited number of unique fields in each collection (say, up to a few hundreds at the very most), the behaviour would change depending on the tools:

  • Fivetran will map each JSON field at source to a JSON field at destination, as long as the destination DB handles dictionary-type structures
  • Stitch Data will do the same, but only for a limited set of destinations, such as Snowflake, but for example not with Postgres. This is quite annoying as Postgres natively handles JSON fields. More details on this are explained here, and you also need to bear in mind that this would also affect your row count for pricing (more on that later).

In summary, there is not any magic that a tool can do if you have truly unstructured source data, however, if you do have some structure, under certain conditions these tools will allow to preserve the dictionary-like format in the destination database.

Tables and field selection

Both Fivetran and Stitch Data allow to only import certain tables from a source schema, but Stitch Data here also offers the advantage of being able to selectively import only chosen fields, which can be a useful advantage if you have tables with lots of fields, and particularly if some of these fields contain nested JSON that could complicate the target structure.

Row count and charging methods

On this point, Fivetran and Stitch Data have two different approaches:

  • Fivetran charges on Unique monthly rows updated
  • Stitch Data on overall monthly rows updated (that is, the same row can be counted multiple times)

How does this matter? If your source stream is mainly a sequence of new records appended, this won’t make it much difference, but if your source systems tend to frequently update the same records over and over, Fivetran here can offer an advantage of more clear and predictable pricing.

Another point to be very careful here is about historical backfills and table schema changes. If you have for example a large table and you add one additional column to it and populate it for all the history, both systems will spot these changes and will update the full records history, even if you might not be genuinely interested in that additional field. This could generate quite high unexpected charges for both tools.

Pricing

Based on the charging methods described above, it’s not exactly possible to compare like for like the two tools, however, in a common scenario where the vast majority of source changes are actually new inserts, Stitch Data proves to be significantly cheaper than Fivetran.

As an example, as of today:

  • Stitch Data has a pricing plan that starts at $100/month for 5M modified rows a month, with a month-by-month commitment and 10 integrations sources. The next level is $180 for 10M rows.
  • Fivetran pricing is a bit more obscure, and you need to navigate through their documentation pages a bit to properly understand it. In summary, it practically starts at $1000–1500 a month for 1M unique rows updated a month, but it then increases more gently for higher consumption volumes.

Replication frequency

The replication frequency that you can set-up will depend both on the tool used and the source system. Fivetran claims a lower limit of 5 minutes, Stitch Data for most sources has a lower limit of 30 minutes, however some sources allow a lower value, but you need to be careful in general not to overlap between two subsequent runs.

Your replication frequency will also affect the number of rows modified, which would particularly affect Stitch Data pricing.

Transformation capabilities

Let’s remember we are talking about ingestion, not data transformation, so it’s not surprising that neither of these tools have a focus on data transformation capabilities.

However, Fivetran also offers integration with dbt (data build tool) and Stitch Data is part of the Talend family, traditionally a big player in the ETL domain.

However, there are no obvious advantages in using their sister data transformation tools, so that part can be genuinely decoupled from ingestion.

Outcome

Both Fivetran and Stitch Data are great tools for data ingestion. Which one fits the bill would depend on the actual use cases and needs.

Fivetran has in some way more complex replication capabilities, but this also comes with an increased cost.

Stitch Data is a great entry-level tool, which can also handle replication at scale for a number of standard scenarios.

Fivetran pricing is quite steep at the beginning, but can become more justifiable if you have very high ingestion volumes (eg billions unique rows a month), and can also offer more predictability due to its unique rows pricing model.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Gianluca Gindro

    Tags
    DataData CleaningData EngineeringData ingestionData ScienceDatabase
    Leave a Comment
    Next Post
    The God Graph

    The God Graph

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in AI & Machine Learning
    AI & Machine Learning,Future of Work
    AI’s Role in the Future of Work

    Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    How Can AI Help Improve Legal Services Delivery?

    Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    5 AI Applications Changing the Energy Industry

    The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

    3 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: support@experfy.com

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2023, Experfy Inc. All rights reserved.