The context: ETL and data integration is powering the cloud economy
The rise of cloud-based ETL (or ELT as it’s evolving) tools (SaaS) started about a decade ago. Driven by the growing popularity of cloud-based data warehousing solutions like AWS Redshift, Google Big Query, and later the massive success of Snowflake. A cloud data warehouse (CDW) is exactly as useful and valuable as the data it stores, so getting the right data into the CDW is critical. Popular CDWs have little to no native capability to connect and ingest data from various DB’s, SaaS applications, or even public data sources — a major shortcoming. Legacy ETL tools like Informatica, Talend, and a few others were not well suited for cloud-first architecture, so several new cloud-native vendors popped up, including companies like Fivetran, Xplenty, and Matillion. These new vendors offered only a handful of data sources in the beginning, but it was enough when combined with the rapid growth of the CDW market, for them to grow quickly.
Historically, ETL tools focused on popular relational databases (RDBMS) like Oracle, SQL Server, MYSQL, and a few very popular SaaS applications like Salesforce, and also the ubiquitous FTP (File Transfer Protocol), but that was about it. Over the last 15 years or so, FTP has all but disappeared, replaced by the rise of the API (Application Programming Interface). This is the default method by which virtually all modern cloud applications share data with users, vendors, partners, and other applications. Much has been written about the rise of the API economy, and this fits within the model. If a new company builds a SaaS HR app, or accounting app, or anything else, they create an API for users to access the data. It’s worth noting here that APIs are like fingerprints, there are literally no two on earth that are the same, so you can start to see the challenge this creates if you need to access these APIs and load the data into your CDW.
Custom connectors and brute force is public enemy #1
Enter the modern ETL vendors previously noted — with their handful of connectors — it’s a good start, but with the exploding numbers of SaaS applications, it’s unclear how they can keep up. They can’t actually keep up. Virtually all of these vendors claim they can “connect any data source to your CDW.” What this really means is “any data” on an impractically short and slow growing list of supported apps. If your desired source is not on this list, then you can’t use the solution. As expected, the focus for new connectors has essentially been a popularity contest; if lots of people use the app and need the data, it’s worth the considerable time and effort it takes these vendors to make a new connector. The rest of the world’s SaaS applications fall into what’s known as the “longtail.” Of course, if you need the data from a “Longtail” app you don’t see it that way — you just need the data. And what has made this problem so much worse is there are thousands of SaaS applications and public APIs that companies widely use and are important, not the 100-150 that existing vendors support. So why don’t these vendors just make more connectors? Well, it turns out creating a new connector for an API is still a manual process. Data engineers write code to take the data from the API, restructure it from its complicated native object form and turn it into normalized, SQL-ready tables you can load into a data warehouse. And since APIs can vary widely in terms of complexity, creating a single new connector can take weeks to months to deliver. A senior executive at a leading cloud ETL company was recently asked how he planned to create all the connectors the company might need; his reply was “brute force.” It’s an understandable answer based on their approach, but not exactly inspiring to a user. Of course, this also assumes the vendor wants to support the application at all. In some cases, they simply deem the application not important enough to make this significant effort, which begs the question, “Important to whom?”.
We’re at a fork in the road
Unfortunately, the delivery of Cloud-based ETL has not lived up to the promise. I speak with companies every week struggling with some data integration challenge, I ask them what they use now, and the answer is always the same: “Oh, we have three different solutions.” I ask them why — but I already know the answer: it’s the connector problem. They needed connector X but vendor Y only has it — and connector Z is only offered by vendor W… So now the user has multiple different solutions to manage — each of them with different pricing structures, different support agreements. In other words, complexity — all just to get the data they need into their cloud data warehouse.
This is wrong!
How many other solutions can you think of that you need to buy three different versions to solve one problem? Do you have three data warehousing solutions or three BI solutions? No.
The radically simple future of data integration
At Precog, we understood all the aforementioned issues deeply and had experienced them first-hand. So we asked ourselves how should this problem be solved holistically? The short answer is easy: build a single solution that can handle all your ingest and ETL/ELT needs. The longer answer is this: think about the problem in an entirely different way and then build sophisticated, elegant software that solves the problem in a scalable way. If you need to hand-build each and every new connector using expensive data engineering labor, it’s not going to work. What’s needed is an intelligent generalized solution that can create the connector automatically from the source data (API response) using AI-like principles. Without writing a single line of code, and nothing to maintain.
This is Precog.
With the constant explosion of new data sources and applications, both public and private, the old way doesn’t scale, period. What does scale is Precog’s concept of “just-in-time data sources.” With Precog, a new API data source can be added in as little as one hour — rarely ever longer than one day. These fully intelligent connectors transform the raw object data into fully SQL-ready relational schema on the fly. And it’s adapts to changes, too. It’s an AI-enabled analytic pipeline handled by the Precog engine — it’s not 1000s of lines of custom code for each and every new connector. Under the hood, you’ll find that an entire Precog “intelligent connector” is a 2Kb config file with zero lines of code. The file is simply telling the Precog engine what to do with a new data source to make it analytic ready.
Yes, a single connector for all of your data
So in a very real sense, the idea of connectors goes away with Precog. It’s a platform that simply allows you to connect to the sources you need, when you need to, and load this data into your data warehouse or any other analytic infrastructure. And it is a single solution with a highly predictable pricing model. This is what customers want. All their data, predictable pricing, and no need to use multiple vendors to solve the same problem.