Share to gain more social capita
Written by — Kimmo Kantojärvi, Data Architect Partner
Selecting the right technology for your business case is a crucial but tricky task. In this piece, Kimmo will share his experiences around data platform selection, and share tips on what to consider when faced with the task of choosing a technology.
Written by — Kimmo Kantojärvi, Data Architect Partner
Share to gain more social capita
Redshift, Snowflake, Synapse, Firebolt, DW, MPP, data platform, data cloud - you’ve probably heard all the magic words already. I have experienced my fair share of different technologies by working as a trusted consultant for large enterprises. Over the years, I have come to notice that selecting the right technology from the previously mentioned list is not as straightforward as you may think.
Therefore, in the following, I will share my own experiences related to finding and selecting data platform technology and finally present tips on what to consider when doing so. As a result, you will understand some of the quirks of those data platforms and what to consider when choosing the right technology.
Firstly, it is worthwhile to understand how different data platforms have developed over the years to truly understand why it is tricky to find the right data platform. So let’s back up a little to 2012, the year when the most significant shift towards cloud-based data platforms started. This was the year when AWS released Redshift; the first massive parallel data processing platform that could be purchased with the swipe of a credit card. Not only were you able to avoid hefty price negotiations and long contracts, but Redshift also offered the elasticity of various instance sizes that everyone was looking for. What’s more, Redshift eased up the DBA tasks by providing automatic backups and other similar functionality - perhaps with the belief that some organizations would not need the DBA role at all anymore.
Although Redshift was delivering lots of value to organizations, it had its setbacks as well. Redshift was based on technology from a company called ParAccel, into which Amazon had earlier invested. Of course, AWS modified the technology to fit its ecosystem better, but the principles of ParAccel remained. ParAccel wasn't however mentioned in AWS sales materials. Why was it then important to know that ParAccel was hiding behind the shadow of Redshift?
Well, the architecture of the ParAccel database was based on the so-called shared-nothing architecture in which each node in the database processes its data individually from other nodes. This was the standard way at the time and was also used by other database companies such as Teradata and Oracle (one could even argue that Redshift was just the old on-premise Teradata running in the cloud). In practice, this shared-nothing architecture caused performance restrictions. Also, the tightly coupled data processing capacity and data storage capacity meant that the cost depended on either one of these. So as a customer, one simply couldn’t choose massive data storage with low compute power and vice versa. As a result, customers were forced to overspend for adequate performance.
I came to face this issue most often when building data platforms based on a data vault. The data vault consumed a lot of processing capacity and was thus expensive even with smaller data volumes. And because everyone was initially so excited about Redshift’s value promise, people ended up using it in ways it was not meant to.
While customers were a bit frustrated with Redshift’s performance in complex data models, a new star was rising from developers with Oracle and Vectorwise backgrounds. The first public announcement of Snowflake happened in 2015 after being privately available for some time. Although Snowflake was a similar service to Redshift, it was built to the cloud from the get-go. In addition to utilizing the elasticity and scalability of the cloud, Snowflake’s most remarkable benefit for customers was the separation of storage and compute, meaning you could spend more money on computing power and save on storage costs and vice versa. In addition, the billing was on a per-second basis since 2017, meaning you did not have to pay for any unused processing time. These properties gave the wanted flexibility for customers.
On the other hand, Snowflake still needed additional performance tuning tasks. For example, files had to be created with specific sizes to optimize the data loading speed, which meant additional work for developers. Customers were also puzzled with estimating the costs for running the Snowflake environment. With Snowflake's ability to meet peak usage periods came fluctuating costs. Typically, organizations want to understand and plan budgets for their solutions, but this was challenging in Snowflake’s case. Where Snowflake was actually able to scale to meet user needs, managing the costs became a new challenge.
At the end of 2019, Microsoft released Azure Synapse Analytics, which was named as the “Snowflake killer” by some. It soon became evident that Synapse was partly built on top of existing technologies, making it a not-so-new offering. The resource side of the offering was entirely based on existing Azure SQL DW, and thus, its legacy focused on performance and scalability. Despite the new brand and sales slides, the Azure SQL DW limitations continued with Synapse. That being said, the serverless on-demand offering of Synapse was a valued new addition. Nevertheless, instead of being the “Snowflake killer,” Synapse could have been better described as a new self-service analytics offering based on partly existing and partly new solutions.
At the same time, AWS released new RA3 instances for Redshift, which provided separated compute and storage for data platforms, causing Redshift to move into the same game with Snowflake. One may now ask, is there actually any real difference between Redshift and Snowflake anymore?
One of the things that make technology choices so tricky is the need to satisfy the requirements of all key stakeholders. Business is typically interested in the costs and results of the data platform - i.e. being able to fulfill their business data needs. On the other hand, procurement is interested in various contractual topics, and IT is concerned with the integration possibilities of the data platform and existing technological choices.
"One of the things that make technology choices so tricky is the need to satisfy the requirements of all key stakeholders."
Snowflake has its challenges as it is not an integrated part of any major cloud system, whereas Synapse (Azure) or Redshift (AWS) are. So from a procurement point of view, it might be easier to get buy-in for Synapse instead of Snowflake. However, Synapse might not fulfill the business needs that well. In practice, one can not select the most cost-effective and best-performing technology if company policies do not allow it.
As you might have already noticed, the sales slides of different technology vendors don’t always reveal the small details behind solutions, which causes unexpected problems once the technology has been taken into use. In addition, technologies are ever-evolving - what used to be the standard earlier is now already replaced with something new. So keeping up-to-speed with the changes is important. What’s more, stakeholders have different needs for the solution, so discussing with all relevant stakeholders is crucial before choosing the technology.
"Get testing, keep up-to-date with the latest developments, and ask around - this will benefit you as well as your business."
The best way to discover the truth behind the sales slides is to test vendors’ technologies independently. It also provides an excellent opportunity to get a feeling of how the technology works. Are some things really cumbersome to do, or perhaps unexpectedly easy to manage? In short, we recommend carrying out the following steps:
Define the most important and potentially complex use cases for your data platform
Run a small PoC project to test these use cases by considering functionality, performance, and security aspects
Get the best available experts to help with the PoC project. They can utilize their previous experience to find out potential issues with different technologies.
As mentioned prior, it is also critical to discuss and agree on the potential solution with all relevant stakeholders. When company policies prevent selecting the best technology available, it is worthwhile to work on this inside the organization and aim to change those policies together with relevant stakeholders.
All in all, one simply cannot solely trust the sales slides. Get testing, keep up-to-date with the latest developments, and ask around - this will benefit you as well as your business. And once you’re up and running with your data platform, be sure to check out our Data Engineering Manifesto, which shares four valuable principles on how to do Data Engineering well.
We're on the constant lookout for data engineers, data architects and data consultants to join our troops. Take a look at our open positions and send us your application today!
Share to gain more social capita
Written by — Kimmo Kantojärvi, Data Architect Partner