Share to gain more social capita
Written by — Mika Heino, Data Architect
After the GenAI boom in 2023, vendors rolled out those features as GA in 2024 into their platforms. Storage Wars saw Delta Lake's dethroning as de-facto open table format as Apache Iceberg rose to prominence, joined by challengers like XTable and Uniform. Let's dive into what Snowflake, Amazon, Google, Microsoft and Databricks introduced across their platforms and offerings during 2024.
Written by — Mika Heino, Data Architect
Share to gain more social capita
It's that time of year again.
Time for a recap of the year 2024.
I considered not writing this year, as the whole internet is full of blog posts written by LLMs, meaning that text pieces that have a human touch are being overshadowed by auto-generated content. This means that getting your voice heard is hard and it might be mixed with hallucinated ChatGPT content. As a professional in the data field and a rookie content creator, I would say that's not a motivating situation.
This means this year's content is lighter than last year's. Instead, I'll focus on the top 5 important product releases and updates from the major vendors and you can check the technical details from the vendor's sites. So without further ado, if you are considering investing in a new data platform and want to know which one offers the best value for your money, or you haven't had the time to check what features your favorite data platform added this year, this blog post is just for you.
Personal note: Most of this blog post was written in late December 2024, and I checked the status of each release (both preview and GA) at the time of writing. I double-checked these statuses before publishing the blog. However, there may still be some errors, for which I apologize. I have included links to the features, allowing you to view the correct status by visiting the vendors' sites and release notes.
Also: If you're on mobile, turn your mobile sideways for better reading experience.
If the previous year was a whirlwind for Generative AI and new vendors promising new AI features to their platforms, 2024 has been the year of delivering those promises. Speaking of which, like 2023, I must point out that the term 'data warehouse' is somewhat misleading nowadays. It's more appropriate to refer to these platforms as 'data platforms,' given that their offerings extend beyond mere storage and computation capabilities. However, I'll stick to 'data warehouse' in the title for consistency (see below) and to differentiate this conversation from customer data platforms (CDP).
This year we did not get any 'massive' updates or overhauls to existing products. Generally, the year was, delightfully calm on feature wise. This year vendors focused on delivering the preview features announced in 2023. Instead 'Storage Wars' which I already wrote in my previous blog continued, as Apache Iceberg started getting noticed around the industry through vendor adoptions starting with Snowflake announcing the general availability of Iceberg tables, Google BiqQuery announcing the preview of the Iceberg tables, Databricks acquiring Tabular and finally with AWS releasing AWS S3 tables.
All this led to a situation where Delta Lake got a death plow as the de-facto data lake table format, and we were ready for internet arguments between table format superiority similar to Blu-ray and HD DVD (or depending on your age: Betamax vs VHS discussions). To make things clear, by death plow, I mean that Delta Lake is suffering from disruptor karma and more table formats have entered into the discussion as the possible de-facto open table format. Delta Lake community introduced UniForm in Delta Lake 3.0 and Apache XTable is gaining visibility. This is to say that Apache Iceberg is just one of the possible file formats of the future and the 'Storage Wars' are far from over. Blog by Alex Merced "Despite Uniform and Apache XTable, your choice of Table Format still matters (Apache Iceberg, Apache Hudi and Delta Lake)" lists several reasons, why this discussion of open lakehouse stack is far from over and what nuances and trade-offs you should note in your architecture if you wish to implement data lake in 2025.
|
That being said, the data lakehouse -pattern seems to be here to stay, but at the same time, it has led some people to wonder what this technical solution solves other than it provides better capabilities to handle streaming than traditional data warehouses. There is a good blog that you should read on the topic. A blog by Dani, called "Apache Iceberg: The Hadoop of the Modern Data Stack?" highlights that you shouldn't rush into building a data lake without strategic goals and I agree. I'm not against data lake, even though it sometimes looks like it, but I would like to focus more on data quality, instead of new technologies. Otherwise we, as a data community, will shoot ourselves in the foot. In the future, the ability to provide better data quality will be more important than whether you can configure Data Lake or not. Business is interested in data quality and how it is used, not in how you store it.
Leaving data lakehouse discussion behind, in this article, we'll explore cloud-based data platforms that provide not only storage and computation services, but also support for various programming languages (such as SQL, Python, Java, R, and more), provide GenAI toolset and even come equipped with frameworks and integrated development environments (IDEs) for machine learning.
What data platforms are in the market then, and how do they differ? The list is rather long - Amazon Redshift, Google BigQuery, Microsoft Fabric, Snowflake, Databricks, Firebolt, Oracle Autonomous Data Warehouse, SAP Datawarehouse Cloud, Teradata Vantage, IBM Db2 Warehouse on Cloud and Cloudera Data Platform are all products on the market the offer capabilities for data warehousing, machine learning, and real-time analytics capabilities. We could also discuss MotherDuck and DuckDB which I mentioned last year, but determining which products to talk about is hard as including DuckDB raises questions on why I don't talk about Clickhouse, Chroma, or Materialize, and the list goes on.
Instead, to keep this article simple, I'll focus on the first five platforms, which are particularly influential in European markets and to be honest with the audience, are the platforms that our customers ask for when it comes to greenfield or migration -projects.
If you want to check only specific data warehouses and their new product updates, I've created the following table of contents. Additionally, I've included a product comparison matrix where I have placed all the new releases in their respective categories.
Product comparison matrix
Azure / Microsoft Fabric
Databricks
Snowflake
AWS / Amazon Redshift
Google BiqQuery
Conclusion
The following product comparison matrix highlights two aspects. Firstly, it illustrates how extensive each platform is with features and also how stable this year was compared to 2023. As previously stated, this year services went GA and we (gladly) were not introduced to new services as you can see in this matrix compared to last year.
This Product Comparison Matrix is not complete in the sense that it would contain all the services per feature per data platform. Instead, the idea is to highlight whether the service has the capabilities or not.
Feature | Snowflake | Databricks | Azure | AWS | |
Serverless Compute "A service where compute is separated from storage" |
Virtual Warehouses |
SQL Serverless |
SQL Analytics endpoint |
BiqQuery |
Redshift Serverless |
ML "Ability to pre-process, train, manage and deploy ML models" |
Container Services / Snowpark ML / Snowflake Notebooks with Container Runtime (new) |
Databricks Runtime / Runtime for ML |
Synapse Data Science |
Vertex AI / Vertex AI API |
Amazon SageMaker / Amazon Bedrock |
Application runtime "Ability to host & run any kind of workload in data platform" |
Native Apps / Container Services |
Container Services |
Azure Containers / AKS |
Containers |
Amazon ECS / EKS / Fargate |
Generative AI "Ability to leverage LLM based services in a simpler manner" |
Cortex |
Lakehouse AI |
Azure Open AI services |
Duet AI |
Amazon Q / Amazon Bedrock |
User Assistance "Ability to assist end users to generate SQL and code with natural language" |
Snowflake Copilot |
Databricks Assistant |
Copilot for several services including Power BI and Fabric Databases |
Duet AI in BiqQuery |
Amazon CodeWhisperer / Amazon Q |
User Search "Ability to search data assets with natural language" |
Universal Search |
Lakehouse IQ / Semantic Search |
N/A |
N/A |
N/A |
Programmatic Access "Ability to use data platform services through code" |
SQL RestAPI / SnowSQL |
SQL Statement Execution API / CLI |
Azure CLI |
BiqQuery REST API / bq command-line tool |
AWS CLI |
OLTP functionalities "Ability to serve same data assets in columnar and row format while enforcing integrity" |
Hybrid Tables |
N/A |
N/A |
N/A |
N/A |
Notebook Experience "Ability to run code in Notebook manner" |
Snowflake Notebooks / Snowflake Notebooks with Container Runtime (new) |
Databricks Notebooks |
Synapse Data Engineering |
Vertex AI Notebooks |
Amazon Sagemaker |
Marketplace "Ability to buy, sell and search data products or add-ons to your data platform" |
YES |
YES |
YES |
YES |
YES |
ETL "Ability to ingest data and process it without the need of 3rd party services" |
Streams and Tasks / Container Services |
Delta Live Tables / Containers |
Data Factory, Azure Functions |
Dataflow, Data Fusion, Cloud Run & Cloud Functions |
AWS Glue, Step Functions |
Data Visualisation "Ability to visualize data for application & reporting usage" |
Streamlit |
Dashboards |
Power BI |
Looker |
Amazon Quicksight |
Streaming "Ability to ingest & process real-time data" |
Snowpipe Streaming |
Spark Structured Streaming |
Synapse Real-Time Analytics, Event Hub, Azure Stream Analytics |
Storage Write API, Data Flow, Pub/Sub |
Amazon Kinesis Data Streams |
|
Fabric has now been about a year and a half with us. Initial thoughts of it are confusing. Historically Microsoft hasn't been the best one to provide clear architecture or documentation on how its products should be used. By that I mean, for example with Azure and Fabric, you can leverage a multitude of services for the same workload. This year's release of Fabric Databases is a good example of this. When you should choose Fabric Databases over Synapse Analytics? Anyway, Microsoft Fabric is best compared to Databricks because the both leverage the same data lakehouse pattern. Differences arise on the pricing model (Fabric is flat), but Fabric is also geared more towards end-users (low-code) with a more comprehensive stack including Power BI, Data Factory, and Purview.
Currently, I would say that Fabric is a good option is you're in Azure and you need an all-in-one solution as it provides all the necessary bits and pieces for it. Just bear in mind that Fabric itself is new product and has a long list of features to add (just look for example the features which were just added to Fabric warehouse, including TRUNCATE command and support for JSON documents)
1. Fabric Databases
As noted above, Fabric Databases are a bit confusing. According to Microsoft documentation, they are meant to be used for OLTP workloads in Fabric, but we have already an Azure SQL Database (based on the trusted SQL Server). Fabric Databases are based on Azure SQL Database and they store (replicate) all the data stored in Fabric Database and also into OneLake in Parquet file format. The use case for Fabric Database seems to be around the possibility of using the same data in OneLake using a multitude of services (Notebooks, Spark, etc.). The jury is still out on whether we need Fabric databases because Microsoft / Azure already provides so many database possibilities (not to mention their data lakehouse patterns).
2. Fabric CI/CD tools
Data land space is moving towards continuous integration and deployment more and more. Tools such as dbt are good examples of this, and now Microsoft is adding CI/CD tools into Fabric itself. With the latest addition, Fabric CI/CD supports data pipelines, warehouses, and Spark environments. These CI/CD pipelines can be triggered using APIs and integrated into Azure DevOps for end-to-end automation, which is good news for us, DevOps users. These features enhance the CI/CD capabilities, that previously already for example Power BI deployment pipelines.
3. Copilot in Fabric Databases and Power BI
These features should be self-explaining. Microsoft keeps pushing Copilot into all its products (even the physical ones) and now we have the Copilot within features we data people use a lot; databases and most especially Power BI.
4. Open Mirroring
Open mirroring is an extension to mirroring features that enable replication of data from various Fabric sources into OneLake. The idea behind mirroring was that data was continuously synchronized using CDC and converted into Delta tables for analysis.
Now this has been extended into Open mirroring, which provides API layers for non-Microsoft vendors to push data into OneLake in a similar manner (CDC included). In first phase, partners for this feature include Oracle and MongoDB from known vendors.
5. Domains in OneLake
Domains in OneLake is a feature that is heavily aligned with data mesh architecture. With domains, you can group data in business areas such as marketing, sales, or HR. Domains work on top of existing users' rights, which means that you grant domain-specific policies and create also sub-domains. All this means that from a UI perspective, you can now find all sales data through one link which gathers all the sales data into one place.
|
A product filled with excellent features, but somehow unable to admit that it's essentially a mix of independently developed excellent components that would need a reboot (in the same manner we got Mad Max: Fury Road).
The new excellent Serverless features are a good example of this. With the new Serverless SQL, you don't need any other products for data warehousing if you are using Databricks for ML/AI purposes, but yet they are mixed with the legacy features. Still, somehow Databricks and its enthusiasts cleverly reframe this situation (of independently developed components) as optimization possibilities and at the same time, they disclose who is paying for this fine-tuning and admin work. That being said, Databricks is the yin to Snowflake yan. Both are needed because a monopoly is not good for anybody and competition creates innovation.
1. Databricks Apps
Do you know Streamlit inside Snowflake or Streamlit apps in the Streamlit community cloud?
Well, this is almost the same feature that Snowflake released last year, but for Databricks. Now you can run Streamlit, Gradio, Dash, you name it, within Databricks. The difference is that you're running templated containers (Ubuntu with preinstalled libraries) within Databricks and uses serverless compute, whereas Snowflake is focused on running Streamlit in a similar manner. Does it matter? No, unless you're a pure Streamlit fan, then Snowflake provides you with the most hassle-free experience. The reason is that you can also run Gradio, Dash, and Flask, within Snowflake / AWS / GCP / Azure, if you use containers.
2. Serverless Compute GA for Notebooks Workflows and DLT
Future is now, old man. Last year Databricks released a highly awaited feature, Serverless Compute for SQL. This year the same trend continues and now you can use (finally) serverless compute for Notebook Workflows and Delta Live Tables. If serverless compute does not ring a bell, it means you're old.
3. Databricks AI/BI
Probably the hardest one to explain. Now you can create Dashboards (reports if you will) with minimal coding using Databricks AI/BI. This includes Genie, a conversational assistant to talk with about business questions in plain language. Databricks AI/BI claims to learn continuously from your data and usage patterns, providing smarter, more accurate answers over time. You can add this to the category of GenAI assistants.
4. Databricks Assistant and AI-generated Comments
Databricks Assistant is what it sounds like. You can ask help for with your SQL or notebook-related questions and the assistant will try to answer you in the best way it can. Having tried out Assistant and Snowflake Copilot, I can say that they both are "ok-ish". If you don't know SQL at all, they are good, but if you have experience in any language, you'll soon see that the help of these Assistants/Copilots is... comparable to Google on steroids. They can help with a specific solution, but they are unable to understand the whole problem that you're trying to solve. For some reason, SQL assistants are now as good as pure coding assistants (think of Cursor for example).
5. Cost tracking features
Now you can create policies and budgets for Databricks features that you use. Simple as that, but as the data platforms get bigger and bigger, these features are crucial.
Spark 4.0
Although Spark features are not Databricks features, it's hard not to speak about Databricks without mentioning Spark 4.0. Spark 4.0 promises to follow ANSI SQL even more, making it easier to migrate traditional SQL workflows into Spark. The features that make Spark easier for SQL developers do not end there as collation support is introduced in the latest version. For Snowflake developers, Spark 4.0 also introduces VARIANT datatype which works similarly as in Snowflake, which is the possibility to store data in semi-structured and unstructured data within the database.
|
Compared to the competition, Snowflake is incredibly polished product making it easy to use by everyone. Snowflake has a very tightly created integrated architecture where everything happens under the same hood and UI experience, but this has come with a cost as sometimes this experience has left little room for flexibility requiring creative workarounds to solve some of the complex ML/AI pipelines. Snowflake has now started to answer these questions, first with Snowpark and lately with Notebooks and Container Services. These new features work excellently, but they do currently feel like bolted-on features, rather than an evolution. Yet, besides these minor improvement needs, Snowflake excels in governance, optimization, and cost predictability making the platform the best one to start your data platform journey.
1. Snowflake Notebooks and Notebooks on Containers
Snowflake released Notebooks in the summer and out-of-the-box, Notebooks offer familiar Snowflake role-based access controls and built-in integrations to Snowflake ML, Streamlit, and Git making it easy to develop pipelines using notebook experience (and also making Hex obsolete). Later on in the autumn, Snowflake added the capability to run Notebooks within Containers opening up the possibility to schedule and automate the notebooks workflows. Container Runtime support was also introduced to GPU support as well as the possibility to use open source packages through pip or leverage the pre-installed images containing libraries and frameworks such as PyTorch, XGBoost, LightGTM, and Scikit-learn).
2. Apache Iceberg tables
Snowflake joined the Apache Iceberg family in the summer of 2024 and similarly the data lakehouse architecture. Now you can leverage the Apache Iceberg format as an external table and benefit from Iceberg's metadata and table for optimizing queries. Apache Iceberg is mentioned in many blogs already, but in short, it offers ACID transactions, schema evolution, and the possibility of time travel because of this metadata layer.
3. Hybrid tables
A long time in the making, Hybrid tables finally saw the daylight. Hybrid tables were introduced first in Snowflake Summit in June 2022 and now at the end of 2024, they are GA in AWS (yes, not yet in Azure or GCP). If Hybrid tables are unknown to you, they are meant to solve the problem of storing transactional and analytical workloads in the same table. Not much is mentioned on how Hybrid tables work, but they seem to have a Dual-Storage architecture that stores the data both in row-based and columnar storage (this also means an extra storage cost and extra costs for serverless resources managing row storage clusters). The tables are meant for real-time data integrations, which makes ideal for single-row operations and because of that, they require primary keys and use row-level locking.
4. Snowpark pandas API
Snowpark has its own API for DataFrame, but we all know that pandas is the go-to data processing library. Because of this Snowflake added Snowpark pandas API into Snowpark. Now you don't need to use the to_pandas() command anymore. By using Modin as a frontend, pandas operations are converted into SQL queries for Snowflake's compute and you'll benefit from Snowflake's distributed architecture scalability and performance.
5. Snowflake Cortex
Snowflake Cortex family saw multitude GenAI additions and these are only few of the releases
|
Redshift stands out as one of the oldest data platforms in the market (BiqQuery came earlier), and it has managed to add features to its core product, Redshift, but the underlying technology PostgreSQL has shown its limitations. New RA3 clusters and Redshift Serverless are still valid options to lean on to Redshift as your data platform if you've already invested heavily in the AWS ecosystem. For a new greenfield project, which leans into a mix of products, Redshift falls behind in the list of 'most desirable data platform to start', as a standalone product, Redshift is looking outdated.
1. Redshift multi-data warehouse writes
Finally, after all these years, Redshift has isolated computed similar to the competition. This release, albeit oddly named, means you can write data into Redshift databases from multiple Redshift instances (warehouses) with varying instance sizes. This release is an older one, dating back to November 26th, 2023, and I just missed it in my previous yearly blog. The feature has been explained in this blog in AWS lingo (because is uses data sharing), but generally, we are speaking of different compute sizes for different workloads and it went GA in AWS a year after its release. To use this feature though, you must be using the latest RA3 clusters and have supported data sharing available.
2. Bedrock Generative AI integration
Bedrock was released last year and provides similar functionalities such as Snowflake Cortex. This means that Bedrock provides you the ability to use foundation models, such as Claude, Llama 2, Mistral AI, and Amazon's own Titan, within AWS services. With this Redshift integration, we now get similar functionality that other data platform providers have introduced, a means to use Bedrock within Redshift SQL for generative AI purposes such as generating summarization, sentiment analysis, or customer classification using LLMs. "Not-so-great" examples, I know. You could leverage SpaCy or NLTK for these, but these are the examples that AWS provides within their blog post. Also, don't ask me why Amazon's Generative AI assistant is called "Q" when everything else is Bedrock.
3. Automatic and Incremental Refresh of Materialized Views
The ability to refresh materialized views automatically and incrementally has its use cases in data warehousing, and now Redshift has joined data platforms supporting this feature in this late December 2024 release. As a context, this feature already exists in Microsoft SQL Server, Oracle, Snowflake, and Google BiqQuery (but not in Databricks).
4. AI-Driven Scaling and Optimization
"AI-Driven Scaling and Optimization" sounds fancy, but honestly, it just means that Redshift Serverless compute units (RPUs) are scaled up and down using more metrics, such as concurrent users, data volume changes, and query complexity. Under the hood, it might be even simpler if-else patterns, but as the marketing goes, this feature learns your data warehouse load patterns. This might be true or false, I'll let you test it out as the AI-Driven Scaling and Optimization feature is available in all AWS regions where Redshift Serverless is available.
4. AI-Driven Scaling and Optimization
"AI-Driven Scaling and Optimization" sounds fancy, but honestly, it just means that Redshift Serverless compute units (RPUs) are scaled up and down using more metrics, such as concurrent users, data volume changes, and query complexity. Under the hood, it might be even simpler if-else patterns, but as the marketing goes, this feature learns your data warehouse load patterns. This might be true or false, I'll let you test it out as the AI-Driven Scaling and Optimization feature is available in all AWS regions where Redshift Serverless is available.
5. S3 tables and S3 metadata
The big fuzz. AWS released a managed Iceberg in early December 2024 and it got quite a lot of attention. What AWS has done, is pretty similar to what it has done with other open-source Apache services such as Airflow e.g. taking an open-source code and making it a managed service without contributing to the source. This time they have created a managed version of Apache Iceberg and thus also picked their side and contributed to the Storage Wars. What S3 tables brings, is a simplified maintenance of Iceberg tables and native AWS integration, with a cost factor. To a Snowflake or Databricks user, this sounds obsolete, but you need to bear in mind that there are a multitude of AWS data platforms that benefit from this. AWS's blog states that users of S3 tables will benefit "3x faster query performance and up to 10x more transactions per second".
S3 metadata release is bundled with S3 tables. Metadata of S3 bucket objects is stored in S3 tables as soon as they land into S3 buckets and is constantly updated with the latest data.
Before you get all excited about S3 tables, I encourage you to read Daniel Beach's blog, "AWS S3 Tables?! The Iceberg Cometh" -blog, where he shows step-by-step, how underdeveloped S3 -tables are actually, starting with the fact that AWS expects you to create them using an EMR -cluster highlighting the fact that these tables are for AWS usage.
|
I could write this introduction in a similar manner as I wrote the Redshift one. BiqQuery (BQ) stands as the oldest, new-style data platform, where compute and storage have been separated. Whereas the underlying technology of Redshift is based on technology not ideally suitable for large-scale computing, BiqQuery uses Google's proprietary technologies such as Dremel and Borg and those technologies still hold up almost 15 years after BiqQuery's introduction on Google I/O in 2010. So why I'm comparing BiqQuery to Redshift? The answer is the ecosystem. BiqQuery is the best option for when you have everything else Google and you need a data platform that provides just the necessary features.
Why wouldn't I choose BiqQuery as my greenfield data platform if I just said the technology still holds up? The ecosystem. Even though BiqQuery provides all the necessary features, even faster and better than Redshift, BQ is not Google's main product. BQ will get the latest features, but later than Snowflake and Databricks, as they do focus solely on data platform features. So if you're focused solely on data platform features, as this blog is, BQ is not the place to start a greenfield project.
1. Vector Search capabilities
By now, you should know how GenAI capabilities work and that you need a database to host those vector embeddings for semantic search, similarity detection, and so on. Now BiqQuery has added these capabilities into its core product including the inverted file index (IVF) index. All this, similarly as in competing data platform products, means that you don't need a separate database beside your data platform to host vector embeddings.
2. Gemini in BiqQuery
Gemini in BiqQuery is similar to Databricks Assistant and Snowflake Copilot, with a slightly different UI experience. That is to say that Gemini does not provide SQL functions in a similar manner as Snowflake Cortex or Bedrock AI, where you can use LLM base models for separate SQL functions, like summarizing text.
3. Continuous SQL Queries
Continuous SQL queries provide BiqQuery the ability to handle event streaming data, compared to previous situations where this had to be done with Dataflow or Pub/Sub to achieve the same result. Now you can create this whole pipeline within BiqQuery itself. If you know Streams and Tasks in Snowflake, this is a similar feature.
4. BigQuery Data Canvas
Data Canvas is a visual, drag-n-drop interface aimed at more non-technical users. It helps users to do data exploration, analysis, and visualization with BiqQuery. With Data Canvas, you can prepare datasets, run queries, and create dashboards without understanding SQL.
5. Time-series analysis enhancements
BigQuery got time series analysis enhancements that help a lot, from a developer point of view, with the creation of time series-related operations, time windowing, and gap filling. These operations work using SQL, so you don't need to learn any new syntaxes. To understand the meaning of these functions, we need to first understand the problem. Gap filling addresses the problem of missing data, for example in case of network error on providing data. The new GAP_FILL function enables you to backfill the missing gaps with different modes, such as linear interpolation or last observation carried forward (locf).
The windowing function helps you to easily map individual data points to specified output windows, with specific time duration. In BigQuery this is handled with new time bucket functions.
If I would honestly say something, I would say that after reviewing all the features released in 2024, I would say that cloud vendor providers AWS, Microsoft, and Google are developing their data capabilities, albeit at different paces.
⭐ AWS with Redshift Serverless is a good base for a data platform, but, for example, their S3 tables are not a polished solution. It's more like "we had to implement something fast for the data lake market," indicating their development is chasing, not leading.
⭐ Google continues to provide a reliable data platform, but BigQuery might not have the latest features that other ones have. We can see this with feature sets they have released (Gemini, vector capabilities), but, as with AWS, they follow the market, not define it.
That being said, if you're in AWS and Google and everything is working, there aren't many reasons* to migrate (*assuming that you're using Redshift Serverless). When it comes to Microsoft Fabric, the story is a bit different.
⭐ Microsoft is bubbling under, but the product is still underdeveloped. It looks like a jack-of-all-trades, not (yet) being a leader or competitor in any feature set. Unless you do not want a seat in the "when this will be ready" train, I would consider other vendors.
Then we come to the pure data platform vendors: Snowflake and Databricks.
⭐ Databricks is good at everything, but when it comes to providing tools for non-technical persons, Databricks is still a bit too technical. That's their approach, which should be able to twinkle everything, but it also comes with a little bit of a messy developer experience because of multiple toolsets (as I noted in the Databricks section).
⭐ Snowflake is also good at everything, but they have taken a different kind of approach: make everything easy under a single UI. As I noted in the Snowflake section, this approach comes with a cost that Snowflake has been trying to mitigate in the past years. The number of configuration possibilities might now be at the level of competition, but usability is unquestioned.
So which toolset would I propose for 2025? Before saying anything, I honestly do think both are superior products, and it kills me inside to see constant LinkedIn battles around the products. I would like to see more discussions where we praise each other for good achievements over complaints of which platform is more costly in "this-and-that" comparisons.
That being said, I would choose Snowflake over Databricks, not because Snowflake is somehow technically superior, but because of you, my reader. With the developments of GenAI, we've seen a multitude of new personas joining the data field, and many of them are newcomers in the data field, and that's a good thing. GenAI is disturbing the traditional data field, and tools that provide an easy way to start and continue building your data products or just help you get your data will ultimately win.
Us Finns should know because of Nokia that being technically the most superior product is not always the road to success. So whether you, my reader, are either a newcomer or just starting your data career, I trust that going forward into the future, I trust that with Snowflake you can't go wrong — now and in the future.
With that said, see you next year (or by the end of this year), and let's find out if AI -agents have replaced me in fixing my Beetle during 2025.
Did you enjoy the article? In creating this blog, I read too much about how AI agents will replace me, listened to Michael Scott from the Office episodes (I need background noise to work), utilized various tools such as Grammarly to fix my grammar, tried to convince Dall-E to create some images for editing, Paint+ for image creation and editing, and finally my trusty Wacom board for drawing. At Recordly, we encourage the use of any necessary tools to expand your knowledge and enhance your creative freedom.
We are currently seeking talented data engineers and data architects to join our amazing team of data professionals! Learn more about us and apply below 👇
THE STATE OF CLOUD DATA WAREHOUSES - 2024 EDITION