Share to gain more social capita
Written by — Tuomas Melin, Data Architect Partner
The Data Engineering Manifesto presents four valuable principles on how to do Data Engineering well.
Written by — Tuomas Melin, Data Architect Partner
Share to gain more social capita
Data Engineering as a profession has been around for quite some time already. We have seen it shake off the oddities of "ETL Developer" or "Data Integration Specialist", and we have witnessed the term grow out of the shadow of the Data Scientist, which was once labeled the "sexiest job title of the 21st century". Nowadays, Data Engineers are one of the most sought out professionals. The title and profession are respected on their own, and the expertise is growing.
However, I haven't yet seen any attempts to classify “How To Do Data Engineering Really Well” so here is my proposition of what the Data Engineering Manifesto should contain. Read this as you would read the famous agile manifesto and its well-known quote “...while there is value in the items on the right, we value the items on the left more”.
Even though this seems self-evident, it really is the most important rule, so no going home without addressing it. This also needs to be highlighted partly because of the large spectrum of different types of Data Engineers out there. This rule is vital, especially if the Data Engineer sits between business and IT or works as part of a centralized data team/data platform. I've seen many cases where the true business need might get lost due to communication hurdles, or worse, might not be articulated downstream at all. And I've seen cases where there hasn't been a business case even to begin with! But the more the data engineering function(s) knows about the business need, the better they can design data pipelines, data storage etc.
If you are a business stakeholder, here's some free game: Make sure your Data Engineering team knows why you want that shiny new data pipeline up and running. You will have more motivated people and a better designed solution.
This take might seem a bit controversial, especially if you take it out of context. But when you are building, fixing, and maintaining data solutions, you first and foremost want someone to use the solution. I've seen too many examples of whole data platforms that were built with the grandiose idea of, for example, unifying enterprise BI capability once and for all. Only to see the platform development efforts put on hold because there were no users. It doesn't matter how good your data quality is if you are solving the wrong problem.
But don't get me wrong. Data Quality is really important. It is one of the most important aspects of a data solution with any business value. And Data Quality is really closely linked to the utilization of the data, as faulty data will quickly drive the utilization close to zero. Both are important. Both are needed. But users come first, quality right after that.
Users come first, quality right after that
It's tempting to try to fix any problem with an exactamundo solution. Especially for me, since I come from a software development background where anything can be solved with some Python. And glue. With this background, you know exactly what the solution needs to be, and you have your Python, Scala, etc. at hand. The tricky thing is that the problem you face by taking this approach doesn't come early on. With any luck, it'll be smooth sailing for a long time. But if you repeat this again and again, you'll soon be maintaining custom software solutions instead of operating your business. If you’re keen to learn more on how to choose the right technology for your business needs, check out our blog post on data platform selection.
This point is especially important when doing consulting and planning a solution for someone else. The situation is different when you are also the one building, maintaining, and updating the solution. But sensible tooling and make-or-buy decisions are relevant in all setups - especially with the rapid way tooling is developing currently around data. Having the right tools and updating or changing them to address changing needs in the business environment will be the better option in the long run.
Some of the frontrunners even call this part the "Data Supply Chain Management". You know what I mean. When you push your data solution to production, you're not ready yet. You are only about to turn a new page, and on that page, you will need Data Management. In fact, your need for data management should have occurred even earlier, and in some aspects, you are already late.
Data solutions (and therefore businesses) can not survive "in the wild" without Data Management. Before any initiative even starts, Data Management should direct the way initiatives are launched, what data they will include, who has ownership of that data, who are the users, and what reasoning they have for accessing the data etc. Data Management describes the way businesses organize around data and orchestrate the whole data lifecycle. It's a key building block in any data journey. And it's not business as usual if you haven't done it before.
We at Recordly want to do data engineering right. If you want us to join you on your data quest, get in touch!