5 Tips on Data Engineering

2 years ago, I have developed an interest in data engineering and fortunately, I recently got a chance to work as a data engineer. In this article, I will write the takeaways from my experience and what I learned so far.

What Is Data Engineering?

Data Engineering is a subset of software engineering or in other words, it's a more specialized version of software engineering where the main focus is the data. Your goal as a data engineer is to design and implement systems that make raw data usable and do that efficiently.

Your main customers are the data science and analytics teams.

Examples of Data Engineering Tasks:

Move data from multiple sources (mongo, Kafka, MySQL) to a single place (HDFS, Redshift) and make it queriable
Transform operational data into schemas that are easily and efficiently queriable
Create aggregates to speed up reports
Add efficient validation to the pipeline

And more examples include writing scripts, designing pipelines and fine-tuning tools.

Tools

Data Engineering has a TON of tools. A lot more than I expected.

This is a blessing and a curse because the stack can be very flexible but you can easily get stuck in a loop of comparing tools and looking for the perfect tools to use.

Tools include:

Spark: A multi-source engine for processing large amounts of data in memory
Hadoop: A collection of tools for processing, analyzing and storing data
Flink: Stream processing engine
Hive: Distributed data warehouse
Presto: Distributed SQL engine
Airflow: Automation tool
Literally ALL POSSIBLE DATA SOURCES

And the list goes on...

As for languages, the most commonly used languages are Python, Scala and Java. Mastering any of the three languages should be enough to get started.

Now that we're done with the basics, let's move on to the less obvious tips that I learned so far.

1. General DevOps Knowledge is a Must

I found myself SSH-ing into the servers a lot more than I expected.

Right off the bat, you need to set up each of the tools in your stack. In most cases, every tool has different deployment/orchestration methods that you need to understand, choose what fits and implement it.

Afterwards, you need to fine-tune tools on a low level which means that you will use a lot of tools and methods that are mainly used by DevOps engineers.

You also need to know how the tools work on a low level so you can optimize them and make them resilient and secure.

Learn about Unix commands, deployment and orchestration tools and get comfortable using the terminal.

2. Be Conscious of Hardware Resource

As a backend engineer, my focus on hardware was limited and I always assumed that what I implemented will work which it often did. However, as a data engineer, I realized that the amount of available resources is crucial.

Most of the tools are hungry for resources since they are all dealing with large amounts of data which means that you need to always keep the available resources in mind when choosing the stack, designing the pipelines, configuring tools or implementing scripts.

Be aware of what resources tools need to be efficient enough given the available resources.

3. The JVM is Important

Many of the tools use the JVM which means that understanding how it works is a good investment. Learning how the JVM works and how to optimize it proved to be very helpful for setting up the tools, debugging and optimizing them.

Spend some time learning the basics of the JVM and how garbage collection works.

4. Learn File Formats

There are many file formats that you will use and each of them has its pros and cons. Some of the available formats are (ORC, Parquet, Avro, Delta, JSON) and a lot more. Each of them stores data differently which will affect the performance and resource consumption.

Learning about the different file formats translates to efficiently utilizing the resources.

5. Distributed Systems

Naturally, most of the tools are distributed. Whether it's processing data or storing data or even presenting data, most of the tools are designed to work in a distributed system dividing the work between different nodes.

Learning about distributed systems is important to better understand the tools and maximize their performance.

Conclusion

So, that's data engineering for you. Data engineers need to build and maintain systems that collect, process and store data making sure that the data is clean, reliable and easy to use for analytics and data science.

It's a mix of backend and DevOps which makes the skillset required wide and gives a huge opportunity for growth.

For me, data engineering is the next step after backend engineering.