#Looking to learn more about ETL Pipelines!

29 messages · Page 1 of 1 (latest)

thin swift
#

Hey Everyone!

I'm currently exploring ETL pipeline development and am seeking guidance on best practices and resources to learn more effectively. My background is part of a senior in undergrad going into software engineering, and a little bit of MLOps as an internship. I want to learn more about it because I was underprepared for my last internship.

After researching a bit, I've seen several tools and frameworks like Apache Airflow, Apache NiFi, and Talend, which are commonly used for ETL processes. However, I'm unsure about the best starting points for a beginner in this field.

Here are my specific questions:

What foundational concepts should I prioritize in learning about ETL pipelines?
Can you recommend any comprehensive beginner-friendly resources (e.g., books, courses, tutorials) that cover the essentials of ETL pipeline development?
Are there any particular tools or frameworks that are recommended for beginners to start with? Why?

Any insights or guidance from your experiences would be really appreciated. Thank you in advance for your help!

ripe bison
#

If you are a beginner learning sql will be far more general and helpful for you than any specific tool

#

Jennifer widom (Stanford) has several great sql and data modeling courses on Edx

solid elbow
#

Depends on whether you want to use sql or python for data transformation.

For ingestion, you’ll prob have to use python
For transformations, you can use other sql or dataframe (polars, pyspark, etc). Don’t use pandas since it doesn’t respect data types

For loading, either use sql to create a new table, or use python to write to destination

sudden schooner
thin swift
thin swift
thin swift
thin swift
solid elbow
#

Etl stands for extract, transform, load. You’ll need to learn all either way

thin swift
#
ripe bison
#

At least to me, ETL is more about knowing how to quickly use the random tools at your disposal to move things around (and observability /monitoring) vs a specific skill set

#

You could go through an AWS glue tutorial or aws cert, their tutorials are pretty okay

solid elbow
#

Fundamentals of data engineering is a good start. A book from oreilly

thin swift
#

Amazing thats a good start, I'll check out the oreilly book, I've also been recommended that databricks book from oreilly about apache spark I think. Thanks so much guys!

solid elbow
#

you don't necessarily have to use spark. polars (like pandas but faster) or sql would do just fine
unless you are talking about big data, then you need spark

thin swift
#

I've never heard of polars, I do want to do stuff with big data though, so spark could be cool to learn.

#

dang I feel so behind on the stuff i need to learn, theres no way to start without just diving in though

solid elbow
#

the tools are less important than knowing what you need to do with a task at hand

big data is fine, but you might want to work on data modeling before jumping in learning the tools
after all you still use the same transformation logic, just with different tools

#

and setting up spark is not trivial. unless you know what you are doing, go with one of those jupyter-based docker images with spark bundled in

sudden schooner
thin swift
thin swift
novel radish
solid elbow
#

classic book on this would be kimball. but tldr; domain knowledge

novel radish
thin swift
solid elbow
#

yes