Initializing pipeline
Available for new roles
Junior Data Engineer

NAM DUONG HUU

I build pipelines that assume failure._

Idempotency · late-arriving data · deduplication · self-healing recovery

View work Download CV
120+
Pipelines shipped
8PB+
Data processed
99.9%
Avg. uptime
Scroll
01 / Stack

Tech Stack

The tools I use to move data reliably, at scale.

Languages
Python SQL Scala Bash
Ingestion & Streaming
Apache Kafka Flink Kinesis Debezium
Processing & Transform
Apache Spark dbt Databricks Pandas
Orchestration
Apache Airflow Dagster Prefect
Storage & Warehouse
Snowflake BigQuery Delta Lake S3
Cloud & Infra
AWS Terraform Docker Kubernetes
02 / Architecture

Anatomy of a Pipeline

How raw events become decision-ready data — watch it flow through every stage.

SOURCESINGESTIONPROCESSINGLAKEHOUSESERVING APACHE AIRFLOW · ORCHESTRATION OBSERVABILITY · DATA QUALITY · LINEAGE · dbt TESTS Lakehouse Snowflake + Delta Lake App DBPostgreSQL EventsKafka topics SaaS APIsREST · GraphQL FilesS3 · CSV · JSON ClickstreamWeb · Mobile CDCDebezium Stream IngestKafka Connect Batch LoadAirbyte STREAMINGFlink< 300ms BATCHSparkETL · ML prep BRONZERaw / Landingimmutable delta tables SILVERCleansed · dbtconformed · tested GOLDMartsaggregates · metrics BI DashboardsLooker · Tableau ML PlatformFeatures · Models Data APIReverse ETL
03 / Work

Selected Work

A few pipelines I'm proud of.

P-01 · streaming

Real-time Fraud Detection

Sub-second fraud scoring on 1.2M events/s using Kafka, Flink and a feature store feeding online ML models.

1.2M/s
throughput
<300ms
latency
↓40%
false pos.
Kafka Flink Snowflake
P-02 · migration

Lakehouse Migration

Migrated a 2PB legacy warehouse to a Delta Lakehouse on Databricks, cutting compute cost 38% and query times in half.

2PB
migrated
↓38%
cost
faster
Spark Databricks Delta Lake
P-03 · platform

Self-serve Analytics

Built a dbt + Airflow + BigQuery platform with 300+ models and automated tests, giving 200 analysts trustworthy self-serve data.

300+
dbt models
200
analysts
99%
tests pass
dbt Airflow BigQuery
P-04 · cdc

CDC Streaming Sync

Change-data-capture pipeline syncing 80+ Postgres tables to Snowflake in near real time with Debezium and Kafka Connect.

80+
tables
~5s
lag
0
data loss
Debezium Kafka Connect Postgres
04 / Experience

Experience

Senior Data Engineer · FinScale

2023 – Present
  • Lead the real-time data platform powering fraud detection across 1.2M events/s.
  • Introduced observability and data contracts, cutting pipeline incidents by 70%.

Data Engineer · Tiki

2020 – 2023
  • Built a self-serve analytics platform with dbt and Airflow for 200+ analysts.
  • Led a 2PB lakehouse migration that reduced compute cost by 38%.

Data Engineer · VNG

2018 – 2020
  • Developed batch ETL pipelines on Spark for player-behavior data.
  • Automated data-quality checks, improving the reliability of reporting.
05 / Credentials

Certifications

AWS Certified Data Engineer
Amazon Web Services · 2024
Databricks Data Engineer Pro
Databricks · 2024
GCP Professional Data Engineer
Google Cloud · 2023
SnowPro Advanced: Data Engineer
Snowflake · 2023
Available for new roles
Get in touch

Let's build something that scales.

Open to senior data engineering and platform roles.

duongnamhaui@gmail.com GitHub LinkedIn