Apr 21, 2026  ·  5 min read 

Backfills won't hurt you if you do this

A backfill that goes wrong is one of the most stressful events in data engineering: you’re touching historical data, downstream consumers may be reading it live, and a mistake can corrupt months of analytics. The engineers who backfill confidently aren’t braver — they’ve just built the right guardrails.

Three things make backfills safe: idempotency so any partition can be rerun without doubling results, explicit partition boundaries so you process one day at a time rather than opening a full-table lock, and a throttle to avoid overwhelming production systems. Add an audit log of which partitions ran and when, and a backfill becomes a repeatable, observable operation rather than a manual job you hold your breath through.

Why backfills go wrong

Two failure modes dominate.

Thundering herd. You kick off a Spark job targeting 365 partitions simultaneously. Every executor competes for the same HDFS namenode, the same Hive metastore, the same downstream Kafka topic. CPU and I/O spike; live pipelines start missing SLAs within minutes. You kill the job, but the damage to in-flight records is already done.

Double counting. You run a job that does INSERT INTO instead of INSERT OVERWRITE. The job fails at partition 47 of 365 and you restart it from the beginning. Partitions 1–46 now contain two copies of every row. Aggregates that run on top of that table are silently wrong — sometimes for weeks before anyone notices.

Both failures share a root cause: the job treats the full historical range as a single unit of work with no checkpointing and no atomic partition semantics.

Idempotent, partition-scoped writes

Make every write a full overwrite of exactly one partition. In Hive/Spark this is INSERT OVERWRITE TABLE … PARTITION (dt='2025-03-15') SELECT …. In BigQuery it is WRITE_TRUNCATE on a date-sharded or range-partitioned table. In Delta Lake it is replaceWhere.

The invariant: running the same partition twice produces the same result as running it once. That lets you retry any failed partition freely.

from pyspark.sql import SparkSession
import time
from datetime import date, timedelta

spark = SparkSession.builder.appName("backfill").getOrCreate()

START = date(2024, 1, 1)
END   = date(2025, 1, 1)          # exclusive
THROTTLE_SECONDS = 4              # pause between partitions
CHECKPOINT_TABLE = "audit.backfill_log"

def already_done(dt: str) -> bool:
    result = spark.sql(f"""
        SELECT 1 FROM {CHECKPOINT_TABLE}
        WHERE job = 'events_daily'
          AND partition_dt = '{dt}'
          AND status = 'SUCCESS'
        LIMIT 1
    """)
    return result.count() > 0

def run_partition(dt: str):
    spark.sql(f"""
        INSERT OVERWRITE TABLE prod.events_daily
        PARTITION (dt = '{dt}')
        SELECT user_id, event_type, COUNT(*) AS cnt
        FROM raw.events
        WHERE DATE(event_ts) = '{dt}'
        GROUP BY user_id, event_type
    """)
    spark.sql(f"""
        INSERT INTO {CHECKPOINT_TABLE}
        VALUES ('events_daily', '{dt}', 'SUCCESS', current_timestamp())
    """)

current = START
while current < END:
    dt_str = current.isoformat()
    if not already_done(dt_str):
        print(f"processing {dt_str}")
        run_partition(dt_str)
        time.sleep(THROTTLE_SECONDS)   # throttle
    else:
        print(f"skip {dt_str} (already done)")
    current += timedelta(days=1)

Key points in the loop:

INSERT OVERWRITE … PARTITION — atomic, idempotent.
already_done() check — skip partitions that succeeded in a prior run.
time.sleep(THROTTLE_SECONDS) — explicit pause between partitions.

Figure 1 — throttle valve isolates backfill load; each partition write is atomic and idempotent

Throttle and isolate from live traffic

A 4-second pause between partitions sounds trivial. Over 365 partitions that is 24 minutes of idle time — easily worth it to keep production query latency stable. Tune the value based on how long each partition takes: a good target is sleep ≥ 0.2 × partition_runtime. For heavier jobs, run the backfill off-peak or on a dedicated cluster that does not share a metastore connection pool with live pipelines.

Additional isolation levers:

Separate queue / resource pool. In YARN, submit backfill jobs to a low-priority queue capped at 25% of cluster capacity. In Databricks, use a separate job cluster rather than an interactive cluster shared with analysts.
Rate-limit metastore calls. If you’re hitting Hive metastore, set hive.metastore.client.connect.retry.delay and cap partition batch sizes to avoid metastore timeouts that cascade into live ETL failures.
Read from a replica. When the source is a transactional database, point the backfill at a read replica, not the primary. A full-scan over 12 months of events can generate enough I/O to push a primary into swap.

Track progress and make it resumable

The audit.backfill_log table in the code above is the simplest form of a checkpoint store. Every row records (job, partition_dt, status, completed_at). The loop reads it before processing each partition, so a restart from any point is safe.

Figure 2 — checkpoint log surfaces partition state; a restart skips DONE, resumes at RUNNING

Useful queries to run against the log:

-- how many partitions are left?
SELECT status, COUNT(*) AS n
FROM audit.backfill_log
WHERE job = 'events_daily'
GROUP BY status;

-- find any partitions that were attempted but never marked SUCCESS
SELECT partition_dt
FROM audit.backfill_log
WHERE job = 'events_daily'
  AND status != 'SUCCESS'
ORDER BY partition_dt;

Keep the log table. Once the backfill is done, it becomes evidence that every partition was processed — useful when a stakeholder asks why February numbers changed.

Closing takeaway

Backfills fail when they’re treated as one big job. They become safe when each partition is a self-contained unit: written atomically, verified against a checkpoint before running, and paced to leave room for live traffic. The pattern above handles 365 partitions in an overnight run without a single on-call page. The throttle is not a performance tax — it is the mechanism that makes the whole thing boring, which is exactly what you want.

Một backfill sai là một trong những sự kiện căng thẳng nhất trong data engineering: bạn đang chạm vào dữ liệu lịch sử, consumer hạ nguồn có thể đang đọc trực tiếp, và một sai lầm có thể làm hỏng nhiều tháng analytics. Những kỹ sư backfill tự tin không phải vì họ dũng cảm hơn — họ đã xây dựng đúng các biện pháp bảo vệ.

Ba thứ giúp backfill an toàn: idempotency để mọi phân vùng có thể chạy lại mà không nhân đôi kết quả, ranh giới phân vùng rõ ràng để xử lý từng ngày một thay vì mở khóa toàn bảng, và van điều tiết để tránh làm quá tải hệ thống production. Thêm audit log ghi lại phân vùng nào đã chạy và khi nào, backfill sẽ trở thành thao tác có thể lặp lại và quan sát được, thay vì một thao tác thủ công mà bạn nín thở chờ đợi.

Tại sao backfill hay xảy ra sự cố

Có hai kiểu thất bại phổ biến nhất.

Thundering herd. Bạn khởi động một Spark job nhắm vào 365 phân vùng cùng lúc. Mọi executor tranh nhau cùng một HDFS namenode, cùng một Hive metastore, cùng một Kafka topic hạ nguồn. CPU và I/O tăng vọt; các pipeline đang chạy trực tiếp bắt đầu miss SLA trong vài phút. Bạn tắt job, nhưng thiệt hại trên các record đang xử lý đã xảy ra.

Double counting. Bạn chạy một job dùng INSERT INTO thay vì INSERT OVERWRITE. Job thất bại ở phân vùng thứ 47 trong 365 và bạn khởi động lại từ đầu. Phân vùng 1–46 giờ chứa hai bản sao của mỗi row. Các aggregation chạy trên bảng đó bị sai một cách im lặng — đôi khi hàng tuần trước khi ai đó nhận ra.

Cả hai thất bại đều có chung nguyên nhân: job coi toàn bộ dải lịch sử là một đơn vị công việc duy nhất, không có checkpointing và không có ngữ nghĩa phân vùng nguyên tử.

Ghi dữ liệu idempotent theo phạm vi phân vùng

Mỗi lần ghi phải là overwrite đầy đủ đúng một phân vùng. Trong Hive/Spark đó là INSERT OVERWRITE TABLE … PARTITION (dt='2025-03-15') SELECT …. Trong BigQuery là WRITE_TRUNCATE trên bảng date-sharded hoặc range-partitioned. Trong Delta Lake là replaceWhere.

Nguyên tắc bất biến: chạy cùng một phân vùng hai lần cho ra kết quả giống như chạy một lần. Điều đó cho phép bạn retry bất kỳ phân vùng thất bại nào một cách tự do.

from pyspark.sql import SparkSession
import time
from datetime import date, timedelta

spark = SparkSession.builder.appName("backfill").getOrCreate()

START = date(2024, 1, 1)
END   = date(2025, 1, 1)          # exclusive
THROTTLE_SECONDS = 4              # pause between partitions
CHECKPOINT_TABLE = "audit.backfill_log"

def already_done(dt: str) -> bool:
    result = spark.sql(f"""
        SELECT 1 FROM {CHECKPOINT_TABLE}
        WHERE job = 'events_daily'
          AND partition_dt = '{dt}'
          AND status = 'SUCCESS'
        LIMIT 1
    """)
    return result.count() > 0

def run_partition(dt: str):
    spark.sql(f"""
        INSERT OVERWRITE TABLE prod.events_daily
        PARTITION (dt = '{dt}')
        SELECT user_id, event_type, COUNT(*) AS cnt
        FROM raw.events
        WHERE DATE(event_ts) = '{dt}'
        GROUP BY user_id, event_type
    """)
    spark.sql(f"""
        INSERT INTO {CHECKPOINT_TABLE}
        VALUES ('events_daily', '{dt}', 'SUCCESS', current_timestamp())
    """)

current = START
while current < END:
    dt_str = current.isoformat()
    if not already_done(dt_str):
        print(f"processing {dt_str}")
        run_partition(dt_str)
        time.sleep(THROTTLE_SECONDS)   # throttle
    else:
        print(f"skip {dt_str} (already done)")
    current += timedelta(days=1)

Các điểm chính trong vòng lặp:

INSERT OVERWRITE … PARTITION — nguyên tử, idempotent.
Kiểm tra already_done() — bỏ qua các phân vùng đã thành công trong lần chạy trước.
time.sleep(THROTTLE_SECONDS) — tạm dừng rõ ràng giữa các phân vùng.

Hình 1 — van điều tiết cô lập tải backfill; mỗi lần ghi phân vùng là nguyên tử và idempotent

Điều tiết và cô lập khỏi traffic trực tiếp

Tạm dừng 4 giây giữa các phân vùng nghe có vẻ nhỏ. Nhưng với 365 phân vùng, đó là 24 phút thời gian rảnh — hoàn toàn xứng đáng để giữ ổn định latency của query production. Điều chỉnh giá trị này dựa trên thời gian mỗi phân vùng cần: mục tiêu tốt là sleep >= 0.2 × thời_gian_phân_vùng. Với job nặng hơn, hãy chạy backfill ngoài giờ cao điểm hoặc trên cluster riêng không chia sẻ connection pool metastore với pipeline đang chạy trực tiếp.

Các đòn bẩy cô lập thêm:

Queue / resource pool riêng. Trong YARN, submit backfill job vào queue ưu tiên thấp giới hạn 25% capacity cluster. Trong Databricks, dùng job cluster riêng thay vì interactive cluster chia sẻ với analyst.
Giới hạn tốc độ gọi metastore. Nếu bạn đang dùng Hive metastore, hãy set hive.metastore.client.connect.retry.delay và giới hạn kích thước batch phân vùng để tránh timeout metastore lan sang lỗi ETL trực tiếp.
Đọc từ replica. Khi nguồn là database transactional, trỏ backfill vào read replica, không phải primary. Full-scan 12 tháng sự kiện có thể tạo đủ I/O để đẩy primary vào swap.

Theo dõi tiến độ và đảm bảo có thể tiếp tục

Bảng audit.backfill_log trong code trên là dạng đơn giản nhất của checkpoint store. Mỗi row ghi lại (job, partition_dt, status, completed_at). Vòng lặp đọc nó trước khi xử lý từng phân vùng, vì vậy khởi động lại từ bất kỳ điểm nào đều an toàn.

Hình 2 — checkpoint log hiển thị trạng thái phân vùng; khi khởi động lại sẽ bỏ qua DONE, tiếp tục từ RUNNING

Các query hữu ích để chạy trên log:

-- còn bao nhiêu phân vùng chưa xong?
SELECT status, COUNT(*) AS n
FROM audit.backfill_log
WHERE job = 'events_daily'
GROUP BY status;

-- tìm các phân vùng đã thử nhưng chưa được đánh dấu SUCCESS
SELECT partition_dt
FROM audit.backfill_log
WHERE job = 'events_daily'
  AND status != 'SUCCESS'
ORDER BY partition_dt;

Giữ lại bảng log. Sau khi backfill hoàn thành, nó trở thành bằng chứng rằng mỗi phân vùng đã được xử lý — hữu ích khi stakeholder hỏi tại sao số liệu tháng Hai thay đổi.

Kết luận

Backfill thất bại khi bị coi là một job lớn. Nó trở nên an toàn khi mỗi phân vùng là một đơn vị độc lập: ghi nguyên tử, kiểm tra checkpoint trước khi chạy, và được điều tiết để còn chỗ cho traffic trực tiếp. Pattern trên xử lý 365 phân vùng trong một lần chạy qua đêm mà không có một lần on-call nào. Van điều tiết không phải là thuế hiệu năng — đó là cơ chế làm cho toàn bộ quá trình trở nên nhàm chán, và đó chính xác là điều bạn muốn.