Chapter· Sep 2024 — Jan 2025

Data Engineer
Nielsen

Bangalore, India

Television measurement is a strange, beautiful problem. Millions of viewing events arrive every day, each one a small whisper that must be aggregated, weighted, and turned into the ratings that decide what gets made next season. My brief at Nielsen was to keep that whisper coherent, on time, and within budget.

The pipeline ran on Databricks, fed by Kafka, scheduled by Airflow, persisting through Redshift, EC2, and S3. When I joined, throughput sat stubbornly around 1.6 million records a day — twenty percent below target. The fix wasn't more cluster; it was better mechanical sympathy. Broadcast joins where the smaller side genuinely was small, key salting for the few skewed publishers that were dragging entire stages, and a careful pass of multi-threading on the I/O-bound steps.

Throughput moved past two million within the same budget envelope, and the end-to-end ETL got thirty percent faster after a sweep of predicate pushdown, materialised Parquet snapshots, and a quietly satisfying cluster rightsizing exercise. None of it was glamorous — it was the kind of work that makes a downstream analyst's morning dashboard load before their coffee does.