Catching silent data loss using BigQuery's native ML and Write API
quality 7/10 · good
0 net
AI Summary
A technical walkthrough of building a silent data loss detection system using BigQuery's native Storage Write API metadata and built-in ML anomaly detection (AI.DETECT_ANOMALIES), implemented as a dbt incremental model that monitors hundreds of tables without external infrastructure or custom ML training.
Tags
Entities
BigQuery
Storage Write API
WRITE_API_TIMELINE_BY_PROJECT
AI.DETECT_ANOMALIES
TimesFM
Nordnet
Pub/Sub
dbt
Your Pipeline Succeeded. Your Data Didn't. Data Platform Engineering Subscribe Sign in Your Pipeline Succeeded. Your Data Didn't. How we catch silent data loss across hundreds of BigQuery tables using only features you already have. (Data volume anomaly detection) Robert Sahlin Mar 11, 2026 12 2 Share Every data platform team eventually faces the same question: “Is our data actually arriving?” Not whether the pipeline succeeded, your orchestrator already tells you that. But whether the right amount of data landed. A pipeline can succeed while writing zero rows. An upstream system can silently halve its output. The uncomfortable truth is that missing 20% of your data is far harder to spot than missing all of it. A complete outage is obvious; dashboards go blank, stakeholders notice, someone pages you. But a partial drop? The dashboards still render. The numbers look plausible. The pipeline shows a green checkmark. By the time anyone notices, you’re days into serving incorrect data to the business. At Nordnet , we ingest data from our microservice landscape into BigQuery using over 200 Pub/Sub BigQuery subscriptions, which use the Storage Write API under the hood. We needed a way to automatically detect when a table’s ingestion volume deviates significantly from its expected pattern; without requiring per-table rules, without training custom ML models and without introducing external monitoring infrastructure. This post describes the solution we built: a single dbt model that monitors hundreds of BigQuery tables for volume anomalies using only BigQuery-native capabilities. No external services. No custom model training. No additional infrastructure. If you use BigQuery and the Storage Write API, you already have access to everything described here. In a follow-up post, I’ll show what the detected anomalies look like in practice and walk through the dashboard we built on top of this model. The full code is available as a reference implementation. The data source you already have Every BigQuery project that uses the Storage Write API automatically populates an INFORMATION_SCHEMA view called WRITE_API_TIMELINE_BY_PROJECT (or WRITE_API_TIMELINE_BY_FOLDER if you organize by folder). This view records metadata about every write operation: which table was written to, when, how many rows, whether the operation succeeded or failed, and more. Most teams never look at this view. It sits there, quietly accumulating a detailed log of every row written to every table in your project. This is the foundation of our anomaly detection system. No additional instrumentation is needed, BigQuery is already tracking the information we need. We start by materializing this INFORMATION_SCHEMA data into a dbt incremental model ( storageWriteApiTimeline ). This gives us a stable, partitioned table containing the write history across all our BigQuery projects, refreshed hourly. The raw INFORMATION_SCHEMA view has retention limits and can be slow to query at scale, so materializing it into our own table gives us faster access and longer history. -- Simplified: the timeline model unions WRITE_API_TIMELINE across projects SELECT project_id, dataset_id, table_id, start_timestamp, total_rows, error_code FROM `your-project.region-xx.INFORMATION_SCHEMA.WRITE_API_TIMELINE_BY_PROJECT` This is the primary data source, the volume signal. The model also reads table labels (for per-table configuration) and a seed file (for overrides), but the write timeline is the only input that matters for detection. Everything else is configuration. Design principles Before diving into the implementation, it’s worth laying out the principles that shaped the solution: 1. Zero-configuration by default, tunable when needed The system should work out of the box for any table with sufficient history. But when a table needs special treatment, a wider observation window, a different sensitivity threshold, a different alerting strategy, you should be able to configure that without touching the SQL. 2. Use the platform, not external tools BigQuery has a built-in anomaly detection function ( AI.DETECT_ANOMALIES ) powered by Google’s TimesFM foundation model. Rather than exporting data to an external system, training a model, and importing predictions back, we use the ML capability that’s already available in our SQL engine. 3. Minimize false positives aggressively A monitoring system that cries wolf trains people to ignore alerts. We’d rather miss an occasional true anomaly than generate noise. Every design decision around thresholds, holiday handling, significance filters, and consecutive-window requirements exists to suppress false positives. 4. Only detect drops, not spikes We deliberately ignore volume increases. A spike is usually a backfill, a re-import, or a sudden external factor driving product usage (e.g. tariff announcements, market events, campaign launches). These are legitimate and don’t represent data quality issues. Missing data is the problem we’re solving: the rows that should have arrived but didn’t. Every part of the system; the lower bound extraction, the significance filters, the severity logic, is oriented around detecting less data than expected, not more. 5. Produce a severity spectrum, not a binary flag Operators need to know whether something is probably fine (WARNING), needs investigation (ERROR), or almost certainly broken (CRITICAL). A single “is_anomaly” boolean doesn’t give them that context. The architecture in layers The solution is structured as a pipeline of CTEs (common table expressions) within a single dbt model. Each CTE handles one concern. Here’s an overview of how they map to the sections below: Let’s walk through them. Layer 1: Table profiling and automatic eligibility Not every table is a good candidate for anomaly detection. A table that was created last week doesn’t have enough history. A table that receives one batch load per month has too few data points for statistical analysis. The first layer profiles every table automatically: TableProfiles AS ( SELECT table_fqn, MAX(real_volume) AS peak_volume, TIMESTAMP_DIFF(MAX(event_timestamp), MIN(event_timestamp), DAY) AS history_days, SAFE_DIVIDE(STDDEV(real_volume), AVG(real_volume)) AS cv_score, MAX(TIMESTAMP_DIFF(event_timestamp, prev_ts, HOUR)) AS max_gap_hours, ROUND( SAFE_DIVIDE(COUNT(*), TIMESTAMP_DIFF(MAX(event_timestamp), MIN(event_timestamp), HOUR)), 3 ) AS liveness_score ... HAVING history_days >= 21 ) Three metrics drive the profiling: Coefficient of variation (cv_score) : How noisy is this table’s volume? A table with cv_score = 0.3 has low variance, its hourly volumes are fairly consistent. A table with cv_score = 13.5 is wildly variable. Both are valid candidates for anomaly detection, but they may need different window sizes and thresholds. Liveness score : The ratio of hours where data actually arrived to total hours observed. A liveness of 0.99 means the table is written to almost every hour, likely a streaming or near-real-time pipeline. A liveness of 0.01 means it’s a once-a-day batch job. This informs the choice of observation window. Max gap hours : The longest observed gap between writes. A table with a 72-hour gap has weekends where nothing arrives. This is normal behavior that shouldn’t trigger alerts. Tables must have at least 21 days of history to be eligible. This ensures the model has enough training data to learn the table’s patterns. Layer 2: Configurable observation windows Different tables need different observation windows. A high-frequency streaming table produces enough signal in a 1-hour window. A batch job that runs a few times per day might need a 6-hour or even 24-hour window to separate signal from noise. The solution supports arbitrary window sizes that divide evenly into 24 hours (1, 2, 3, 4, 6, 8, 12, 24). The window spine generates one time slot per window per table: WindowSpine AS ( SELECT DISTINCT P.table_fqn, P.window_hours, -- Map each hour to its containing window TIMESTAMP_TRUNC( TIMESTAMP_ADD( TIMESTAMP_TRUNC(TS.hour_timestamp, DAY), INTERVAL (DIV(EXTRACT(HOUR FROM ...), P.window_hours) * P.window_hours) HOUR ), HOUR ) AS window_start, ... FROM TableProfiles AS P CROSS JOIN UNNEST(GENERATE_TIMESTAMP_ARRAY(...)) AS TS ) The key insight here is the SELECT DISTINCT . We generate an hourly timestamp array and then map each hour to its containing window using integer division. Multiple hours map to the same window, so DISTINCT collapses them into one row per window. This gives us a clean spine: one row per window per table, regardless of whether any data arrived during that window. Hourly volumes are then aggregated into these windows with a LEFT JOIN , producing actual_rows = 0 for windows where nothing was written. This is critical, the absence of data is itself a signal, and the system must not silently drop empty windows. Layer 3: Three-tier configuration hierarchy How does the system know which window size or threshold to use for a given table? Through a three-tier configuration hierarchy: 1. Smart defaults. Every table starts with sensible defaults: a 1-hour window, a 0.99 anomaly threshold (very conservative), and consecutive-window confirmation required. No configuration needed. 2. BigQuery labels. Table owners can set labels directly on their BigQuery tables: anomaly-detection anomaly-window anomaly-threshold anomaly-require-consecutive . These labels are read automatically via a model that queries the organization’s Cloud Asset Export , a BigQuery dataset that Google Cloud can populate with metadata about all resources across your GCP organization. Alternatively, you could use INFORMATION_SCHEMA.TABLE_OPTIONS per project if you don’t have Cloud Asset Export set up. This is the self-service path, teams configure their own tables through Terraform or the BigQuery console without needing to know about the monitoring system’s internals. 3. Seed overrides. For edge cases where labels can’t be applied (external tables, tables not managed in Terraform, temporary testing), a CSV seed file provides the highest-priority overrides. This is the escape hatch for the platform team. The configuration merge is a straightforward COALESCE . ConfigMerged AS ( SELECT COALESCE(SO.override_window_hours, TLC.label_window_hours, 1) AS window_hours, COALESCE(SO.override_threshold, TLC.label_threshold, 0.99) AS custom_threshold, ... FROM TableLabelConfig AS TLC FULL OUTER JOIN SeedOverrides AS SO ON TLC.table_fqn = SO.table_fqn ) This design means the system works on day one with zero configuration, and teams gradually tune individual tables as they learn which ones need attention. A few examples from our seed overrides illustrate the range: Not every table is a good fit for anomaly detection. Some tables have inherently random ingestion patterns, event-driven data that spikes and dips unpredictably based on user behavior or external factors. Rather than tolerating noisy alerts, you can set detection_mode to disabled via either a BigQuery label or the seed file, and the system will skip them entirely. Layer 4: TimesFM anomaly detection With the windowed, aggregated volumes in hand, we feed them into BigQuery’s built-in AI.DETECT_ANOMALIES function using Google’s TimesFM 2.5 foundation model. Results_TimesFM AS ( SELECT table_fqn, time_series_timestamp AS event_timestamp, is_anomaly, anomaly_probability, lower_bound FROM AI.DETECT_ANOMALIES( (SELECT * FROM AnomalyInput WHERE event_timestamp < start_date), -- training (SELECT * FROM AnomalyInput WHERE event_timestamp >= start_date), -- detection id_cols => ['table_fqn'], data_col => 'log_volume', timestamp_col => 'event_timestamp', model => 'TimesFM 2.5', anomaly_prob_threshold => 0.95 ) ) A few important choices here: Log-scale volumes. We feed LOG10(raw_volume + 100) rather than raw volumes. This compresses the dynamic range (volumes can span from 0 to millions) and makes the model less sensitive to absolute magnitude. The + 100 offset prevents log(0) issues and provides a baseline so that small tables with volumes of 10-50 rows aren’t dominated by noise. Separation of training and detection data. AI.DETECT_ANOMALIES takes two subqueries: one for training (historical patterns) and one for detection (recent windows to evaluate). We use a 28-day training window. The function learns each table’s hourly/daily patterns from this history and then evaluates the recent windows against those learned patterns. Per-table time series via id_cols . The id_cols => ['table_fqn'] parameter tells the function to treat each table as an independent time series. One function call handles all tables simultaneously, no need to loop or call the function per table. Lower bound extraction. Besides the binary anomaly flag and probability, the function returns a lower_bound , the minimum expected volume in log space. We convert this back to linear space ( POW(10, lower_bound) - 100 ) to produce a human-readable expected_min that operators can compare against actual_rows in dashboards and alerts. Layer 5: False positive suppression The raw output from TimesFM is a starting point, not the final answer. Several filters reduce false positives: Holiday suppression. Public holidays in Nordic countries (SE, NO, DK, FI) and the US cause legitimate volume drops that shouldn’t trigger alerts. We also suppress alerts on the afternoon before holidays (13:00 onward) when business activity typically winds down: CASE WHEN is_holiday THEN FALSE WHEN is_day_before_holiday AND EXTRACT(HOUR FROM event_timestamp) >= 13 THEN FALSE ... END AS is_bad_window Absolute significance filter. If the gap between expected and actual volume is less than 0.1% of the table’s peak volume, it’s noise regardless of what the probability says. A table that normally writes 10 million rows per hour dropping by 5,000 rows is not operationally significant: WHEN (expected_min - actual_rows) < (peak_volume * 0.001) THEN FALSE Relative significance filter. Similarly, if the relative deviation is less than 25%, it’s not worth alerting on. This catches cases where the model is highly confident about a small deviation: WHEN SAFE_DIVIDE((expected_min - actual_rows), expected_min) < 0.25 THEN FALSE Per-table threshold. The TimesFM function uses a global anomaly_prob_threshold of 0.95, but each table has its own threshold (defaulting to 0.99). The per-table threshold is applied after the model runs, so tables with high variance get a more conservative threshold while stable tables can use the default. Layer 6: Severity matrix After all filtering, each window is classified into a severity level: The PENDING state is key to the consecutive-window strategy. For small observation windows (1-4 hours), a single anomalous window might just be a brief dip in traffic. Requiring two consecutive anomalous windows before escalating to WARNING/ERROR/CRITICAL dramatically reduces false positives for these high-frequency time series. For larger windows (6+ hours), a single anomalous window is already significant, so consecutive confirmation is disabled by default. The PENDING rows are filtered from the final output, they’re an intermediate state, not an alert. If the next window is also anomalous, the system will escalate. If it recovers, the moment passes without noise. Why a single dbt model? This is a reasonable question. Why implement all of this in one SQL model rather than splitting it into multiple models? Atomic evaluation. Anomaly detection requires comparing a table’s current window against its historical pattern. If you split the pipeline across models, you risk the history and the current evaluation running against different snapshots of the underlying data. A single model ensures everything evaluates atomically. AI.DETECT_ANOMALIES requires co-located data. The BigQuery ML function needs both training and detection data in the same query. You can’t materialize the training data in one model and reference it from another model’s ML function call, the function takes subqueries, not table references. Incremental efficiency. The model runs hourly as an incremental merge. Each run processes only the last 2 days of data (with a 28-day lookback for training context), and merges the results into the existing table. The hourly cost is negligible because INFORMATION_SCHEMA views are small and AI.DETECT_ANOMALIES scales with the number of time series points, not the number of tables. The “dense grid” problem One subtle but important design challenge deserves its own section; ensuring that windows with zero data volume appear in the output. Consider a table that normally receives data every hour. On a given day, the upstream system goes down and writes nothing for 24 hours. This is exactly the scenario we want to detect. But if we only aggregate the volume data that exists, those 24 hours are simply absent from the dataset, there are no rows to aggregate. This is the “dense grid” problem. The solution has three parts: 1. A window spine independent of the data. The WindowSpine CTE generates one row per window per table by crossing TableProfiles with a generated timestamp array. This spine exists regardless of whether any data arrived. 2. A LEFT JOIN for aggregation. When aggregating raw volumes into windows, we use a LEFT JOIN from the spine to the data, with COALESCE(SUM(volume), 0) . Windows with no matching data get actual_rows = 0 instead of being dropped. 3. A LEFT JOIN for ML results. The AI.DETECT_ANOMALIES function might not return results for every window (if a table has too little training data, or if the function internally decides it can’t make a prediction). We LEFT JOIN from the dense grid to the ML results, so windows without predictions still appear in the output with NULL anomaly probability and is_anomaly = FALSE . Without all three of these, the system has a blind spot: it can only detect anomalies in data it can see, and zero-volume windows are invisible. Operational considerations Late-arriving data The Storage Write API INFORMATION_SCHEMA views can have delayed ingestion, a write that happened at 14:00 might not appear in the view until 14:30 or later. The model handles this by looking back 2 days on each incremental run rather than just processing the current hour. This ensures late-arriving data is captured and previously evaluated windows are re-evaluated with complete information. The full_refresh: false guard The model is configured with full_refresh: false in its dbt YAML config. This is deliberate. A full refresh would reprocess all historical data through AI.DETECT_ANOMALIES , which is expensive and unnecessary. The incremental merge strategy means the table accumulates results over time, and each run only processes the recent detection window. Hourly cadence The model runs hourly. This means the fastest detection latency is approximately 1-2 hours: one hour for the current window to complete, plus the next scheduled run to evaluate it. For tables with multi-hour windows, the latency scales accordingly. This is an acceptable tradeoff for a system that monitors hundreds of tables without per-table infrastructure. What makes this generic The entire solution rests on a data source that every BigQuery user has: INFORMATION_SCHEMA.WRITE_API_TIMELINE . If you write data to BigQuery via the Storage Write API, which includes BigQuery subscriptions from Pub/Sub, Dataflow jobs, and many ETL tools, you already have this data. The solution doesn’t require: External data monitoring services Custom ML model training or hosting Per-table configuration (though it supports it) Any infrastructure beyond what BigQuery and dbt provide The only BigQuery-specific dependency is AI.DETECT_ANOMALIES with the TimesFM model, which is available in all BigQuery editions that support BQML. The holiday suppression uses an open-source BigQuery UDF ( bigfunctions.is_public_holiday ), but this could be replaced with any holiday calendar. To adapt this for your own environment, you would: Point the timeline model at your own project’s INFORMATION_SCHEMA views Adjust the holiday countries to match your operational geography Optionally configure per-table overrides for tables with unusual patterns Connect the output table to your alerting and incident management system What’s next In the next post in this series, I’ll show real anomalies the system has detected, and walk through the dashboard we use to triage alerts. We’ll look at cases where the system correctly identified upstream outages, partial data deliveries, and schema-change-induced volume shifts, and cases where we had to tune the configuration to suppress false positives. Further ahead, I’ll cover how we close the loop; how anomaly detections feed into our incident management system, how we surface data health badges with lineage context directly in our Looker dashboards, and how that changes the way teams respond to data quality issues. This solution was built at Nordnet as part of our data platform governance layer. The full code is available as a reference implementation. 12 2 Share Previous Discussion about this post Comments Restacks Daniel Beach 2d Liked by Robert Sahlin Honest question, just because I'm curious, do you think long-term that BigQuery will "hold up" against Snowflake and Databricks? Do you think it's cheaper to operate on, what about feature parity? It's been probably 10 years since I used BigQuery, so I've been out of it. Reply Share 1 reply by Robert Sahlin 1 more comment... Top Latest Discussions No posts Ready for more? Subscribe © 2026 Robert Sahlin · Privacy ∙ Terms ∙ Collection notice Start your Substack Get the app Substack is the home for great culture