TimescaleDB: How to downsample data
Use Case
A number of sensors continuously generating data points at regular intervals (2 seconds). The readings include a humidity value and a temperature value.
CREATE TABLE IF NOT EXISTS dht_readings (
time TIMESTAMPTZ NOT NULL,
device_id TEXT,
temperature FLOAT NOT NULL,
humidity FLOAT NOT NULL
)
SELECT create_hypertable('dht_readings', 'time', if_not_exists => TRUE);
What Is Downsampling
Downsampling reduces the granularity of a data set from a fine grained time series to a coarse set of data. As an example, if our sensors compute and send a reading every 2 seconds we can downsample to 1 hour, 5 hours, and 1 day averages.
In our example each sensor would generate a total of 60/2s * 60m * 24h = 43200 samples x day x device
. When we downsample to 1 hour we have 24 samples per day.
We are trading off data resolution for data compactness augmenting the amount of information that we can store over time.
To do this we use PostgreSQL aggregation functions to make value aggregates, continuous aggregates to recompute aggregates and materialize the results, and data retention policies to remove data after a period of time.
What is the best PostgrSQL aggregation function for our example use case?
Process
Continuous Aggregates
In this step we create a view that will rollup an average of values for a time period that is greater than our original sampling rate.
CREATE VIEW dht_readings_5m
WITH (
timescaledb.continuous,
timescaledb.ignore_invalidation_older_than = '5d',
timescaledb.refresh_lag = '-30m',
timescaledb.refresh_interval = '5m'
)
AS SELECT
time_bucket('5m', time) as bucket,
device_id,
AVG(humidity) as humidity_avg,
MAX(humidity) as humidity_max,
MIN(humidity) as humidity_min,
AVG(temperature) as temperature_avg,
MAX(temperature) as temperature_max,
MIN(temperature) as temperature_min
FROM dht_readings
GROUP BY bucket, device_id, temperature, humidity;
You can find the time_bucket
documentation here with a set of sample queries that provide good insights on how to use it.
We should create aggregates for our other buckets of 5h and 24h.
Data Retention Policy
In order to clean up the space taken by the raw data set we will create a retention policy to delete the finer grained sensor data collected.
SELECT add_drop_chunks_policy('dht_readings', INTERVAL '5 days', cascade_to_materializations => FALSE);
We are dropping dht_readings
data older than 5 days but we maintain the aggregated views such as dht_readings_5m
, dht_readings_1h
, and dht_readings_24h
.
You can get some information regarding the continuous aggregation pipeline by issuing the following query:
SELECT * FROM timescaledb_information.continuous_aggregate_stats;
view_name | completed_threshold | invalidation_threshold | job_id | last_run_started_at | last_successful_finish | last_run_status | job_status | last_run_duration | next_scheduled_run | total_runs | total_successes | total_failures | total_crashes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dht_readings_5m | 2020-07-23 01:40:00+00 | 2020-07-23 01:40:00+00 | 1002 | 2020-07-23 01:36:53.076732+00 | 2020-07-23 01:36:53.158439+00 | Success | Scheduled | 00:00:00.081707 | 2020-07-23 01:41:53.158439+00 | 1339 | 1339 | 0 | 0 |
dht_readings_1h | 2020-07-23 01:00:00+00 | 2020-07-23 01:40:00+00 | 1003 | 2020-07-23 01:29:37.550259+00 | 2020-07-23 01:29:37.674958+00 | Success | Scheduled | 00:00:00.124699 | 2020-07-23 02:29:37.674958+00 | 15 | 15 | 0 | 0 |
dht_readings_1d | 2020-07-23 00:00:00+00 | 2020-07-23 01:40:00+00 | 1004 | 2020-07-23 00:29:47.522506+00 | 2020-07-23 00:29:48.247258+00 | Success | Scheduled | 00:00:00.724752 | 2020-07-24 00:29:48.247258+00 | 1 | 1 | 0 | 0 |
Querying
Data Backfill
See time_bucket_gapfill and
Notes
Resources:
- How to proactively manage long-term data storage with downsampling: Published on January 23, 2020.
- Downsampling Timescale Data with Continuous Aggregations: Published on May 07, 2019.