PingThings
This blog post explores how to maximize your efficiency in working with large datasetes using `windows`, `aligned_windows` and `values` queries (Photo credit Roald Dahl).
Laurel Dunn
June 23, 2021
This tutorial offers a guide on using the PredictiveGrid to work wtih VERY big data sets.
When working with high-resolution time series data, seemingly simple tasks can quickly become intractable. The reason for this is that the volume of data exceeds the computational limits of most most computing environments.
Here, we’ll describe three methods for querying data in PredictiveGrid. In practice none of these is “better” than another — there is a time and a place for each. This post will weigh the relative advantages of each approach.
stream.values()
stream.windows()
stream.aligned_windows()
import btrdb
import pandas as pd
import numpy as np
from btrdb.utils.timez import *
from datetime import datetime, timedelta
from matplotlib import pyplot as plt
db = btrdb.connect()
One definition of the term “Big Data” helps to put the problem in context:
The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods.
Let’s dig into what this means by looking at the size of the sunshine
data set.
n_points = 0
for stream in db.streams_in_collection('sunshine'):
n_points += stream.count()
print('Thats a total of %.2f Billion points!'%(n_points/1e9))
Thats a total of 279.60 Billion points!
To illustrate what’s meant by BIG DATA
, let’s investigate the very simple task of querying data from a single stream.
Your first thought might be to say: Give me all the data!
But what will that yield?
streams = db.streams_in_collection('sunshine/PMU1', tags={'name': 'L1MAG'})
stream = streams[0]
print('collection:\t', stream.collection)
print('stream name:\t', stream.name)
# How many points is that?
print('num points:\t', stream.count()/1e9, 'Billion')
collection: sunshine/PMU1
stream name: L1MAG
num points: 5.143168296 Billion
# What is that in gigabytes?
print('%.2f Billion points is %.2f gigabytes of data!'%(stream.count()/1e9, stream.count()*64*2/8/1e9))
5.14 Billion points is 82.29 gigabytes of data!
That volume of data will almost certainly overload your local computing environment. Even if you’re working in the cloud, it would be expensive to maintain an envrionment with that much memory. It’s worth checking that you need that much data before asking for an environment large enough to get it.
Most importantly, it’ll likely take quite a long time to get the data back to you. Waiting for data is NOT worth your time.
Below, we’ve included a helper function for issuing different types of queries.
Windows queries provide statistical aggregates or “summary statistics” of raw data points in a give ntime interval. A windows query will return a time series of StatPoint
objetcs, which can be used to explore summary statistics of raw values over time.
StatPoints
?t0 = currently_as_ns()
start, _ = stream.earliest()
end, _ = stream.latest()
window = ns_delta(days=5)
statpoints = stream.windows(start.time, end.time, window)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))
Runtime: 1.45 seconds
The query stream.windows()
scanned through 18 months of data to return StatPoint
objects in intervals specified by window
. Let’s visualize the restults.
It took less than 10 MICROseconds to run through all 18 months of data.
That’s pretty fast
t0 = currently_as_ns()
window = ns_delta(days=1)
statpoints = stream.windows(start.time, end.time, window)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))
Runtime: 5.99 seconds
t0 = currently_as_ns()
window = ns_delta(hours=6)
statpoints = stream.windows(start.time, end.time, window)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))
Runtime: 26.49 seconds
That one took a while! Don’t worry, there’s a better way.
Aligned windows return results that look very much like windows queries. The only differece, is that time stamps are adjusted to align with time windows stored inherently in the database. Where windows
queries may need to re-compute statistical aggregates over the time window requested, aligned_windows
queries can leverage pre-computed values.
Let’s look at the difference in performance.
window = ns_delta(hours=6)
pw = np.log2(window)
t0 = currently_as_ns()
statpoints = stream.aligned_windows(start.time, end.time, pointwidth=pw)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))
Runtime: 0.04 seconds
That’s much faster! The only thing to note is that the time increment in an aligned_windows
query is rounded to the nearest time increment that matches the inherent structure of the database.
print(btrdb.utils.general.pointwidth(pw))
Increment requested: 6 hours
Increment used: 4.89 hours
It is wortwhile to become familiar with aligned_windows
queries because they can be much faster.
If you don’t care whether or not aggregates are returned in precisely 1-hour increments (for example), aligned_windows
queries will allow you to query more data in finer time increments than you would be able to do using windows
queries.
window = ns_delta(hours=6)
pw = np.log2(window)
t0 = currently_as_ns()
statpoints = stream.aligned_windows(start.time, end.time, pointwidth=pw)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))
Runtime: 0.05 seconds
window = ns_delta(minutes=30)
pw = np.log2(window)
t0 = currently_as_ns()
statpoints = stream.aligned_windows(start.time, end.time, pointwidth=pw)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))
Runtime: 1.74 seconds
window = ns_delta(minutes=1)
pw = np.log2(window)
t0 = currently_as_ns()
statpoints = stream.aligned_windows(start.time, end.time, pointwidth=pw)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))
Runtime: 60.34 seconds
That last query took a while! Let’s make note of that…
dt = (end.time-start.time)/1e9/3600/24
pw = btrdb.utils.general.pointwidth(pw)
print("Note to self: Don't try to query %i days of data at %i second granularity"%(dt, pw))
Note to self: Don't try to query 561 days of data at 35 second granularity
values
Many analytics can be done using StatPoints to summarize steady state characteristics of the data at the time-scale that is of interest, or to identify intervals in the data where there is an “event” in the data.
Here, we’ll simply explore at what point values queries become intractable to perform.
window = ns_delta(minutes=1)
start_time = start.time
end_time = start_time + window
t0 = currently_as_ns()
statpoints = stream.values(start_time, end_time)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))
Runtime: 0.27 seconds
window = ns_delta(minutes=10)
start_time = start.time
end_time = start_time + window
t0 = currently_as_ns()
statpoints = stream.values(start_time, end_time)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))
Runtime: 0.23 seconds
window = ns_delta(hours=1)
start_time = start.time
end_time = start_time + window
t0 = currently_as_ns()
statpoints = stream.values(start_time, end_time)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))
Runtime: 1.15 seconds
window = ns_delta(hours=6)
start_time = start.time
end_time = start_time + window
t0 = currently_as_ns()
statpoints = stream.values(start_time, end_time)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))
Runtime: 6.86 seconds
When running values queries, be sure to check how much working memory you have available in your jupyterhub instance. Bringing large amounts of data into memory can easily crash your jupyter server! You may need to shut down and move to a larger instance if your kernel crashes repeatedly.
aligned_windows
queries in actionHere are some examples where we use statpoints to hone in on time intervals that are known (or likely) to be of interest for a given analytic:
values
queries in actionHere are examples where we use values queries to examine events that warrant full-resolution queries:
Laurel Dunn
Laurel is a Civil Engineer specializing in data-driven risk assessment and decision analysis for power systems. She has engaged with utilities, regulators, and policymakers about issues like grid modernization and climate resilience. She hopes to help society realize the benefits of scientific advancements by being a facilitator for knowledge transfer among institutions and people.