Splines, piecewise polynomials segmented by discrete points, are known to be a good approximation for many real-world scenarios. Data scientists often use spline interpolation to produce smooth graphs and estimate missing values by “filling in” the space between discrete points of data.

Consider the following graph, which shows the temperature every four hours in Yosemite Park in early July.

The graph appears jagged as we have data at 4 hour intervals. Spline interpolation can be used to fill in the areas between points while producing a much smoother graph:

Here we explore how to interpolate the data with spline interpolation using just SQL. We make use of Common-Table Expressions (CTEs) and time windowing functions.

The method

Spline interpolation produces a piecewise cubic polynomial function for each pair of points. These distinct polynomial functions join smoothly at each data point. This is distinct from a rolling average, where the line rarely intersects a data point. In practice, a rolling average is good to smooth high variance data, while spline interpolation is used to fill in sparse data. This wikipedia article on spline interpolation is a good introduction to the topic, and we’ll be referring to the various equations in that article here.

Our primary task is to solve for the variables in three equations — equation 15 constrains the polynomial functions to be smooth at each data point, and equations 16 and 17 constrain the curvature at the end points to be straight lines.

As the article shows, these three equations can be represented in matrix form. Conveniently, the coefficient matrix is a tridiagonal matrix which can be solved using the Thomas Algorithm, as explained here.

The interpolation algorithm normally uses the entire dataset to determine the polynomial functions. In our implementation, however, we consider a subset of four nearby data points when determining the coefficients of each polynomial (also called windowing).

The resulting matrices for our study (n=4) are useful for explanatory purposes and produce the smooth graphs we are looking for, but you may find the need to increase the window size when using real-world data.

The source data

To begin, we first download temperature data from the National Oceanic and Atmospheric Administration (NOAA) website, cut it down to the first two days of July 2017, and insert into a table as follows:

insert into temperatures (hour, temperature) values
(1,19.5),
(2,19.5),
(3,19.7),
…
(176,14.6);

In order to simulate a rough sampling, we extract the values every four hours:

with xy as (
select
cast(hour as double precision) as x
, temperature as y
from temperatures
where
cast(hour as integer) % 4 = 0
order by
hour
limit
12
)

This is what we plot to get the graph in fig. 1 above.

with recursive
, x_series(input_x) as (
select
min(x)
from
xy
union all
select
input_x + 0.1
from
x_series
where
input_x + 0.1 < (select max(x) from x)
)
, x_coordinate as (
select
input_x
, max(x) over(order by input_x) as previous_x
from
x_series
left join
xy on abs(x_series.input_x - xy.x) < 0.001
)

Preparing the Input Variables — The 4 Points in Each Window

As mentioned, we will use a window of four data points around each data point. This is easy to do by creating columns for the (x, y) coordinate for each of the four data points by leveraging the lag and lead functions.

fullxy as (
select
x
, y
, lag(x, 1) over (order by x) as x0
, lag(y, 1) over (order by x) as y0
, x as x1
, y as y1
, lead(x, 1) over (order by x) as x2
, lead(y, 1) over (order by x) as y2
, lead(x, 2) over (order by x) as x3
, lead(y, 2) over (order by x) as y3
from
xy
order by
x
)

This produces a table like the following:

As we can see, for each point (x, y), we also duplicate the (x, y) coordinates of the point before it (x0, y0) and the two points following it (x2, y2) and (x3, y3), for a total of four data points per row.

Producing the Tridiagonal Coefficients

, tridiagonal as (
select
*
, 3*(y1-y0) /power(x1-x0, 2) as d0
, 3*((y1-y0)/(power(x1-x0, 2)) + (y2-y1)/(power(x2-x1, 2))) as d1
, 3*((y2-y1)/(power(x2-x1, 2)) + (y3-y2)/(power(x3-x2, 2))) as d2
, 3*(y3-y2) /power(x3-x2, 2) as d3
, 1 / (x1-x0) as a1
, 1 / (x2-x1) as a2
, 1 / (x3-x2) as a3
, 2 / (x1-x0) as b0
, 2 * (1/(x1-x0) + 1/(x2-x1)) as b1
, 2 * (1/(x2-x1) + 1/(x3-x2)) as b2
, 2 / (x3-x2) as b3
, 1 / (x1-x0) as c0
, 1 / (x2-x1) as c1
, 1 / (x3-x2) as c2
from fullxy
order by x)

Now, for each row, we have the necessary a, b, c, and d vectors for the tridiagonal matrix.

Finding the coefficients for the tridiagonal matrix using the Thomas Algorithm

The Thomas Algorithm can now be used to solve the tridiagonal matrix we created. The algorithm uses a series of “sweeps” across the matrix to find a solution. First there is a forward sweep, where the c prime and d prime values are calculated, and then a backward substitution, where the solution slope vector is solved. We unroll the calculation into distinct sweeps and solve in a series of CTEs behaving like nested subqueries.

forward_sweep_1 as (
select
*
, c0 / b0 as c_prime_0
, c1 / (b1 - a1 * c0 / b0) as c_prime_1
, c2 / (b2 - a2 * (c1 / (b1 - a1 * c0 / b0))) as c_prime_2
, d0 / b0 as d_prime_0
, (d1 - a1 * d0 / b0) / (b1 - a1 * c0 / b0) as d_prime_1
from
tridiagonal
)
, forward_sweep_2 as (
select
*
, (d2 - a2 * d_prime_1) / (b2 - a2 * c_prime_1) as d_prime_2
, (d3 - a3 * ((d2 - a2 * d_prime_1) / (b2 - a2 * c_prime_1)))
/ (b3 - a3 * c_prime_2) as d_prime_3
from
forward_sweep_1
)
, backwards_substitution_1 as (
select
*
, d_prime_3 as k3
, d_prime_2 - c_prime_2 *(d_prime_3) as k2
, d_prime_1 - c_prime_1 * (d_prime_2 - c_prime_2 *(d_prime_3)) as k1
from
forward_sweep_2
)
, backwards_substitution_2 as (
select
*
, d_prime_0 - c_prime_0 * k1 as k0
from
backwards_substitution_1
)

Determining the polynomial coefficients

Now, we’re ready to determine the polynomial coefficients a and b (equations 10 & 11).

, coefficients as (
select
x, y, x0, y0, x1, y1, x2, y2, k0, k1, k2, k3
k1  * (x2-x1) - (y2-y1)  as a,
-k2 * (x2-x1) + (y2-y1) as b
from
backwards_substitution_2
)

Final step: calculating the temperature at each interpolated point

Here we calculate t (equation 2) and output y values (equation 1).

, interpolated_table as (
select
*
, (input_x - x1) / (x2 - x1) as t
from
x_coordinate,
coefficients
where
abs(coefficients.x1 - x_coordinate.previous_x) < 0.001
)
, final_table as (
select
*
, (1-t) * y1 + t*y2 + t*(1-t)*(a*(1-t)+b*t) as output_y
from
interpolated_table
order by input_x
)

The following is the final select clause. output_y contains the interpolated values. We also include the original (x, y) coordinate as well as scatter_y so we can display the original data points on a scatter plot.

select
round(cast(input_x as numeric), 2) as input_x
, x
, y
, round(cast(output_y as numeric), 2) as output_y
, case
when abs(input_x - x) < 0.01 then y
else null
end as scatter_y
from
final_table
order by
x

Final considerations

And there you have it! Using only SQL, we have used spline interpolation to generate smooth graphs. For the sake of brevity, we excluded data points on the left and right most of the data set but they can be added back into your analysis. Using splines for your graphs makes time series easier to read and pushes the limits of SQL.

Happy scripting!

Tags: