Python Referencing data by interpolation - python

I have two datasets, One of which has time array in datetime.datetime form, and x,y,z coordinates array of that time, like time[0]=datetime.datetime(2000,1,21,0,7,25), x[0]=-6.7, etc.
I'd like to calculate something from the coordinates, but that needs another parameter (Ma) which depend on time. Second data set has another time array in same datetime form, and the parameter recorded at that time, like time[0]=datetime.datetime(2000,1,1,0,3), Ma[0]=2.73
The problem is that the time array of two data set is different (though the ranges are similar)
So I want to interpolate the parameter's value at each time of data set 1, like Ma[0], but 0 is not index of time of dataset2, but corresponds to index of dataset 1.
How can I do that?
PS. Can I convert the time form to simpler one? datetime.datetime seems quite cumbersome.

The following is an example of how to interpolate your values. The coord_ and ma_ arrays will be your imported data.
The first thing the script does is build some more sensible data structures from your disparate 1 dimensional arrays. The part that you're actually looking for is the call to np.interp, documented here.
import numpy as np
import datetime
import time
# Numpy cannot interpolate between datetimes
# This function converts a datetime to a timestamp
def to_ts(dt):
return time.mktime(dt.timetuple())
coord_dts = np.array([
datetime.datetime(2000, 1, 1, 12),
datetime.datetime(2000, 1, 2, 12),
datetime.datetime(2000, 1, 3, 12),
datetime.datetime(2000, 1, 4, 12)
])
coord_xs = np.array([3, 5, 8, 13])
coord_ys = np.array([2, 3, 5, 7])
coord_zs = np.array([1, 3, 6, 10])
ma_dts = np.array([
datetime.datetime(2000, 1, 1),
datetime.datetime(2000, 1, 2),
datetime.datetime(2000, 1, 3),
datetime.datetime(2000, 1, 4)
])
ma_vals = np.array([1, 2, 3, 4])
# Handling the data as separate arrays will be painful.
# This builds an array of dictionaries with the form:
# [ { 'time': timestamp, 'x': x coordinate, 'y': y coordinate, 'z': z coordinate }, ... ]
coords = np.array([
{ 'time': to_ts(coord_dts[idx]), 'x': coord_xs[idx], 'y': coord_ys[idx], 'z': coord_zs[idx] }
for idx, _ in enumerate(coord_dts)
])
# Build array of timestamps from ma datetimes
ma_ts = [ to_ts(dt) for dt in ma_dts ]
for coord in coords:
print("ma interpolated value", np.interp(coord['time'], ma_ts, ma_vals))
print("at coordinates:", coord['x'], coord['y'], coord['z'])

Related

perform numpy mean over matrix using labels as indicators

import numpy as np
arr = np.random.random((5, 3))
labels = [1, 1, 2, 2, 3]
arr
Out[136]:
array([[0.20349907, 0.1330621 , 0.78268978],
[0.71883378, 0.24783927, 0.35576746],
[0.17760916, 0.25003952, 0.29058267],
[0.90379712, 0.78134806, 0.49941208],
[0.08025936, 0.01712403, 0.53479622]])
labels
Out[137]: [1, 1, 2, 2, 3]
assume I have this dataset.
I would like, using the labels as indicators, to perform np.mean over the rows.
(The labels here indicates the class of each row.
labels could also be [0, 1, 1, 0, 4, 1, 4] So have no assumptions over them.)
So the output here will be an average over the:
1st and 2nd row.
3rd and 4th row.
5th row.
in the most efficient way numpy offers. like so:
[np.mean(arr[:2], axis=0),
np.mean(arr[2:4], axis=0),
np.mean(arr[4:], axis=0)]
Out[180]:
[array([0.46116642, 0.19045069, 0.56922862]),
array([0.54070314, 0.51569379, 0.39499737]),
array([0.08025936, 0.01712403, 0.53479622])]
(in real life scenario the matrix dimensions could be (100000, 256))
First we would like to sort our label and matrix:
labels = np.array(labels)
# Getting the indices of a sorted array
sorted_indices = np.argsort(labels)
# Use the indices to sort both labels and matrix
sorted_labels = labels[sorted_indices]
sorted_matrix = matrix[sorted_indices]
Then, we calculate the "steps" or pairs of indices, (from, to) we want to calculate average over, We sum them and divide by their count.
# Here we're getting the amount of rows per label to average (over the sorted_matrix).
# Infact, we're getting the start and end indices per label.
label_indices = np.concatenate(([0], np.where(np.diff(sorted_labels) != 0)[0] + 1, [len(sorted_labels)]))
# using add + reduceat to add all rows with regard to the label indices
group_sums = np.add.reduceat(sorted_matrix, label_indices[:-1], axis=0)
# getting count for each group using the diff in label_indices
group_counts = np.diff(label_indices)
# Calculating the mean
group_means = group_sums / group_counts[:, np.newaxis]
Example:
matrix
Out[265]:
array([[0.69524902, 0.22105336, 0.65631557, 0.54823511, 0.25248685],
[0.61675048, 0.45973729, 0.22410694, 0.71403135, 0.02391662],
[0.02559926, 0.41640708, 0.27931808, 0.29139379, 0.76402121],
[0.27166955, 0.79121862, 0.23512671, 0.32568048, 0.38712154],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.01685432, 0.8395658 , 0.73460083, 0.08056013, 0.02522956],
[0.27274409, 0.64602305, 0.05698037, 0.23214598, 0.75130743],
[0.65069115, 0.32383729, 0.86316629, 0.69659358, 0.26667206],
[0.91971818, 0.02011127, 0.91776206, 0.79474582, 0.39678431],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
labels
Out[266]: array([3, 3, 2, 3, 1, 0, 2, 0, 2, 5])
group_means
Out[267]:
array([[0.33377274, 0.58170155, 0.79888356, 0.38857686, 0.14595081],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.40602051, 0.36084713, 0.41802017, 0.43942853, 0.63737099],
[0.52788969, 0.49066976, 0.37184974, 0.52931565, 0.221175 ],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
and the results are suited for: np.unique(sorted_labels)
np.unique(sorted_labels)
Out[271]: array([0, 1, 2, 3, 5])
I did not understand the labels part in your question. but there is a way to calculate the mean of each row in a matrix.
use --> np.mean(arr, axis = 1).
If lables to be used, please go through below mentioned script.
import numpy as np
arr = np.array([[1,2,3],
[4,5,6],
[7,8,9],
[1,2,3],
[4,5,6]])
labels =np.array([0, 1, 1, 0, 4])
#print(arr)
#print('LABEL IS :', labels)
#print('MEAN VALUES ARE : ',np.mean(arr[:2], axis = 1))
id = labels.argsort()
eq_lal = labels[id]
print(eq_lal)
print(arr[eq_lal])
print(np.mean(arr[eq_lal], axis = 1))

Clipping a datatime series along the y-axis

I have a list of tuples, where each tuple is a datetime and float. I wish to clip the float values so that they are all above a threshold value. For example if I have:
a = [
(datetime.datetime(2021, 11, 1, 0, 0, tzinfo=tzutc()), 100),
(datetime.datetime(2021, 11, 1, 1, 0, tzinfo=tzutc()), 9.0),
(datetime.datetime(2021, 11, 1, 2, 0, tzinfo=tzutc()), 100.0)
]
and if I want to clip at 10.0, this would give me:
b = [
(datetime.datetime(2021, 11, 1, 0, 0, tzinfo=tzutc()), 100),
(datetime.datetime(2021, 11, 1, 0, ?, tzinfo=tzutc()), 10.0),
(datetime.datetime(2021, 11, 1, 1, ?, tzinfo=tzutc()), 10.0),
(datetime.datetime(2021, 11, 1, 2, 0, tzinfo=tzutc()), 100.0)
]
So if I were to plot the a data (before clipping), I would get a V shaped graph. However, if I clip the data at 10.0 to give me the b data, and plot, I will have a \_/ shaped graph instead. There is a bit of math involved in calculating the new times so I'm hoping there is already functionality available to do this kind of thing. The datetimes are sorted in order and are unique. I can fix the data so the difference between consecutive times is equal, should that be necessary.
Apologies for not putting a full answer yesterday, my SO account is still rate-limited.
I have made a bit more complex custom dataset to showcase several values in a row being below threshold.
import pandas as pd
from datetime import datetime
from matplotlib import pyplot as plt
from scipy.interpolate import InterpolatedUnivariateSpline
df = pd.DataFrame([
(datetime(2021, 10, 31, 23, 0), 0),
(datetime(2021, 11, 1, 0, 0), 80),
(datetime(2021, 11, 1, 1, 0), 100),
(datetime(2021, 11, 1, 2, 0), 6),
(datetime(2021, 11, 1, 3, 0), 105),
(datetime(2021, 11, 1, 4, 0), 70),
(datetime(2021, 11, 1, 5, 0), 200),
(datetime(2021, 11, 1, 6, 0), 0),
(datetime(2021, 11, 1, 7, 0), 7),
(datetime(2021, 11, 1, 8, 0), 0),
(datetime(2021, 11, 1, 9, 0), 20),
(datetime(2021, 11, 1, 10, 0), 100),
(datetime(2021, 11, 1, 11, 0), 0)
], columns=['time', 'whatever'])
THRESHOLD = 10
The first thing to do here is to express index in terms of timedelta so that it behaves as any usual number we can then do all kinds of calculations with. For convenience, I am also expressing it as Series - an even better approach would be to create it as such from the get go, save the initial timestamp and reindex.
start_time = df['time'][0]
df.set_index((df['time'] - start_time).dt.total_seconds(), inplace=True)
series = df['whatever']
Then, I've tried InterpolatedUnivariateSpline from scipy:
roots = InterpolatedUnivariateSpline(df.index, series.values - THRESHOLD).roots()
threshold_crossings = pd.Series([THRESHOLD] * len(roots), index=roots)
new_series = pd.concat([series[series > THRESHOLD], threshold_crossings]).sort_index()
Let's test it out:
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(series)
ax.plot(df.index, [THRESHOLD] * len(df.index), 'k-.', label='threshold')
ax.plot(new_series)
ax.set_xlabel('$t-t_0$, s')
axins = ax.inset_axes([0.6, 0.6, 0.35, 0.3])
axins.plot(series)
axins.plot(df.index, [THRESHOLD] * len(df.index), 'k-.')
axins.plot(new_series)
axins.set_ylim(0, 20)
ax.indicate_inset_zoom(axins, edgecolor="black")
ax.set_ylabel('whatever, a.u.')
ax.legend(loc='upper left')
ax.set_title('Roots from InterpolatedUnivariateSpline')
Not so great. Spline roots interpolation is quite a bit off (after all, it uses a cubic B-spline under the hood and can't find roots if setting order to 1). Ah well. For monotonic functions, we could just inverse the interpolation, but this is not the case here. I hope someone finds a better way to do it, but my next step was rolling out a custom function:
def my_interp(series: pd.Series, thr: float) -> pd.Series:
needs_interp = series > thr
# XOR means we are only considering transition points
needs_interp = (needs_interp ^ needs_interp.shift(-1)).fillna(False)
# The last point will never be interpolated
x = series.index.to_series()
k = series.diff(periods=-1) / x.diff(periods=-1)
b = series - k * x
x_fill = ((thr - b) / k)[needs_interp]
fill_series = pd.Series(data=[thr] * x_fill.size, index=x_fill.values)
# NB! needs_interp is a wrong mask to use for series here
return pd.concat([series[series > thr], fill_series]).sort_index()
new= my_interp(series, THRESHOLD)
It achieves what you want to do with good precision:
To get back to timestamp representation, one would simply do
new_series.index = (start_time + pd.to_timedelta(new_series.index, unit='s'))
With that said, there are a couple caveats:
The function above assumes the timestamps are sorted (can be achieved
by sort_index), and no duplicates are present in the series
Edge conditions are nasty as usual. I have tested the function a little bit, the logic seems sound and it does not break if either side of the series is above/below the threshold, and it handles irregular data just fine, but still - watch out for NaNs in your data and consider how you should handle all the edge conditions, sorting etc.
There is no logic dedicated to handling data points exactly at threshold or ensuring there is any regularity in new timestamps. This could lead to bugs, too: e.g. if some portion of your code relies on having at least 2 data points every day, it might not hold after the transformation.

How does xarray's interp nearest method choose the nearest center?

I have a 2-dimensional xarray dataset that I want to interpolate on the lon and lot coordinates such that I have a higher resolution, but the values correspond exactly with the original values at each coordinate.
I thought the excellent xr.interp function would be able to do this, but following the example I see some discrepancy between the original and interpolated values. I am increasing the longitude and latitude resolution by 4, and thus would except all air values that occur once in the original dataset, to occur 16 times in the interpolated dataset, but this is not the case.
Does anyone know what the cause is that the original and interpolated dataset do not align and how I could solve it?
ds = xr.tutorial.open_dataset("air_temperature").isel(time=0)
fig, axes = plt.subplots(ncols=2, figsize=(10, 4))
ds_sel=ds.sel(lon=slice(250,260),lat=slice(40,30))
ds.air.plot(ax=axes[0],xlim=(250,260),ylim=(30,40))
axes[0].set_title("Raw data")
# Interpolated data
new_lon = np.linspace(ds.lon[0], ds.lon[-1], ds.dims["lon"] * 4)
new_lat = np.linspace(ds.lat[0], ds.lat[-1], ds.dims["lat"] * 4)
dsi = ds.interp(lat=new_lat, lon=new_lon,method="nearest")
dsi_sel=dsi.sel(lon=slice(250,260),lat=slice(40,30))
dsi.air.plot(ax=axes[1],xlim=(250,260),ylim=(30,40))
axes[1].set_title("Interpolated data")
Showing the unique values with
unique, counts = np.unique(ds_sel.air.values, return_counts=True)
print("original values",dict(zip(unique, counts)))
unique, counts = np.unique(dsi_sel.air.values, return_counts=True)
print("interpolated values",dict(zip(unique, counts)))
I get
original values {262.1: 1, 263.1: 1, 263.9: 1, 264.4: 1, 265.19998: 1, 266.6: 1, 266.79: 1, 266.9: 2, 268.29: 1, 269.79: 1, 270.4: 1, 273.0: 1, 273.6: 1, 275.19998: 1, 276.29: 1, 278.0: 1, 278.5: 1, 278.6: 1, 281.5: 1, 282.1: 1, 282.29: 1, 284.6: 1, 286.79: 1, 288.0: 1}
interpolated values {262.1: 4, 263.1: 8, 263.9: 8, 264.4: 8, 265.19998: 4, 266.6: 16, 266.79: 16, 266.9: 24, 268.29: 8, 269.79: 20, 270.4: 10, 273.0: 20, 273.6: 16, 275.19998: 8, 276.29: 20, 278.0: 16, 278.5: 10, 278.6: 8, 281.5: 4, 282.1: 16, 282.29: 8, 284.6: 8, 286.79: 8, 288.0: 4}
I think you're conceptually running up against a fencepost error (see the section on this page: https://en.wikipedia.org/wiki/Off-by-one_error)
You should interpret the xarray coordinates as "midpoints", not as the cell boundaries.
Your new_lon isn't nicely divided into 1/2, 1/4, 1/8, etc.:
print(new_lon)
[200. 200.61611374 201.23222749 201.84834123 202.46445498
203.08056872 203.69668246 204.31279621 204.92890995]
And it doesn't include all the original coordinates.
Taking the "off-by-ones" into account:
new_lon = np.linspace(ds.lon[0], ds.lon[-1], (ds.dims["lon"] - 1) * 4 + 1)
new_lat = np.linspace(ds.lat[0], ds.lat[-1], (ds.dims["lat"] - 1) * 4 + 1)
print(new_lon)
[200. 200.625 201.25 201.875 202.5 203.125 203.75 204.375 205. ]
You can then e.g. inspect the part of the first row of the original and the interpolated one:
selection = ds["air"][0, :3]
selection_i = dsi["air"][0, :9]
print(selection["lon"])
print(selection.values)
print(selection_i["lon"])
print(selection_i.values)
This looks good to me:
[200. 202.5 205. ]
[241.2 242.5 243.5]
[200. 200.625 201.25 201.875 202.5 203.125 203.75 204.375 205. ]
[241.2 241.2 241.2 242.5 242.5 242.5 242.5 243.5 243.5]
Of course, when doing nearest interpolation, you might end up with ties:
0.5 is equally far removed from 0.0 as it is from 1.0 -- and so you inadverntely have to bias either "up" or "down" to get a single nearest value.
Also note that the .plot() command, which draws a Matplotlib QuadMesh has to infer boundaries from midpoints somehow. This can sometimes lead to boundaries being drawn slightly differently from what you might have in mind (especially if the coordinate is unevenly spaced).

How to plot and visualize the forecasts of time series future values with auto-regression?

I want to plot the time series and forecasts of its future values. I am using autoregression model. I am not sure if the strategy I am using is correct or not. But I want to estimate time series data for 15 timestamps in the future. I expect to have a fill between different possibilities:
import numpy as np
from numpy import convolve
import matplotlib.pyplot as plt
plt.style.use('ggplot')
def moving_average(y, period):
buffer = []
for i in range(period, len(y)):
buffer.append(y[i - period : i].mean())
return buffer
def auto_regressive(y, p, d, q, future_count):
"""
p = the order (number of time lags)
d = degree of differencing
q = the order of the moving-average
"""
buffer = np.copy(y).tolist()
for i in range(future_count):
ma = moving_average(np.array(buffer[-p:]), q)
forecast = buffer[-1]
for n in range(0, len(ma), d):
forecast -= buffer[-1 - n] - ma[n]
buffer.append(forecast)
return buffer
y=[60, 2, 0, 0, 1, 1, 0, -1, -2, 0, -2, 6, 0, 2, 0, 4, 0, 1, 3, 2, 1, 2, 1, 0, 2, 2, 0, 1, 0, 1, 3, -1, 0, 2, 2, 1, 3, 2, 4, 2, 3, 0, 0, 2, 2, 0, 3, 1, 0, 2]
x=[1549984749, 1549984751, 1549984755, 1549984761, 1549984768, 1549984769, 1549984770, 1549984774, 1549984780, 1549984783, 1549984786, 1549984787, 1549984788,
1549984794, 1549984797, 1549984855, 1549984923, 1549984930, 1549984955, 1549985006, 1549985008, 1549985027, 1549985086, 1549985091, 1549985101, 1549985115,
1549985116, 1549985118, 1549985130, 1549985130, 1549985139, 1549985141, 1549985146, 1549985154, 1549985178, 1549985192, 1549985203, 1549985217, 1549985245,
1549985288, 1549985311, 1549985316, 1549985425, 1549985447, 1549985460, 1549985463, 1549985489, 1549985561, 1549985595, 1549985610]
x=np.array(x)
print(np.size(x))
y=np.array(y)
print(np.size(y))
future_count = 15
predicted_15 = auto_regressive(y,20,1,2,future_count)
plt.plot(x[len(x) - len(predicted_15):], predicted_15)
plt.plot(x, y, 'o-')
plt.show()
But I got this error:
"have shapes {} and {}".format(x.shape, y.shape))
ValueError: x and y must have same first dimension, but have shapes (15,) and (65,)
You are getting an error as predicted_15 contains y as well as your forecasted values (so y has length 65). You want to plot only the forecasted values (length 15).
plt.plot(x[len(x) - len(predicted_15):], predicted_15[len(x):])
Having said this, you need to consider what x-values these predicted y values correspond to.

Computing average for numpy array

I have a 2d numpy array (6 x 6) elements. I want to create another 2D array out of it, where each block is the average of all elements within a blocksize window. Currently, I have the foll. code:
import os, numpy
def avg_func(data, blocksize = 2):
# Takes data, and averages all positive (only numerical) numbers in blocks
dimensions = data.shape
height = int(numpy.floor(dimensions[0]/blocksize))
width = int(numpy.floor(dimensions[1]/blocksize))
averaged = numpy.zeros((height, width))
for i in range(0, height):
print i*1.0/height
for j in range(0, width):
block = data[i*blocksize:(i+1)*blocksize,j*blocksize:(j+1)*blocksize]
if block.any():
averaged[i][j] = numpy.average(block[block>0])
return averaged
arr = numpy.random.random((6,6))
avgd = avg_func(arr, 3)
Is there any way I can make it more pythonic? Perhaps numpy has something which does it already?
UPDATE
Based on M. Massias's soln below, here is an update with fixed values replaced by variables. Not sure if it is coded right. it does seem to work though:
dimensions = data.shape
height = int(numpy.floor(dimensions[0]/block_size))
width = int(numpy.floor(dimensions[1]/block_size))
t = data.reshape([height, block_size, width, block_size])
avrgd = numpy.mean(t, axis=(1, 3))
To compute some operation slice by slice in numpy, it is very often useful to reshape your array and use extra axes.
To explain the process we'll use here: you can reshape your array, take the mean, reshape it again and take the mean again.
Here I assume blocksize is 2
t = np.array([[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],])
t = t.reshape([6, 3, 2])
t = np.mean(t, axis=2)
t = t.reshape([3, 2, 3])
np.mean(t, axis=1)
outputs
array([[ 0.5, 2.5, 4.5],
[ 0.5, 2.5, 4.5],
[ 0.5, 2.5, 4.5]])
Now that it's clear how this works, you can do it in one pass only:
t = t.reshape([3, 2, 3, 2])
np.mean(t, axis=(1, 3))
works too (and should be quicker since means are computed only once - I guess). I'll let you substitute height/blocksize, width/blocksize and blocksize accordingly.
See #askewcan nice remark on how to generalize this to any dimension.

Categories

Resources