I have a list of tuples, where each tuple is a datetime and float. I wish to clip the float values so that they are all above a threshold value. For example if I have:
a = [
(datetime.datetime(2021, 11, 1, 0, 0, tzinfo=tzutc()), 100),
(datetime.datetime(2021, 11, 1, 1, 0, tzinfo=tzutc()), 9.0),
(datetime.datetime(2021, 11, 1, 2, 0, tzinfo=tzutc()), 100.0)
]
and if I want to clip at 10.0, this would give me:
b = [
(datetime.datetime(2021, 11, 1, 0, 0, tzinfo=tzutc()), 100),
(datetime.datetime(2021, 11, 1, 0, ?, tzinfo=tzutc()), 10.0),
(datetime.datetime(2021, 11, 1, 1, ?, tzinfo=tzutc()), 10.0),
(datetime.datetime(2021, 11, 1, 2, 0, tzinfo=tzutc()), 100.0)
]
So if I were to plot the a data (before clipping), I would get a V shaped graph. However, if I clip the data at 10.0 to give me the b data, and plot, I will have a \_/ shaped graph instead. There is a bit of math involved in calculating the new times so I'm hoping there is already functionality available to do this kind of thing. The datetimes are sorted in order and are unique. I can fix the data so the difference between consecutive times is equal, should that be necessary.
Apologies for not putting a full answer yesterday, my SO account is still rate-limited.
I have made a bit more complex custom dataset to showcase several values in a row being below threshold.
import pandas as pd
from datetime import datetime
from matplotlib import pyplot as plt
from scipy.interpolate import InterpolatedUnivariateSpline
df = pd.DataFrame([
(datetime(2021, 10, 31, 23, 0), 0),
(datetime(2021, 11, 1, 0, 0), 80),
(datetime(2021, 11, 1, 1, 0), 100),
(datetime(2021, 11, 1, 2, 0), 6),
(datetime(2021, 11, 1, 3, 0), 105),
(datetime(2021, 11, 1, 4, 0), 70),
(datetime(2021, 11, 1, 5, 0), 200),
(datetime(2021, 11, 1, 6, 0), 0),
(datetime(2021, 11, 1, 7, 0), 7),
(datetime(2021, 11, 1, 8, 0), 0),
(datetime(2021, 11, 1, 9, 0), 20),
(datetime(2021, 11, 1, 10, 0), 100),
(datetime(2021, 11, 1, 11, 0), 0)
], columns=['time', 'whatever'])
THRESHOLD = 10
The first thing to do here is to express index in terms of timedelta so that it behaves as any usual number we can then do all kinds of calculations with. For convenience, I am also expressing it as Series - an even better approach would be to create it as such from the get go, save the initial timestamp and reindex.
start_time = df['time'][0]
df.set_index((df['time'] - start_time).dt.total_seconds(), inplace=True)
series = df['whatever']
Then, I've tried InterpolatedUnivariateSpline from scipy:
roots = InterpolatedUnivariateSpline(df.index, series.values - THRESHOLD).roots()
threshold_crossings = pd.Series([THRESHOLD] * len(roots), index=roots)
new_series = pd.concat([series[series > THRESHOLD], threshold_crossings]).sort_index()
Let's test it out:
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(series)
ax.plot(df.index, [THRESHOLD] * len(df.index), 'k-.', label='threshold')
ax.plot(new_series)
ax.set_xlabel('$t-t_0$, s')
axins = ax.inset_axes([0.6, 0.6, 0.35, 0.3])
axins.plot(series)
axins.plot(df.index, [THRESHOLD] * len(df.index), 'k-.')
axins.plot(new_series)
axins.set_ylim(0, 20)
ax.indicate_inset_zoom(axins, edgecolor="black")
ax.set_ylabel('whatever, a.u.')
ax.legend(loc='upper left')
ax.set_title('Roots from InterpolatedUnivariateSpline')
Not so great. Spline roots interpolation is quite a bit off (after all, it uses a cubic B-spline under the hood and can't find roots if setting order to 1). Ah well. For monotonic functions, we could just inverse the interpolation, but this is not the case here. I hope someone finds a better way to do it, but my next step was rolling out a custom function:
def my_interp(series: pd.Series, thr: float) -> pd.Series:
needs_interp = series > thr
# XOR means we are only considering transition points
needs_interp = (needs_interp ^ needs_interp.shift(-1)).fillna(False)
# The last point will never be interpolated
x = series.index.to_series()
k = series.diff(periods=-1) / x.diff(periods=-1)
b = series - k * x
x_fill = ((thr - b) / k)[needs_interp]
fill_series = pd.Series(data=[thr] * x_fill.size, index=x_fill.values)
# NB! needs_interp is a wrong mask to use for series here
return pd.concat([series[series > thr], fill_series]).sort_index()
new= my_interp(series, THRESHOLD)
It achieves what you want to do with good precision:
To get back to timestamp representation, one would simply do
new_series.index = (start_time + pd.to_timedelta(new_series.index, unit='s'))
With that said, there are a couple caveats:
The function above assumes the timestamps are sorted (can be achieved
by sort_index), and no duplicates are present in the series
Edge conditions are nasty as usual. I have tested the function a little bit, the logic seems sound and it does not break if either side of the series is above/below the threshold, and it handles irregular data just fine, but still - watch out for NaNs in your data and consider how you should handle all the edge conditions, sorting etc.
There is no logic dedicated to handling data points exactly at threshold or ensuring there is any regularity in new timestamps. This could lead to bugs, too: e.g. if some portion of your code relies on having at least 2 data points every day, it might not hold after the transformation.
Related
I have a 2-dimensional xarray dataset that I want to interpolate on the lon and lot coordinates such that I have a higher resolution, but the values correspond exactly with the original values at each coordinate.
I thought the excellent xr.interp function would be able to do this, but following the example I see some discrepancy between the original and interpolated values. I am increasing the longitude and latitude resolution by 4, and thus would except all air values that occur once in the original dataset, to occur 16 times in the interpolated dataset, but this is not the case.
Does anyone know what the cause is that the original and interpolated dataset do not align and how I could solve it?
ds = xr.tutorial.open_dataset("air_temperature").isel(time=0)
fig, axes = plt.subplots(ncols=2, figsize=(10, 4))
ds_sel=ds.sel(lon=slice(250,260),lat=slice(40,30))
ds.air.plot(ax=axes[0],xlim=(250,260),ylim=(30,40))
axes[0].set_title("Raw data")
# Interpolated data
new_lon = np.linspace(ds.lon[0], ds.lon[-1], ds.dims["lon"] * 4)
new_lat = np.linspace(ds.lat[0], ds.lat[-1], ds.dims["lat"] * 4)
dsi = ds.interp(lat=new_lat, lon=new_lon,method="nearest")
dsi_sel=dsi.sel(lon=slice(250,260),lat=slice(40,30))
dsi.air.plot(ax=axes[1],xlim=(250,260),ylim=(30,40))
axes[1].set_title("Interpolated data")
Showing the unique values with
unique, counts = np.unique(ds_sel.air.values, return_counts=True)
print("original values",dict(zip(unique, counts)))
unique, counts = np.unique(dsi_sel.air.values, return_counts=True)
print("interpolated values",dict(zip(unique, counts)))
I get
original values {262.1: 1, 263.1: 1, 263.9: 1, 264.4: 1, 265.19998: 1, 266.6: 1, 266.79: 1, 266.9: 2, 268.29: 1, 269.79: 1, 270.4: 1, 273.0: 1, 273.6: 1, 275.19998: 1, 276.29: 1, 278.0: 1, 278.5: 1, 278.6: 1, 281.5: 1, 282.1: 1, 282.29: 1, 284.6: 1, 286.79: 1, 288.0: 1}
interpolated values {262.1: 4, 263.1: 8, 263.9: 8, 264.4: 8, 265.19998: 4, 266.6: 16, 266.79: 16, 266.9: 24, 268.29: 8, 269.79: 20, 270.4: 10, 273.0: 20, 273.6: 16, 275.19998: 8, 276.29: 20, 278.0: 16, 278.5: 10, 278.6: 8, 281.5: 4, 282.1: 16, 282.29: 8, 284.6: 8, 286.79: 8, 288.0: 4}
I think you're conceptually running up against a fencepost error (see the section on this page: https://en.wikipedia.org/wiki/Off-by-one_error)
You should interpret the xarray coordinates as "midpoints", not as the cell boundaries.
Your new_lon isn't nicely divided into 1/2, 1/4, 1/8, etc.:
print(new_lon)
[200. 200.61611374 201.23222749 201.84834123 202.46445498
203.08056872 203.69668246 204.31279621 204.92890995]
And it doesn't include all the original coordinates.
Taking the "off-by-ones" into account:
new_lon = np.linspace(ds.lon[0], ds.lon[-1], (ds.dims["lon"] - 1) * 4 + 1)
new_lat = np.linspace(ds.lat[0], ds.lat[-1], (ds.dims["lat"] - 1) * 4 + 1)
print(new_lon)
[200. 200.625 201.25 201.875 202.5 203.125 203.75 204.375 205. ]
You can then e.g. inspect the part of the first row of the original and the interpolated one:
selection = ds["air"][0, :3]
selection_i = dsi["air"][0, :9]
print(selection["lon"])
print(selection.values)
print(selection_i["lon"])
print(selection_i.values)
This looks good to me:
[200. 202.5 205. ]
[241.2 242.5 243.5]
[200. 200.625 201.25 201.875 202.5 203.125 203.75 204.375 205. ]
[241.2 241.2 241.2 242.5 242.5 242.5 242.5 243.5 243.5]
Of course, when doing nearest interpolation, you might end up with ties:
0.5 is equally far removed from 0.0 as it is from 1.0 -- and so you inadverntely have to bias either "up" or "down" to get a single nearest value.
Also note that the .plot() command, which draws a Matplotlib QuadMesh has to infer boundaries from midpoints somehow. This can sometimes lead to boundaries being drawn slightly differently from what you might have in mind (especially if the coordinate is unevenly spaced).
I have two datasets, One of which has time array in datetime.datetime form, and x,y,z coordinates array of that time, like time[0]=datetime.datetime(2000,1,21,0,7,25), x[0]=-6.7, etc.
I'd like to calculate something from the coordinates, but that needs another parameter (Ma) which depend on time. Second data set has another time array in same datetime form, and the parameter recorded at that time, like time[0]=datetime.datetime(2000,1,1,0,3), Ma[0]=2.73
The problem is that the time array of two data set is different (though the ranges are similar)
So I want to interpolate the parameter's value at each time of data set 1, like Ma[0], but 0 is not index of time of dataset2, but corresponds to index of dataset 1.
How can I do that?
PS. Can I convert the time form to simpler one? datetime.datetime seems quite cumbersome.
The following is an example of how to interpolate your values. The coord_ and ma_ arrays will be your imported data.
The first thing the script does is build some more sensible data structures from your disparate 1 dimensional arrays. The part that you're actually looking for is the call to np.interp, documented here.
import numpy as np
import datetime
import time
# Numpy cannot interpolate between datetimes
# This function converts a datetime to a timestamp
def to_ts(dt):
return time.mktime(dt.timetuple())
coord_dts = np.array([
datetime.datetime(2000, 1, 1, 12),
datetime.datetime(2000, 1, 2, 12),
datetime.datetime(2000, 1, 3, 12),
datetime.datetime(2000, 1, 4, 12)
])
coord_xs = np.array([3, 5, 8, 13])
coord_ys = np.array([2, 3, 5, 7])
coord_zs = np.array([1, 3, 6, 10])
ma_dts = np.array([
datetime.datetime(2000, 1, 1),
datetime.datetime(2000, 1, 2),
datetime.datetime(2000, 1, 3),
datetime.datetime(2000, 1, 4)
])
ma_vals = np.array([1, 2, 3, 4])
# Handling the data as separate arrays will be painful.
# This builds an array of dictionaries with the form:
# [ { 'time': timestamp, 'x': x coordinate, 'y': y coordinate, 'z': z coordinate }, ... ]
coords = np.array([
{ 'time': to_ts(coord_dts[idx]), 'x': coord_xs[idx], 'y': coord_ys[idx], 'z': coord_zs[idx] }
for idx, _ in enumerate(coord_dts)
])
# Build array of timestamps from ma datetimes
ma_ts = [ to_ts(dt) for dt in ma_dts ]
for coord in coords:
print("ma interpolated value", np.interp(coord['time'], ma_ts, ma_vals))
print("at coordinates:", coord['x'], coord['y'], coord['z'])
I need some help. I am learning between matplotlib and numpy. I am simply reproducing a piece of code from "Intraday candlestick charts using Matplotlib" with my own csv file to learn from that code. The different part of my code that is different from it is the following:
import numpy as np
import matplotlib.pyplot as plt
import datetime
from mpl_finance import candlestick_ohlc
from matplotlib.dates import num2date
# data in a text file, 5 columns: time, opening, close, high, low
# note that I'm using the time you formated into an ordinal float data =
np.loadtxt("/Users/paul/Documents/python/Quant/INTC.csv", delimiter=",")
I am getting an error that says
ValueError: could not convert string to float: b'Date'.
I even try to use this line and still gives me the same error message
data = np.genfromtxt("/Users/paul/Documents/python/Quant/INTC.csv", delimiter=",", skip_header=1, usecols=[0,1,2,3,4], dtype=(dt, float,float,float, float))"
it's probably a basic concept that I am not understanding. much appriciated for some guidance.
sample data:
Date,Open,High,Low,Close,Volume,Adj Close
2017-11-06,46.599998,46.740002,46.090000,46.700001,46.700001,34035000
2017-11-07,46.700001,47.090000,46.389999,46.779999,46.779999,24461400
2017-11-08,46.619999,46.700001,46.279999,46.700001,46.700001,21565800
2017-11-09,46.049999,46.389999,45.650002,46.299999,46.299999,25570400
2017-11-10,46.040001,46.090000,45.380001,45.580002,45.580002,24095400
2017-11-13,45.259998,45.939999,45.250000,45.750000,45.750000,18999000
You can import "Date column' (string) by converting it to a datetime object. Once you have it as a datetime object you can filter out the weekends as shown below. you can plot the filtered data in matplotlib as matplotlib understands datetime objects. hope that helps.
'''
data in csv
Date,Open,High,Low,Close,Volume,Adj Close
2017-12-06,46.599998,46.740002,46.090000,46.700001,46.700001,34035000
2017-12-07,46.700001,47.090000,46.389999,46.779999,46.779999,24461400
2017-12-08,46.619999,46.700001,46.279999,46.700001,46.700001,21565800
2017-12-09,46.049999,46.389999,45.650002,46.299999,46.299999,25570400
2017-12-10,46.040001,46.090000,45.380001,45.580002,45.580002,24095400
2017-12-13,45.259998,45.939999,45.250000,45.750000,45.750000,18999000
'''
import numpy as np
from datetime import datetime
# use converter to convert a string object to datetime object. Note dtype is object for all columns
data = np.genfromtxt(r'stock.csv', delimiter = ',', names = True,
converters={0: lambda x: datetime.strptime(x, "%Y-%m-%d")}, dtype=object)
print data
'''
[ (datetime.datetime(2017, 12, 6, 0, 0), '46.599998', '46.740002', '46.090000', '46.700001', '46.700001', '34035000')
(datetime.datetime(2017, 12, 7, 0, 0), '46.700001', '47.090000', '46.389999', '46.779999', '46.779999', '24461400')
(datetime.datetime(2017, 12, 8, 0, 0), '46.619999', '46.700001', '46.279999', '46.700001', '46.700001', '21565800')
(datetime.datetime(2017, 12, 9, 0, 0), '46.049999', '46.389999', '45.650002', '46.299999', '46.299999', '25570400')
(datetime.datetime(2017, 12, 10, 0, 0), '46.040001', '46.090000', '45.380001', '45.580002', '45.580002', '24095400')
(datetime.datetime(2017, 12, 13, 0, 0), '45.259998', '45.939999', '45.250000', '45.750000', '45.750000', '18999000')]
'''
# check if a day is a weekday or not
def check_weekday_or_not(datetime_object):
if datetime_object.weekday() not in [5,6]:
# datetime.weekday() returns 5 and 6 for saturday and Sunday
return True
else:
return False
# create a function to apply on each row of the matrix
vfunc =np.vectorize(check_weekday_or_not)
filter_mask = vfunc(data['Date'])
print filter_mask
#[ True True True False False True]
# Apply the filter mask to obtain an array without weekends.
print data[filter_mask]
'''
array([ (datetime.datetime(2017, 12, 6, 0, 0), '46.599998', '46.740002', '46.090000', '46.700001', '46.700001', '34035000'),
(datetime.datetime(2017, 12, 7, 0, 0), '46.700001', '47.090000', '46.389999', '46.779999', '46.779999', '24461400'),
(datetime.datetime(2017, 12, 8, 0, 0), '46.619999', '46.700001', '46.279999', '46.700001', '46.700001', '21565800'),
(datetime.datetime(2017, 12, 13, 0, 0), '45.259998', '45.939999', '45.250000', '45.750000', '45.750000', '18999000')],
dtype=[('Date', 'O'), ('Open', 'O'), ('High', 'O'), ('Low', 'O'), ('Close', 'O'), ('Volume', 'O'), ('Adj_Close', 'O')])
'''
Suppose I have two arrays indicating the x and y coordinates of a calibration curve.
X = [1,2,3,4,5,6,7,8,9,10,12,14,16,18,20,30,40,50]
Y = [2,4,6,8,10,12,14,16,18,20,24,28,32,36,40,60,80,100]
My example arrays above contain 18 points. You'll notice that the x values are not linearly spaced; there are more points at lower values of x.
Let's suppose I need to reduce the number of points in my calibration curve to 13 points. Obviously, I could just remove the first five or the last five points, but that would shorten my overall range of x values. To maintain range and minimise the space between x values I would preferentially remove values x= 2,4,6,8,10. Removing these x points and their respective y values would leave 13 points in the curve as required.
How could I do this point selection and removal automatically in Python? I.e. Is there an algorithm to pick the best x points from a list, where "best" is defined as keeping the points as close as possible while keeping the overall range and adhering to the new number of points.
Please note that the points remaining must be in the original lists, so I can't interpolate the 18 points on to a 13 point grid.
This would maximize the square root distances between the chosen points. It in some sense spreads the points as far as possible.
import itertools
list(max(itertools.combinations(sorted(X), 13), i
key=lambda l: sum((a - b) ** 2 for a, b in zip(l, l[1:]))))
Note that this is only feasible for small problems. The time complexity for selecting k points is O(k * (len(X) choose k)), so basically O(exp(len(X)). So don't even think about using this for, e.g., len(X) == 100 and k == 10.
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 30, 40, 50]
Y = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 24, 28, 32, 36, 40, 60, 80, 100]
assert len(X) == len(set(X)), "Duplicate X values found"
points = list(zip(X, Y))
points.sort() # sorts by X
while len(points) > 13:
# Find index whose neighbouring X values are closest together
i = min(range(1, len(points) - 1), key=lambda p: points[p + 1][0] - points[p - 1][0])
points.pop(i)
print(points)
Output:
[(1, 2), (3, 6), (5, 10), (7, 14), (10, 20), (12, 24), (14, 28), (16, 32), (18, 36), (20, 40), (30, 60), (40, 80), (50, 100)]
If you want the original series again:
X, Y = zip(*points)
An algorithm that would achieve that:
Convert each number into the sum of the absolute difference to the number to the left and to the right. If a number is missing, first or last cases, then use MAX_INT. For example, 1 would become MAX_INT; 2 would become 2, 10 would become 3.
Remove the first case with the lowest sum.
If you need to remove more numbers, go to 1.
This would remove 2,4,6,8,10,3,...
Here is a recursive approach that repeatedly removes the point which will be the least missed:
def mostRedundantPoint(x):
#returns the index, i, in the range 0 < i < len(x) - 1
#that minimizes x[i+1] - x[i-1]
#assumes len(x) > 2 and that x
#is sorted in ascending order
gaps = [x[i+1] - x[i-1] for i in range(1,len(x)-1)]
i = gaps.index(min(gaps))
return i+1
def reduceList(x,k):
if len(x) <= k:
return x
else:
i = mostRedundantPoint(x)
return reduceList(x[:i]+x[i+1:],k)
X = [1,2,3,4,5,6,7,8,9,10,12,14,16,18,20,30,40,50]
print(reduceList(X,13))
#prints [1, 3, 5, 7, 10, 12, 14, 16, 18, 20, 30, 40, 50]
This list essentially agrees with your intended output since 7 vs. 8 have the same net effect. It is reasonably quick in the sense that it is almost instantaneous in reducing sorted([random.randint(1,10**6) for i in range(1000)]) from 1000 elements to 100 elements. The fact that it is recursive implies that it will blow the stack if you try to remove many more points than that, but with what seems to be your intended problem size that shouldn't be an issue. If need be, you could of course replace the recursion by a loop.
I have lists of datetimes and values like this:
import datetime
x = [datetime.datetime(2016, 9, 26, 0, 0), datetime.datetime(2016, 9, 27, 0, 0),
datetime.datetime(2016, 9, 28, 0, 0), datetime.datetime(2016, 9, 29, 0, 0),
datetime.datetime(2016, 9, 30, 0, 0), datetime.datetime(2016, 10, 1, 0, 0)]
y = [26060, 23243, 22834, 22541, 22441, 23248]
And can plot them like this:
import matplotlib.pyplot as plt
plt.plot(x, y)
I would like to be able to plot a smooth version using more x-points. So first I do this:
delta_t = max(x) - min(x)
N_points = 300
xnew = [min(x) + i*delta_t/N_points for i in range(N_points)]
Then attempting a spline fit with scipy:
from scipy.interpolate import spline
ynew = spline(x, y, xnew)
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
What is the best way to proceed? I am open to solutions involving other libraries such as pandas or plotly.
You're trying to pass a list of datetimes to the spline function, which are Python objects (hence dtype('O')). You need to convert the datetimes to a numeric format first, and then convert them back if you wish:
int_x = [i.total_seconds() for i in x]
ynew = spline(int_x, y, xnew)
Edit: total_seconds() is actually a timedelta method, not for datetimes. However it looks like you sorted it out so I'll leave this answer as is.
Figured something out:
x_ts = [x_.timestamp() for x_ in x]
xnew_ts = [x_.timestamp() for x_ in xnew]
ynew = spline(x_ts, y, xnew_ts)
plt.plot(xnew, ynew)
This works very nicely, but I'm still open to ideas for simpler methods.