Numpy structured array with datetime - python

I tried to build a structured array with a datetime coloumn
import numpy as np
na_trades = np.zeros(2, dtype = 'datetime64,i4')
na_trades[0] = (np.datetime64('1970-01-01 00:00:00'),0)
TypeError: Cannot cast NumPy timedelta64 scalar from metadata [s] to according to the rule 'same_kind'
Is there a way to fix this?

You have to specify that the datetime64 is in seconds when you create the array because the one you parse and try to assign is a datetime64[s]:
na_trades = np.zeros(2, dtype='datetime64[s],i4')
na_trades[0] = (np.datetime64('1971-01-01 00:00:00'), 0)
The error you get means that the datetime64 object that you specified is not same_kind as the one you try to assing. You try to assign a seconds resolution one, and you created a different one when you constructed the array (by default I think it's nanoseconds).

Try following:
>>> na_trades = np.zeros(2, dtype=[('dt', 'datetime64[s]'), ('vol', 'i4')])
>>> na_trades
array([(datetime.datetime(1970, 1, 1, 0, 0), 0),
(datetime.datetime(1970, 1, 1, 0, 0), 0)],
dtype=[('dt', ('<M8[s]', {})), ('vol', '<i4')])
>>> na_trades[0] = (np.datetime64('1970-01-02 00:00:00'),1)
>>> na_trades
array([(datetime.datetime(4707, 11, 29, 0, 0), 1),
(datetime.datetime(1970, 1, 1, 0, 0), 0)],
dtype=[('dt', ('<M8[s]', {})), ('vol', '<i4')])

Related

Clipping a datatime series along the y-axis

I have a list of tuples, where each tuple is a datetime and float. I wish to clip the float values so that they are all above a threshold value. For example if I have:
a = [
(datetime.datetime(2021, 11, 1, 0, 0, tzinfo=tzutc()), 100),
(datetime.datetime(2021, 11, 1, 1, 0, tzinfo=tzutc()), 9.0),
(datetime.datetime(2021, 11, 1, 2, 0, tzinfo=tzutc()), 100.0)
]
and if I want to clip at 10.0, this would give me:
b = [
(datetime.datetime(2021, 11, 1, 0, 0, tzinfo=tzutc()), 100),
(datetime.datetime(2021, 11, 1, 0, ?, tzinfo=tzutc()), 10.0),
(datetime.datetime(2021, 11, 1, 1, ?, tzinfo=tzutc()), 10.0),
(datetime.datetime(2021, 11, 1, 2, 0, tzinfo=tzutc()), 100.0)
]
So if I were to plot the a data (before clipping), I would get a V shaped graph. However, if I clip the data at 10.0 to give me the b data, and plot, I will have a \_/ shaped graph instead. There is a bit of math involved in calculating the new times so I'm hoping there is already functionality available to do this kind of thing. The datetimes are sorted in order and are unique. I can fix the data so the difference between consecutive times is equal, should that be necessary.
Apologies for not putting a full answer yesterday, my SO account is still rate-limited.
I have made a bit more complex custom dataset to showcase several values in a row being below threshold.
import pandas as pd
from datetime import datetime
from matplotlib import pyplot as plt
from scipy.interpolate import InterpolatedUnivariateSpline
df = pd.DataFrame([
(datetime(2021, 10, 31, 23, 0), 0),
(datetime(2021, 11, 1, 0, 0), 80),
(datetime(2021, 11, 1, 1, 0), 100),
(datetime(2021, 11, 1, 2, 0), 6),
(datetime(2021, 11, 1, 3, 0), 105),
(datetime(2021, 11, 1, 4, 0), 70),
(datetime(2021, 11, 1, 5, 0), 200),
(datetime(2021, 11, 1, 6, 0), 0),
(datetime(2021, 11, 1, 7, 0), 7),
(datetime(2021, 11, 1, 8, 0), 0),
(datetime(2021, 11, 1, 9, 0), 20),
(datetime(2021, 11, 1, 10, 0), 100),
(datetime(2021, 11, 1, 11, 0), 0)
], columns=['time', 'whatever'])
THRESHOLD = 10
The first thing to do here is to express index in terms of timedelta so that it behaves as any usual number we can then do all kinds of calculations with. For convenience, I am also expressing it as Series - an even better approach would be to create it as such from the get go, save the initial timestamp and reindex.
start_time = df['time'][0]
df.set_index((df['time'] - start_time).dt.total_seconds(), inplace=True)
series = df['whatever']
Then, I've tried InterpolatedUnivariateSpline from scipy:
roots = InterpolatedUnivariateSpline(df.index, series.values - THRESHOLD).roots()
threshold_crossings = pd.Series([THRESHOLD] * len(roots), index=roots)
new_series = pd.concat([series[series > THRESHOLD], threshold_crossings]).sort_index()
Let's test it out:
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(series)
ax.plot(df.index, [THRESHOLD] * len(df.index), 'k-.', label='threshold')
ax.plot(new_series)
ax.set_xlabel('$t-t_0$, s')
axins = ax.inset_axes([0.6, 0.6, 0.35, 0.3])
axins.plot(series)
axins.plot(df.index, [THRESHOLD] * len(df.index), 'k-.')
axins.plot(new_series)
axins.set_ylim(0, 20)
ax.indicate_inset_zoom(axins, edgecolor="black")
ax.set_ylabel('whatever, a.u.')
ax.legend(loc='upper left')
ax.set_title('Roots from InterpolatedUnivariateSpline')
Not so great. Spline roots interpolation is quite a bit off (after all, it uses a cubic B-spline under the hood and can't find roots if setting order to 1). Ah well. For monotonic functions, we could just inverse the interpolation, but this is not the case here. I hope someone finds a better way to do it, but my next step was rolling out a custom function:
def my_interp(series: pd.Series, thr: float) -> pd.Series:
needs_interp = series > thr
# XOR means we are only considering transition points
needs_interp = (needs_interp ^ needs_interp.shift(-1)).fillna(False)
# The last point will never be interpolated
x = series.index.to_series()
k = series.diff(periods=-1) / x.diff(periods=-1)
b = series - k * x
x_fill = ((thr - b) / k)[needs_interp]
fill_series = pd.Series(data=[thr] * x_fill.size, index=x_fill.values)
# NB! needs_interp is a wrong mask to use for series here
return pd.concat([series[series > thr], fill_series]).sort_index()
new= my_interp(series, THRESHOLD)
It achieves what you want to do with good precision:
To get back to timestamp representation, one would simply do
new_series.index = (start_time + pd.to_timedelta(new_series.index, unit='s'))
With that said, there are a couple caveats:
The function above assumes the timestamps are sorted (can be achieved
by sort_index), and no duplicates are present in the series
Edge conditions are nasty as usual. I have tested the function a little bit, the logic seems sound and it does not break if either side of the series is above/below the threshold, and it handles irregular data just fine, but still - watch out for NaNs in your data and consider how you should handle all the edge conditions, sorting etc.
There is no logic dedicated to handling data points exactly at threshold or ensuring there is any regularity in new timestamps. This could lead to bugs, too: e.g. if some portion of your code relies on having at least 2 data points every day, it might not hold after the transformation.

What is JaxNumpy-compatible equivalent to this Python function?

How do I implement the below in a JAX-compatable way (e.g., using jax.numpy)?
def actions(state: tuple[int, ...]) -> list[tuple[int, ...]]:
l = []
iterables = [range(1, i+1) for i in state]
ns = list(range(len(iterables)))
for i, iterable in enumerate(iterables):
for value in iterable:
action = tuple(value if n == i else 0 for n in ns)
l.append(action)
return l
>>> state = (3, 1, 2)
>>> actions(state)
[(1, 0, 0), (2, 0, 0), (3, 0, 0), (0, 1, 0), (0, 0, 1), (0, 0, 2)]
Jax, like numpy, cannot efficiently operate on Python container types like lists and tuples, so there's not really any JAX-compatible way to create a function with the exact signature you specify above.
But if you're alright with the return value being a two-dimensional array, you could do something like this, based on jnp.vstack:
from typing import Tuple
import jax.numpy as jnp
from jax import jit, partial
#partial(jit, static_argnums=0)
def actions(state: Tuple[int, ...]) -> jnp.ndarray:
return jnp.vstack([
jnp.zeros((val, len(state)), int).at[:, i].set(jnp.arange(1, val + 1))
for i, val in enumerate(state)])
>>> state = (3, 1, 2)
>>> actions(state)
DeviceArray([[1, 0, 0],
[2, 0, 0],
[3, 0, 0],
[0, 1, 0],
[0, 0, 1],
[0, 0, 2]], dtype=int32)
Note that because the size of the output array depends on the content of state, state must be a static quantity, so a tuple is a good option for the input.

Python Referencing data by interpolation

I have two datasets, One of which has time array in datetime.datetime form, and x,y,z coordinates array of that time, like time[0]=datetime.datetime(2000,1,21,0,7,25), x[0]=-6.7, etc.
I'd like to calculate something from the coordinates, but that needs another parameter (Ma) which depend on time. Second data set has another time array in same datetime form, and the parameter recorded at that time, like time[0]=datetime.datetime(2000,1,1,0,3), Ma[0]=2.73
The problem is that the time array of two data set is different (though the ranges are similar)
So I want to interpolate the parameter's value at each time of data set 1, like Ma[0], but 0 is not index of time of dataset2, but corresponds to index of dataset 1.
How can I do that?
PS. Can I convert the time form to simpler one? datetime.datetime seems quite cumbersome.
The following is an example of how to interpolate your values. The coord_ and ma_ arrays will be your imported data.
The first thing the script does is build some more sensible data structures from your disparate 1 dimensional arrays. The part that you're actually looking for is the call to np.interp, documented here.
import numpy as np
import datetime
import time
# Numpy cannot interpolate between datetimes
# This function converts a datetime to a timestamp
def to_ts(dt):
return time.mktime(dt.timetuple())
coord_dts = np.array([
datetime.datetime(2000, 1, 1, 12),
datetime.datetime(2000, 1, 2, 12),
datetime.datetime(2000, 1, 3, 12),
datetime.datetime(2000, 1, 4, 12)
])
coord_xs = np.array([3, 5, 8, 13])
coord_ys = np.array([2, 3, 5, 7])
coord_zs = np.array([1, 3, 6, 10])
ma_dts = np.array([
datetime.datetime(2000, 1, 1),
datetime.datetime(2000, 1, 2),
datetime.datetime(2000, 1, 3),
datetime.datetime(2000, 1, 4)
])
ma_vals = np.array([1, 2, 3, 4])
# Handling the data as separate arrays will be painful.
# This builds an array of dictionaries with the form:
# [ { 'time': timestamp, 'x': x coordinate, 'y': y coordinate, 'z': z coordinate }, ... ]
coords = np.array([
{ 'time': to_ts(coord_dts[idx]), 'x': coord_xs[idx], 'y': coord_ys[idx], 'z': coord_zs[idx] }
for idx, _ in enumerate(coord_dts)
])
# Build array of timestamps from ma datetimes
ma_ts = [ to_ts(dt) for dt in ma_dts ]
for coord in coords:
print("ma interpolated value", np.interp(coord['time'], ma_ts, ma_vals))
print("at coordinates:", coord['x'], coord['y'], coord['z'])

candlestick charts using Matplotlib skip weekends

I need some help. I am learning between matplotlib and numpy. I am simply reproducing a piece of code from "Intraday candlestick charts using Matplotlib" with my own csv file to learn from that code. The different part of my code that is different from it is the following:
import numpy as np
import matplotlib.pyplot as plt
import datetime
from mpl_finance import candlestick_ohlc
from matplotlib.dates import num2date
# data in a text file, 5 columns: time, opening, close, high, low
# note that I'm using the time you formated into an ordinal float data =
np.loadtxt("/Users/paul/Documents/python/Quant/INTC.csv", delimiter=",")
I am getting an error that says
ValueError: could not convert string to float: b'Date'.
I even try to use this line and still gives me the same error message
data = np.genfromtxt("/Users/paul/Documents/python/Quant/INTC.csv", delimiter=",", skip_header=1, usecols=[0,1,2,3,4], dtype=(dt, float,float,float, float))"
it's probably a basic concept that I am not understanding. much appriciated for some guidance.
sample data:
Date,Open,High,Low,Close,Volume,Adj Close
2017-11-06,46.599998,46.740002,46.090000,46.700001,46.700001,34035000
2017-11-07,46.700001,47.090000,46.389999,46.779999,46.779999,24461400
2017-11-08,46.619999,46.700001,46.279999,46.700001,46.700001,21565800
2017-11-09,46.049999,46.389999,45.650002,46.299999,46.299999,25570400
2017-11-10,46.040001,46.090000,45.380001,45.580002,45.580002,24095400
2017-11-13,45.259998,45.939999,45.250000,45.750000,45.750000,18999000
You can import "Date column' (string) by converting it to a datetime object. Once you have it as a datetime object you can filter out the weekends as shown below. you can plot the filtered data in matplotlib as matplotlib understands datetime objects. hope that helps.
'''
data in csv
Date,Open,High,Low,Close,Volume,Adj Close
2017-12-06,46.599998,46.740002,46.090000,46.700001,46.700001,34035000
2017-12-07,46.700001,47.090000,46.389999,46.779999,46.779999,24461400
2017-12-08,46.619999,46.700001,46.279999,46.700001,46.700001,21565800
2017-12-09,46.049999,46.389999,45.650002,46.299999,46.299999,25570400
2017-12-10,46.040001,46.090000,45.380001,45.580002,45.580002,24095400
2017-12-13,45.259998,45.939999,45.250000,45.750000,45.750000,18999000
'''
import numpy as np
from datetime import datetime
# use converter to convert a string object to datetime object. Note dtype is object for all columns
data = np.genfromtxt(r'stock.csv', delimiter = ',', names = True,
converters={0: lambda x: datetime.strptime(x, "%Y-%m-%d")}, dtype=object)
print data
'''
[ (datetime.datetime(2017, 12, 6, 0, 0), '46.599998', '46.740002', '46.090000', '46.700001', '46.700001', '34035000')
(datetime.datetime(2017, 12, 7, 0, 0), '46.700001', '47.090000', '46.389999', '46.779999', '46.779999', '24461400')
(datetime.datetime(2017, 12, 8, 0, 0), '46.619999', '46.700001', '46.279999', '46.700001', '46.700001', '21565800')
(datetime.datetime(2017, 12, 9, 0, 0), '46.049999', '46.389999', '45.650002', '46.299999', '46.299999', '25570400')
(datetime.datetime(2017, 12, 10, 0, 0), '46.040001', '46.090000', '45.380001', '45.580002', '45.580002', '24095400')
(datetime.datetime(2017, 12, 13, 0, 0), '45.259998', '45.939999', '45.250000', '45.750000', '45.750000', '18999000')]
'''
# check if a day is a weekday or not
def check_weekday_or_not(datetime_object):
if datetime_object.weekday() not in [5,6]:
# datetime.weekday() returns 5 and 6 for saturday and Sunday
return True
else:
return False
# create a function to apply on each row of the matrix
vfunc =np.vectorize(check_weekday_or_not)
filter_mask = vfunc(data['Date'])
print filter_mask
#[ True True True False False True]
# Apply the filter mask to obtain an array without weekends.
print data[filter_mask]
'''
array([ (datetime.datetime(2017, 12, 6, 0, 0), '46.599998', '46.740002', '46.090000', '46.700001', '46.700001', '34035000'),
(datetime.datetime(2017, 12, 7, 0, 0), '46.700001', '47.090000', '46.389999', '46.779999', '46.779999', '24461400'),
(datetime.datetime(2017, 12, 8, 0, 0), '46.619999', '46.700001', '46.279999', '46.700001', '46.700001', '21565800'),
(datetime.datetime(2017, 12, 13, 0, 0), '45.259998', '45.939999', '45.250000', '45.750000', '45.750000', '18999000')],
dtype=[('Date', 'O'), ('Open', 'O'), ('High', 'O'), ('Low', 'O'), ('Close', 'O'), ('Volume', 'O'), ('Adj_Close', 'O')])
'''

Create new numpy array-scalar of flexible dtype

I have a working solution to my problem, but when trying different things I was astounded there wasn't a better solution that I could find. It all boils down to creating a single flexible dtype value for comparing and inserting into an array.
I have an RGB 24-bit image (so 8-bits for each R, G, and B) image array. It turns out for some actions it is best to use it as a 3D array with HxWx3 other times it is best to use it as a structured array with the dtype([('R',uint8),('G',uint8),('B',uint8)]). One example is when trying to relabel the image colors so that every unique color is given a different value. I do this with the following code:
# Given im as an array of HxWx3, dtype=uint8
from numpy import dtype, uint8, unique, insert, searchsorted
rgb_dtype = dtype([('R',uint8),('G',uint8),('B',uint8)]))
im = im.view(dtype=rgb_dtype).squeeze() # need squeeze to remove the third dim
values = unique(im)
if tuple(values[0]) != (0, 0, 0):
values = insert(values, 0, 0) # value 0 needs to always be (0, 0, 0)
labels = searchsorted(values, im)
This works beautifully, however I am tried to make the if statement look nicer and just couldn't find a way. So lets look at the comparison first:
>>> values[0]
(0, 0, 0)
>>> values[0] == 0
False
>>> values[0] == (0, 0, 0)
False
>>> values[0] == array([0, 0, 0])
False
>>> values[0] == array([uint8(0), uint8(0), uint8(0)]).view(dtype=rgb_dtype)[0]
True
>>> values[0] == zeros((), dtype=rgb_dtype)
True
But what if you wanted something besides (0, 0, 0) or (1, 1, 1) and something that was not ridiculous looking? It seems like there should be an easier way to construct this, like rgb_dtype.create((0,0,0)).
Next with the insert statement, you need to insert 0 for (0, 0, 0). For other values this really does not work, for example inserting (1, 2, 3) actually inserts (1, 1, 1), (2, 2, 2), (3, 3, 3).
So in the end, is there a nicer way? Thanks!
I could make insert() work for your case doing (note that instead of 0 it is used [0]):
values = insert(values, [0], (1,2,3))
giving (for example):
array([(0, 1, 3), (0, 0, 0), (0, 0, 4), ..., (255, 255, 251), (255, 255, 253), (255, 255, 255)],
dtype=[('R', 'u1'), ('G', 'u1'), ('B', 'u1')])
Regarding another way to do your if, you can do this:
str(values[0]) == str((0,0,0))
or, perhaps more robust:
eval(str(values[0])) == eval(str(0,0,0))

Categories

Resources