How to resample timedeltas? - python

I have been running an experiment that outputs data with two columns:
seconds since start of experiment (float)
a measurement. (float)
I would now like to load this into Pandas to resample and plot the measurements. I've done this before, but those times my timestamps have been since epoch or in datetime (YYY-MM-DD HH:mm:ss) format. If I'm loading my first column as integers I'm unable to do
data.resample('5Min', how='mean')
. It also does not seem possible if I'd convert my first column to timedelta(seconds=...). My question is, is it possible to resample this data without subverting to epoch conversion?

You can use groupby with time // period to do this:
import pandas as pd
import numpy as np
t = np.random.rand(10000)*3600
t.sort()
v = np.random.rand(10000)
df = pd.DataFrame({"time":t, "value":v})
period = 5*60
s = df.groupby(df.time // period).value.mean()
s.index *= period

I have the same structure of sensors data. First column is seconds since start of the experiment and rest of columns are value.
here is the data structure:
time x y z
0 0.015948 0.403931 0.449005 -0.796860
1 0.036006 0.403915 0.448029 -0.795395
2 0.055885 0.404907 0.446548 -0.795853
here is what worked for me:
convert the time to time delta:
df.time=pd.to_timedelta(df.time,unit="s")
set the time as index
df.set_index("time",inplace=True)
resample to the frequency you want
df.resample("40ms").mean()

Related

Filter Dask DataFrame rows by specific values of index

Is there effective solution to select specific rows in Dask DataFrame?
I would like to get only those rows which index is in a closed set (using the isin function is not enough efficient for me).
Are there any other effective solutions than ddf.loc[ddf.index.isin(list_of_index_values)]
ddf.loc[~ddf.index.isin(list_of_index_values)]
?
You can use the query method. You haven't provided a usable example but the format would be something like this
list_of_index_values = [6, 3]
dff.query('column in #list_of_index_values')
EDIT: Just for fun. I did this in pandas but I wouldn't expect much variance.
No clue whats stored in the index but assumed int.
from random import randint
import pandas as pd
from datetime import datetime as dt
# build huge random dataset
lst = []
for i in range(100000000):
lst.append(randint(0,100000))
# build huge random index
index = []
for i in range(1000000):
index.append(randint(0,100000))
df = pd.DataFrame(lst, columns=['values'])
isin = dt.now()
df[df['values'].isin(index)]
print(f'total execution time for isin {dt.now()-isin}')
query = dt.now()
df.query('values in #index')
print(f'total execution time for query {dt.now()-query}')
# total execution time for isin 0:01:22.914507
# total execution time for query 0:01:13.794499
If your index is sequential however
time = dt.now()
df[df['values']>100000]
print(dt.now()-time)
# 0:00:00.128209
It's not even close. You can even build out a range
time = dt.now()
df[(df['values']>100000) | (df['values'] < 500)]
print(dt.now()-time)
# 0:00:00.650321
Obviously the third method isn't always an option, but something to keep in mind if speed is a priority and you just need index between 2 values or some such.

How to average 2D array columns based on date header

I am working on some glaicer borehole temperature data consisting of ~1,000 rows by 700 columns. The vertical index is depth (i.e. as you move down the array depth increases) and the column headers are datetime values (i.e. as you move right along the array you move forwards in time).
I am looking for a way to average all temperatures in the columns depending on a date sampling rate. For example, the early datetimes have a spacing of 10 minutes, but the later datetimes have a spacing of six hours.
It would be good to be able to put in the sampling as an input and get out data based on that sampling rate so that I can see which one works best.
It would also be good that if I choose say 3 hour sampling this is simply ignored for spacing of above 3 hours and no change to the data is made in this case (i.e. datetime spacings of 10 minutes are averaged, but datetime spacings of 6 hours are left unaffected).
All of this needs to come out in either a pandas dataframe with date as column headers and depth as the index, or as a numpy array and separate list of datetimes.
I'm fairly new to Python, and this is my first question on stackoverflow!! Thanks :)
(I know the following is not totally correct use of Pandas, but it works for the figure slider I've produced!)
import numpy as np
import pandas as pd
#example array
T = np.array([ [-2, -2, -2, -2.1, -2.3, -2.6],
[-2.2, -2.3, -3, -3.1, -3.3, -3.3],
[-4, -4, -4.5, -4.4, -4.6, -4.5]])
#example headers at 8 and then 4 hour spacing
headers = [pd.date_range(start='2018-04-24 00:00:00', end='2018-04-24 08:00:00', periods=3).tolist() +
pd.date_range(start='2018-04-24 12:00:00', end='2018-04-25 12:00:00', periods=3).tolist()]
#pandas dataframe in same setup as much larger one I'm using
T_df = pd.DataFrame(T, columns = headers)
One trick you can use is to convert your time series to a numeric series, and then use the groupby method.
For instance, imagine you have
df = pd.DataFrame([['10:00:00', 1.],['10:10:00', 2.],['10:20:00', 3.],['10:30:00', 4.]],columns=['Time', 'Value'])
df.Time = pd.to_datetime(df.Time, format='%X')
You can convert your time series by:
df['DeltaT'] = pd.to_timedelta(df.Time).dt.total_seconds().astype(int)
df['DeltaT'] -= df['DeltaT'][0] # To start to 0
Then use the groupby method. You can for instance create a new column to floor the time interval you want:
myInterval = 1200.
df['group'] = (df['DeltaT']/myInterval).astype(int)
So you can use groupby followed by mean() (or a function you define)
df.groupby('group').mean()
Hope this helps!

Python Pandas: fill a column using values from rows at an earlier timestamps

I have a dataframe df where one column is timestamp and one is A. Column A contains decimals.
I would like to add a new column B and fill it with the current value of A divided by the value of A one minute earlier. That is:
df['B'] = df['A']_current / df['A'] _(current - 1 min)
NOTE: The data does not come in exactly every 1 minute so "the row one minute earlier" means the row whose timestamp is the closest to (current - 1 minute).
Here is how I do it:
First, I use the timestamp as index in order to use get_loc and create a new dataframe new_df starting from 1 minute after df. In this way I'm sure I have all the data when I go look 1 minute earlier within the first minute of data.
new_df = df.loc[df['timestamp'] > df.timestamp[0] + delta] # delta = 1 min timedelta
values = []
for index, row n new_df.iterrows():
v = row.A / df.iloc[df.index.get_loc(row.timestamp-delta,method='nearest')]['A']
values.append[v]
v_ser = pd.Series(values)
new_df['B'] = v_ser.values
I'm afraid this is not that great. It takes a long time for large dataframes. Also, I am not 100% sure the above is completely correct. Sometimes I get this message:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
What is the best / most efficient way to do the task above? Thank you.
PS. If someone can think of a better title please let me know. It took me longer to write the title than the post and I still don't like it.
You could try to use .asof() if the DataFrame has been indexed correctly by the timestamps (if not, use .set_index() first).
Simple example here
import pandas as pd
import numpy as np
n_vals = 50
# Create a DataFrame with random values and 'unusual times'
df = pd.DataFrame(data = np.random.randint(low=1,high=6, size=n_vals),
index=pd.DatetimeIndex(start=pd.Timestamp.now(),
freq='23s', periods=n_vals),
columns=['value'])
# Demonstrate how to use .asof() to get the value that was the 'state' at
# the time 1 min since the index. Note the .values call
df['value_one_min_ago'] = df['value'].asof(df.index - pd.Timedelta('1m')).values
# Note that there will be some NaNs to deal with consider .fillna()

pandas out of bounds nanosecond timestamp after offset rollforward plus adding a month offset

I am confused how pandas blew out of bounds for datetime objects with these lines:
import pandas as pd
BOMoffset = pd.tseries.offsets.MonthBegin()
# here some code sets the all_treatments dataframe and the newrowix, micolix, mocolix counters
all_treatments.iloc[newrowix,micolix] = BOMoffset.rollforward(all_treatments.iloc[i,micolix] + pd.tseries.offsets.DateOffset(months = x))
all_treatments.iloc[newrowix,mocolix] = BOMoffset.rollforward(all_treatments.iloc[newrowix,micolix]+ pd.tseries.offsets.DateOffset(months = 1))
Here all_treatments.iloc[i,micolix] is a datetime set by pd.to_datetime(all_treatments['INDATUMA'], errors='coerce',format='%Y%m%d'), and INDATUMA is date information in the format 20070125.
This logic seems to work on mock data (no errors, dates make sense), so at the moment I cannot reproduce while it fails in my entire data with the following error:
pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-05-01 00:00:00
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
In [54]: pd.Timestamp.min
Out[54]: Timestamp('1677-09-22 00:12:43.145225')
In [55]: pd.Timestamp.max
Out[55]: Timestamp('2262-04-11 23:47:16.854775807')
And your value is out of this range 2262-05-01 00:00:00 and hence the outofbounds error
Straight out of: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamp-limitations
Workaround:
This will force the dates which are outside the bounds to NaT
pd.to_datetime(date_col_to_force, errors = 'coerce')
Setting the errors parameter in pd.to_datetime to 'coerce' causes replacement of out of bounds values with NaT. Quoting the docs:
If ‘coerce’, then invalid parsing will be set as NaT
E.g.:
datetime_variable = pd.to_datetime(datetime_variable, errors = 'coerce')
This does not fix the data (obviously), but still allows processing the non-NaT data points.
The reason you are seeing this error message
"OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3000-12-23 00:00:00" is because pandas timestamp data type stores date in nanosecond resolution(from the docs).
Which means the date values have to be in the range
pd.Timestamp.min(1677-09-21 00:12:43.145225) and
pd.Timestamp.max(2262-04-11 23:47:16.854775807)
Even if you only want the date with resolution of seconds or microseconds, pandas will still store it internally in nanoseconds. There is no option in pandas to store a timestamp outside of the above mentioned range.
This is surprising because databases like sql server and libraries like numpy allows to store date beyond this range. Also maximum of 64 bits are used in most of the cases to store the date.
But here is the difference.
SQL server stores date in nanosecond resolution but only up to a accuracy of 100 ns(as opposed to 1 ns in pandas). Since the space is limited(64 bits), its a matter of range vs accuracy. With pandas timestamp we have higher accuracy but lower date range.
In case of numpy (pandas is built on top of numpy) datetime64 data type,
if the date falls in the above mentioned range you can store
it in nanoseconds which is similar to pandas.
OR you can give up the nanosecond resolution and go with
microseconds which will give you a much larger range. This is something that is missing in pandas timestamp type.
However if you choose to store in nanoseconds and the date is outside the range then numpy will automatically wrap around this date and you might get unexpected results (referenced below in the 4th solution).
np.datetime64("3000-06-19T08:17:14.073456178", dtype="datetime64[ns]")
> numpy.datetime64('1831-05-11T09:08:06.654352946')
Now with pandas we have below options,
import pandas as pd
data = {'Name': ['John', 'Sam'], 'dob': ['3000-06-19T08:17:14', '2000-06-19T21:17:14']}
my_df = pd.DataFrame(data)
1)If you are ok with losing the data which is out of range then simply use below param to convert out of range date to NaT(not a time).
my_df['dob'] = pd.to_datetime(my_df['dob'], errors = 'coerce')
2)If you dont want to lose the data then you can convert the values into a python datetime type. Here the column "dob" is of type pandas object but the individual value will be of type python datetime. However doing this we will lose the benefit of vectorized functions.
import datetime as dt
my_df['dob'] = my_df['dob'].apply(lambda x: dt.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S') if type(x)==str else pd.NaT)
print(type(my_df.iloc[0][1]))
> <class 'datetime.datetime'>
3)Another option is to use numpy instead of pandas series if possible. In case of pandas dataframe, you can convert a series(or column in a df) to numpy array. Process the data separately and then join it back to the dataframe.
4)we can also use pandas timespans as suggested in the docs. Do checkout the difference b/w timestamp and period before using this data type. Date range and frequency here works similar to numpy(mentioned above in the numpy section).
my_df['dob'] = my_df['dob'].apply(lambda x: pd.Period(x, freq='ms'))
You can try with strptime() in datetime library along with lambda expression to convert text to date values in a series object:
Example:
df['F'].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %I:%M:%S') if type(x)==str else np.NaN)
None of above are so good, because it will delete your data. But, you can only mantain and edit your conversion:
# convertin from epoch to datatime mantainig the nanoseconds timestamp
xbarout= pd.to_datetime(xbarout.iloc[:,0],unit='ns')

Time Series plot of timestamps in monthly buckets using Python/Pandas [duplicate]

I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()

Categories

Resources