I'm having trouble using a custom transformation with an expanding window. I have a function:
def minutes_since_previous(x):
'''given datetime-indexed series x,
returns time difference in minutes between the last datetime
and the datetime of the previous nonnull value in each column.'''
x = pd.Series(x) # assumes aready sorted, else add .sort_index()
tfinal=x.index[-1]
allprev = x[:-1].dropna() #all prev nonnulls.
if (len(allprev) > 0):
tprev = allprev.index[-1]
return (tfinal - tprev)
else:
return np.nan
and a series:
x
Out[116]:
datetime
2017-06-05 22:01:20.000 81.070099
2017-06-05 22:25:24.235 NaN
2017-06-05 22:33:13.000 NaN
2017-06-05 23:10:06.208 NaN
2017-06-05 23:11:40.437 NaN
2017-06-05 23:30:32.911 NaN
2017-06-06 06:19:41.934 NaN
2017-06-06 06:37:11.432 NaN
2017-06-06 06:41:37.000 93.681000
dtype: float32
The function works as expected on the series as a whole:
In [117]:
minutes_since_previous(x)
Out[117]:
Timedelta('0 days 08:40:17')
But when applied to the expanding window, I lose the datetime nature of the index, and instead of giving timedeltas since the first timestamp, I just get integer differences:
In [118]:
x.expanding().apply(minutes_since_previous)
Out[118]:
datetime
2017-06-05 22:01:20.000 NaN
2017-06-05 22:25:24.235 1.0
2017-06-05 22:33:13.000 2.0
2017-06-05 23:10:06.208 3.0
2017-06-05 23:11:40.437 4.0
2017-06-05 23:30:32.911 5.0
2017-06-06 06:19:41.934 6.0
2017-06-06 06:37:11.432 7.0
2017-06-06 06:41:37.000 8.0
dtype: float64
Perhaps .expanding() resets the index internally, but if so, how can I get what I'm looking for?
I searched "pandas expanding index" and found a question about multiple columns,
and many, many that were really about various kinds of resampling, but none dealing with losing index information.
Thanks ...
Related
I have a pd.Series object with a pd.DatetimeIndex containing dates. I would like to calculate the difference from a past value, for example the one month before. The values are not exactly aligned to the months, so I cannot simply add a monthly date offset. There might also be missing data.
So I would like to match the previous value using an offset and a tolerance. One way to do this is using the .reindex() method with method='nearest' which matches the previous data point almost like I want to:
shifted = data.copy()
shifted.index = shifted.index + pd.DateOffset(months=1)
shifted = shifted.reindex(
data.index,
method="nearest",
tolerance=timedelta(days=100),
)
return data - shifted
Here we calculate the difference from the value one month before, but we tolerate finding a value 100 days around that timestamp.
This is almost what I want, but I want to avoid subtracting the value from itself. I always want to subtract a value in the past, or no value at all.
For example: if this is the data
2020-01-02 1.0
2020-02-03 2.0
2020-04-05 3.0
And I use the code above, the last data point, 3.0 will be subtracted from itself, since its date is closer to 2020-05-05 than to 2020-03-03. And the result will be
2020-01-02 0.0
2020-02-03 1.0
2020-04-05 0.0
While the goal is to get
2020-01-02 NaN
2020-02-03 1.0
2020-04-05 1.0
Additional edit after Baron Legendre's answer (thanks for pointing out the flaw in my question):
The tolerance variable is also important to me. So let's say there is a gap of a year in the data, that falls outside the tolerance of 100 days, and the result should be NaN:
2015-12-04 10.0
2020-01-02 1.0
2020-02-03 2.0
2020-04-05 3.0
Should result in:
2015-12-05 NaN (because there is no past value to subtract)
2020-01-02 NaN (because the past value is too far back)
2020-02-03 1.0
2020-04-05 1.0
Hope that explains the problem well enough. Any ideas on how to do this efficiently, without looping over every single data point?
ser
###
2015-12-04 10
2020-01-02 1
2020-02-03 2
2020-04-05 3
dtype: int64
df = ser.reset_index()
tdiff = df['index'].diff().dt.days
ser[:] = np.where(tdiff > 100, np.nan, ser - ser.shift())
ser
###
2015-12-04 NaN
2020-01-02 NaN
2020-02-03 1.0
2020-04-05 1.0
dtype: float64
I have a dataframe similar to the one below.
dry_bulb_tmp_mean Time Timedelta
Date
2011-01-14 -11.245833 2011-01-14 NaN
2011-01-15 -12.608333 2011-01-15 1.0
2011-01-16 -15.700000 2011-01-16 1.0
2011-01-17 -19.954167 2011-01-17 1.0
2011-01-20 -13.654167 2011-01-20 3.0
2011-01-21 -11.887500 2011-01-21 1.0
2011-01-22 -17.866667 2011-01-22 1.0
2011-01-23 -23.250000 2011-01-23 1.0
2011-01-24 -23.654167 2011-01-24 1.0
2011-01-25 -16.254167 2011-01-25 1.0
2011-01-30 -12.233333 2011-01-30 5.0
2011-01-31 -19.041667 2011-01-31 1.0
I am tasked with creating a new dataframe that gives me the lengths for different runs. Basically, a run is however many consecutive days occur in the dataframe. For example, from the 14th, to the 17th I get a run of 4, but then at the 20th I get a run of 1. I have attempted to do this as follows.
if temp_persis33['Timedelta'].iloc[row] == 1:
length += 1
Every time a value greater than 1 is found in the Timedelta column, it will append the counter to a list, and then reset the counter. However, I am not sure how to compare the values in the dataframe. I have tried a few different things and nothing has worked. Any help would be appreciated, thank you.
IIUC try grouping on a boolean array where Timedelta does not equal 1
df.groupby(df['Timedelta'].ne(1).cumsum())['Time'].count().to_numpy()
# array([4, 6, 2])
You don't even need to create the Timedelta column. You can apply diff directly to it. pd.DataFrame.diff() works directly with datetime:
df.groupby(df.Time.diff().ne(1).cumsum()).Time.count()
Given previous datetime values in a Pandas DataFrame--either as an index or as values in a column--is there a way to "autofill" remaining time increments based on the previous fixed increments?
For example, given:
import pandas as pd
import numpy as np
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:10'),
np.nan,
np.nan])
I would like to apply a function to yield:
B
2013-01-01 09:00:00
0.0
2013-01-01 09:00:05
1.0
2013-01-01 09:00:10
2.0
2013-01-01 09:00:15
NaN
2013-01-01 09:00:20
4.0
Where I have missing timesteps for my last two data points. Here, timesteps are fixed in 5 second increments.
This will be for thousands of rows. While I might reset_index and then create a function to apply to each row, this seems cumbersome. Is there a slick or built-in way to do this that I'm not finding?
Assuming the first index value is a valid datetime and all the values are spaced 5s apart, you could do the following:
df.index = pd.date_range(df.index[0], periods=len(df), freq='5s')
>>> df
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:05 1.0
2013-01-01 09:00:10 2.0
2013-01-01 09:00:15 NaN
2013-01-01 09:00:20 4.0
This solution might work for you,but also use reset_index() fuction.
new_dateindex=pd.Series(pd.date_range(start=pd.Timestamp('20130101 09:00:00'),periods=1000,freq='5S'),name='Date')
#'periods=1000' can change to 'periods=len(df.index)'
df.reset_index().join(new_dateindex,how='right')
I have a dataframe (df) where column A is drug units that is dosed at time point given by Timestamp. I want to fill the missing values (NaN) with the drug concentration given the half-life of the drug (180mins). I am struggling with the code in pandas . Would really appreciate help and insight. Thanks in advance
df
A
Timestamp
1991-04-21 09:09:00 9.0
1991-04-21 3:00:00 NaN
1991-04-21 9:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN
Given the half -life of the drug is 180 mins. I wanted to fillna(values) as a function of time elapsed and the half life of the drug
something like
Timestamp A
1991-04-21 09:00:00 9.0
1991-04-21 3:00:00 ~2.25
1991-04-21 9:00:00 ~0.55
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 ~2.5
1991-04-22 16:56:00 ~0.75
Your timestamps are not sorted and I'm assuming this was a typo. I fixed it below.
import pandas as pd
import numpy as np
from StringIO import StringIO
text = """TimeStamp A
1991-04-21 09:09:00 9.0
1991-04-21 13:00:00 NaN
1991-04-21 19:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN """
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[0])
This is the magic code.
# half-life of 180 minutes is 10,800 seconds
# we need to calculate lamda (intentionally mis-spelled)
lamda = 10800 / np.log(2)
# returns time difference for each element
# relative to first element
def time_diff(x):
return x - x.iloc[0]
# create partition of non-nulls with subsequent nulls
partition = df.A.notnull().cumsum()
# calculate time differences in seconds for each
# element relative to most recent non-null observation
# use .dt accessor and method .total_seconds()
tdiffs = df.TimeStamp.groupby(partition).apply(time_diff).dt.total_seconds()
# apply exponential decay
decay = np.exp(-tdiffs / lamda)
# finally, forward fill the observations and multiply by decay
decay * df.A.ffill()
0 9.000000
1 3.697606
2 0.924402
3 10.000000
4 2.452325
5 1.152895
dtype: float64
Let's say I have financial data in a pandas.Series, called fin_series.
Here's a peek at fin_series.
In [565]: fin_series
Out[565]:
Date
2008-05-16 1000.000000
2008-05-19 1001.651747
2008-05-20 1004.137434
...
2014-12-22 1158.085200
2014-12-23 1150.139126
2014-12-24 1148.934665
Name: Close, Length: 1665
I'm interested in looking at the quarterly endpoints of the data. However, not all financial trading days fall exactly on the 'end of the quarter.'
For example:
In [566]: fin_series.asfreq('q')
Out[566]:
2008-06-30 976.169624
2008-09-30 819.518923
2008-12-31 760.429261
...
2009-06-30 795.768956
2009-09-30 870.467121
2009-12-31 886.329978
...
2011-09-30 963.304679
2011-12-31 NaN
2012-03-31 NaN
....
2012-09-30 NaN
2012-12-31 1095.757137
2013-03-31 NaN
2013-06-30 NaN
...
2014-03-31 1138.548881
2014-06-30 1168.248194
2014-09-30 1147.000073
Freq: Q-DEC, Name: Close, dtype: float64
Here's a little function that accomplishes what I'd like, along with the desired end result.
def bmg_qt_asfreq(series):
ind = series[1:].index.quarter != series[:-1].index.quarter
ind = numpy.append(ind, True)
return tmp[ind]
which gives me:
In [15]: bmg_asfreq(tmp)
Out[15]:
Date
2008-06-30 976.169425
2008-09-30 819.517607
2008-12-31 760.428770
...
2011-09-30 963.252831
2011-12-30 999.742132
2012-03-30 1049.848583
...
2012-09-28 1086.689824
2012-12-31 1093.943357
2013-03-28 1117.111859
Name: Close, dtype: float64
Note that I'm preserving the dates of the "closest prior price," instead of simply using pandas.asfreq(freq = 'q', method = 'ffill'), as the preservations of dates that exist within the original Series.Index is crucial.
This seems like a silly problem that many people have had and must be addressed by all of the pandas time manipulation functionality, but I can't figure out how to do it with resample or asfreq.
Anyone who could show me the builtin pandas functionality to accomplish this would be greatly appreciated.
Regards,
Assuming the input is a dataframe Series , first do
import pandas as pd
fin_series.resample("q",pd.Series.last_valid_index)
to get a series with the last non-NA index for each quarter. Then
fin_series.resample("q","last")
for the last non-NA value. You can then join these together. As you suggested in your comment:
fin_series[fin_series.resample("q",pd.Series.last_valid_index)]
df.asfreq('d').interpolate().asfreq('q')