I'm having trouble using a custom transformation with an expanding window. I have a function:
def minutes_since_previous(x):
'''given datetime-indexed series x,
returns time difference in minutes between the last datetime
and the datetime of the previous nonnull value in each column.'''
x = pd.Series(x) # assumes aready sorted, else add .sort_index()
allprev = x[:-1].dropna() #all prev nonnulls.
if (len(allprev) > 0):
tprev = allprev.index[-1]
return (tfinal - tprev)
return np.nan
and a series:
2017-06-05 22:01:20.000 81.070099
2017-06-05 22:25:24.235 NaN
2017-06-05 22:33:13.000 NaN
2017-06-05 23:10:06.208 NaN
2017-06-05 23:11:40.437 NaN
2017-06-05 23:30:32.911 NaN
2017-06-06 06:19:41.934 NaN
2017-06-06 06:37:11.432 NaN
2017-06-06 06:41:37.000 93.681000
dtype: float32
The function works as expected on the series as a whole:
In [117]:
Timedelta('0 days 08:40:17')
But when applied to the expanding window, I lose the datetime nature of the index, and instead of giving timedeltas since the first timestamp, I just get integer differences:
In [118]:
2017-06-05 22:01:20.000 NaN
2017-06-05 22:25:24.235 1.0
2017-06-05 22:33:13.000 2.0
2017-06-05 23:10:06.208 3.0
2017-06-05 23:11:40.437 4.0
2017-06-05 23:30:32.911 5.0
2017-06-06 06:19:41.934 6.0
2017-06-06 06:37:11.432 7.0
2017-06-06 06:41:37.000 8.0
dtype: float64
Perhaps .expanding() resets the index internally, but if so, how can I get what I'm looking for?
I searched "pandas expanding index" and found a question about multiple columns,
and many, many that were really about various kinds of resampling, but none dealing with losing index information.
Thanks ...
I have a pd.Series object with a pd.DatetimeIndex containing dates. I would like to calculate the difference from a past value, for example the one month before. The values are not exactly aligned to the months, so I cannot simply add a monthly date offset. There might also be missing data.
So I would like to match the previous value using an offset and a tolerance. One way to do this is using the .reindex() method with method='nearest' which matches the previous data point almost like I want to:
shifted = data.copy()
shifted.index = shifted.index + pd.DateOffset(months=1)
shifted = shifted.reindex(
return data - shifted
Here we calculate the difference from the value one month before, but we tolerate finding a value 100 days around that timestamp.
This is almost what I want, but I want to avoid subtracting the value from itself. I always want to subtract a value in the past, or no value at all.
For example: if this is the data
2020-01-02 1.0
2020-02-03 2.0
2020-04-05 3.0
And I use the code above, the last data point, 3.0 will be subtracted from itself, since its date is closer to 2020-05-05 than to 2020-03-03. And the result will be
2020-01-02 0.0
2020-02-03 1.0
2020-04-05 0.0
While the goal is to get
2020-01-02 NaN
2020-02-03 1.0
2020-04-05 1.0
Additional edit after Baron Legendre's answer (thanks for pointing out the flaw in my question):
The tolerance variable is also important to me. So let's say there is a gap of a year in the data, that falls outside the tolerance of 100 days, and the result should be NaN:
2015-12-04 10.0
2020-01-02 1.0
2020-02-03 2.0
2020-04-05 3.0
Should result in:
2015-12-05 NaN (because there is no past value to subtract)
2020-01-02 NaN (because the past value is too far back)
2020-02-03 1.0
2020-04-05 1.0
Hope that explains the problem well enough. Any ideas on how to do this efficiently, without looping over every single data point?
2015-12-04 10
2020-01-02 1
2020-02-03 2
2020-04-05 3
dtype: int64
df = ser.reset_index()
tdiff = df['index'].diff().dt.days
ser[:] = np.where(tdiff > 100, np.nan, ser - ser.shift())
2015-12-04 NaN
2020-01-02 NaN
2020-02-03 1.0
2020-04-05 1.0
dtype: float64
I have a dataframe similar to the one below.
dry_bulb_tmp_mean Time Timedelta
2011-01-14 -11.245833 2011-01-14 NaN
2011-01-15 -12.608333 2011-01-15 1.0
2011-01-16 -15.700000 2011-01-16 1.0
2011-01-17 -19.954167 2011-01-17 1.0
2011-01-20 -13.654167 2011-01-20 3.0
2011-01-21 -11.887500 2011-01-21 1.0
2011-01-22 -17.866667 2011-01-22 1.0
2011-01-23 -23.250000 2011-01-23 1.0
2011-01-24 -23.654167 2011-01-24 1.0
2011-01-25 -16.254167 2011-01-25 1.0
2011-01-30 -12.233333 2011-01-30 5.0
2011-01-31 -19.041667 2011-01-31 1.0
I am tasked with creating a new dataframe that gives me the lengths for different runs. Basically, a run is however many consecutive days occur in the dataframe. For example, from the 14th, to the 17th I get a run of 4, but then at the 20th I get a run of 1. I have attempted to do this as follows.
if temp_persis33['Timedelta'].iloc[row] == 1:
length += 1
Every time a value greater than 1 is found in the Timedelta column, it will append the counter to a list, and then reset the counter. However, I am not sure how to compare the values in the dataframe. I have tried a few different things and nothing has worked. Any help would be appreciated, thank you.
IIUC try grouping on a boolean array where Timedelta does not equal 1
# array([4, 6, 2])
You don't even need to create the Timedelta column. You can apply diff directly to it. pd.DataFrame.diff() works directly with datetime:
Given previous datetime values in a Pandas DataFrame--either as an index or as values in a column--is there a way to "autofill" remaining time increments based on the previous fixed increments?
For example, given:
import pandas as pd
import numpy as np
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:10'),
I would like to apply a function to yield:
2013-01-01 09:00:00
2013-01-01 09:00:05
2013-01-01 09:00:10
2013-01-01 09:00:15
2013-01-01 09:00:20
Where I have missing timesteps for my last two data points. Here, timesteps are fixed in 5 second increments.
This will be for thousands of rows. While I might reset_index and then create a function to apply to each row, this seems cumbersome. Is there a slick or built-in way to do this that I'm not finding?
Assuming the first index value is a valid datetime and all the values are spaced 5s apart, you could do the following:
df.index = pd.date_range(df.index[0], periods=len(df), freq='5s')
>>> df
2013-01-01 09:00:00 0.0
2013-01-01 09:00:05 1.0
2013-01-01 09:00:10 2.0
2013-01-01 09:00:15 NaN
2013-01-01 09:00:20 4.0
This solution might work for you,but also use reset_index() fuction.
new_dateindex=pd.Series(pd.date_range(start=pd.Timestamp('20130101 09:00:00'),periods=1000,freq='5S'),name='Date')
#'periods=1000' can change to 'periods=len(df.index)'
I have a dataframe (df) where column A is drug units that is dosed at time point given by Timestamp. I want to fill the missing values (NaN) with the drug concentration given the half-life of the drug (180mins). I am struggling with the code in pandas . Would really appreciate help and insight. Thanks in advance
1991-04-21 09:09:00 9.0
1991-04-21 3:00:00 NaN
1991-04-21 9:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN
Given the half -life of the drug is 180 mins. I wanted to fillna(values) as a function of time elapsed and the half life of the drug
something like
Timestamp A
1991-04-21 09:00:00 9.0
1991-04-21 3:00:00 ~2.25
1991-04-21 9:00:00 ~0.55
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 ~2.5
1991-04-22 16:56:00 ~0.75
Your timestamps are not sorted and I'm assuming this was a typo. I fixed it below.
import pandas as pd
import numpy as np
from StringIO import StringIO
text = """TimeStamp A
1991-04-21 09:09:00 9.0
1991-04-21 13:00:00 NaN
1991-04-21 19:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN """
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[0])
This is the magic code.
# half-life of 180 minutes is 10,800 seconds
# we need to calculate lamda (intentionally mis-spelled)
lamda = 10800 / np.log(2)
# returns time difference for each element
# relative to first element
def time_diff(x):
return x - x.iloc[0]
# create partition of non-nulls with subsequent nulls
partition = df.A.notnull().cumsum()
# calculate time differences in seconds for each
# element relative to most recent non-null observation
# use .dt accessor and method .total_seconds()
tdiffs = df.TimeStamp.groupby(partition).apply(time_diff).dt.total_seconds()
# apply exponential decay
decay = np.exp(-tdiffs / lamda)
# finally, forward fill the observations and multiply by decay
decay * df.A.ffill()
0 9.000000
1 3.697606
2 0.924402
3 10.000000
4 2.452325
5 1.152895
dtype: float64
Let's say I have financial data in a pandas.Series, called fin_series.
Here's a peek at fin_series.
In [565]: fin_series
2008-05-16 1000.000000
2008-05-19 1001.651747
2008-05-20 1004.137434
2014-12-22 1158.085200
2014-12-23 1150.139126
2014-12-24 1148.934665
Name: Close, Length: 1665
I'm interested in looking at the quarterly endpoints of the data. However, not all financial trading days fall exactly on the 'end of the quarter.'
For example:
In [566]: fin_series.asfreq('q')
2008-06-30 976.169624
2008-09-30 819.518923
2008-12-31 760.429261
2009-06-30 795.768956
2009-09-30 870.467121
2009-12-31 886.329978
2011-09-30 963.304679
2011-12-31 NaN
2012-03-31 NaN
2012-09-30 NaN
2012-12-31 1095.757137
2013-03-31 NaN
2013-06-30 NaN
2014-03-31 1138.548881
2014-06-30 1168.248194
2014-09-30 1147.000073
Freq: Q-DEC, Name: Close, dtype: float64
Here's a little function that accomplishes what I'd like, along with the desired end result.
def bmg_qt_asfreq(series):
ind = series[1:].index.quarter != series[:-1].index.quarter
ind = numpy.append(ind, True)
return tmp[ind]
which gives me:
In [15]: bmg_asfreq(tmp)
2008-06-30 976.169425
2008-09-30 819.517607
2008-12-31 760.428770
2011-09-30 963.252831
2011-12-30 999.742132
2012-03-30 1049.848583
2012-09-28 1086.689824
2012-12-31 1093.943357
2013-03-28 1117.111859
Name: Close, dtype: float64
Note that I'm preserving the dates of the "closest prior price," instead of simply using pandas.asfreq(freq = 'q', method = 'ffill'), as the preservations of dates that exist within the original Series.Index is crucial.
This seems like a silly problem that many people have had and must be addressed by all of the pandas time manipulation functionality, but I can't figure out how to do it with resample or asfreq.
Anyone who could show me the builtin pandas functionality to accomplish this would be greatly appreciated.
Assuming the input is a dataframe Series , first do
import pandas as pd
to get a series with the last non-NA index for each quarter. Then
for the last non-NA value. You can then join these together. As you suggested in your comment: