Pandas, Filling between dates with average change between previous rows - python

I think this is best illustrated with examples. Lets say we have a DataFrame like such:
295340 299616
2014-11-02 304.904110 157.123288
2014-12-02 597.303413 305.488493
2015-01-02 896.310372 454.614630
2015-02-02 1192.379580 599.588466
2015-02-04 1211.285484 NaN
2015-03-02 NaN 726.622932
Now let's say I want to reindex this, like such:
rng = pd.date_range(df.index[0], df.index[-1])
df.reindex(rng)
295340 299616
2014-11-02 304.904110 157.123288
2014-11-03 NaN NaN
2014-11-04 NaN NaN
2014-11-05 NaN NaN
...
2014-11-29 NaN NaN
2014-11-30 NaN NaN
2014-12-01 NaN NaN
2014-12-02 597.303413 305.488493
Now if we look at 295340, we see the difference between their values is (597.30-304.90) = 292.39.
The amount of days between the two values is 31. So the average increase is 9.43 a day.
So what I would want is something like such:
295340 299616
2014-11-02 304.904110 157.123288
2014-11-03 314.336345 NaN
2014-11-04 323.768581 NaN
2014-11-05 333.200816 NaN
The way I calculated that was:
304.904110 + (((597.303413-304.904110) / 31) * N)
Where N is 1 for the first row since row 1, 2 after, etc.
I would obviously want all the columns filled this way, so 299616 with the same method and such.
Any ideas for something that is as efficient as possible? I know of ways to do this, but nothing seems efficient and it seems like there should be some type of fillna() or something that works for this type of finance related problem.
NOTE: The columns will not all be spaced out the same. Each one can have numbers anywhere within the range of dates, so I can't just assume that the next number for each column will be at X date.

You can use DataFrame.interpolate with the "time" method after a resample. (It won't give quite the numbers you gave, because there are only 30 days between 2 Nov and 2 Dec, not 31):
>>> dnew = df.resample("1d").interpolate("time")
>>> dnew.head(100)
295340 299616
2014-11-02 304.904110 157.123288
2014-11-03 314.650753 162.068795
[...]
2014-11-28 558.316839 285.706466
2014-11-29 568.063483 290.651972
2014-11-30 577.810126 295.597479
2014-12-01 587.556770 300.542986
2014-12-02 597.303413 305.488493
2014-12-03 606.948799 310.299014
[...]
2014-12-30 867.374215 440.183068
2014-12-31 877.019600 444.993589
2015-01-01 886.664986 449.804109
2015-01-02 896.310372 454.614630
[...]
2015-02-01 1182.828960 594.911891
2015-02-02 1192.379580 599.588466
[...]
The downside here is that it'll extrapolate using the last value at the end:
[...]
2015-01-31 1173.278341 590.235315
2015-02-01 1182.828960 594.911891
2015-02-02 1192.379580 599.588466
2015-02-03 1201.832532 604.125411
2015-02-04 1211.285484 608.662356
2015-02-05 1211.285484 613.199302
2015-02-06 1211.285484 617.736247
[...]
So you'd have to decide how you want to handle that.

Related

Filter the single column of an Dataframe from another dataframe column

DataFrame1 :
origin 2001-01-01 00:00:00 2002-01-01 00:00:00 2003-01-01 00:00:00 2004-01-01 00:00:00 ... 2008-01-01 00:00:00 2009-01-01 00:00:00 2010-01-01 00:00:00 Grand Total
Simulation 1 1.597942e+13 NaN 1.114312e+20 4.370424e+26 ... 3.633710e+52 3.388095e+58 1.103886e+64 3.159025e+71
Simulation 2 1.852542e+13 NaN 1.280181e+20 4.958904e+26 ... 7.830853e+52 1.077502e+59 5.605342e+64 1.852667e+72
Simulation 3 1.978941e+13 NaN 1.024391e+20 5.038746e+26 ... 6.922672e+52 9.431727e+58 5.947689e+63 4.921311e+71
Simulation 4 1.845122e+13 NaN 1.050210e+20 4.305396e+26 ... 6.529340e+52 1.004737e+59 4.311079e+63 6.250895e+71
Simulation 5 1.733954e+13 NaN 1.082353e+20 4.400699e+26 ... 4.554812e+52 2.587384e+58 5.571276e+63 1.459044e+71
Im trying to filter the cloumn Grand Total from the above Dataframe1.
DataFrame2 :
CI Var
0 60.0 2.059017e+72
1 70.0 2.402186e+72
2 80.0 2.745356e+72
3 90.5 3.105684e+72
In DataFrame2, in Column Var first value is 2.059017e+72, now we have to collect the values from Grand Total column of DataFrame1 which is greater than 2.059017e+72 and store it in the separate dataframe, for each value of var..
You can filter the columns like this:
DataFrame3 = DataFrame1.loc[DataFrame2['Var'][0] < DataFrame1['Grand Total']]
Do you want to print the values or save them as an extra column of df2?

Pandas DataFrame.resample monthly offset from particular day of month

I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....

How to efficiently fill a column of dataframe with values calculated from other dataframes

I am trying to fill a dataframe (elist) with cumulative returns of firms and cumulative market returns. This can be done by looping through the elist dataframe using iterrows, see this link. However, this is slow.
I am looking for a more potent, faster solution.
The raw returns that are the inputs for the cumulative return calculations originate from two dataframes (ri, rm). The results should then be recorded in columns of elist. See example below, using the data in this file.
Before running the iterrows loop, elist looks like:
permno begdat enddat return vwretd
0 11628 2012-03-31 2013-03-31 NaN NaN
1 11628 2012-06-30 2013-06-30 NaN NaN
2 11628 2012-09-30 2013-09-30 NaN NaN
3 11628 2012-12-31 2013-12-31 NaN NaN
4 11628 2013-03-31 2014-03-31 NaN NaN
After running the loop elist should look like:
permno begdat enddat return vwretd
0 11628 2012-03-31 2013-03-31 0.212355 0.133429
1 11628 2012-06-30 2013-06-30 0.274788 0.198380
2 11628 2012-09-30 2013-09-30 0.243590 0.198079
3 11628 2012-12-31 2013-12-31 0.299277 0.304479
4 11628 2013-03-31 2014-03-31 0.303147 0.208454
Here is the code that relies on iterrows, which is slow:
import os,sys
import pandas as pd
import numpy as np
rm = pd.read_csv('rm_so.csv') # market return
ri = pd.read_csv('ri_so.csv') # firm return
elist = pd.read_csv('elist_so.csv') # table to be filled with cumlative returns over a period (begdat to enddat)
for index, row in elist.iterrows():
#fill cumulative market return
elist.loc[index, 'vwretd']=rm.loc[(rm['date']>row['begdat']) & (rm['date']<=row['enddat']),'vwretd'].product()-1
#fill cumulative firm return
r = ri.loc[(ri['permno']==row['permno']),]
elist.loc[index, 'return'] = r.loc[(r['date']>row['begdat']) & (r['date']<=row['enddat']),'ret'].product()-1
It would be great to see this process running faster!

Python -Pandas Downsampling with first returns NaN

I am trying use pandas to resample vessel tracking data from seconds to minutes using how='first'. The dataframe is called hg1s. The unique ID is called MMSI. The datetime index is TX_DTTM. Here is a data sample:
TX_DTTM MMSI LAT LON NS
2013-10-01 00:00:02 367542760 29.660550 -94.974195 15
2013-10-01 00:00:04 367542760 29.660550 -94.974195 15
2013-10-01 00:00:07 367451120 29.614161 -94.954459 0
2013-10-01 00:00:15 367542760 29.660210 -94.974069 15
2013-10-01 00:00:13 367542760 29.660210 -94.974069 15
The code to resample:
hg1s1min = hg1s.groupby('MMSI').resample('1Min', how='first')
And a data sample of the output:
hg1s1min[20000:20004]
MMSI TX_DTTM NS LAT LON
367448060 2013-10-21 00:42:00 NaN NaN NaN
2013-10-21 00:43:00 NaN NaN NaN
2013-10-21 00:44:00 NaN NaN NaN
2013-10-21 00:45:00 NaN NaN NaN
It's safe to assume that there are several data points within each minute, so I don't understand why this isn't picking up the first record for that method. I looked at this link: Pandas Downsampling Issue because it seemed similar to my problem. I tried passing label='left' and label='right', neither worked.
How do I return the first record in every minute for each MMSI?
As it turns out, the problem isn't with the method, but with my assumption about the data. The large data set is a month, or 44640 minutes. While every record in my dataset has the relevant values, there isn't 100% overlap in time. In this case MMSI = 367448060 is present at 2013-10-17 23:24:31 and again at 2013-10-29 20:57:32. between those two data points, there isn't data to sample, resulting in a NaN, which is correct.

pandas asfreq returns NaN if exact date DNE

Let's say I have financial data in a pandas.Series, called fin_series.
Here's a peek at fin_series.
In [565]: fin_series
Out[565]:
Date
2008-05-16 1000.000000
2008-05-19 1001.651747
2008-05-20 1004.137434
...
2014-12-22 1158.085200
2014-12-23 1150.139126
2014-12-24 1148.934665
Name: Close, Length: 1665
I'm interested in looking at the quarterly endpoints of the data. However, not all financial trading days fall exactly on the 'end of the quarter.'
For example:
In [566]: fin_series.asfreq('q')
Out[566]:
2008-06-30 976.169624
2008-09-30 819.518923
2008-12-31 760.429261
...
2009-06-30 795.768956
2009-09-30 870.467121
2009-12-31 886.329978
...
2011-09-30 963.304679
2011-12-31 NaN
2012-03-31 NaN
....
2012-09-30 NaN
2012-12-31 1095.757137
2013-03-31 NaN
2013-06-30 NaN
...
2014-03-31 1138.548881
2014-06-30 1168.248194
2014-09-30 1147.000073
Freq: Q-DEC, Name: Close, dtype: float64
Here's a little function that accomplishes what I'd like, along with the desired end result.
def bmg_qt_asfreq(series):
ind = series[1:].index.quarter != series[:-1].index.quarter
ind = numpy.append(ind, True)
return tmp[ind]
which gives me:
In [15]: bmg_asfreq(tmp)
Out[15]:
Date
2008-06-30 976.169425
2008-09-30 819.517607
2008-12-31 760.428770
...
2011-09-30 963.252831
2011-12-30 999.742132
2012-03-30 1049.848583
...
2012-09-28 1086.689824
2012-12-31 1093.943357
2013-03-28 1117.111859
Name: Close, dtype: float64
Note that I'm preserving the dates of the "closest prior price," instead of simply using pandas.asfreq(freq = 'q', method = 'ffill'), as the preservations of dates that exist within the original Series.Index is crucial.
This seems like a silly problem that many people have had and must be addressed by all of the pandas time manipulation functionality, but I can't figure out how to do it with resample or asfreq.
Anyone who could show me the builtin pandas functionality to accomplish this would be greatly appreciated.
Regards,
Assuming the input is a dataframe Series , first do
import pandas as pd
fin_series.resample("q",pd.Series.last_valid_index)
to get a series with the last non-NA index for each quarter. Then
fin_series.resample("q","last")
for the last non-NA value. You can then join these together. As you suggested in your comment:
fin_series[fin_series.resample("q",pd.Series.last_valid_index)]
df.asfreq('d').interpolate().asfreq('q')

Categories

Resources