Pandas: decompress date range to individual dates - python

Dataset: I have a 1GB dataset of stocks, which have values between date ranges. There is no overlapping in date ranges and the dataset is sorted on (ticker, start_date).
>>> df.head()
start_date end_date val
ticker
AAPL 2014-05-01 2014-05-01 10.0000000000
AAPL 2014-06-05 2014-06-10 20.0000000000
GOOG 2014-06-01 2014-06-15 50.0000000000
MSFT 2014-06-16 2014-06-16 None
TWTR 2014-01-17 2014-05-17 10.0000000000
Goal: I want to decompress the dataframe so that I have individual dates instead of date ranges. For example, the AAPL rows would go from being only 2 rows to 7 rows:
>>> AAPL_decompressed.head()
val
date
2014-05-01 10.0000000000
2014-06-05 20.0000000000
2014-06-06 20.0000000000
2014-06-07 20.0000000000
2014-06-08 20.0000000000
I'm hoping there's a nice optimized method from pandas like resample that can do this in a couple lines.

A bit more than a few lines, but I think it results in what you asked:
Starting with your dataframe:
In [70]: df
Out[70]:
start_date end_date val row
ticker
AAPL 2014-05-01 2014-05-01 10 0
AAPL 2014-06-05 2014-06-10 20 1
GOOG 2014-06-01 2014-06-15 50 2
MSFT 2014-06-16 2014-06-16 NaN 3
TWTR 2014-01-17 2014-05-17 10 4
First I reshape this dataframe to a dataframe with one date column (so every row two times repeated for each date of start_date and end_date (and I add a counter column called row):
In [60]: df['row'] = range(len(df))
In [61]: starts = df[['start_date', 'val', 'row']].rename(columns={'start_date': 'date'})
In [62]: ends = df[['end_date', 'val', 'row']].rename(columns={'end_date':'date'})
In [63]: df_decomp = pd.concat([starts, ends])
In [64]: df_decomp = df_decomp.set_index('row', append=True)
In [65]: df_decomp.sort_index()
Out[65]:
date val
ticker row
AAPL 0 2014-05-01 10
0 2014-05-01 10
1 2014-06-05 20
1 2014-06-10 20
GOOG 2 2014-06-01 50
2 2014-06-15 50
MSFT 3 2014-06-16 NaN
3 2014-06-16 NaN
TWTR 4 2014-01-17 10
4 2014-05-17 10
Based on this new dataframe, I can group it by ticker and row, and apply a daily resample on each of these groups and fillna (with method 'pad' to forward fill)
In [66]: df_decomp = df_decomp.groupby(level=[0,1]).apply(lambda x: x.set_index('date').resample('D').fillna(method='pad'))
In [67]: df_decomp = df_decomp.reset_index(level=1, drop=True)
The last command was to drop the now superfluous row index level.
When we access the AAPL rows, it gives your desired output:
In [69]: df_decomp.loc['AAPL']
Out[69]:
val
date
2014-05-01 10
2014-06-05 20
2014-06-06 20
2014-06-07 20
2014-06-08 20
2014-06-09 20
2014-06-10 20

I think you can do this in five steps:
1) filter the ticker column to find the stock you want
2) use pandas.bdate_range to build a list of date ranges between start and end
3) flatten this list using reduce
4) reindex your new filtered dataframe
5) fill nans using the method pad
Here is the code:
>>> import pandas as pd
>>> import datetime
>>> data = [('AAPL', datetime.date(2014, 4, 28), datetime.date(2014, 5, 2), 90),
('AAPL', datetime.date(2014, 5, 5), datetime.date(2014, 5, 9), 80),
('MSFT', datetime.date(2014, 5, 5), datetime.date(2014, 5, 9), 150),
('AAPL', datetime.date(2014, 5, 12), datetime.date(2014, 5, 16), 85)]
>>> df = pd.DataFrame(data=data, columns=['ticker', 'start', 'end', 'val'])
>>> df_new = df[df['ticker'] == 'AAPL']
>>> df_new.name = 'AAPL'
>>> df_new.index = df_new['start']
>>> df_new.index.name = 'date'
>>> df_new.index = df_new.index.to_datetime()
>>> from functools import reduce #for py3k only
>>> new_index = [pd.bdate_range(**d) for d in df_new[['start','end']].to_dict('record')]
>>> new_index_flat = reduce(pd.tseries.index.DatetimeIndex.append, new_index)
>>> df_new = df_new.reindex(new_index_flat)
>>> df_new = df_new.fillna(method='pad')
>>> df_new
ticker start end val
2014-04-28 AAPL 2014-04-28 2014-05-02 90
2014-04-29 AAPL 2014-04-28 2014-05-02 90
2014-04-30 AAPL 2014-04-28 2014-05-02 90
2014-05-01 AAPL 2014-04-28 2014-05-02 90
2014-05-02 AAPL 2014-04-28 2014-05-02 90
2014-05-05 AAPL 2014-05-05 2014-05-09 80
2014-05-06 AAPL 2014-05-05 2014-05-09 80
2014-05-07 AAPL 2014-05-05 2014-05-09 80
2014-05-08 AAPL 2014-05-05 2014-05-09 80
2014-05-09 AAPL 2014-05-05 2014-05-09 80
2014-05-12 AAPL 2014-05-12 2014-05-16 85
2014-05-13 AAPL 2014-05-12 2014-05-16 85
2014-05-14 AAPL 2014-05-12 2014-05-16 85
2014-05-15 AAPL 2014-05-12 2014-05-16 85
2014-05-16 AAPL 2014-05-12 2014-05-16 85
[15 rows x 4 columns]
Hope it helps!

Here is a hacky way to do it - I post this poor answer (remember - I can't code :-) ) because I am new to pandas and would not mind someone improving it.
This reads the file that had the data originally posted - then creates a multi-index off of the stock_id and the end_date. The get_val function below takes the entire frame, a ticker e.g. 'AAPL', and a date and uses index.searchsorted which behaves like map::upper_bound in C++ - i.e. finds the index where the date would be inserted if you wanted to insert - i.e. find the enddate closest to, but after the date in question - this will have the value that we want and we return that with get_val.
Then I grab a cross section from a Series with this Multiindex based on the stock_id of 'AAPL'. Then we form an empty list which will be used to flatten the list of tuples of dates from the multiindex with the key of 'AAPL'. These dates become the index and values of a series. Then I map this series to get_val to grab the stock price that is desired.
I know this is probably wrong...but...happy to learn.
I would not be surprised to find out that there is an easy way to inflate a data frame like this that uses some fill forward interpolation method...
stocks=pd.read_csv('stocks2.csv', parse_dates=['start_date', 'end_date'], index_col='ticker')
mi=zip(stocks.index, pd.Series(zip(stocks['start_date'],stocks['end_date'].values)).map(lambda z: tuple(pd.date_range(start=z[0], end=z[1]))).values)
mi=pd.MultiIndex.from_tuples(mi)
ticker='AAPL'
s=pd.Series(index=mi,data=0)
s=list(s.xs(key=ticker).index)
l=[]
map(lambda x: l.extend(x), s)
s=pd.Series(index=l,data=l)
stocks_byticker=stocks[stocks.index==ticker].set_index('end_date')
print(s.map(lambda x: stocks_byticker.ix[stocks_byticker.index.searchsorted(x), 'val']))
2014-05-01 10
2014-06-05 20
2014-06-06 20
2014-06-07 20
2014-06-08 20
2014-06-09 20
2014-06-10 20

Here's a slightly more general way to do this that expands on joris's good answer, but allows this to work with any number of additional columns:
import pandas as pd
df['join_id'] = range(len(df))
starts = df[['start_date', 'join_id']].rename(columns={'start_date': 'date'})
ends = df[['end_date', 'join_id']].rename(columns={'end_date': 'date'})
start_end = pd.concat([starts, ends]).set_index('date')
fact_table = start_end.groupby("task_id").apply(lambda x: x.resample('D').fillna(method='pad'))
del fact_table["join_id"]
fact_table = fact_table.reset_index()
final = fact_table.merge(df, right_on='join', left_on='join', how='left')

Related

group datetime column by 5 minutes increment only for time of day (ignoring date) and count

I have a dataframe with one column timestamp (of type datetime) and some other columns but their content don't matter. I'm trying to group by 5 minutes interval and count but ignoring the date and only caring about the time of day.
One can generate an example dataframe using this code:
def get_random_dates_df(
n=10000,
start=pd.to_datetime('2015-01-01'),
period_duration_days=5,
seed=None
):
if not seed: # from piR's answer
np.random.seed(0)
end = start + pd.Timedelta(period_duration_days, 'd'),
n_seconds = int(period_duration_days * 3600 * 24)
random_dates = pd.to_timedelta(n_seconds * np.random.rand(n), unit='s') + start
return pd.DataFrame(data={"timestamp": random_dates}).reset_index()
df = get_random_dates_df()
it would look like this:
index
timestamp
0
0
2015-01-03 17:51:27.433696604
1
1
2015-01-04 13:49:21.806272885
2
2
2015-01-04 00:19:53.778462950
3
3
2015-01-03 17:23:09.535054659
4
4
2015-01-03 02:50:18.873314407
I think I have a working solution but it seems overly complicated:
gpd_df = df.groupby(pd.Grouper(key="timestamp", freq="5min")).agg(
count=("index", "count")
).reset_index()
gpd_df["time_of_day"] = gpd_df["timestamp"].dt.time
res_df= gpd_df.groupby("time_of_day").sum()
Output:
count
time_of_day
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
... ...
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
[288 rows x 1 columns]
Is there a better way to solve this?
You could groupby the floored 5Min datetime's time portion:
df2 = df.groupby(df['timestamp'].dt.floor('5Min').dt.time)['index'].count()
I'd suggest something like this, to avoid trying to merge the results of two groupbys together:
gpd_df = df.copy()
gpd_df["time_of_day"] = gpd_df["timestamp"].apply(lambda x: x.replace(year=2000, month=1, day=1))
gpd_df = gpd_df.set_index("time_of_day")
res_df = gpd_df.resample("5min").size()
It works by setting the year/month/day to fixed values and applying the built-in resampling function.
What about flooring the datetimes to 5min, extracting the time only and using value_counts:
out = (df['timestamp']
.dt.floor('5min')
.dt.time.value_counts(sort=False)
.sort_index()
)
Output:
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
..
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
Name: timestamp, Length: 288, dtype: int64

Elegant way to shift multiple date columns - Pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14

Pandas drop rows by time duration

I would like to drop dataframe rows by time condition (ignoring date). My data contains around 100 million rows. I have around 100 columns and each column has different sampling frequency.
I prepared following snippet of code that takes into account different sampling frequency:
import pandas as pd
# leave_duration=0.01 seconds
# drop_duration=0.1 seconds
i = pd.date_range('2018-01-01', periods=1000, freq='2ms')
i=i.append(pd.date_range('2018-01-01', periods=1000, freq='3ms'))
i=i.append(pd.date_range('2018-01-01', periods=1000, freq='0.5ms'))
df = pd.DataFrame({'A': range(len(i))}, index=i)
df=df.sort_index()
print(df)
# drop by duration....
In this simple example, there is data that lasts for around 1 second, and has 3 different sampling frequencies. The goal is to drop rows that last for (eg) 0.1 second duration and leave rows of (eg) 0.01 second duration. How can I do it with a one-liner?
by df=df.loc['2018-01-01 00:00:00.000000 ':'2018-01-01 00:00:00.000500 '] you will have new df witch data are between 2018-01-01 00:00:00.000000 and 2018-01-01 00:00:00.000500
now you can apply you filter for desire dates
import pandas as pd
# leave_duration=0.01 seconds
# drop_duration=0.1 seconds
i = pd.date_range('2018-01-01', periods=1000, freq='2ms')
i=i.append(pd.date_range('2018-01-01', periods=1000, freq='3ms'))
i=i.append(pd.date_range('2018-01-01', periods=1000, freq='0.5ms'))
df = pd.DataFrame({'A': range(len(i))}, index=i)
df=df.sort_index()
print(df)
#filter data between 2018-01-01 00:00:00.000000 ':'2018-01-01 00:00:00.000500
df=df.loc['2018-01-01 00:00:00.000000 ':'2018-01-01 00:00:00.000500 ']
print(df)
Output:
Before data filter applied
A
2018-01-01 00:00:00.000000 0
2018-01-01 00:00:00.000000 2000
2018-01-01 00:00:00.000000 1000
2018-01-01 00:00:00.000500 2001
2018-01-01 00:00:00.001000 2002
... ...
2018-01-01 00:00:02.985000 1995
2018-01-01 00:00:02.988000 1996
2018-01-01 00:00:02.991000 1997
2018-01-01 00:00:02.994000 1998
2018-01-01 00:00:02.997000 1999
[3000 rows x 1 columns]
After date filter applied:
A
2018-01-01 00:00:00.000000 0
2018-01-01 00:00:00.000000 2000
2018-01-01 00:00:00.000000 1000
2018-01-01 00:00:00.000500 2001

Pulling specific value from Pandas DataReader dataframe?

Here is the code I am running:
def competitor_stock_data_report():
import datetime
import pandas_datareader.data as web
date_time = datetime.datetime.now()
date = date_time.date()
stocklist = ['LAZ','AMG','BEN','LM','EVR','GHL','HLI','MC','PJT','MS','GS','JPM','AB']
start = datetime.datetime(date.year-1, date.month, date.day-1)
end = datetime.datetime(date.year, date.month, date.day-1)
for x in stocklist:
df = web.DataReader(x, 'google', start, end)
print(df)
print(df.loc[df['Date'] == start]['Close'].values)
The problem is in the last line. How do I pull the specific value of the date specified 'Close' value?
Open High Low Close Volume
Date
2016-08-02 35.22 35.25 33.66 33.75 861111
2016-08-03 33.57 34.72 33.42 34.25 921401
2016-08-04 33.89 34.22 33.77 34.07 587016
2016-08-05 34.55 34.94 34.31 34.35 463317
2016-08-08 34.54 34.75 34.31 34.74 958230
2016-08-09 34.68 35.12 34.64 34.87 732959
I would like to get 33.75 for example, but the date is dynamically changing..
Any suggestions?
IMO the easiest way to get a column's value in the first row:
In [40]: df
Out[40]:
Open High Low Close Volume
Date
2016-08-03 767.18 773.21 766.82 773.18 1287421
2016-08-04 772.22 774.07 768.80 771.61 1140254
2016-08-05 773.78 783.04 772.34 782.22 1801205
2016-08-08 782.00 782.63 778.09 781.76 1107857
2016-08-09 781.10 788.94 780.57 784.26 1318894
... ... ... ... ... ...
2017-07-27 951.78 951.78 920.00 934.09 3212996
2017-07-28 929.40 943.83 927.50 941.53 1846351
2017-07-31 941.89 943.59 926.04 930.50 1970095
2017-08-01 932.38 937.45 929.26 930.83 1277734
2017-08-02 928.61 932.60 916.68 930.39 1824448
[252 rows x 5 columns]
In [41]: df.iat[0, df.columns.get_loc('Close')]
Out[41]: 773.17999999999995
Last row:
In [42]: df.iat[-1, df.columns.get_loc('Close')]
Out[42]: 930.38999999999999
Recommended
df.at[df.index[-1], 'Close']
df.iat[-1, df.columns.get_loc('Close')]
df.loc[df.index[-1], 'Close']
df.iloc[-1, df.columns.get_loc('Close')]
Not intended as public api, but works
df.get_value(df.index[-1], 'Close')
df.get_value(-1, df.columns.get_loc('Close'), takeable=True)
Not recommended, chained indexing
There could be more, but do I really need to add them
df.iloc[-1].at['Close']
df.loc[:, 'Close'].iat[-1]
All yield
34.869999999999997

pick month start and end data in python

I have stock data downloaded from yahoo finance. I want to pickup data in the row corresponding to monthly start and month end. I am trying to do it with python pandas data frame. But I am not getting correct method to get the starting & ending of the month. will be great full if somebody can help me in solving this.
Please note that if 1st of the month is holiday and there is no data for that, I need to pick up 2nd day's data. Same rule applies to last of the month also. Thanks in advance.
Example data is
2016-01-05,222.80,222.80,217.00,217.75,15074800,217.75
2016-01-04,226.95,226.95,220.05,220.70,14092000,220.70
2015-12-31,225.95,226.55,224.00,224.45,11558300,224.45
2015-12-30,229.00,229.70,224.85,225.80,11702800,225.80
2015-12-29,228.85,229.95,227.50,228.20,7263200,228.20
2015-12-28,229.05,229.95,228.00,228.90,8756800,228.90
........
........
2015-12-04,240.00,242.15,238.05,241.10,11115100,241.10
2015-12-03,244.15,244.50,240.40,241.10,7155600,241.10
2015-12-02,250.55,250.65,243.75,244.60,10881700,244.60
2015-11-30,249.65,253.00,245.00,250.20,12865400,250.20
2015-11-27,243.00,250.50,242.80,249.70,15149900,249.70
2015-11-26,241.95,244.90,241.00,242.50,13629800,242.50
First, you should convert your date column to datetime format, then group by month, then sort groupby Series by date and take the first/last from it using head/tail methods, like so:
In [37]: df
Out[37]:
0 1 2 3 4 5 6
0 2016-01-05 222.80 222.80 217.00 217.75 15074800 217.75
1 2016-01-04 226.95 226.95 220.05 220.70 14092000 220.70
2 2015-12-31 225.95 226.55 224.00 224.45 11558300 224.45
3 2015-12-30 229.00 229.70 224.85 225.80 11702800 225.80
4 2015-12-29 228.85 229.95 227.50 228.20 7263200 228.20
5 2015-12-28 229.05 229.95 228.00 228.90 8756800 228.90
In [25]: import datetime
In [29]: df[0] = df[0].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d')
)
In [36]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).head(1))
Out[36]:
0 1 2 3 4 5 6
0
1 1 2016-01-04 226.95 226.95 220.05 220.7 14092000 220.7
12 5 2015-12-28 229.05 229.95 228.00 228.9 8756800 228.9
In [38]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).tail(1))
Out[38]:
0 1 2 3 4 5 6
0
1 0 2016-01-05 222.80 222.80 217.0 217.75 15074800 217.75
12 2 2015-12-31 225.95 226.55 224.0 224.45 11558300 224.45
You can merge the result dataframes, using pd.concat()
For the first / last day of each month, you can use .resample() with 'BMS' and 'BM' for Business Month (Start) like so (using pandas 0.18 syntax):
df.resample('BMS').first()
df.resample('BM').last()
This assumes that your data have a DateTimeIndex as usual when downloaded from yahoo using pandas_datareader:
from datetime import datetime
from pandas_datareader.data import DataReader
df = DataReader('FB', 'yahoo', datetime(2015, 1, 1), datetime(2015, 3, 31))['Open']
df.head()
Date
2015-01-02 78.580002
2015-01-05 77.980003
2015-01-06 77.230003
2015-01-07 76.760002
2015-01-08 76.739998
Name: Open, dtype: float64
df.tail()
Date
2015-03-25 85.500000
2015-03-26 82.720001
2015-03-27 83.379997
2015-03-30 83.809998
2015-03-31 82.900002
Name: Open, dtype: float64
do:
df.resample('BMS').first()
Date
2015-01-01 78.580002
2015-02-02 76.110001
2015-03-02 79.000000
Freq: BMS, Name: Open, dtype: float64
and
df.resample('BM').last()
to get:
Date
2015-01-30 78.000000
2015-02-27 80.680000
2015-03-31 82.900002
Freq: BM, Name: Open, dtype: float64
Assuming you have downloaded data from Yahoo:
> import pandas.io.data as web
> import datetime
> start = datetime.datetime(2016,1,1)
> end = datetime.datetime(2016,5,1)
> df = web.DataReader("AAPL", "yahoo", start, end)
You simply pick the month end and start rows with:
df[df.index.is_month_end]
df[df.index.is_month_start]
If you want to access a specific row, like the first row of the first starting day of the selected starting days, you simply do:
df[df.index.is_month_start].ix[0]

Categories

Resources