Selecting multiple ranges of dates from dataframe - python

I have a dataframe with dates and prices (as below).
df=pd.DataFrame({'date':['2015-01-01','2015-01-02','2015-01-03',
'2016-01-01','2016-01-02','2016-01-03',
'2017-01-01','2017-01-02','2017-01-03',
'2018-01-01','2018-01-02','2018-01-03'],
'price':[78,87,52,94,55,45,68,76,65,75,78,21]
})
df['date'] = pd.to_datetime(df['date'], errors='ignore', format='%Y%m%d')
select_dates = df.set_index(['date'])
I want to select a range of specific dates to add to a new dataframe. For example, I would like to select prices for the first quarter of 2015 and the first quarter of 2016. I have provided data for a shorter time period for the example, so in this case, I would like to select the first 2 days of 2015 and the first 2 days of 2016.
I would like to end up with a dataframe like this (with date as the index).
date
price
2015-01-01
78
2015-01-02
87
2016-01-01
94
2016-01-02
55
I have been using this method to select dates, but I don't know how to select more than one range at a time
select_dates2=select_dates.loc['2015-01-01':'2015-01-02']

Another way:
df['date'] = pd.to_datetime(df['date'])
df[df.date.dt.year.isin([2015, 2016]) & df.date.dt.day.lt(3)]
date price
0 2015-01-01 78
1 2015-01-02 87
3 2016-01-01 94
4 2016-01-02 55

One option is to use the dt accessor to select certain years and month-days, then use isin to create a boolean mask to filter df.
df['date'] = pd.to_datetime(df['date'])
out = df[df['date'].dt.year.isin([2015, 2016]) & df['date'].dt.strftime('%m-%d').isin(['01-01','01-02'])]
Output:
date price
0 2015-01-01 78
1 2015-01-02 87
3 2016-01-01 94
4 2016-01-02 55

one option is to get the index as a MultiIndex of date objects; this allows for a relatively easy selection on multiple levels (in this case, year and day):
(df
.assign(year = df.date.dt.year, day = df.date.dt.day)
.set_index(['year', 'day'])
.loc(axis = 0)[2015:2016, :2]
)
date price
year day
2015 1 2015-01-01 78
2 2015-01-02 87
2016 1 2016-01-01 94
2 2016-01-02 55

Related

Calculating values from time series in pandas multi-indexed pivot tables

I've got a dataframe in pandas that stores the Id of a person, the quality of interaction, and the date of the interaction. A person can have multiple interactions across multiple dates, so to help visualise and plot this I converted it into a pivot table grouping first by Id then by date to analyse the pattern over time.
e.g.
import pandas as pd
df = pd.DataFrame({'Id':['A4G8','A4G8','A4G8','P9N3','P9N3','P9N3','P9N3','C7R5','L4U7'],
'Date':['2016-1-1','2016-1-15','2016-1-30','2017-2-12','2017-2-28','2017-3-10','2019-1-1','2018-6-1','2019-8-6'],
'Quality':[2,3,6,1,5,10,10,2,2]})
pt = df.pivot_table(values='Quality', index=['Id','Date'])
print(pt)
Leads to this:
Id
Date
Quality
A4G8
2016-1-1
2
2016-1-15
4
2016-1-30
6
P9N3
2017-2-12
1
2017-2-28
5
2017-3-10
10
2019-1-1
10
C7R5
2018-6-1
2
L4U7
2019-8-6
2
However, I'd also like to...
Measure the time from the first interaction for each interaction per Id
Measure the time from the previous interaction with the same Id
So I'd get a table similar to the one below
Id
Date
Quality
Time From First
Time To Prev
A4G8
2016-1-1
2
0 days
NA days
2016-1-15
4
14 days
14 days
2016-1-30
6
29 days
14 days
P9N3
2017-2-12
1
0 days
NA days
2017-2-28
5
15 days
15 days
2017-3-10
10
24 days
9 days
The Id column is a string type, and I've converted the date column into datetime, and the Quality column into an integer.
The column is rather large (>10,000 unique ids) so for performance reasons I'm trying to avoid using for loops. I'm guessing the solution is somehow using pd.eval but I'm stuck as to how to apply it correctly.
Apologies I'm a python, pandas, & stack overflow) noob and I haven't found the answer anywhere yet so even some pointers on where to look would be great :-).
Many thanks in advance
Convert Dates to datetimes and then substract minimal datetimes per groups by GroupBy.transformb subtracted by column Date and for second new column use DataFrameGroupBy.diff:
df['Date'] = pd.to_datetime(df['Date'])
df['Time From First'] = df['Date'].sub(df.groupby('Id')['Date'].transform('min'))
df['Time To Prev'] = df.groupby('Id')['Date'].diff()
print (df)
Id Date Quality Time From First Time To Prev
0 A4G8 2016-01-01 2 0 days NaT
1 A4G8 2016-01-15 3 14 days 14 days
2 A4G8 2016-01-30 6 29 days 15 days
3 P9N3 2017-02-12 1 0 days NaT
4 P9N3 2017-02-28 5 16 days 16 days
5 P9N3 2017-03-10 10 26 days 10 days
6 P9N3 2019-01-01 10 688 days 662 days
7 C7R5 2018-06-01 2 0 days NaT
8 L4U7 2019-08-06 2 0 days NaT
df["Date"] = pd.to_datetime(df.Date)
df = df.merge(
df.groupby(["Id"]).Date.first(),
on="Id",
how="left",
suffixes=["", "_first"]
)
df["Time From First"] = df.Date-df.Date_first
df['Time To Prev'] = df.groupby('Id').Date.diff()
df.set_index(["Id", "Date"], inplace=True)
df
output:

How to find row with the highest value for the day in pandas and gets the categorial percentages?

Here is my dataset:
date CAT_A CAT_B CAT_C
2018-01-01 5:00 12 223 155
2018-01-01 6:00 199 68 72
...
2018-12-31 23:00 56 92 237
The data shows every hour for every day of the year. So I want to know in pandas how I can find the highest value row for each day, and then get the categorical percentages at that hour. For example if the highest hour was 5:00 for day 01-01 then CAT_A: 3.07%, CAT_B: 57.2% CAT_C: 29.7%
We sum the three columns:
df["sum_categories"] = df.sum(axis=1)
We groupby on daily basis and obtain the index of the max daily row:
idx = df.resample("D")["sum_categories"].idxmax()
We select the rows with this index and calculate proportion:
df.loc[idx,["CAT_A", "CAT_B", "CAT_C"]].div(df.loc[idx,"sum_categories"].values)
Use DataFrameGroupBy.idxmax by Series created by sum and divide by DataFrame.div filtered rows by DataFrame.loc, multiple by 100 and round:
#if necessary DatetimeIndex
#df = df.set_index('date')
s = df.sum(axis=1)
idx = s.groupby(pd.Grouper(freq="D")).idxmax()
df = df.loc[idx].div(s.loc[idx], axis=0).mul(100).round(2)
print (df)
CAT_A CAT_B CAT_C
date
2018-01-01 05:00:00 3.08 57.18 39.74

Elegant way to shift multiple date columns - Pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14

How to convert this forloop to pandas lambda function, to increase speed

This forloop will take 3 days to complete. How can I increase the speed?
for i in range(df.shape[0]):
df.loc[df['Creation date'] >= pd.to_datetime(str(df['Original conf GI dte'].iloc[i])),'delivered'] += df['Sale order item'].iloc[i]
I think the forloop is enough to understand?
If Creation date is bigger than Original conf GI date, then add Sale order item value to delivered column.
Each row's date is "Date Accepted" (Date Delivered is future date). Input is Order Ouantity, Date Accepted & Date Delivered....Output is Delivered column
Order Quantity Date Accepted Date Delivered Delivered
20 01-05-2010 01-02-2011 0
10 01-11-2010 01-03-2011 0
300 01-12-2010 01-09-2011 0
5 01-03-2011 01-03-2012 30
20 01-04-2012 01-11-2013 335
10 01-07-2013 01-12-2014 335
Convert values to numpy arrays by Series.to_numpy, compare them with broadcasting, match order values by numpy.where and last sum:
date1 = df['Date Accepted'].to_numpy()
date2 = df['Date Delivered'].to_numpy()
order = df['Order Quantity'].to_numpy()
#oldier pandas versions
#date1 = df['Date Accepted'].values
#date2 = df['Date Delivered'].values
#order = df['Order Quantity'].values
df['Delivered1'] = np.where(date1[:, None] >= date2, order, 0).sum(axis=1)
print (df)
Order Quantity Date Accepted Date Delivered Delivered Delivered1
0 20 2010-01-05 2011-01-02 0 0
1 10 2010-01-11 2011-01-03 0 0
2 300 2010-01-12 2011-01-09 0 0
3 5 2011-01-03 2012-01-03 30 30
4 20 2012-01-04 2013-01-11 335 335
5 10 2013-01-07 2014-01-12 335 335
If I understand correctly, you can use np.where() for speed. Currently you are looping on the dataframe rows whereas numpy operations are designed to operate on the entire column:
cond= df['Creation date'].ge(pd.to_datetime(str(df['Original conf GI dte'])))
df['delivered']=np.where(cond,df['delivered']+df['Sale order item'],df['delivered'])

Day number of a quarter for a given date in pandas

I have create a dataframe of dates as follows:
import pandas as pd
timespan = 366
df = pd.DataFrame({'Date':pd.date_range(pd.datetime.today(), periods=timespan).tolist()})
I'm struggling to identify the day number in a quarter. For example
date expected_value
2017-01-01 1 # First day in Q1
2017-01-02 2 # Second day in Q1
2017-02-01 32 # 32nd day in Q1
2017-04-01 1 # First day in Q2
May I have your suggestions? Thank you in advance.
>>> df.assign(
days_in_qurater=[(date - ts.start_time).days + 1
for date, ts in zip(df['Date'],
pd.PeriodIndex(df['Date'], freq='Q'))])
Date days_in_qurater
0 2017-01-01 1
1 2017-01-02 2
2 2017-01-03 3
...
363 2017-12-30 91
364 2017-12-31 92
365 2018-01-01 1
This is around 250x faster than Alexander's solution:
df['day_qtr']=(df.Date - pd.PeriodIndex(df.Date,freq='Q').start_time).dt.days + 1
One of way is by creating a new df based on dates and quarter cumcount then map the values to the real df i.e
timespan = 5000
ndf = pd.DataFrame({'Date':pd.date_range('2015-01-01', periods=timespan).tolist()})
ndf['q'] = ndf['Date'].dt.to_period('Q')
ndf['new'] = ndf.groupby('q').cumcount()+1
maps = dict(zip(ndf['Date'].dt.date, ndf['new'].values.tolist()))
Map the values
df['expected'] = df.Date.dt.date.map(maps)
Output:
Date expected
0 2017-09-12 09:42:14.324492 74
1 2017-09-13 09:42:14.324492 75
2 2017-09-14 09:42:14.324492 76
3 2017-09-15 09:42:14.324492 77
4 2017-09-16 09:42:14.324492 78
.
.
143 2018-02-02 09:42:14.324492 33
.
.
201 2018-04-01 09:42:14.324492 1
Hope it helps.
start with:
day_of_year = datetime.now().timetuple().tm_yday
from
Convert Year/Month/Day to Day of Year in Python
you can get 1st day of each quarter that way and bracket / subtract the value of the first day to get the day of the quarter

Categories

Resources