I have create a dataframe of dates as follows:
import pandas as pd
timespan = 366
df = pd.DataFrame({'Date':pd.date_range(pd.datetime.today(), periods=timespan).tolist()})
I'm struggling to identify the day number in a quarter. For example
date expected_value
2017-01-01 1 # First day in Q1
2017-01-02 2 # Second day in Q1
2017-02-01 32 # 32nd day in Q1
2017-04-01 1 # First day in Q2
May I have your suggestions? Thank you in advance.
>>> df.assign(
days_in_qurater=[(date - ts.start_time).days + 1
for date, ts in zip(df['Date'],
pd.PeriodIndex(df['Date'], freq='Q'))])
Date days_in_qurater
0 2017-01-01 1
1 2017-01-02 2
2 2017-01-03 3
...
363 2017-12-30 91
364 2017-12-31 92
365 2018-01-01 1
This is around 250x faster than Alexander's solution:
df['day_qtr']=(df.Date - pd.PeriodIndex(df.Date,freq='Q').start_time).dt.days + 1
One of way is by creating a new df based on dates and quarter cumcount then map the values to the real df i.e
timespan = 5000
ndf = pd.DataFrame({'Date':pd.date_range('2015-01-01', periods=timespan).tolist()})
ndf['q'] = ndf['Date'].dt.to_period('Q')
ndf['new'] = ndf.groupby('q').cumcount()+1
maps = dict(zip(ndf['Date'].dt.date, ndf['new'].values.tolist()))
Map the values
df['expected'] = df.Date.dt.date.map(maps)
Output:
Date expected
0 2017-09-12 09:42:14.324492 74
1 2017-09-13 09:42:14.324492 75
2 2017-09-14 09:42:14.324492 76
3 2017-09-15 09:42:14.324492 77
4 2017-09-16 09:42:14.324492 78
.
.
143 2018-02-02 09:42:14.324492 33
.
.
201 2018-04-01 09:42:14.324492 1
Hope it helps.
start with:
day_of_year = datetime.now().timetuple().tm_yday
from
Convert Year/Month/Day to Day of Year in Python
you can get 1st day of each quarter that way and bracket / subtract the value of the first day to get the day of the quarter
Related
I have 5 years of daily volume data. I want to create a new column in pandas dataframe where it produces the values of YoY Growth for that particular day. For e.g. compares 2018-01-01 with 2017-01-01, and 2019-01-01 compares with 2018-01-01
I have 364 records for each year (except for the year 2020 there are 365 days)
How can I create the column YoY_Growth as below in pandas dataframe.
# It's more convenient to index the dataframe with the Date for our algorith,
df = df.set_index("Date")
is_leap_day = (df.index.month == 2) & (df.index.day == 29)
# Leap day is an edge case, since you can't find Feb 29 of the previous year.
# pandas handles this by shifting to Feb 28 of the previous year:
# 2020-02-29 -> 2019-02-28
# 2020-02-28 -> 2019-02-28
# This creates a duplicate for Feb 28. So we need to handle leap day separately.
volume_last_year = df.loc[~is_leap_day, "Volume"].shift(freq=pd.DateOffset(years=1))
# For non leap days
df["YoY_Growth"] = df["Volume"] / volume_last_year - 1
# For leap days
df.loc[is_leap_day, "YoY_Growth"] = (
df.loc[is_leap_day, "Volume"] / volume_last_year.shift(freq=pd.DateOffset(days=1))
- 1
)
Result (Volume was randomly generated):
df.loc[["2019-01-01", "2019-02-28", "2020-01-01", "2020-02-28", "2020-02-29"], :]
Volume YoY_Growth
Date
2019-01-01 45 NaN
2019-02-28 23 NaN
2020-01-01 10 -0.777778 # = 10 / 45 - 1
2020-02-28 34 0.478261 # = 34 / 23 - 1
2020-02-29 76 2.304348 # = 76 / 23 - 1
I have a dataframe with dates and prices (as below).
df=pd.DataFrame({'date':['2015-01-01','2015-01-02','2015-01-03',
'2016-01-01','2016-01-02','2016-01-03',
'2017-01-01','2017-01-02','2017-01-03',
'2018-01-01','2018-01-02','2018-01-03'],
'price':[78,87,52,94,55,45,68,76,65,75,78,21]
})
df['date'] = pd.to_datetime(df['date'], errors='ignore', format='%Y%m%d')
select_dates = df.set_index(['date'])
I want to select a range of specific dates to add to a new dataframe. For example, I would like to select prices for the first quarter of 2015 and the first quarter of 2016. I have provided data for a shorter time period for the example, so in this case, I would like to select the first 2 days of 2015 and the first 2 days of 2016.
I would like to end up with a dataframe like this (with date as the index).
date
price
2015-01-01
78
2015-01-02
87
2016-01-01
94
2016-01-02
55
I have been using this method to select dates, but I don't know how to select more than one range at a time
select_dates2=select_dates.loc['2015-01-01':'2015-01-02']
Another way:
df['date'] = pd.to_datetime(df['date'])
df[df.date.dt.year.isin([2015, 2016]) & df.date.dt.day.lt(3)]
date price
0 2015-01-01 78
1 2015-01-02 87
3 2016-01-01 94
4 2016-01-02 55
One option is to use the dt accessor to select certain years and month-days, then use isin to create a boolean mask to filter df.
df['date'] = pd.to_datetime(df['date'])
out = df[df['date'].dt.year.isin([2015, 2016]) & df['date'].dt.strftime('%m-%d').isin(['01-01','01-02'])]
Output:
date price
0 2015-01-01 78
1 2015-01-02 87
3 2016-01-01 94
4 2016-01-02 55
one option is to get the index as a MultiIndex of date objects; this allows for a relatively easy selection on multiple levels (in this case, year and day):
(df
.assign(year = df.date.dt.year, day = df.date.dt.day)
.set_index(['year', 'day'])
.loc(axis = 0)[2015:2016, :2]
)
date price
year day
2015 1 2015-01-01 78
2 2015-01-02 87
2016 1 2016-01-01 94
2 2016-01-02 55
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14
I have a dataset of userids and the all the times they use a particular pass. I need to find out how many days since each of them first used the pass. I was thinking of running through the dataset and store the first use in a dictionary and minus it off today's date. I cant seem to get it to work.
Userid Start use Day
1712 2019-01-04 Friday
1712 2019-01-05 Saturday
9050 2019-01-04 Friday
9050 2019-01-04 Friday
9050 2019-01-06 Sunday
9409 2019-01-05 Saturday
9683 2019-05-20 Monday
8800 2019-05-17 Friday
8800 2019-05-17 Friday
This is the part of the dataset. Date format is Y-m-d
usedict={}
keys = df.user_id
values = df.start_date
for i in keys:
if (usedict[i] == keys):
continue
else:
usedict[i] = values[i]
prints(usedict)
user_id use_count days_used Ave Daily Trips register_date days_since_reg
12 42 23 1.826087 NaT NaT
17 28 13 2.153846 NaT NaT
114 54 24 2.250000 2019-02-04 107 days
169 31 17 1.823529 NaT NaT
1414 49 20 2.450000 NaT NaT
1712 76 34 2.235294 NaT NaT
2388 24 12 2.000000 NaT NaT
6150 10 5 2.000000 2019-02-05 106 days
You can achieve what you want with the following. I have used only 2 user ids from the example given by you, but the same will apply to all.
import pandas as pd
import datetime
df = pd.DataFrame([{'Userid':'1712','use_date':'2019-01-04'},
{'Userid':'1712','use_date':'2019-01-05'},
{'Userid':'9050','use_date':'2019-01-04'},
{'Userid':'9050','use_date':'2019-01-04'},
{'Userid':'9050','use_date':'2019-01-06'}])
df.use_date = pd.to_datetime(df.use_date).dt.date
group_df = df.sort_values(by='use_date').groupby('Userid', as_index=False).agg({'use_date':'first'}).rename(columns={'use_date':'first_use_date'})
group_df['diff_from_today'] = datetime.datetime.today().date() - group_df.first_use_date
The output is:
print(group_df)
Userid first_use_date diff_from_today
0 1712 2019-01-04 139 days
1 9050 2019-01-04 139 days
Check sort_values and groupby for more details.
I am only looking at two columns but you could find the min for each id with groupby and then use apply to get difference (I have done difference in days)
import pandas as pd
import datetime
user_id = [1712, 1712, 9050, 9050, 9050, 9409, 9683, 8800, 8800]
start = ['2019-01-04', '2019-01-05', '2019-01-04', '2019-01-04', '2019-01-06', '2019-01-05', '2019-05-20', '2019-05-17', '2019-05-17']
df = pd.DataFrame(list(zip(user_id, start)), columns = ['UserId', 'Start'])
df['Start']= pd.to_datetime(df['Start'])
df = df.groupby('UserId')['Start'].agg([pd.np.min])
now = datetime.datetime.now()
df['days'] = df['amin'].apply(lambda x: (now - x).days)
a_dict = pd.Series(df.days.values,index = df.index).to_dict()
print(a_dict)
References:
to_dict() method taken from #jeff
Output:
I'm trying to figure out how to add 3 months to a date in a Pandas dataframe, while keeping it in the date format, so I can use it to lookup a range.
This is what I've tried:
#create dataframe
df = pd.DataFrame([pd.Timestamp('20161011'),
pd.Timestamp('20161101') ], columns=['date'])
#create a future month period
plus_month_period = 3
#calculate date + future period
df['future_date'] = plus_month_period.astype("timedelta64[M]")
However, I get the following error:
AttributeError: 'int' object has no attribute 'astype'
You could use pd.DateOffset
In [1756]: df.date + pd.DateOffset(months=plus_month_period)
Out[1756]:
0 2017-01-11
1 2017-02-01
Name: date, dtype: datetime64[ns]
Details
In [1757]: df
Out[1757]:
date
0 2016-10-11
1 2016-11-01
In [1758]: plus_month_period
Out[1758]: 3
Suppose you have a dataframe of the following format, where you have to add integer months to a date column.
Start_Date
Months_to_add
2014-06-01
23
2014-06-01
4
2000-10-01
10
2016-07-01
3
2017-12-01
90
2019-01-01
2
In such a scenario, using Zero's code or mattblack's code won't be useful. You have to use lambda function over the rows where the function takes 2 arguments -
A date to which months need to be added to
A month value in integer format
You can use the following function:
# Importing required modules
from dateutil.relativedelta import relativedelta
# Defining the function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
After this you can use the following code snippet to add months to the Start_Date column. Use progress_apply functionality of Pandas. Refer to this Stackoverflow answer on progress_apply : Progress indicator during pandas operations.
from tqdm import tqdm
tqdm.pandas()
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
Here's the full code form dataset creation, for your reference:
import pandas as pd
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
# Initilize a new dataframe
df = pd.DataFrame()
# Add Start Date column
df["Start_Date"] = ['2014-06-01T00:00:00.000000000',
'2014-06-01T00:00:00.000000000',
'2000-10-01T00:00:00.000000000',
'2016-07-01T00:00:00.000000000',
'2017-12-01T00:00:00.000000000',
'2019-01-01T00:00:00.000000000']
# To convert the date column to a datetime format
df["Start_Date"] = pd.to_datetime(df["Start_Date"])
# Add months column
df["Months_to_add"] = [23, 4, 10, 3, 90, 2]
# Defining the Add Months function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
# Apply function on the dataframe using lambda operation.
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
You will have the final output dataframe as follows.
Start_Date
Months_to_add
End_Date
2014-06-01
23
2016-05-01
2014-06-01
4
2014-10-01
2000-10-01
10
2001-08-01
2016-07-01
3
2016-10-01
2017-12-01
90
2025-06-01
2019-01-01
2
2019-03-01
Please add to comments if there are any issues with the above code.
All the best!
I believe that the simplest and most efficient (faster) way to solve this is to transform the date to monthly periods with to_period(M), add the result with the values of the Months_to_add column and then retrieve the data as datetime with the .dt.to_timestamp() command.
Using the sample data created by #Aruparna Maity
Start_Date
Months_to_add
2014-06-01
23
2014-06-20
4
2000-10-01
10
2016-07-05
3
2017-12-15
90
2019-01-01
2
df['End_Date'] = ((df['Start_Date'].dt.to_period('M')) + df['Months_to_add']).dt.to_timestamp()
df.head(6)
#output
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-01
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-01
4 2017-12-15 90 2025-06-01
5 2019-01-01 2 2019-03-01
If the exact day is needed, just repeat the process, but changing the periods to days
df['End_Date'] = ((df['End_Date'].dt.to_period('D')) + df['Start_Date'].dt.day -1).dt.to_timestamp()
#output:
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-20
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-05
4 2017-12-15 90 2025-06-15
5 2019-01-01 2 2019-03-01
Another way using numpy timedelta64
df['date'] + np.timedelta64(plus_month_period, 'M')
0 2017-01-10 07:27:18
1 2017-01-31 07:27:18
Name: date, dtype: datetime64[ns]