What formula can I use to determine based off of a certain date the pay period number that falls between a start date and end date? In other words, from df1 I have a date that I need to compare against df2 start dates and end dates and then produce the pay period that those dates fall within in a new data set.
The formula I've used so far is the following. Keep in mind I'm not a seasoned programmer:
1st try:
def calc(row):
if pf2(row['actn_dt']) >= pp_calendar([0], and pf2(row['actn_dt']) <= df2([1]):
return df2([2,3])
2nd try:
pf2['pay'] = np.where (pf2['actn_dt'] >= df2[0]) | (pf2['actn_dt'] <= pf2[1]), pp_calendar[2]
3rd try:
def calc(row):
if pf2(row['actn_dt']) >= df2(row[1]) | pf2(row['actn_dt']) <= df2(row[2]):
return df2(row[3])
pf2['pay'] = pf2.apply (lambda row: calc(row), axis=1)
print df:
actn_dt
16 2008-09-28 00:00:00
17 2008-03-16 00:00:00
18 2009-08-30 00:00:00
43 2008-06-22 00:00:00
89 2009-08-16 00:00:00
106 2009-03-29 00:00:00
244 2009-08-30 00:00:00
371 2009-09-13 00:00:00
400 2009-07-19 00:00:00
439 2007-12-23 00:00:00
print df2:
START_DATE END_DATE PAY_PERIOD CALENDAR_YEAR
0 2008-09-28 2008-10-11 10 2008
1 2008-03-16 2008-03-16 06 2008
2 2009-08-30 2009-09-12 18 2009
3 2008-06-22 2008-06-22 13 2008
4 2009-03-29 2009-04-11 07 2009
Expected Result:
actn_dt START_DATE END_DATE PAY_PERIOD CALENDAR_YEAR
16 2008-09-28 2008-09-28 2008-10-11 10 2008
17 2008-03-16 2008-03-16 2008-03-29 06 2008
18 2009-08-30 2009-08-30 2009-09-12 18 2009
43 2008-06-22 2008-06-22 2008-07-05 13 2008
89 2009-08-16 2009-08-16 2008-08-29 17 2009
106 2009-03-29 2009-03-29 2009-04-11 07 2009
244 2009-08-30 2009-08-30 2009-09-12 18 2009
Thank you for your knowledge and time!
Head Line Solution
def find_pay_period(date, df):
df = df[(df.START_DATE <= date) & (date <= df.END_DATE)].iloc[0, :]
df['actn_dt'] = date
return df
df1.actn_dt.apply(lambda x: find_pay_period(x, df2))
Explanation
start with
# apply() will take date in df1 and find the first row in df2
# such that the date is between START_DATE and END_DATE and
# then return the row.
df1.actn_dt.apply(lambda x: find_pay_period(x, df2))
now this
def find_pay_period(date, df):
# df[] use boolean mask to filter
# .iloc[0, :] grabs first row of filtered DataFrame.
# Keep in mind this is a Series.
df = df[(df.START_DATE <= date) & (date <= df.END_DATE)].iloc[0, :]
# add back your date
df['actn_dt'] = date
return df
Related
I have 5 years of daily volume data. I want to create a new column in pandas dataframe where it produces the values of YoY Growth for that particular day. For e.g. compares 2018-01-01 with 2017-01-01, and 2019-01-01 compares with 2018-01-01
I have 364 records for each year (except for the year 2020 there are 365 days)
How can I create the column YoY_Growth as below in pandas dataframe.
# It's more convenient to index the dataframe with the Date for our algorith,
df = df.set_index("Date")
is_leap_day = (df.index.month == 2) & (df.index.day == 29)
# Leap day is an edge case, since you can't find Feb 29 of the previous year.
# pandas handles this by shifting to Feb 28 of the previous year:
# 2020-02-29 -> 2019-02-28
# 2020-02-28 -> 2019-02-28
# This creates a duplicate for Feb 28. So we need to handle leap day separately.
volume_last_year = df.loc[~is_leap_day, "Volume"].shift(freq=pd.DateOffset(years=1))
# For non leap days
df["YoY_Growth"] = df["Volume"] / volume_last_year - 1
# For leap days
df.loc[is_leap_day, "YoY_Growth"] = (
df.loc[is_leap_day, "Volume"] / volume_last_year.shift(freq=pd.DateOffset(days=1))
- 1
)
Result (Volume was randomly generated):
df.loc[["2019-01-01", "2019-02-28", "2020-01-01", "2020-02-28", "2020-02-29"], :]
Volume YoY_Growth
Date
2019-01-01 45 NaN
2019-02-28 23 NaN
2020-01-01 10 -0.777778 # = 10 / 45 - 1
2020-02-28 34 0.478261 # = 34 / 23 - 1
2020-02-29 76 2.304348 # = 76 / 23 - 1
I would like to create two columns "Year" and "Month" from a Date column that contains different year and month arrangements. Some are YY-Mmm and the others are Mmm-YY.
import pandas as pd
dataSet = {
"Date": ["18-Jan", "18-Jan", "18-Feb", "18-Feb", "Oct-17", "Oct-17"],
"Quantity": [3476, 20, 789, 409, 81, 640],
}
df = pd.DataFrame(dataSet, columns=["Date", "Quantity"])
My attempt is as follows:
Date1 = []
Date2 = []
for dt in df.Date:
Date1.append(dt.split("-")[0])
Date2.append(dt.split("-")[1])
Year = []
try:
for yr in Date1:
Year.append(int(yr.Date1))
except:
for yr in Date2:
Year.append(int(yr.Date2))
You can make use of the extract dataframe string method to split the date strings up. Since the year can precede or follow the month, we can get a bit creative and have a Year1 column and Year2 columns for either position. Then use np.where to create a single Year column pulls from each of these other year columns.
For example:
import numpy as np
split_dates = df["Date"].str.extract(r"(?P<Year1>\d+)?-?(?P<Month>\w+)-?(?P<Year2>\d+)?")
split_dates["Year"] = np.where(
split_dates["Year1"].notna(),
split_dates["Year1"],
split_dates["Year2"],
)
split_dates = split_dates[["Year", "Month"]]
With result for split_dates:
Year Month
0 18 Jan
1 18 Jan
2 18 Feb
3 18 Feb
4 17 Oct
5 17 Oct
Then you can merge back with your original dataframe with pd.merge, like so:
pd.merge(df, split_dates, how="inner", left_index=True, right_index=True)
Which yields:
Date Quantity Year Month
0 18-Jan 3476 18 Jan
1 18-Jan 20 18 Jan
2 18-Feb 789 18 Feb
3 18-Feb 409 18 Feb
4 Oct-17 81 17 Oct
5 Oct-17 640 17 Oct
Thank you for your help. I managed to get it working with what I've learned so far, i.e. for loop, if-else and split() and with the help of another expert.
# Split the Date column and store it in an array
dA = []
for dP in df.Date:
dA.append(dP.split("-"))
# Append month and year to respective lists based on if conditions
Month = []
Year = []
for moYr in dA:
if len(moYr[0]) == 2:
Month.append(moYr[1])
Year.append(moYr[0])
else:
Month.append(moYr[0])
Year.append(moYr[1])
This took me hours!
Try using Python datetime strptime(<date>, "%y-%b") on the date column to convert it to a Python datetime.
from datetime import datetime
def parse_dt(x):
try:
return datetime.strptime(x, "%y-%b")
except:
return datetime.strptime(x, "%b-%y")
df['timestamp'] = df['Date'].apply(parse_dt)
df
Date Quantity timestamp
0 18-Jan 3476 2018-01-01
1 18-Jan 20 2018-01-01
2 18-Feb 789 2018-02-01
3 18-Feb 409 2018-02-01
4 Oct-17 81 2017-10-01
5 Oct-17 640 2017-10-01
Then you can just use .month and .year attributes, or if you prefer the month as its abbreviated form, use Python datetime.strftime('%b').
df['year'] = df.timestamp.apply(lambda x: x.year)
df['month'] = df.timestamp.apply(lambda x: x.strftime('%b'))
df
Date Quantity timestamp year month
0 18-Jan 3476 2018-01-01 2018 Jan
1 18-Jan 20 2018-01-01 2018 Jan
2 18-Feb 789 2018-02-01 2018 Feb
3 18-Feb 409 2018-02-01 2018 Feb
4 Oct-17 81 2017-10-01 2017 Oct
5 Oct-17 640 2017-10-01 2017 Oct
I have a DataFrame in this way:
shop_id item_price item_cnt_day day month year
59 9.00 1.0 02 01 2013
59 8.00 2.0 02 01 2013
25 10.00 4.0 05 02 2013
25 17.0 1.0 06 01 2013
25 10.00 1.0 15 01 2013
And I try to get the result like following DataFrame:
shop_id all_revenue month year
59 25.00 01 2013
25 27.00 01 2013
I mean I want to get each shop's revenue in January 2013.
BUT, I don't know how to code in Pandas. Any help would be appreciated.
eval + groupby + sum
You can assign a series via eval, then use groupby:
res = df.eval('revenue=item_price * item_cnt_day')\
.groupby(['shop_id', 'month', 'year'], as_index=False)['revenue'].sum()
You can, if you wish, query for January 2013 (before or after the above operations):
res = res.query('month == 1 & year == 2013')
print(res)
shop_id month year revenue
0 25 1 2013 27.0
2 59 1 2013 25.0
I like filtering the dataframe first, to reduce number of unnecessary calculations:
df.query('month == 1 and year == 2013')\
.assign(all_revenue = df.item_price * df.item_cnt_day)\
.groupby(['shop_id','month','year'], as_index=False)['all_revenue'].sum()
Output:
shop_id month year all_revenue
0 25 1 2013 27.0
1 59 1 2013 25.0
Note: Because your column names are "friendly", no spaces nor special characters, you can use query method. If that doesn't work for your column naming then you need to use boolean indexing.
df[(df['month'] == 1) & (df['year'] == 2013)]\
.assign(all_revenue = df.item_price * df.item_cnt_day)\
.groupby(['shop_id','month','year'], as_index=False)['all_revenue'].sum()
I have create a dataframe of dates as follows:
import pandas as pd
timespan = 366
df = pd.DataFrame({'Date':pd.date_range(pd.datetime.today(), periods=timespan).tolist()})
I'm struggling to identify the day number in a quarter. For example
date expected_value
2017-01-01 1 # First day in Q1
2017-01-02 2 # Second day in Q1
2017-02-01 32 # 32nd day in Q1
2017-04-01 1 # First day in Q2
May I have your suggestions? Thank you in advance.
>>> df.assign(
days_in_qurater=[(date - ts.start_time).days + 1
for date, ts in zip(df['Date'],
pd.PeriodIndex(df['Date'], freq='Q'))])
Date days_in_qurater
0 2017-01-01 1
1 2017-01-02 2
2 2017-01-03 3
...
363 2017-12-30 91
364 2017-12-31 92
365 2018-01-01 1
This is around 250x faster than Alexander's solution:
df['day_qtr']=(df.Date - pd.PeriodIndex(df.Date,freq='Q').start_time).dt.days + 1
One of way is by creating a new df based on dates and quarter cumcount then map the values to the real df i.e
timespan = 5000
ndf = pd.DataFrame({'Date':pd.date_range('2015-01-01', periods=timespan).tolist()})
ndf['q'] = ndf['Date'].dt.to_period('Q')
ndf['new'] = ndf.groupby('q').cumcount()+1
maps = dict(zip(ndf['Date'].dt.date, ndf['new'].values.tolist()))
Map the values
df['expected'] = df.Date.dt.date.map(maps)
Output:
Date expected
0 2017-09-12 09:42:14.324492 74
1 2017-09-13 09:42:14.324492 75
2 2017-09-14 09:42:14.324492 76
3 2017-09-15 09:42:14.324492 77
4 2017-09-16 09:42:14.324492 78
.
.
143 2018-02-02 09:42:14.324492 33
.
.
201 2018-04-01 09:42:14.324492 1
Hope it helps.
start with:
day_of_year = datetime.now().timetuple().tm_yday
from
Convert Year/Month/Day to Day of Year in Python
you can get 1st day of each quarter that way and bracket / subtract the value of the first day to get the day of the quarter
I have a df1 that I want to group with folowing columns
retailer object
store_id int64
visit_date datetime64[ns]
categories_euclidean float64
probes_euclidean float64
I want to group it by following df2
pk start_date end_date
name
Cycle 01 1 2016-02-24 2016-03-13
Cycle 02 2 2016-03-14 2016-03-27
Cycle 03 3 2016-03-28 2016-04-10
Cycle 04 4 2016-04-11 2016-04-24
Cycle 05 5 2016-04-25 2016-05-08
Cycle 06 6 2016-05-09 2016-05-22
The filtering condition is: df2.start_date <= df1.visit_date <= df2.end_date,
and the groups names to be df2.names
Any idea how to do so ?
Try:
def filt(x):
cond1 = df2.start_date <= x
cond2 = df2.end_date >= x
return df2[cond1 & cond2].index[0]
df1['name'] = df1.visit_date.apply(filt)
df1.groupby('name')
I couldn't play around with this. But it might work.