I have 5 years of daily volume data. I want to create a new column in pandas dataframe where it produces the values of YoY Growth for that particular day. For e.g. compares 2018-01-01 with 2017-01-01, and 2019-01-01 compares with 2018-01-01
I have 364 records for each year (except for the year 2020 there are 365 days)
How can I create the column YoY_Growth as below in pandas dataframe.
# It's more convenient to index the dataframe with the Date for our algorith,
df = df.set_index("Date")
is_leap_day = (df.index.month == 2) & (df.index.day == 29)
# Leap day is an edge case, since you can't find Feb 29 of the previous year.
# pandas handles this by shifting to Feb 28 of the previous year:
# 2020-02-29 -> 2019-02-28
# 2020-02-28 -> 2019-02-28
# This creates a duplicate for Feb 28. So we need to handle leap day separately.
volume_last_year = df.loc[~is_leap_day, "Volume"].shift(freq=pd.DateOffset(years=1))
# For non leap days
df["YoY_Growth"] = df["Volume"] / volume_last_year - 1
# For leap days
df.loc[is_leap_day, "YoY_Growth"] = (
df.loc[is_leap_day, "Volume"] / volume_last_year.shift(freq=pd.DateOffset(days=1))
- 1
)
Result (Volume was randomly generated):
df.loc[["2019-01-01", "2019-02-28", "2020-01-01", "2020-02-28", "2020-02-29"], :]
Volume YoY_Growth
Date
2019-01-01 45 NaN
2019-02-28 23 NaN
2020-01-01 10 -0.777778 # = 10 / 45 - 1
2020-02-28 34 0.478261 # = 34 / 23 - 1
2020-02-29 76 2.304348 # = 76 / 23 - 1
Related
I am trying to create 2 columns based of a column that contains numerical values.
Value
0
4
10
24
null
49
Expected Output:
Value Day Hour
0 Sunday 12:00am
4 Sunday 4:00am
10 Sunday 10:00am
24 Monday 12:00am
null No Day No Time
49 Tuesday 1:00am
Continued.....
Code I am trying out:
value = df.value.unique()
Sunday_Starting_Point = pd.to_datetime('Sunday 2015')
(Sunday_Starting_Point + pd.to_timedelta(Value, 'h')).dt.strftime('%A %I:%M%P')
Thanks for looking!
I think unique values are not necessary, you can use 2 times dt.strftime for 2 columns with replace with NaT values:
Sunday_Starting_Point = pd.to_datetime('Sunday 2015')
x = pd.to_numeric(df.Value, errors='coerce')
s = Sunday_Starting_Point + pd.to_timedelta(x, unit='h')
df['Day'] = s.dt.strftime('%A').replace('NaT','No Day')
df['Hour'] = s.dt.strftime('%I:%M%p').replace('NaT','No Time')
print (df)
Value Day Hour
0 0.0 Sunday 12:00AM
1 4.0 Sunday 04:00AM
2 10.0 Sunday 10:00AM
3 24.0 Monday 12:00AM
4 NaN No Day No Time
5 49.0 Tuesday 01:00AM
I would like to create two columns "Year" and "Month" from a Date column that contains different year and month arrangements. Some are YY-Mmm and the others are Mmm-YY.
import pandas as pd
dataSet = {
"Date": ["18-Jan", "18-Jan", "18-Feb", "18-Feb", "Oct-17", "Oct-17"],
"Quantity": [3476, 20, 789, 409, 81, 640],
}
df = pd.DataFrame(dataSet, columns=["Date", "Quantity"])
My attempt is as follows:
Date1 = []
Date2 = []
for dt in df.Date:
Date1.append(dt.split("-")[0])
Date2.append(dt.split("-")[1])
Year = []
try:
for yr in Date1:
Year.append(int(yr.Date1))
except:
for yr in Date2:
Year.append(int(yr.Date2))
You can make use of the extract dataframe string method to split the date strings up. Since the year can precede or follow the month, we can get a bit creative and have a Year1 column and Year2 columns for either position. Then use np.where to create a single Year column pulls from each of these other year columns.
For example:
import numpy as np
split_dates = df["Date"].str.extract(r"(?P<Year1>\d+)?-?(?P<Month>\w+)-?(?P<Year2>\d+)?")
split_dates["Year"] = np.where(
split_dates["Year1"].notna(),
split_dates["Year1"],
split_dates["Year2"],
)
split_dates = split_dates[["Year", "Month"]]
With result for split_dates:
Year Month
0 18 Jan
1 18 Jan
2 18 Feb
3 18 Feb
4 17 Oct
5 17 Oct
Then you can merge back with your original dataframe with pd.merge, like so:
pd.merge(df, split_dates, how="inner", left_index=True, right_index=True)
Which yields:
Date Quantity Year Month
0 18-Jan 3476 18 Jan
1 18-Jan 20 18 Jan
2 18-Feb 789 18 Feb
3 18-Feb 409 18 Feb
4 Oct-17 81 17 Oct
5 Oct-17 640 17 Oct
Thank you for your help. I managed to get it working with what I've learned so far, i.e. for loop, if-else and split() and with the help of another expert.
# Split the Date column and store it in an array
dA = []
for dP in df.Date:
dA.append(dP.split("-"))
# Append month and year to respective lists based on if conditions
Month = []
Year = []
for moYr in dA:
if len(moYr[0]) == 2:
Month.append(moYr[1])
Year.append(moYr[0])
else:
Month.append(moYr[0])
Year.append(moYr[1])
This took me hours!
Try using Python datetime strptime(<date>, "%y-%b") on the date column to convert it to a Python datetime.
from datetime import datetime
def parse_dt(x):
try:
return datetime.strptime(x, "%y-%b")
except:
return datetime.strptime(x, "%b-%y")
df['timestamp'] = df['Date'].apply(parse_dt)
df
Date Quantity timestamp
0 18-Jan 3476 2018-01-01
1 18-Jan 20 2018-01-01
2 18-Feb 789 2018-02-01
3 18-Feb 409 2018-02-01
4 Oct-17 81 2017-10-01
5 Oct-17 640 2017-10-01
Then you can just use .month and .year attributes, or if you prefer the month as its abbreviated form, use Python datetime.strftime('%b').
df['year'] = df.timestamp.apply(lambda x: x.year)
df['month'] = df.timestamp.apply(lambda x: x.strftime('%b'))
df
Date Quantity timestamp year month
0 18-Jan 3476 2018-01-01 2018 Jan
1 18-Jan 20 2018-01-01 2018 Jan
2 18-Feb 789 2018-02-01 2018 Feb
3 18-Feb 409 2018-02-01 2018 Feb
4 Oct-17 81 2017-10-01 2017 Oct
5 Oct-17 640 2017-10-01 2017 Oct
I'm working on a dataframe named df that contains a year of daily information for a float variable (balance) for many account values (used as main key). I'm trying to create a new column expected_balance by matching the date of previous months, calculating an average and using it as expected future value. I'll explain in detail now:
The dataset is generated after appending and parsing multiple json values, once I finish working on it, I get this:
date balance account day month year fdate
0 2018-04-13 470.57 SP014 13 4 2018 201804
1 2018-04-14 375.54 SP014 14 4 2018 201804
2 2018-04-15 375.54 SP014 15 4 2018 201804
3 2018-04-16 229.04 SP014 16 4 2018 201804
4 2018-04-17 216.62 SP014 17 4 2018 201804
... ... ... ... ... ... ... ...
414857 2019-02-24 381.26 KO012 24 2 2019 201902
414858 2019-02-25 181.26 KO012 25 2 2019 201902
414859 2019-02-26 160.82 KO012 26 2 2019 201902
414860 2019-02-27 0.82 KO012 27 2 2019 201902
414861 2019-02-28 109.50 KO012 28 2 2019 201902
Each account value has 365 values (a starting date when the information was obtained and a year of info), resampled by day. After that, I'm splitting this dataframe into train and test. Train consists of all previous values except for the last 2 months of information and test are these last 2 months (the last month is not necesarilly full, if the last/max date value is 20-04-2019, then train will be from 20-04-2018 to 31-03-2019 and test 01-03-2019 to 20-04-2019). This is how I manage:
df_test_1 = df[df.fdate==df.groupby('account').fdate.transform('max')].copy()
dft = df.drop(df_test_1.index)
df_test_2 = dft[dft.fdate==dft.groupby('account').fdate.transform('max')].copy()
df_train = dft.drop(df_test_2.index)
df_test = pd.concat([df_test_2,df_test_1])
#print("Shape df: ",df.shape) #for validation purposes
#print("Shape test: ",df_test.shape) #for validation purposes
#print("Shape train: ",df_train.shape) #for validation purposes
What I need to do now is create a new column exp_bal (expected balance) for each date in the df_test that's calculated by averaging all train values for the particular day (this is the method requested so I must follow the instructions).
Here is an example of an expected output/result, I'm only printing account's AA001 values for a specific day for the last 2 train months (suppose these values always repeat for the other 8 months):
date balance account day month year fdate
... ... ... ... ... ... ... ...
0 2019-03-20 200.00 AA000 20 3 2019 201903
1 2019-04-20 100.00 AA000 20 4 2019 201904
I should be able to use this information to append a new column for each day that is the average of the same day value for all months of df_train
date balance account day month year fdate exp_bal
0 2018-05-20 470.57 AA000 20 5 2018 201805 150.00
30 2019-06-20 381.26 AA000 20 6 2019 201906 150.00
So then I can calculate a mse for the that prediction for that account.
First of all I'm using this to iterate over each account:
ids = list(df['account'].unique())
for i in range(0,len(ids)):
dft_train = df_train[df_train['account'] == ids[i]]
dft_test = df_test[df_test['account'] == ids[i]]
first_date = min(dft_test['date'])
last_date = max(df_ttest['date'])
dft_train = dft_train.set_index('date')
dft_test = dft_train.set_index('date')
And after this I'm lost on how to use the dft_train values to create this average for a given day that will be appended in a new column in dft_test.
I appreciate any help or suggestion, also feel free to ask for clarification/ more info, I'll gladly edit this. Thanks in advance!
Not sure if it's the only question you have with the above, but this is how to calculate the expected balance of the train data:
import pandas as pd, numpy as np
# make test data
n = 60
df = pd.DataFrame({'Date': np.tile(pd.date_range('2018-01-01',periods=n).values, 2), 'Account': np.repeat(['A', 'B'], n), 'Balance': range(2*n)})
df['Day'] = df.Date.dt.day
# calculate expected balance
df['exp_bal'] = df.groupby(['Account', 'Day']).Balance.transform('mean')
# example output for day 5
print(df[df.Day==5])
Output:
Date Account Balance Day exp_bal
4 2018-01-05 A 4 5 19.5
35 2018-02-05 A 35 5 19.5
64 2018-01-05 B 64 5 79.5
95 2018-02-05 B 95 5 79.5
I have create a dataframe of dates as follows:
import pandas as pd
timespan = 366
df = pd.DataFrame({'Date':pd.date_range(pd.datetime.today(), periods=timespan).tolist()})
I'm struggling to identify the day number in a quarter. For example
date expected_value
2017-01-01 1 # First day in Q1
2017-01-02 2 # Second day in Q1
2017-02-01 32 # 32nd day in Q1
2017-04-01 1 # First day in Q2
May I have your suggestions? Thank you in advance.
>>> df.assign(
days_in_qurater=[(date - ts.start_time).days + 1
for date, ts in zip(df['Date'],
pd.PeriodIndex(df['Date'], freq='Q'))])
Date days_in_qurater
0 2017-01-01 1
1 2017-01-02 2
2 2017-01-03 3
...
363 2017-12-30 91
364 2017-12-31 92
365 2018-01-01 1
This is around 250x faster than Alexander's solution:
df['day_qtr']=(df.Date - pd.PeriodIndex(df.Date,freq='Q').start_time).dt.days + 1
One of way is by creating a new df based on dates and quarter cumcount then map the values to the real df i.e
timespan = 5000
ndf = pd.DataFrame({'Date':pd.date_range('2015-01-01', periods=timespan).tolist()})
ndf['q'] = ndf['Date'].dt.to_period('Q')
ndf['new'] = ndf.groupby('q').cumcount()+1
maps = dict(zip(ndf['Date'].dt.date, ndf['new'].values.tolist()))
Map the values
df['expected'] = df.Date.dt.date.map(maps)
Output:
Date expected
0 2017-09-12 09:42:14.324492 74
1 2017-09-13 09:42:14.324492 75
2 2017-09-14 09:42:14.324492 76
3 2017-09-15 09:42:14.324492 77
4 2017-09-16 09:42:14.324492 78
.
.
143 2018-02-02 09:42:14.324492 33
.
.
201 2018-04-01 09:42:14.324492 1
Hope it helps.
start with:
day_of_year = datetime.now().timetuple().tm_yday
from
Convert Year/Month/Day to Day of Year in Python
you can get 1st day of each quarter that way and bracket / subtract the value of the first day to get the day of the quarter
What formula can I use to determine based off of a certain date the pay period number that falls between a start date and end date? In other words, from df1 I have a date that I need to compare against df2 start dates and end dates and then produce the pay period that those dates fall within in a new data set.
The formula I've used so far is the following. Keep in mind I'm not a seasoned programmer:
1st try:
def calc(row):
if pf2(row['actn_dt']) >= pp_calendar([0], and pf2(row['actn_dt']) <= df2([1]):
return df2([2,3])
2nd try:
pf2['pay'] = np.where (pf2['actn_dt'] >= df2[0]) | (pf2['actn_dt'] <= pf2[1]), pp_calendar[2]
3rd try:
def calc(row):
if pf2(row['actn_dt']) >= df2(row[1]) | pf2(row['actn_dt']) <= df2(row[2]):
return df2(row[3])
pf2['pay'] = pf2.apply (lambda row: calc(row), axis=1)
print df:
actn_dt
16 2008-09-28 00:00:00
17 2008-03-16 00:00:00
18 2009-08-30 00:00:00
43 2008-06-22 00:00:00
89 2009-08-16 00:00:00
106 2009-03-29 00:00:00
244 2009-08-30 00:00:00
371 2009-09-13 00:00:00
400 2009-07-19 00:00:00
439 2007-12-23 00:00:00
print df2:
START_DATE END_DATE PAY_PERIOD CALENDAR_YEAR
0 2008-09-28 2008-10-11 10 2008
1 2008-03-16 2008-03-16 06 2008
2 2009-08-30 2009-09-12 18 2009
3 2008-06-22 2008-06-22 13 2008
4 2009-03-29 2009-04-11 07 2009
Expected Result:
actn_dt START_DATE END_DATE PAY_PERIOD CALENDAR_YEAR
16 2008-09-28 2008-09-28 2008-10-11 10 2008
17 2008-03-16 2008-03-16 2008-03-29 06 2008
18 2009-08-30 2009-08-30 2009-09-12 18 2009
43 2008-06-22 2008-06-22 2008-07-05 13 2008
89 2009-08-16 2009-08-16 2008-08-29 17 2009
106 2009-03-29 2009-03-29 2009-04-11 07 2009
244 2009-08-30 2009-08-30 2009-09-12 18 2009
Thank you for your knowledge and time!
Head Line Solution
def find_pay_period(date, df):
df = df[(df.START_DATE <= date) & (date <= df.END_DATE)].iloc[0, :]
df['actn_dt'] = date
return df
df1.actn_dt.apply(lambda x: find_pay_period(x, df2))
Explanation
start with
# apply() will take date in df1 and find the first row in df2
# such that the date is between START_DATE and END_DATE and
# then return the row.
df1.actn_dt.apply(lambda x: find_pay_period(x, df2))
now this
def find_pay_period(date, df):
# df[] use boolean mask to filter
# .iloc[0, :] grabs first row of filtered DataFrame.
# Keep in mind this is a Series.
df = df[(df.START_DATE <= date) & (date <= df.END_DATE)].iloc[0, :]
# add back your date
df['actn_dt'] = date
return df