In a pandas df, I have number of days for a given month in the first col and Amount in the sec col. How can I add the days that are not in there for that month in the first col and give the value 0 for it in the second col
df = pd.DataFrame({
'Date':['5/23/2019', '5/9/2019'],
'Amount':np.random.choice([10000])
})
I would like the result to look like the following:
Expected Output
Date Amount
0 5/01/2019 0
1 5/02/2019 0
.
.
. 5/23/2019 1000
. 5/24/2019 0
Look at date_range from pandas.
I'm assuming that 5/31/2019 is not in your output like the comment asks because you want the differences between the min and max dates?
I convert the date column to a datetime type. I pass the min and max date to date_range and store that in a dataframe. then I do left join.
df['Date'] = pd.to_datetime(df['Date'])
date_range = pd.DataFrame(pd.date_range(start=df['Date'].min(), end=df['Date'].max()), columns=['Date'])
final_df = pd.merge(date_range, df, how='left')
Date Amount
0 2019-05-09 10000.0
1 2019-05-10 NaN
2 2019-05-11 NaN
3 2019-05-12 NaN
4 2019-05-13 NaN
5 2019-05-14 NaN
6 2019-05-15 NaN
7 2019-05-16 NaN
8 2019-05-17 NaN
9 2019-05-18 NaN
10 2019-05-19 NaN
11 2019-05-20 NaN
12 2019-05-21 NaN
13 2019-05-22 NaN
14 2019-05-23 10000.0
Related
I have a dataframe, df that looks like this
Date Value
10/1/2019 5
10/2/2019 10
10/3/2019 15
10/4/2019 20
10/5/2019 25
10/6/2019 30
10/7/2019 35
I would like to calculate the delta for a period of 7 days
Desired output:
Date Delta
10/1/2019 30
This is what I am doing: A user has helped me with a variation of the code below:
df['Delta']=df.iloc[0:,1].sub(df.iloc[6:,1]), Date=pd.Series
(pd.date_range(pd.Timestamp('2019-10-01'),
periods=7, freq='7d'))[['Delta','Date']]
Any suggestions is appreciated
Let us try shift
s = df.set_index('Date')['Value']
df['New'] = s.shift(freq = '-6 D').reindex(s.index).values
df['DIFF'] = df['New'] - df['Value']
df
Out[39]:
Date Value New DIFF
0 2019-10-01 5 35.0 30.0
1 2019-10-02 10 NaN NaN
2 2019-10-03 15 NaN NaN
3 2019-10-04 20 NaN NaN
4 2019-10-05 25 NaN NaN
5 2019-10-06 30 NaN NaN
6 2019-10-07 35 NaN NaN
I have a Pandas dataframe representing stock price close data that has 2 indices: Date and Day of Week (where 0,1,2,3,4 = Monday, Tuesday, Wednesday, Thursday, Friday). It looks like this:
Date Day Close
2019-04-08 0 283.01
2019-04-09 1 281.56
2019-04-10 2 282.52
2019-04-11 3 282.44
2019-04-12 4 284.35
...
2020-04-02 3 251.83
2020-04-03 4 248.19
2020-04-06 0 262.35
I'd like to convert this into the following:
Week Of 0 1 2 3 4
2019-04-08 283.01 281.56 282.52 282.44 284.35
...
2020-03-30 .. 257.75 246.15 251.83 248.19
2020-04-06 262.35 N/A N/A N/A N/A
I thought the best way to do this might be through pandas pivot table functionality, but I'm running into issues with its aggfunc.
You can use to_period to extract week and weekday to get the weekdays, then pivot:
# convert to datetime type if not already is
df['Date'] = pd.to_datetime(df['Date'])
# extract the new index and columns
df['Day'] = df['Date'].dt.weekday
df['Week'] = df['Date'].dt.to_period('W')
result = df.pivot(index='Week', columns='Day', values='Close')
Output:
Day 0 1 2 3 4
Week
2019-04-08/2019-04-14 283.01 281.56 282.52 282.44 284.35
2020-03-30/2020-04-05 NaN NaN NaN 251.83 248.19
2020-04-06/2020-04-12 262.35 NaN NaN NaN NaN
Or if you only want Mondays as the Week. you can just subtract the Day from the Date:
df['Day'] = df['Date'].dt.weekday
df['Week'] = df['Date'] - pd.to_timedelta(df['Day'], unit='D')
result = df.pivot(index='Week', columns='Day', values='Close')
Output:
Day 0 1 2 3 4
Week
2019-04-08 283.01 281.56 282.52 282.44 284.35
2020-03-30 NaN NaN NaN 251.83 248.19
2020-04-06 262.35 NaN NaN NaN NaN
I have a dataframe, where I have date wise withdraw/credit and closing balance information.
date withdraw credit closing_balance
02/06/17 2,500.00 nan 6,396.77
03/06/17 nan 36,767.00 43,163.77
05/06/17 1,770.00 nan 41,393.77
05/06/17 6000.00 nan 35393.77
05/06/17 278.00 nan 35115.77
07/06/17 1812.00 nan 33303.77
Now we can see that we have 2 days entries missing in this table.
i.e 04/06/17 and 06/06/17. Since there were no transactions on that day.
What I'm looking to do is add dummy rows in the dataframe for these dates, 4th and 6th, with
withdraw column as 0, credit column as 0,
And closing balance column as the same of the last closing balance entry of the previous day.
Expected output
date withdraw credit closing_balance
02/06/17 2,500.00 nan 6,396.77
03/06/17 nan 36,767.00 43,163.77
04/06/17 nan(or 0) nan(or 0) 43,163.77
05/06/17 1,770.00 nan 41,393.77
05/06/17 6000.00 nan 35393.77
05/06/17 278.00 nan 35115.77
06/06/17 nan(or 0) nan(or 0) 35115.77
07/06/17 1812.00 nan 33303.77
Is there a pythonic way of doing this.
What i thought was to first find the missing dates, then create a ttemporary dataframe for those dates and then concatenate it with the main dataframe and then sort.
But I'm having issue in how to get the previous days last closing balance entry, to fill in the missing days closing balance.
Idea is add all missing datetimes with merge and left join by another DataFrame created with minimal and maximal datetimes and date_range. Then forward filling missing values for closing_balance and set 0 for new datetimes:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%y')
df1 = pd.DataFrame({'Date':pd.date_range(df['Date'].min(), df['Date'].max())})
df2 = df1.merge(df, how='left')
df2['closing_balance'] = df2['closing_balance'].ffill()
df2.loc[~df2['Date'].isin(df['Date']), ['withdraw','credit']] = 0
print (df2)
Date withdraw credit closing_balance
0 2017-06-02 2,500.00 NaN 6,396.77
1 2017-06-03 NaN 36,767.00 43,163.77
2 2017-06-04 0 0 43,163.77
3 2017-06-05 1,770.00 NaN 41,393.77
4 2017-06-05 6000.00 NaN 35393.77
5 2017-06-05 278.00 NaN 35115.77
6 2017-06-06 0 0 35115.77
7 2017-06-07 1812.00 NaN 33303.77
Similar idea with different condition for set 0 values with merge and indicator parameter:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%y')
df1 = pd.DataFrame({'Date':pd.date_range(df['Date'].min(), df['Date'].max())})
df2 = df1.merge(df, how='left', indicator=True)
df2['closing_balance'] = df2['closing_balance'].ffill()
df2.loc[df2.pop('_merge').eq('left_only'), ['withdraw','credit']] = 0
print (df2)
Date withdraw credit closing_balance
0 2017-06-02 2,500.00 NaN 6,396.77
1 2017-06-03 NaN 36,767.00 43,163.77
2 2017-06-04 0 0 43,163.77
3 2017-06-05 1,770.00 NaN 41,393.77
4 2017-06-05 6000.00 NaN 35393.77
5 2017-06-05 278.00 NaN 35115.77
6 2017-06-06 0 0 35115.77
7 2017-06-07 1812.00 NaN 33303.77
Pandas: select DF rows based on another DF is the closest answer I can find to my question, but I don't believe it quite solves it.
Anyway, I am working with two very large pandas dataframes (so speed is a consideration), df_emails and df_trips, both of which are already sorted by CustID and then by date.
df_emails includes the date we sent a customer an email and it looks like this:
CustID DateSent
0 2 2018-01-20
1 2 2018-02-19
2 2 2018-03-31
3 4 2018-01-10
4 4 2018-02-26
5 5 2018-02-01
6 5 2018-02-07
df_trips includes the dates a customer came to the store and how much they spent, and it looks like this:
CustID TripDate TotalSpend
0 2 2018-02-04 25
1 2 2018-02-16 100
2 2 2018-02-22 250
3 4 2018-01-03 50
4 4 2018-02-28 100
5 4 2018-03-21 100
6 8 2018-01-07 200
Basically, what I need to do is find the number of trips and total spend for each customer in between each email sent. If it is the last time an email is sent for a given customer, I need to find the total number of trips and total spend after the email, but before the end of the data (2018-04-01). So the final dataframe would look like this:
CustID DateSent NextDateSentOrEndOfData TripsBetween TotalSpendBetween
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 2018-04-01 0.0 0.0
3 4 2018-01-10 2018-02-26 0.0 0.0
4 4 2018-02-26 2018-04-01 2.0 200.0
5 5 2018-02-01 2018-02-07 0.0 0.0
6 5 2018-02-07 2018-04-01 0.0 0.0
Though I have tried my best to do this in a Python/Pandas friendly way, the only accurate solution I have been able to implement is through an np.where, shifting, and looping. The solution looks like this:
df_emails["CustNthVisit"] = df_emails.groupby("CustID").cumcount()+1
df_emails["CustTotalVisit"] = df_emails.groupby("CustID")["CustID"].transform('count')
df_emails["NextDateSentOrEndOfData"] = pd.to_datetime(df_emails["DateSent"].shift(-1)).where(df_emails["CustNthVisit"] != df_emails["CustTotalVisit"], pd.to_datetime('04-01-2018'))
for i in df_emails.index:
df_emails.at[i, "TripsBetween"] = len(df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])])
for i in df_emails.index:
df_emails.at[i, "TotalSpendBetween"] = df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])].TotalSpend.sum()
df_emails.drop(['CustNthVisit',"CustTotalVisit"], axis=1, inplace=True)
However, a %%timeit has revealed that this takes 10.6ms on just the seven rows shown above, which makes this solution pretty much infeasible on my actual datasets of about 1,000,000 rows. Does anyone know a solution here that is faster and thus feasible?
Add the next date column to emails
df_emails["NextDateSent"] = df_emails.groupby("CustID").shift(-1)
Sort for merge_asof and then merge to nearest to create a trip lookup table
df_emails = df_emails.sort_values("DateSent")
df_trips = df_trips.sort_values("TripDate")
df_lookup = pd.merge_asof(df_trips, df_emails, by="CustID", left_on="TripDate",right_on="DateSent", direction="backward")
Aggregate the lookup table for the data you want.
df_lookup = df_lookup.loc[:, ["CustID", "DateSent", "TotalSpend"]].groupby(["CustID", "DateSent"]).agg(["count","sum"])
Left join it back to the email table.
df_merge = df_emails.join(df_lookup, on=["CustID", "DateSent"]).sort_values("CustID")
I choose to leave NaNs as NaNs because I don't like filling default values (you can always do that later if you prefer, but you can't easily distinguish between things that existed vs things that didn't if you put defaults in early)
CustID DateSent NextDateSent (TotalSpend, count) (TotalSpend, sum)
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 NaT NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 NaT 2.0 200.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 NaT NaN NaN
This would be an easy case of merge_asof had I been able to handle the max_date, so I go a long way:
max_date = pd.to_datetime('2018-04-01')
# set_index for easy extraction by id
df_emails.set_index('CustID', inplace=True)
# we want this later in the final output
df_emails['NextDateSentOrEndOfData'] = df_emails.groupby('CustID').shift(-1).fillna(max_date)
# cuts function for groupby
def cuts(df):
custID = df.CustID.iloc[0]
bins=list(df_emails.loc[[custID], 'DateSent']) + [max_date]
return pd.cut(df.TripDate, bins=bins, right=False)
# bin the dates:
s = df_trips.groupby('CustID', as_index=False, group_keys=False).apply(cuts)
# aggregate the info:
new_df = (df_trips.groupby([df_trips.CustID, s])
.TotalSpend.agg(['sum', 'size'])
.reset_index()
)
# get the right limit:
new_df['NextDateSentOrEndOfData'] = new_df.TripDate.apply(lambda x: x.right)
# drop the unnecessary info
new_df.drop('TripDate', axis=1, inplace=True)
# merge:
df_emails.reset_index().merge(new_df,
on=['CustID','NextDateSentOrEndOfData'],
how='left'
)
Output:
CustID DateSent NextDateSentOrEndOfData sum size
0 2 2018-01-20 2018-02-19 125.0 2.0
1 2 2018-02-19 2018-03-31 250.0 1.0
2 2 2018-03-31 2018-04-01 NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 2018-04-01 200.0 2.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 2018-04-01 NaN NaN
I have a DF where I am calculating the filling the emi value in fields
account Total Start Date End Date EMI
211829 107000 05/19/17 01/22/19 5350
320563 175000 08/04/17 10/30/18 12500
648336 246000 02/26/17 08/25/19 8482.7586206897
109996 175000 11/23/17 11/27/19 7291.6666666667
121213 317000 09/07/17 04/12/18 45285.7142857143
Then based on dates range I create new fields like Jan 17 , Feb 17 , Mar 17 etc. and fill them up with the code below.
jant17 = pd.to_datetime('2017-01-01')
febt17 = pd.to_datetime('2017-02-01')
mart17 = pd.to_datetime('2017-03-01')
jan17 = pd.to_datetime('2017-01-31')
feb17 = pd.to_datetime('2017-02-28')
mar17 = pd.to_datetime('2017-03-31')
df.ix[(df['Start Date'] <= jan17) & (df['End Date'] >= jant17) , 'Jan17'] = df['EMI']
But the drawback is when I have to do a forecast till 2019 or 2020 They become too many lines of code to write and when there is any update I need to modify too many lines of code. To reduce the lines of code I tried an alternate method with using for loop but the code started taking very long to execute.
monthend = { 'Jan17' : pd.to_datetime('2017-01-31'),
'Feb17' : pd.to_datetime('2017-02-28'),
'Mar17' : pd.to_datetime('2017-03-31')}
monthbeg = { 'Jant17' : pd.to_datetime('2017-01-01'),
'Febt17' : pd.to_datetime('2017-02-01'),
'Mart17' : pd.to_datetime('2017-03-01')}
for mend in monthend.values():
for mbeg in monthbeg.values():
for coln in colnames:
df.ix[(df['Start Date'] <= mend) & (df['End Date'] >= mbeg) , coln] = df['EMI']
This greatly reduced the no of lines of code but increased to execution time from 3-4 mins to 1 hour plus. Is there a better way to code this with less lines and lesser processing time
I think you can create helper df with start, end dates and names of columns, loop rows and create new columns of original df:
dates = pd.DataFrame({'start':pd.date_range('2017-01-01', freq='MS', periods=10),
'end':pd.date_range('2017-01-01', freq='M', periods=10)})
dates['names'] = dates.start.dt.strftime('%b%y')
print (dates)
end start names
0 2017-01-31 2017-01-01 Jan17
1 2017-02-28 2017-02-01 Feb17
2 2017-03-31 2017-03-01 Mar17
3 2017-04-30 2017-04-01 Apr17
4 2017-05-31 2017-05-01 May17
5 2017-06-30 2017-06-01 Jun17
6 2017-07-31 2017-07-01 Jul17
7 2017-08-31 2017-08-01 Aug17
8 2017-09-30 2017-09-01 Sep17
9 2017-10-31 2017-10-01 Oct17
#if necessary convert to datetimes
df['Start Date'] = pd.to_datetime(df['Start Date'])
df['End Date'] = pd.to_datetime(df['End Date'])
def f(x):
df.loc[(df['Start Date'] <= x.start) & (df['End Date'] >= x.end) , x.names] = df['EMI']
dates.apply(f, axis=1)
print (df)
account Total Start Date End Date EMI Jan17 Feb17 \
0 211829 107000 2017-05-19 2019-01-22 5350.000000 NaN NaN
1 320563 175000 2017-08-04 2018-10-30 12500.000000 NaN NaN
2 648336 246000 2017-02-26 2019-08-25 8482.758621 NaN NaN
3 109996 175000 2017-11-23 2019-11-27 7291.666667 NaN NaN
4 121213 317000 2017-09-07 2018-04-12 45285.714286 NaN NaN
Mar17 Apr17 May17 Jun17 Jul17 \
0 NaN NaN NaN 5350.000000 5350.000000
1 NaN NaN NaN NaN NaN
2 8482.758621 8482.758621 8482.758621 8482.758621 8482.758621
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
Aug17 Sep17 Oct17
0 5350.000000 5350.000000 5350.000000
1 NaN 12500.000000 12500.000000
2 8482.758621 8482.758621 8482.758621
3 NaN NaN NaN
4 NaN NaN 45285.714286