Fill rmultiple ows in between pandas dataframe rows on condition - python

I have a dataframe, where I have date wise withdraw/credit and closing balance information.
date withdraw credit closing_balance
02/06/17 2,500.00 nan 6,396.77
03/06/17 nan 36,767.00 43,163.77
05/06/17 1,770.00 nan 41,393.77
05/06/17 6000.00 nan 35393.77
05/06/17 278.00 nan 35115.77
07/06/17 1812.00 nan 33303.77
Now we can see that we have 2 days entries missing in this table.
i.e 04/06/17 and 06/06/17. Since there were no transactions on that day.
What I'm looking to do is add dummy rows in the dataframe for these dates, 4th and 6th, with
withdraw column as 0, credit column as 0,
And closing balance column as the same of the last closing balance entry of the previous day.
Expected output
date withdraw credit closing_balance
02/06/17 2,500.00 nan 6,396.77
03/06/17 nan 36,767.00 43,163.77
04/06/17 nan(or 0) nan(or 0) 43,163.77
05/06/17 1,770.00 nan 41,393.77
05/06/17 6000.00 nan 35393.77
05/06/17 278.00 nan 35115.77
06/06/17 nan(or 0) nan(or 0) 35115.77
07/06/17 1812.00 nan 33303.77
Is there a pythonic way of doing this.
What i thought was to first find the missing dates, then create a ttemporary dataframe for those dates and then concatenate it with the main dataframe and then sort.
But I'm having issue in how to get the previous days last closing balance entry, to fill in the missing days closing balance.

Idea is add all missing datetimes with merge and left join by another DataFrame created with minimal and maximal datetimes and date_range. Then forward filling missing values for closing_balance and set 0 for new datetimes:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%y')
df1 = pd.DataFrame({'Date':pd.date_range(df['Date'].min(), df['Date'].max())})
df2 = df1.merge(df, how='left')
df2['closing_balance'] = df2['closing_balance'].ffill()
df2.loc[~df2['Date'].isin(df['Date']), ['withdraw','credit']] = 0
print (df2)
Date withdraw credit closing_balance
0 2017-06-02 2,500.00 NaN 6,396.77
1 2017-06-03 NaN 36,767.00 43,163.77
2 2017-06-04 0 0 43,163.77
3 2017-06-05 1,770.00 NaN 41,393.77
4 2017-06-05 6000.00 NaN 35393.77
5 2017-06-05 278.00 NaN 35115.77
6 2017-06-06 0 0 35115.77
7 2017-06-07 1812.00 NaN 33303.77
Similar idea with different condition for set 0 values with merge and indicator parameter:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%y')
df1 = pd.DataFrame({'Date':pd.date_range(df['Date'].min(), df['Date'].max())})
df2 = df1.merge(df, how='left', indicator=True)
df2['closing_balance'] = df2['closing_balance'].ffill()
df2.loc[df2.pop('_merge').eq('left_only'), ['withdraw','credit']] = 0
print (df2)
Date withdraw credit closing_balance
0 2017-06-02 2,500.00 NaN 6,396.77
1 2017-06-03 NaN 36,767.00 43,163.77
2 2017-06-04 0 0 43,163.77
3 2017-06-05 1,770.00 NaN 41,393.77
4 2017-06-05 6000.00 NaN 35393.77
5 2017-06-05 278.00 NaN 35115.77
6 2017-06-06 0 0 35115.77
7 2017-06-07 1812.00 NaN 33303.77

Related

Counting Number of Occurrences Between Dates (Given an ID value) From Another Dataframe

Pandas: select DF rows based on another DF is the closest answer I can find to my question, but I don't believe it quite solves it.
Anyway, I am working with two very large pandas dataframes (so speed is a consideration), df_emails and df_trips, both of which are already sorted by CustID and then by date.
df_emails includes the date we sent a customer an email and it looks like this:
CustID DateSent
0 2 2018-01-20
1 2 2018-02-19
2 2 2018-03-31
3 4 2018-01-10
4 4 2018-02-26
5 5 2018-02-01
6 5 2018-02-07
df_trips includes the dates a customer came to the store and how much they spent, and it looks like this:
CustID TripDate TotalSpend
0 2 2018-02-04 25
1 2 2018-02-16 100
2 2 2018-02-22 250
3 4 2018-01-03 50
4 4 2018-02-28 100
5 4 2018-03-21 100
6 8 2018-01-07 200
Basically, what I need to do is find the number of trips and total spend for each customer in between each email sent. If it is the last time an email is sent for a given customer, I need to find the total number of trips and total spend after the email, but before the end of the data (2018-04-01). So the final dataframe would look like this:
CustID DateSent NextDateSentOrEndOfData TripsBetween TotalSpendBetween
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 2018-04-01 0.0 0.0
3 4 2018-01-10 2018-02-26 0.0 0.0
4 4 2018-02-26 2018-04-01 2.0 200.0
5 5 2018-02-01 2018-02-07 0.0 0.0
6 5 2018-02-07 2018-04-01 0.0 0.0
Though I have tried my best to do this in a Python/Pandas friendly way, the only accurate solution I have been able to implement is through an np.where, shifting, and looping. The solution looks like this:
df_emails["CustNthVisit"] = df_emails.groupby("CustID").cumcount()+1
df_emails["CustTotalVisit"] = df_emails.groupby("CustID")["CustID"].transform('count')
df_emails["NextDateSentOrEndOfData"] = pd.to_datetime(df_emails["DateSent"].shift(-1)).where(df_emails["CustNthVisit"] != df_emails["CustTotalVisit"], pd.to_datetime('04-01-2018'))
for i in df_emails.index:
df_emails.at[i, "TripsBetween"] = len(df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])])
for i in df_emails.index:
df_emails.at[i, "TotalSpendBetween"] = df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])].TotalSpend.sum()
df_emails.drop(['CustNthVisit',"CustTotalVisit"], axis=1, inplace=True)
However, a %%timeit has revealed that this takes 10.6ms on just the seven rows shown above, which makes this solution pretty much infeasible on my actual datasets of about 1,000,000 rows. Does anyone know a solution here that is faster and thus feasible?
Add the next date column to emails
df_emails["NextDateSent"] = df_emails.groupby("CustID").shift(-1)
Sort for merge_asof and then merge to nearest to create a trip lookup table
df_emails = df_emails.sort_values("DateSent")
df_trips = df_trips.sort_values("TripDate")
df_lookup = pd.merge_asof(df_trips, df_emails, by="CustID", left_on="TripDate",right_on="DateSent", direction="backward")
Aggregate the lookup table for the data you want.
df_lookup = df_lookup.loc[:, ["CustID", "DateSent", "TotalSpend"]].groupby(["CustID", "DateSent"]).agg(["count","sum"])
Left join it back to the email table.
df_merge = df_emails.join(df_lookup, on=["CustID", "DateSent"]).sort_values("CustID")
I choose to leave NaNs as NaNs because I don't like filling default values (you can always do that later if you prefer, but you can't easily distinguish between things that existed vs things that didn't if you put defaults in early)
CustID DateSent NextDateSent (TotalSpend, count) (TotalSpend, sum)
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 NaT NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 NaT 2.0 200.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 NaT NaN NaN
This would be an easy case of merge_asof had I been able to handle the max_date, so I go a long way:
max_date = pd.to_datetime('2018-04-01')
# set_index for easy extraction by id
df_emails.set_index('CustID', inplace=True)
# we want this later in the final output
df_emails['NextDateSentOrEndOfData'] = df_emails.groupby('CustID').shift(-1).fillna(max_date)
# cuts function for groupby
def cuts(df):
custID = df.CustID.iloc[0]
bins=list(df_emails.loc[[custID], 'DateSent']) + [max_date]
return pd.cut(df.TripDate, bins=bins, right=False)
# bin the dates:
s = df_trips.groupby('CustID', as_index=False, group_keys=False).apply(cuts)
# aggregate the info:
new_df = (df_trips.groupby([df_trips.CustID, s])
.TotalSpend.agg(['sum', 'size'])
.reset_index()
)
# get the right limit:
new_df['NextDateSentOrEndOfData'] = new_df.TripDate.apply(lambda x: x.right)
# drop the unnecessary info
new_df.drop('TripDate', axis=1, inplace=True)
# merge:
df_emails.reset_index().merge(new_df,
on=['CustID','NextDateSentOrEndOfData'],
how='left'
)
Output:
CustID DateSent NextDateSentOrEndOfData sum size
0 2 2018-01-20 2018-02-19 125.0 2.0
1 2 2018-02-19 2018-03-31 250.0 1.0
2 2 2018-03-31 2018-04-01 NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 2018-04-01 200.0 2.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 2018-04-01 NaN NaN

Adding extra days for each month in pandas

In a pandas df, I have number of days for a given month in the first col and Amount in the sec col. How can I add the days that are not in there for that month in the first col and give the value 0 for it in the second col
df = pd.DataFrame({
'Date':['5/23/2019', '5/9/2019'],
'Amount':np.random.choice([10000])
})
I would like the result to look like the following:
Expected Output
Date Amount
0 5/01/2019 0
1 5/02/2019 0
.
.
. 5/23/2019 1000
. 5/24/2019 0
Look at date_range from pandas.
I'm assuming that 5/31/2019 is not in your output like the comment asks because you want the differences between the min and max dates?
I convert the date column to a datetime type. I pass the min and max date to date_range and store that in a dataframe. then I do left join.
df['Date'] = pd.to_datetime(df['Date'])
date_range = pd.DataFrame(pd.date_range(start=df['Date'].min(), end=df['Date'].max()), columns=['Date'])
final_df = pd.merge(date_range, df, how='left')
Date Amount
0 2019-05-09 10000.0
1 2019-05-10 NaN
2 2019-05-11 NaN
3 2019-05-12 NaN
4 2019-05-13 NaN
5 2019-05-14 NaN
6 2019-05-15 NaN
7 2019-05-16 NaN
8 2019-05-17 NaN
9 2019-05-18 NaN
10 2019-05-19 NaN
11 2019-05-20 NaN
12 2019-05-21 NaN
13 2019-05-22 NaN
14 2019-05-23 10000.0

Assitance needed in python pandas to reduce lines of code and cycle time

I have a DF where I am calculating the filling the emi value in fields
account Total Start Date End Date EMI
211829 107000 05/19/17 01/22/19 5350
320563 175000 08/04/17 10/30/18 12500
648336 246000 02/26/17 08/25/19 8482.7586206897
109996 175000 11/23/17 11/27/19 7291.6666666667
121213 317000 09/07/17 04/12/18 45285.7142857143
Then based on dates range I create new fields like Jan 17 , Feb 17 , Mar 17 etc. and fill them up with the code below.
jant17 = pd.to_datetime('2017-01-01')
febt17 = pd.to_datetime('2017-02-01')
mart17 = pd.to_datetime('2017-03-01')
jan17 = pd.to_datetime('2017-01-31')
feb17 = pd.to_datetime('2017-02-28')
mar17 = pd.to_datetime('2017-03-31')
df.ix[(df['Start Date'] <= jan17) & (df['End Date'] >= jant17) , 'Jan17'] = df['EMI']
But the drawback is when I have to do a forecast till 2019 or 2020 They become too many lines of code to write and when there is any update I need to modify too many lines of code. To reduce the lines of code I tried an alternate method with using for loop but the code started taking very long to execute.
monthend = { 'Jan17' : pd.to_datetime('2017-01-31'),
'Feb17' : pd.to_datetime('2017-02-28'),
'Mar17' : pd.to_datetime('2017-03-31')}
monthbeg = { 'Jant17' : pd.to_datetime('2017-01-01'),
'Febt17' : pd.to_datetime('2017-02-01'),
'Mart17' : pd.to_datetime('2017-03-01')}
for mend in monthend.values():
for mbeg in monthbeg.values():
for coln in colnames:
df.ix[(df['Start Date'] <= mend) & (df['End Date'] >= mbeg) , coln] = df['EMI']
This greatly reduced the no of lines of code but increased to execution time from 3-4 mins to 1 hour plus. Is there a better way to code this with less lines and lesser processing time
I think you can create helper df with start, end dates and names of columns, loop rows and create new columns of original df:
dates = pd.DataFrame({'start':pd.date_range('2017-01-01', freq='MS', periods=10),
'end':pd.date_range('2017-01-01', freq='M', periods=10)})
dates['names'] = dates.start.dt.strftime('%b%y')
print (dates)
end start names
0 2017-01-31 2017-01-01 Jan17
1 2017-02-28 2017-02-01 Feb17
2 2017-03-31 2017-03-01 Mar17
3 2017-04-30 2017-04-01 Apr17
4 2017-05-31 2017-05-01 May17
5 2017-06-30 2017-06-01 Jun17
6 2017-07-31 2017-07-01 Jul17
7 2017-08-31 2017-08-01 Aug17
8 2017-09-30 2017-09-01 Sep17
9 2017-10-31 2017-10-01 Oct17
#if necessary convert to datetimes
df['Start Date'] = pd.to_datetime(df['Start Date'])
df['End Date'] = pd.to_datetime(df['End Date'])
def f(x):
df.loc[(df['Start Date'] <= x.start) & (df['End Date'] >= x.end) , x.names] = df['EMI']
dates.apply(f, axis=1)
print (df)
account Total Start Date End Date EMI Jan17 Feb17 \
0 211829 107000 2017-05-19 2019-01-22 5350.000000 NaN NaN
1 320563 175000 2017-08-04 2018-10-30 12500.000000 NaN NaN
2 648336 246000 2017-02-26 2019-08-25 8482.758621 NaN NaN
3 109996 175000 2017-11-23 2019-11-27 7291.666667 NaN NaN
4 121213 317000 2017-09-07 2018-04-12 45285.714286 NaN NaN
Mar17 Apr17 May17 Jun17 Jul17 \
0 NaN NaN NaN 5350.000000 5350.000000
1 NaN NaN NaN NaN NaN
2 8482.758621 8482.758621 8482.758621 8482.758621 8482.758621
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
Aug17 Sep17 Oct17
0 5350.000000 5350.000000 5350.000000
1 NaN 12500.000000 12500.000000
2 8482.758621 8482.758621 8482.758621
3 NaN NaN NaN
4 NaN NaN 45285.714286

removing rows with any column containing NaN, NaTs, and nans

Currently I have data as below:
df_all.head()
Out[2]:
Unnamed: 0 Symbol Date Close Weight
0 4061 A 2016-01-13 36.515889 (0.000002)
1 4062 AA 2016-01-14 36.351784 0.000112
2 4063 AAC 2016-01-15 36.351784 (0.000004)
3 4064 AAL 2016-01-19 36.590483 0.000006
4 4065 AAMC 2016-01-20 35.934062 0.000002
df_all.tail()
Out[3]:
Unnamed: 0 Symbol Date Close Weight
1252498 26950320 nan NaT 9.84 NaN
1252499 26950321 nan NaT 10.26 NaN
1252500 26950322 nan NaT 9.99 NaN
1252501 26950323 nan NaT 9.11 NaN
1252502 26950324 nan NaT 9.18 NaN
df_all.dtypes
Out[4]:
Unnamed: 0 int64
Symbol object
Date datetime64[ns]
Close float64
Weight object
dtype: object
As can be seen, I am getting values in Symbol of nan, Nat for Date and NaN for weight.
MY GOAL: I want to remove any row that has ANY column containing nan, Nat or NaN and have a new df_clean to be the result
I don't seem to be able to apply the appropriate filter? I am not sure if I have to convert the datatypes first (although I tried this as well)
You can use
df_all.replace({'nan': None})[~pd.isnull(df_all).any(axis=1)]
This is because isnull recognizes both NaN and NaT as "null" values.
Since, the symbol 'nan' is not caught by dropna() or isnull(). You need to cast the symbol'nan' as np.nan
Try this:
df["symbol"] = np.where(df["symbol"]=='nan',np.nan, df["symbol"] )
df.dropna()

Combine_first and null values in Pandas

df1:
0 1
0 nan 3.00
1 -4.00 nan
2 nan 7.00
df2:
0 1 2
1 -42.00 nan 8.00
2 -5.00 nan 4.00
df3 = df1.combine_first(df2)
df3:
0 1 2
0 nan 3.00 nan
1 -4.00 nan 8.00
2 -5.00 7.00 4.00
This is what I'd like df3 to be:
0 1 2
0 nan 3.00 nan
1 -4.00 nan 8.00
2 nan 7.00 4.00
(The difference is in df3.ix[2:2,0:0])
That is, if the column and index are the same for any cell in both df1 and df2, I'd like df1's value to prevail, even if that value is nan. combine_first does that, except when the value in df1 is nan.
Here's a bit of a hacky way to do it. First, align df2 with df1, which creates a frame indexed with the union of df1/df2, filled with df2's values. Then assign back df1's values.
In [325]: df3, _ = df2.align(df1)
In [327]: df3.loc[df1.index, df1.columns] = df1
In [328]: df3
Out[328]:
0 1 2
0 NaN 3 NaN
1 -4 NaN 8
2 NaN 7 4

Categories

Resources