NaN values when adding two columns - python

I have two dataframes with different indexing that I want to sum the same column from the two dataframes.
I tried the following but gives NaN values
result['Anomaly'] = df['Anomaly'] + tmp['Anomaly']
df
date Anomaly
0 2018-12-06 0
1 2019-01-07 0
2 2019-02-06 1
3 2019-03-06 0
4 2019-04-06 0
tmp
date Anomaly
0 2018-12-06 0
1 2019-01-07 1
4 2019-04-06 0
result
date Anomaly
0 2018-12-06 0.0
1 2019-01-07 NaN
2 2019-02-06 1.0
3 2019-03-06 0.0
4 2019-04-06 0.0
What I want is actually:
result
date Anomaly
0 2018-12-06 0
1 2019-01-07 1
2 2019-02-06 1
3 2019-03-06 0
4 2019-04-06 0

Here is necessary align by datetimes, so first use DataFrame.set_index for DatetimeIndex and then use Series.add:
df = df.set_index('date')
tmp = tmp.set_index('date')
result = df['Anomaly'].add(tmp['Anomaly'], fill_value=0).reset_index()

You can try this
pd.concat([df, tmp]).groupby('date', as_index=False)["Anomaly"].sum()
date Anomaly
0 2018-12-06 0
1 2019-01-07 1
2 2019-02-06 1
3 2019-03-06 0
4 2019-04-06 0

combine_first():
res = pd.DataFrame({'date':df.date,'Anomaly':tmp.Anomaly.combine_first(df.Anomaly)})
print(res)
date Anomaly
0 2018-12-06 0.0
1 2019-01-07 1.0
2 2019-02-06 1.0
3 2019-03-06 0.0
4 2019-04-06 0.0

You must first set correct indices on your dataframes, and then add using the date indices:
tmp1 = tmp.set_index('date')
result = df.set_index('date')
result.loc[tmp1.index] += tmp1
result.reset_index(inplace=True)

Related

Conditional groupby on dates using pandas

I'm in a bit of a pickle. I've been working on a problem all day without seeing any real results. I'm working in Python and using Pandas for handling data.
What I'm trying to achieve is based on the customers previous interactions to sum each type of interaction. The timestamp of the interaction should be less than the timestamp of the survey. Ideally, I would like to sum the interactions for the customer during some period - like less than e.g. 5 years.
The first dataframe contains a customer ID, segmentation of that customer during in that survey e.g. 1 being "happy", 2 being "sad" and a timestamp for the time of the recorded segment or time of that survey.
import pandas as pd
#Generic example
customers = pd.DataFrame({"customerID":[1,1,1,2,2,3,4,4],"customerSeg":[1,2,2,1,2,3,3,3],"timestamp":['1999-01-01','2000-01-01','2000-06-01','2001-01-01','2003-01-01','1999-01-01','2005-01-01','2008-01-01']})
customers
Which yields something like:
customerID
customerSeg
timestamp
1
1
1999-01-01
1
1
2000-01-01
1
1
2000-06-01
2
2
2001-01-01
2
2
2003-01-01
3
3
1999-01-01
4
4
2005-01-01
4
4
2008-01-01
The other dataframe contains interactions with that customer eg. at service and a phonecall.
interactions = pd.DataFrame({"customerID":[1,1,1,1,2,2,2,2,4,4,4],"timestamp":['1999-07-01','1999-11-01','2000-03-01','2001-04-01','2000-12-01','2002-01-01','2004-03-01','2004-05-01','2000-01-01','2004-01-01','2009-01-01'],"service":[1,0,1,0,1,0,1,1,0,1,1],"phonecall":[0,1,1,1,1,1,0,1,1,0,1]})
interactions
Output:
customerID
timestamp
service
phonecall
1
1999-07-01
1
0
1
1999-11-01
0
1
1
2000-03-01
1
1
1
2001-04-01
0
1
2
2000-12-01
1
1
2
2002-01-01
0
1
2
2004-03-01
1
0
2
2004-05-01
1
1
4
2000-01-01
0
1
4
2004-01-01
1
0
4
2009-01-01
1
1
Result for all previous interactions (ideally, I would like only the last 5 years):
customerID
customerSeg
timestamp
service
phonecall
1
1
1999-01-01
0
0
1
1
2000-01-01
1
1
1
1
2000-06-01
2
2
2
2
2001-01-01
1
1
2
2
2003-01-01
1
2
3
3
1999-01-01
0
0
4
4
2005-01-01
1
1
4
4
2008-01-01
1
1
I've tried almost everything, I could come up with. So, I would really appreciate some inputs. I'm pretty much confined to using Pandas and Python, since it's the language, I'm most familiar with, but also because I need to read a csv file of the customer segmentation.
I think you need several steps for transforming your data.
First of all, we convert the timestamp columns in both dataframes to datetime, so we can calculate the desired interval and do the comparisons:
customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])
After that, we create a new column that contains that start date (e.g. 5 years before the timestamp):
customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)
Now we join the customers dataframe with the interactions dataframe on the customerID:
result = customers.merge(interactions, on='customerID', how='outer')
This yields
customerID customerSeg timestamp_x start_date timestamp_y service phonecall
0 1 1 1999-01-01 1994-01-01 1999-07-01 1.0 0.0
1 1 1 1999-01-01 1994-01-01 1999-11-01 0.0 1.0
2 1 1 1999-01-01 1994-01-01 2000-03-01 1.0 1.0
3 1 1 1999-01-01 1994-01-01 2001-04-01 0.0 1.0
4 1 2 2000-01-01 1995-01-01 1999-07-01 1.0 0.0
5 1 2 2000-01-01 1995-01-01 1999-11-01 0.0 1.0
6 1 2 2000-01-01 1995-01-01 2000-03-01 1.0 1.0
7 1 2 2000-01-01 1995-01-01 2001-04-01 0.0 1.0
...
Now here is how the condition is evaluated - what we want is that only those service and phonecall interactions will be used that are in rows that meet the condition (timestamp_y is in the interval between start_date and timestamp_x), so we replace the others by zero:
result['service'] = result.apply(lambda x: x.service if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
result['phonecall'] = result.apply(lambda x: x.phonecall if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
Finally we group the dataframe, summing up the service and phonecall interactions:
result = result.groupby(['customerID', 'timestamp_x', 'customerSeg'])[['service', 'phonecall']].sum()
Result:
service phonecall
customerID timestamp_x customerSeg
1 1999-01-01 1 0.0 0.0
2000-01-01 2 1.0 1.0
2000-06-01 2 2.0 2.0
2 2001-01-01 1 1.0 1.0
2003-01-01 2 1.0 2.0
3 1999-01-01 3 0.0 0.0
4 2005-01-01 3 1.0 1.0
2008-01-01 3 1.0 0.0
(Note that your customerSeg data in the sample code seems not quite to match the data in the table.)
One option is to use the conditional_join from pyjanitor to compute the rows that match the criteria, before grouping and summing:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])
customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)
(interactions
.conditional_join(
customers,
# column from left, column from right, comparison operator
('timestamp', 'timestamp', '<='),
('timestamp', 'start_date', '>='),
('customerID', 'customerID', '=='),
how='right')
# drop irrelevant columns
.drop(columns=[('left', 'customerID'),
('left', 'timestamp'),
('right', 'start_date')])
# return to single index
.droplevel(0,1)
.groupby(['customerID', 'customerSeg', 'timestamp'])
.sum()
)
service phonecall
customerID customerSeg timestamp
1 1 1999-01-01 0.0 0.0
2 2000-01-01 1.0 1.0
2000-06-01 2.0 2.0
2 1 2001-01-01 1.0 1.0
2 2003-01-01 1.0 2.0
3 3 1999-01-01 0.0 0.0
4 3 2005-01-01 1.0 1.0
2008-01-01 1.0 0.0

Add rows in a gap dates

I need to insert rows in my dataframe:
This is my df:
I want this result, grouped by client. I mean, I have to create this for every client present in my dataframe
Try something like this:
df['month'] = pd.to_datetime(df.month, format='%d/%m/%Y',dayfirst=True ,errors='coerce')
df.set_index(['month']).groupby(['client']).resample('M').asfreq().drop('client', axis=1).reset_index()
client month col1
0 1 2017-03-31 20.0
1 1 2017-04-30 NaN
2 1 2017-05-31 90.0
3 1 2017-06-30 NaN
4 1 2017-07-31 NaN
5 1 2017-08-31 NaN
6 1 2017-09-30 NaN
7 1 2017-10-31 NaN
8 1 2017-11-30 NaN
9 1 2017-12-31 100.0
10 2 2018-09-30 NaN
11 2 2018-10-31 7.0

How to return CUMSUM of Days in Dates - Python

How do I return the CumSum number of days for the dates provided?
import pandas as pd
df = pd.DataFrame({
'date': ['2019-01-01','2019-01-03','2019-01-05',
'2019-01-06','2019-01-07','2019-01-08',
'2019-01-09','2019-01-12','2019-01-013']})
df['date'].cumsum() does not work here.
Desired dataframe:
date Cumsum days
0 2019-01-01 0
1 2019-01-03 2
2 2019-01-05 4
3 2019-01-06 5
4 2019-01-07 6
5 2019-01-08 7
6 2019-01-09 8
7 2019-01-12 9
8 2019-01-013 11
Another way is calling diff, fillna and cumsum
df['cumsum days'] = df['date'].diff().dt.days.fillna(0).cumsum()
Out[2044]:
date cumsum days
0 2019-01-01 0.0
1 2019-01-03 2.0
2 2019-01-05 4.0
3 2019-01-06 5.0
4 2019-01-07 6.0
5 2019-01-08 7.0
6 2019-01-09 8.0
7 2019-01-12 11.0
8 2019-01-13 12.0
Thanks, this should work:
def generate_time_delta_column(df, time_column, date_first_online_column):
return (df[time_column] - df[date_first_online_column]).dt.days

Counting Number of Occurrences Between Dates (Given an ID value) From Another Dataframe

Pandas: select DF rows based on another DF is the closest answer I can find to my question, but I don't believe it quite solves it.
Anyway, I am working with two very large pandas dataframes (so speed is a consideration), df_emails and df_trips, both of which are already sorted by CustID and then by date.
df_emails includes the date we sent a customer an email and it looks like this:
CustID DateSent
0 2 2018-01-20
1 2 2018-02-19
2 2 2018-03-31
3 4 2018-01-10
4 4 2018-02-26
5 5 2018-02-01
6 5 2018-02-07
df_trips includes the dates a customer came to the store and how much they spent, and it looks like this:
CustID TripDate TotalSpend
0 2 2018-02-04 25
1 2 2018-02-16 100
2 2 2018-02-22 250
3 4 2018-01-03 50
4 4 2018-02-28 100
5 4 2018-03-21 100
6 8 2018-01-07 200
Basically, what I need to do is find the number of trips and total spend for each customer in between each email sent. If it is the last time an email is sent for a given customer, I need to find the total number of trips and total spend after the email, but before the end of the data (2018-04-01). So the final dataframe would look like this:
CustID DateSent NextDateSentOrEndOfData TripsBetween TotalSpendBetween
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 2018-04-01 0.0 0.0
3 4 2018-01-10 2018-02-26 0.0 0.0
4 4 2018-02-26 2018-04-01 2.0 200.0
5 5 2018-02-01 2018-02-07 0.0 0.0
6 5 2018-02-07 2018-04-01 0.0 0.0
Though I have tried my best to do this in a Python/Pandas friendly way, the only accurate solution I have been able to implement is through an np.where, shifting, and looping. The solution looks like this:
df_emails["CustNthVisit"] = df_emails.groupby("CustID").cumcount()+1
df_emails["CustTotalVisit"] = df_emails.groupby("CustID")["CustID"].transform('count')
df_emails["NextDateSentOrEndOfData"] = pd.to_datetime(df_emails["DateSent"].shift(-1)).where(df_emails["CustNthVisit"] != df_emails["CustTotalVisit"], pd.to_datetime('04-01-2018'))
for i in df_emails.index:
df_emails.at[i, "TripsBetween"] = len(df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])])
for i in df_emails.index:
df_emails.at[i, "TotalSpendBetween"] = df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])].TotalSpend.sum()
df_emails.drop(['CustNthVisit',"CustTotalVisit"], axis=1, inplace=True)
However, a %%timeit has revealed that this takes 10.6ms on just the seven rows shown above, which makes this solution pretty much infeasible on my actual datasets of about 1,000,000 rows. Does anyone know a solution here that is faster and thus feasible?
Add the next date column to emails
df_emails["NextDateSent"] = df_emails.groupby("CustID").shift(-1)
Sort for merge_asof and then merge to nearest to create a trip lookup table
df_emails = df_emails.sort_values("DateSent")
df_trips = df_trips.sort_values("TripDate")
df_lookup = pd.merge_asof(df_trips, df_emails, by="CustID", left_on="TripDate",right_on="DateSent", direction="backward")
Aggregate the lookup table for the data you want.
df_lookup = df_lookup.loc[:, ["CustID", "DateSent", "TotalSpend"]].groupby(["CustID", "DateSent"]).agg(["count","sum"])
Left join it back to the email table.
df_merge = df_emails.join(df_lookup, on=["CustID", "DateSent"]).sort_values("CustID")
I choose to leave NaNs as NaNs because I don't like filling default values (you can always do that later if you prefer, but you can't easily distinguish between things that existed vs things that didn't if you put defaults in early)
CustID DateSent NextDateSent (TotalSpend, count) (TotalSpend, sum)
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 NaT NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 NaT 2.0 200.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 NaT NaN NaN
This would be an easy case of merge_asof had I been able to handle the max_date, so I go a long way:
max_date = pd.to_datetime('2018-04-01')
# set_index for easy extraction by id
df_emails.set_index('CustID', inplace=True)
# we want this later in the final output
df_emails['NextDateSentOrEndOfData'] = df_emails.groupby('CustID').shift(-1).fillna(max_date)
# cuts function for groupby
def cuts(df):
custID = df.CustID.iloc[0]
bins=list(df_emails.loc[[custID], 'DateSent']) + [max_date]
return pd.cut(df.TripDate, bins=bins, right=False)
# bin the dates:
s = df_trips.groupby('CustID', as_index=False, group_keys=False).apply(cuts)
# aggregate the info:
new_df = (df_trips.groupby([df_trips.CustID, s])
.TotalSpend.agg(['sum', 'size'])
.reset_index()
)
# get the right limit:
new_df['NextDateSentOrEndOfData'] = new_df.TripDate.apply(lambda x: x.right)
# drop the unnecessary info
new_df.drop('TripDate', axis=1, inplace=True)
# merge:
df_emails.reset_index().merge(new_df,
on=['CustID','NextDateSentOrEndOfData'],
how='left'
)
Output:
CustID DateSent NextDateSentOrEndOfData sum size
0 2 2018-01-20 2018-02-19 125.0 2.0
1 2 2018-02-19 2018-03-31 250.0 1.0
2 2 2018-03-31 2018-04-01 NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 2018-04-01 200.0 2.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 2018-04-01 NaN NaN

How to take difference of several Timestamp series in a dataframe in Pandas?

I want to obtain the timedelta interval between several timestamp columns in a dataframe. Also, several entries are NaN.
Original DF:
0 1 2 3 4 5
0 date1 date2 NaN NaN NaN NaN
1 date3 date4 date5 date6 date7 date8
Desired Output:
0 1 2 3 4
0 date2-date1 NaN NaN NaN NaN
1 date4-date3 date5-date4 date6-date5 date7-date6 date8-date7
I think you can use if consecutive NaNs to end of rows:
df = pd.DataFrame([['2015-01-02','2015-01-03', np.nan, np.nan],
['2015-01-02','2015-01-05','2015-01-07','2015-01-12']])
print (df)
0 1 2 3
0 2015-01-02 2015-01-03 NaN NaN
1 2015-01-02 2015-01-05 2015-01-07 2015-01-12
df = df.apply(pd.to_datetime).ffill(axis=1).diff(axis=1)
print (df)
0 1 2 3
0 NaT 1 days 0 days 0 days
1 NaT 3 days 2 days 5 days
Details:
First convert all columns to datetimes:
print (df.apply(pd.to_datetime))
0 1 2 3
0 2015-01-02 2015-01-03 NaT NaT
1 2015-01-02 2015-01-05 2015-01-07 2015-01-12
Replace NaNs by forward filling last value per rows:
print (df.apply(pd.to_datetime).ffill(axis=1))
0 1 2 3
0 2015-01-02 2015-01-03 2015-01-03 2015-01-03
1 2015-01-02 2015-01-05 2015-01-07 2015-01-12
Get difference by diff:
print (df.apply(pd.to_datetime).ffill(axis=1).diff(axis=1))
0 1 2 3
0 NaT 1 days 0 days 0 days
1 NaT 3 days 2 days 5 days

Categories

Resources