I'm in a bit of a pickle. I've been working on a problem all day without seeing any real results. I'm working in Python and using Pandas for handling data.
What I'm trying to achieve is based on the customers previous interactions to sum each type of interaction. The timestamp of the interaction should be less than the timestamp of the survey. Ideally, I would like to sum the interactions for the customer during some period - like less than e.g. 5 years.
The first dataframe contains a customer ID, segmentation of that customer during in that survey e.g. 1 being "happy", 2 being "sad" and a timestamp for the time of the recorded segment or time of that survey.
import pandas as pd
#Generic example
customers = pd.DataFrame({"customerID":[1,1,1,2,2,3,4,4],"customerSeg":[1,2,2,1,2,3,3,3],"timestamp":['1999-01-01','2000-01-01','2000-06-01','2001-01-01','2003-01-01','1999-01-01','2005-01-01','2008-01-01']})
customers
Which yields something like:
customerID
customerSeg
timestamp
1
1
1999-01-01
1
1
2000-01-01
1
1
2000-06-01
2
2
2001-01-01
2
2
2003-01-01
3
3
1999-01-01
4
4
2005-01-01
4
4
2008-01-01
The other dataframe contains interactions with that customer eg. at service and a phonecall.
interactions = pd.DataFrame({"customerID":[1,1,1,1,2,2,2,2,4,4,4],"timestamp":['1999-07-01','1999-11-01','2000-03-01','2001-04-01','2000-12-01','2002-01-01','2004-03-01','2004-05-01','2000-01-01','2004-01-01','2009-01-01'],"service":[1,0,1,0,1,0,1,1,0,1,1],"phonecall":[0,1,1,1,1,1,0,1,1,0,1]})
interactions
Output:
customerID
timestamp
service
phonecall
1
1999-07-01
1
0
1
1999-11-01
0
1
1
2000-03-01
1
1
1
2001-04-01
0
1
2
2000-12-01
1
1
2
2002-01-01
0
1
2
2004-03-01
1
0
2
2004-05-01
1
1
4
2000-01-01
0
1
4
2004-01-01
1
0
4
2009-01-01
1
1
Result for all previous interactions (ideally, I would like only the last 5 years):
customerID
customerSeg
timestamp
service
phonecall
1
1
1999-01-01
0
0
1
1
2000-01-01
1
1
1
1
2000-06-01
2
2
2
2
2001-01-01
1
1
2
2
2003-01-01
1
2
3
3
1999-01-01
0
0
4
4
2005-01-01
1
1
4
4
2008-01-01
1
1
I've tried almost everything, I could come up with. So, I would really appreciate some inputs. I'm pretty much confined to using Pandas and Python, since it's the language, I'm most familiar with, but also because I need to read a csv file of the customer segmentation.
I think you need several steps for transforming your data.
First of all, we convert the timestamp columns in both dataframes to datetime, so we can calculate the desired interval and do the comparisons:
customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])
After that, we create a new column that contains that start date (e.g. 5 years before the timestamp):
customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)
Now we join the customers dataframe with the interactions dataframe on the customerID:
result = customers.merge(interactions, on='customerID', how='outer')
This yields
customerID customerSeg timestamp_x start_date timestamp_y service phonecall
0 1 1 1999-01-01 1994-01-01 1999-07-01 1.0 0.0
1 1 1 1999-01-01 1994-01-01 1999-11-01 0.0 1.0
2 1 1 1999-01-01 1994-01-01 2000-03-01 1.0 1.0
3 1 1 1999-01-01 1994-01-01 2001-04-01 0.0 1.0
4 1 2 2000-01-01 1995-01-01 1999-07-01 1.0 0.0
5 1 2 2000-01-01 1995-01-01 1999-11-01 0.0 1.0
6 1 2 2000-01-01 1995-01-01 2000-03-01 1.0 1.0
7 1 2 2000-01-01 1995-01-01 2001-04-01 0.0 1.0
...
Now here is how the condition is evaluated - what we want is that only those service and phonecall interactions will be used that are in rows that meet the condition (timestamp_y is in the interval between start_date and timestamp_x), so we replace the others by zero:
result['service'] = result.apply(lambda x: x.service if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
result['phonecall'] = result.apply(lambda x: x.phonecall if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
Finally we group the dataframe, summing up the service and phonecall interactions:
result = result.groupby(['customerID', 'timestamp_x', 'customerSeg'])[['service', 'phonecall']].sum()
Result:
service phonecall
customerID timestamp_x customerSeg
1 1999-01-01 1 0.0 0.0
2000-01-01 2 1.0 1.0
2000-06-01 2 2.0 2.0
2 2001-01-01 1 1.0 1.0
2003-01-01 2 1.0 2.0
3 1999-01-01 3 0.0 0.0
4 2005-01-01 3 1.0 1.0
2008-01-01 3 1.0 0.0
(Note that your customerSeg data in the sample code seems not quite to match the data in the table.)
One option is to use the conditional_join from pyjanitor to compute the rows that match the criteria, before grouping and summing:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])
customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)
(interactions
.conditional_join(
customers,
# column from left, column from right, comparison operator
('timestamp', 'timestamp', '<='),
('timestamp', 'start_date', '>='),
('customerID', 'customerID', '=='),
how='right')
# drop irrelevant columns
.drop(columns=[('left', 'customerID'),
('left', 'timestamp'),
('right', 'start_date')])
# return to single index
.droplevel(0,1)
.groupby(['customerID', 'customerSeg', 'timestamp'])
.sum()
)
service phonecall
customerID customerSeg timestamp
1 1 1999-01-01 0.0 0.0
2 2000-01-01 1.0 1.0
2000-06-01 2.0 2.0
2 1 2001-01-01 1.0 1.0
2 2003-01-01 1.0 2.0
3 3 1999-01-01 0.0 0.0
4 3 2005-01-01 1.0 1.0
2008-01-01 1.0 0.0
I have a Pandas dataframe in the following format:
id name timestamp time_diff <=30min
1 movie3 2009-05-04 18:00:00+00:00 NaN False
1 movie5 2009-05-05 18:15:00+00:00 00:15:00 True
1 movie1 2009-05-05 22:00:00+00:00 03:45:00 False
2 movie7 2009-05-04 09:30:00+00:00 NaN False
2 movie8 2009-05-05 12:00:00+00:00 02:30:00 False
3 movie1 2009-05-04 17:45:00+00:00 NaN False
3 movie7 2009-05-04 18:15:00+00:00 00:30:00 True
3 movie6 2009-05-04 18:30:00+00:00 00:15:00 True
3 movie6 2009-05-04 19:00:00+00:00 00:30:00 True
4 movie1 2009-05-05 12:45:00+00:00 NaN False
5 movie7 2009-05-04 11:00:00+00:00 NaN False
5 movie8 2009-05-04 11:15:00+00:00 00:15:00 True
The data shows the movies watched on a video streaming platform. Id is the user id, name is the name of the movie and timestamp is the timestamp at which the movie started. <30min indicates if the user has started the movie within 30minutes of the previous movie watched.
A movie-session is comprised by one or more movies played by a single user, where each movie has started within 30 minutes of the previous movie start time (Basically a session is defined as consecutive rows in which df['<30min'] == True).
The length of a session is defined as time_stamp of the last consecutive df['<30min'] == True - timestamp of the first True of the session.
How can I find the 3 longest sessions (in minutes) in the data, and the movies played during the sessions?
As a first step, I have tried something like this:
df.groupby((df['<20'] == False).cumsum())['time_diff'].fillna(pd.Timedelta(seconds=0)).cumsum()
But it doesn't work (the cumsum does not reset when df['time_diff']=False), and looks very slow.
Also, I think it would make my life harder when I have to select the longest 3 sessions as I could get multiple values for the same session that could be selected in the longest 3.
Not sure I understood you correctly. If I did then this may work;
Coercer timestamp to datetime;
df['timestamp']=pd.to_datetime(df['timestamp'])
filter out the True values which indicate consecutive watch.Groupby id whicle calculating the difference between maximum and minimum time. This is then joined to the main df
df.join(df[df['<=30min']==True].groupby('id')['timestamp'].transform(lambda x:x.max()-x.min()).to_frame().rename(columns={'timestamp':'Max'}))
id name timestamp time_diff <=30min Max
0 1 movie3 2009-05-04 18:00:00+00:00 NaN False NaT
1 1 movie5 2009-05-05 18:15:00+00:00 00:15:00 True 00:00:00
2 1 movie1 2009-05-05 22:00:00+00:00 03:45:00 False NaT
3 2 movie7 2009-05-04 09:30:00+00:00 NaN False NaT
4 2 movie8 2009-05-05 12:00:00+00:00 02:30:00 False NaT
5 3 movie1 2009-05-04 17:45:00+00:00 NaN False NaT
6 3 movie7 2009-05-04 18:15:00+00:00 00:30:00 True 00:45:00
7 3 movie6 2009-05-04 18:30:00+00:00 00:15:00 True 00:45:00
8 3 movie6 2009-05-04 19:00:00+00:00 00:30:00 True 00:45:00
9 4 movie1 2009-05-05 12:45:00+00:00 NaN False NaT
10 5 movie7 2009-05-04 11:00:00+00:00 NaN False NaT
11 5 movie8 2009-05-04 11:15:00+00:00 00:15:00 True 00:00:00
I am trying to find a way to calculate an inverse cumsum for pandas. This means applying cumsum but from bottom to top. The problem I'm facing is, I'm trying to find the number of workable day for each month for Spain both from top to bottom (1st workable day = 1, 2nd = 2, 3rd = 3, etc...) and bottom to top (last workable day = 1, day before last = 2, etc...).
So far I managed to get the top to bottom order to work but can't get the inverse order to work, I've searched a lot and couldn't find a way to perform an inverse cummulative sum:
import pandas as pd
from datetime import date
from workalendar.europe import Spain
import numpy as np
cal = Spain()
#print(cal.holidays(2019))
rng = pd.date_range('2019-01-01', periods=365, freq='D')
df = pd.DataFrame({ 'Date': rng})
df['flag_workable'] = df['Date'].apply(lambda x: cal.is_working_day(x))
df_workable = df[df['flag_workable'] == True]
df_workable['month'] = df_workable['Date'].dt.month
df_workable['workable_day'] = df_workable.groupby('month')['flag_workable'].cumsum()
print(df)
print(df_workable.head(30))
Output for January:
Date flag_workable month workable_day
1 2019-01-02 True 1 1.0
2 2019-01-03 True 1 2.0
3 2019-01-04 True 1 3.0
6 2019-01-07 True 1 4.0
7 2019-01-08 True 1 5.0
Example for last days of January:
Date flag_workable month workable_day
24 2019-01-25 True 1 18.0
27 2019-01-28 True 1 19.0
28 2019-01-29 True 1 20.0
29 2019-01-30 True 1 21.0
30 2019-01-31 True 1 22.0
This would be the expected output after applying the inverse cummulative:
Date flag_workable month workable_day inv_workable_day
1 2019-01-02 True 1 1.0 22.0
2 2019-01-03 True 1 2.0 21.0
3 2019-01-04 True 1 3.0 20.0
6 2019-01-07 True 1 4.0 19.0
7 2019-01-08 True 1 5.0 18.0
Last days of January:
Date flag_workable month workable_day inv_workable_day
24 2019-01-25 True 1 18.0 5.0
27 2019-01-28 True 1 19.0 4.0
28 2019-01-29 True 1 20.0 3.0
29 2019-01-30 True 1 21.0 2.0
30 2019-01-31 True 1 22.0 1.0
Invert the row order of the DataFrame prior to grouping so that the cumsum is calculated in reverse order within each month.
df['inv_workable_day'] = df[::-1].groupby('month')['flag_workable'].cumsum()
df['workable_day'] = df.groupby('month')['flag_workable'].cumsum()
# Date flag_workable month inv_workable_day workable_day
#1 2019-01-02 True 1 5.0 1.0
#2 2019-01-03 True 1 4.0 2.0
#3 2019-01-04 True 1 3.0 3.0
#6 2019-01-07 True 1 2.0 4.0
#7 2019-01-08 True 1 1.0 5.0
#8 2019-02-01 True 2 1.0 1.0
Solution
Whichever column you want to apply cumsum to you have two options:
Order descending a copy of that column by index, followed by cumsum and then order ascending by index. Finally assign it back to the data frame column.
Use numpy:
import numpy as np
array = df.column_data.to_numpy()
array = np.flip(array) # to flip the order
array = np.cumsum(array)
array = np.flip(array) # to flip back to original order
df.column_data_cumsum = array
I'm struggling with a simple pandas algo for stock market trading. Nothing serious or complicated, I just want to learn how to do it in python.
What I am trying to do is
buy stocks when signal turns True, and pay for it with the cash (so cash goes down, stock position goes up
When signal turns false, sell stocks and add the result to the cash.
But I can't get this to work. I could get this to work with looping, but that would be too time consuming. Any suggestions?
## data set
close=[21.02,21.05,21.10,21.22, 22.17,22.13,22.07]
signal=[False,True,True,True,False,True,True]
data={'close':close, 'signal':signal}
df=pd.DataFrame.from_dict(data)
df['cash']=1000
df['trade']=0
df['pos']=0
## if signal turns True, buy stocks
buysubset = ((df.signal==True) & (df.signal.shift(1)==False))
sellsubset = ((df.signal==False) & df.signal.shift(1)==True)
df.loc[buysubset,'trade']=(df.cash/df.close).astype(int)
df.loc[buysubset,'cash']=df.cash-(df.trade*df.close)
df.loc[sellsubset,'trade']=-df.pos.shift(1)
## if previous row has position, keep the position if the signal is still True
df['pos']=df.trade.mask(df.signal & (df.trade == 0)).ffill().astype(int)
I get this as a result:
close signal cash trade pos
0 21.02 False 1000.00 0.0 0
1 21.05 True 10.65 47.0 47
2 21.10 True 1000.00 0.0 47
3 21.22 True 1000.00 0.0 47
4 22.17 False 1000.00 -0.0 0
5 23.34 True 4.15 45.0 45
But would like to get this :
close signal cash trade pos
0 21.02 False 1000.00 0.0 0
1 21.05 True 10.65 47.0 47
2 21.10 True 10.65 0.0 47
3 21.22 True 10.65 0.0 47
4 22.17 False 1052.62 -47.0 0
5 23.34 True 2.57 45.0 45
df.groupby([df.index.month, df.index.day])[vars_rs].transform(lambda y: y.fillna(y.median()))
I am filling missing values in a dataframe with median values from climatology. The days range from Jan 1 2010 to Dec 31st 2016. However, I only want to fill in missing values for days before current date (say Oct 1st 2016). How do I modify the statement?
The algorithm would be:
Get a part of the data frame which contains only rows filtered by date with a boolean mask
Perform required replacements on it
Append the rest of the initial data frame to the end of the resulting data frame.
Dummy data:
df = pd.DataFrame(np.zeros((5, 2)),columns=['A', 'B'],index=pd.date_range('2000',periods=5,freq='M'))
A B
2000-01-31 0.0 0.0
2000-02-29 0.0 0.0
2000-03-31 0.0 0.0
2000-04-30 0.0 0.0
2000-05-31 0.0 0.0
The code
vars_rs = ['A', 'B']
mask = df.index < '2000-03-31'
early = df[mask]
early = early.groupby([early.index.month, early.index.day])[vars_rs].transform(lambda y: y.replace(0.0, 1)) # replace with your code
result = early.append(df[~mask])
So the result is
A B
2000-01-31 1.0 1.0
2000-02-29 1.0 1.0
2000-03-31 0.0 0.0
2000-04-30 0.0 0.0
2000-05-31 0.0 0.0
Use np.where, example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['a','a','b','b','c','c'],'B':[1,2,3,4,5,6],'C':[1,np.nan,np.nan,np.nan,np.nan,np.nan]})
df.ix[:,'C'] = np.where((df.A != 'c')&(df.B < 4)&(pd.isnull(df.C)),-99,df.ix[:,'C'])
Like this you can directly modify the desired column using boolean expressions and all columns.
Original dataframe:
A B C
0 a 1 1.0
1 a 2 NaN
2 b 3 NaN
3 b 4 NaN
4 c 5 NaN
5 c 6 NaN
Modified dataframe:
A B C
0 a 1 1.0
1 a 2 -99.0
2 b 3 -99.0
3 b 4 NaN
4 c 5 NaN
5 c 6 NaN