Slicing a Pandas df based on column values - python

How would you do the following?
For customers with churn=1, take the average of their last 3 purchases based on the month they leave. E.g. Churn_Month=3, then average last 3 purchases: from Mar, Feb and Jan if available. Sometimes would be 2 or 1 purchase.
For customers with churn=0, take the average of their last 3 purchases when available, sometimes would be 2 or 1 purchase.
And put it all in one pandas Dataframe. See Expected Output
Available information
df1: Here you'll find transactions with customer id, date, purchase1 and purchase2.
ID DATE P1 P2
0 1 2003-04-01 449 55
1 4 2003-02-01 406 213
2 3 2003-11-01 332 372
3 1 2003-03-01 61 336
4 3 2003-10-01 428 247
5 3 2003-12-01 335 339
6 3 2003-09-01 367 41
7 2 2003-01-01 11 270
8 1 2003-01-01 55 102
9 2 2003-02-01 244 500
10 1 2003-02-01 456 272
11 5 2003-03-01 240 180
12 4 2002-12-01 156 152
13 5 2003-01-01 144 185
14 4 2003-01-01 246 428
15 1 2003-05-01 492 97
16 5 2003-02-01 371 66
17 5 2003-04-01 246 428
18 5 2003-05-01 406 213
df2: Here you'll find customer ID, whether they leave the company or not and the month when they left (E.g. 3.0 = March)
ID Churn Churn_Month
0 1 1 3.0
1 2 0 0.0
2 3 1 12.0
3 4 0 0.0
4 5 1 4.0
Expected Output:
Mean of P1 and P2 by ID, merged with df2 information. ID will be the new index.
ID P1 P2 Churn Churn_Month
1 190.6 236.6 1 3.0
2 127.5 385 0 0.0
3 365 319.3 1 12.0
4 269.3 264.3 0 0.0
5 285.6 224.6 1 4.0

Some extra details were necessary here. First, when Churn == 1 assume that the customer left. Using df2 you can determine which month they left and remove any data that occurred after. From there it's pretty straight forward in terms of grouping, aggregating, and filter the data.
# merge
df3 = df1.merge(df2)
# convert DATE to datetime
df3.DATE = pd.to_datetime(df3.DATE)
# filter rows where month of (DATE is <= Churn_Month and Churn == 1)
# or Churn == 0
df3 = df3.loc[
((df3.Churn == 1) & (df3.DATE.dt.month <= df3.Churn_Month)) |
(df3.Churn == 0)
].copy()
# sort values ascending
df3.sort_values([
'ID',
'DATE',
'P1',
'P2',
'Churn',
'Churn_Month'
], inplace=True)
# groupby ID, Churn
# take last 3 DATEs
# merge with original to filter rows
# group on ID, Churn, and Churn_Month
# average P1 and P2
# reset_index to get columns back
# round results to 1 decimal at the end
df3.groupby([
'ID',
'Churn'
]).DATE.nth([
-1, -2, -3
]).reset_index().merge(df3).groupby([
'ID',
'Churn',
'Churn_Month'
])[[
'P1',
'P2'
]].mean().reset_index().round(1)
Results
ID Churn Churn_Month P1 P2
0 1 1 3.0 190.7 236.7
1 2 0 0.0 127.5 385.0
2 3 1 12.0 365.0 319.3
3 4 0 0.0 269.3 264.3
4 5 1 4.0 285.7 224.7

Related

Add a portion of a dataframe to another dataframe

Suppose to have two dataframes, df1 and df2, with equal number of columns, but different number of rows, e.g:
df1 = pd.DataFrame([(1,2),(3,4),(5,6),(7,8),(9,10),(11,12)], columns=['a','b'])
a b
1 1 2
2 3 4
3 5 6
4 7 8
5 9 10
6 11 12
df2 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=['a','b'])
a b
1 100 200
2 300 400
3 500 600
I would like to add df2 to the df1 tail (df1.loc[df2.shape[0]:]), thus obtaining:
a b
1 1 2
2 3 4
3 5 6
4 107 208
5 309 410
6 511 612
Any idea?
Thanks!
If there is more rows in df1 like in df2 rows is possible use DataFrame.iloc with convert values to numpy array for avoid alignment (different indices create NaNs):
df1.iloc[-df2.shape[0]:] += df2.to_numpy()
print (df1)
a b
0 1 2
1 3 4
2 5 6
3 107 208
4 309 410
5 511 612
For general solution working with any number of rows with unique indices in both Dataframe with rename and DataFrame.add:
df = df1.add(df2.rename(dict(zip(df2.index[::-1], df1.index[::-1]))), fill_value=0)
print (df)
a b
0 1.0 2.0
1 3.0 4.0
2 5.0 6.0
3 107.0 208.0
4 309.0 410.0
5 511.0 612.0

How to filter a dataframe and identify records based on a condition on multiple other columns

id zone price
0 0000001 1 33.0
1 0000001 2 24.0
2 0000001 3 34.0
3 0000001 4 45.0
4 0000001 5 51.0
I have the above pandas dataframe, here there are multiple ids (only 1 id is shown here). dataframe consist of a certain id with 5 zones and 5 prices. these prices should follow the below pattern
p1 (price of zone 1) < p2< p3< p4< p5
if anything out of order we should identify and print anomaly records to a file.
here in this example p3 <p4 <p5 but p1 and p2 are erroneous. (p1 > p2 whereas p1 < p2 is expected)
therefore 1st 2 records should be printed to a file
likewise this has to be done to the entire dataframe for all unique ids in it
My dataframe is huge, what is the most efficient way to do this filtering and identify erroneous records?
You can compute the diff per group after sorting the values to ensure the zones are increasing. If the diff is ≤ 0 the price is not strictly increasing and the rows should be flagged:
s = (df.sort_values(by=['id', 'zone']) # sort rows
.groupby('id') # group by id
['price'].diff() # compute the diff
.le(0) # flag those ≤ 0 (not increasing)
)
df[s|s.shift(-1)] # slice flagged rows + previous row
Example output:
id zone price
0 1 1 33.0
1 1 2 24.0
Example input:
id zone price
0 1 1 33.0
1 1 2 24.0
2 1 3 34.0
3 1 4 45.0
4 1 5 51.0
5 2 1 20.0
6 2 2 24.0
7 2 3 34.0
8 2 4 45.0
9 2 5 51.0
saving to file
df[s|s.shift(-1)].to_csv('incorrect_prices.csv')
Another way would be to first sort your dataframe by id and zone in ascending order and compare the next price with previous price using groupby.shift() creating a new column. Then you can just print out the prices that have fell in value:
import numpy as np
import pandas as pd
df.sort_values(by=['id','zone'],ascending=True)
df['increase'] = np.where(df.zone.eq(1),'no change',
np.where(df.groupby('id')['price'].shift(1) < df['price'],'inc','dec'))
>>> df
id zone price increase
0 1 1 33 no change
1 1 2 24 dec
2 1 3 34 inc
3 1 4 45 inc
4 1 5 51 inc
5 2 1 34 no change
6 2 2 56 inc
7 2 3 22 dec
8 2 4 55 inc
9 2 5 77 inc
10 3 1 44 no change
11 3 2 55 inc
12 3 3 44 dec
13 3 4 66 inc
14 3 5 33 dec
>>> df.loc[df.increase.eq('dec')]
id zone price increase
1 1 2 24 dec
7 2 3 22 dec
12 3 3 44 dec
14 3 5 33 dec
I have added some extra ID's to try and mimic your real data.

Pandas DataFrame Change Values Based on Values in Different Rows

I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0

Drop rows after maximum value in a grouped Pandas dataframe

I've got a date-ordered dataframe that can be grouped. What I am attempting to do is groupby a variable (Person), determine the maximum (weight) for each group (person), and then drop all rows that come after (date) the maximum.
Here's an example of the data:
df = pd.DataFrame({'Person': 1,1,1,1,1,2,2,2,2,2],'Date': '1/1/2015','2/1/2015','3/1/2015','4/1/2015','5/1/2015','6/1/2011','7/1/2011','8/1/2011','9/1/2011','10/1/2011'], 'MonthNo':[1,2,3,4,5,1,2,3,4,5], 'Weight':[100,110,115,112,108,205,210,211,215,206]})
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
3 4/1/2015 4 1 112
4 5/1/2015 5 1 108
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
9 10/1/2011 5 2 206
Here's what I want the result to look like:
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
I think its worth noting, there can be disjoint start dates and the maximum may appear at different times.
My idea was to find the maximum for each group, obtain the MonthNo the maximum was in for that group, and then discard any rows with MonthNo greater Max Weight MonthNo. So far I've been able to obtain the max by group, but cannot get past doing a comparison based on that.
Please let me know if I can edit/provide more information, haven't posted many questions here! Thanks for the help, sorry if my formatting/question isn't clear.
Using idxmax with groupby
df.groupby('Person',sort=False).apply(lambda x : x.reset_index(drop=True).iloc[:x.reset_index(drop=True).Weight.idxmax()+1,:])
Out[131]:
Date MonthNo Person Weight
Person
1 0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
2 0 6/1/2011 1 2 205
1 7/1/2011 2 2 210
2 8/1/2011 3 2 211
3 9/1/2011 4 2 215
You can use groupby.transform with idxmax. The first 2 steps may not be necessary depending on how your dataframe is structured.
# convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'])
# sort by Person and Date to make index usable for next step
df = df.sort_values(['Person', 'Date']).reset_index(drop=True)
# filter for index less than idxmax transformed by group
df = df[df.index <= df.groupby('Person')['Weight'].transform('idxmax')]
print(df)
Date MonthNo Person Weight
0 2015-01-01 1 1 100
1 2015-02-01 2 1 110
2 2015-03-01 3 1 115
5 2011-06-01 1 2 205
6 2011-07-01 2 2 210
7 2011-08-01 3 2 211
8 2011-09-01 4 2 215

Pandas: Add new column based on comparison of two DFs

I have 2 dataframes that I am wanting to compare one to the other and add a 'True/False' to a new column in the first based on the comparison.
My data resembles:
DF1:
cat sub-cat low high
3 3 1 208 223
4 3 1 224 350
8 4 1 223 244
9 4 1 245 350
13 5 1 232 252
14 5 1 253 350
DF2:
Cat Sub-Cat Rating
0 5 1 246
1 5 2 239
2 8 1 203
3 8 2 218
4 K 1 149
5 K 2 165
6 K 1 171
7 K 2 185
8 K 1 157
9 K 2 171
Desired result would be for DF2 to have an additional column with a True or False depending on if, based on the cat and sub-cat, that the rating is between the low.min() and high.max() or Null if no matches found to compare.
Have been running rounds with this for far too long with no results to speak of.
Thank you in advance for any assistance.
Update:
First row would look something like:
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
As it falls within the min low and the max high.
Example: There are two rows in DF1 for cat = 5 and sub-cat = 2. I need to get the minimum low and the maximum high from those 2 rows and then check if the rating from row 0 in DF2 falls within the minimum low and maximum high from the two matching rows in DF1
join post groupby.agg
d2 = DF2.join(
DF1.groupby(
['cat', 'sub-cat']
).agg(dict(low='min', high='max')),
on=['Cat', 'Sub-Cat']
)
d2
Cat Sub-Cat Rating high low
0 5 1 246 350.0 232.0
1 5 2 239 NaN NaN
2 8 1 203 NaN NaN
3 8 2 218 NaN NaN
4 K 1 149 NaN NaN
5 K 2 165 NaN NaN
6 K 1 171 NaN NaN
7 K 2 185 NaN NaN
8 K 1 157 NaN NaN
9 K 2 171 NaN NaN
assign with .loc
DF2.loc[d2.eval('low <= Rating <= high'), 'In-Spec'] = True
DF2
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
1 5 2 239 NaN
2 8 1 203 NaN
3 8 2 218 NaN
4 K 1 149 NaN
5 K 2 165 NaN
6 K 1 171 NaN
7 K 2 185 NaN
8 K 1 157 NaN
9 K 2 171 NaN
To add a new column based on a boolean expression would involve something along the lines of:
temp = boolean code involving inequality
df2['new column name'] = temp
However I'm not sure I understand, the first row in your DF2 table for instance, has a rating of 246, which means it's true for row 13 of DF1, but false for row 14. What would you like it to return?
You can do it like this
df2['In-Spec'] = 'False'
df2['In-Spec'][(df2['Rating'] > df1['low']) & (df2['Rating'] < df1['high'])] = 'True'
But which rows should be compared with each others? Do you want them to compare by their index or by their cat & subcat names?

Categories

Resources