I've got a date-ordered dataframe that can be grouped. What I am attempting to do is groupby a variable (Person), determine the maximum (weight) for each group (person), and then drop all rows that come after (date) the maximum.
Here's an example of the data:
df = pd.DataFrame({'Person': 1,1,1,1,1,2,2,2,2,2],'Date': '1/1/2015','2/1/2015','3/1/2015','4/1/2015','5/1/2015','6/1/2011','7/1/2011','8/1/2011','9/1/2011','10/1/2011'], 'MonthNo':[1,2,3,4,5,1,2,3,4,5], 'Weight':[100,110,115,112,108,205,210,211,215,206]})
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
3 4/1/2015 4 1 112
4 5/1/2015 5 1 108
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
9 10/1/2011 5 2 206
Here's what I want the result to look like:
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
I think its worth noting, there can be disjoint start dates and the maximum may appear at different times.
My idea was to find the maximum for each group, obtain the MonthNo the maximum was in for that group, and then discard any rows with MonthNo greater Max Weight MonthNo. So far I've been able to obtain the max by group, but cannot get past doing a comparison based on that.
Please let me know if I can edit/provide more information, haven't posted many questions here! Thanks for the help, sorry if my formatting/question isn't clear.
Using idxmax with groupby
df.groupby('Person',sort=False).apply(lambda x : x.reset_index(drop=True).iloc[:x.reset_index(drop=True).Weight.idxmax()+1,:])
Out[131]:
Date MonthNo Person Weight
Person
1 0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
2 0 6/1/2011 1 2 205
1 7/1/2011 2 2 210
2 8/1/2011 3 2 211
3 9/1/2011 4 2 215
You can use groupby.transform with idxmax. The first 2 steps may not be necessary depending on how your dataframe is structured.
# convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'])
# sort by Person and Date to make index usable for next step
df = df.sort_values(['Person', 'Date']).reset_index(drop=True)
# filter for index less than idxmax transformed by group
df = df[df.index <= df.groupby('Person')['Weight'].transform('idxmax')]
print(df)
Date MonthNo Person Weight
0 2015-01-01 1 1 100
1 2015-02-01 2 1 110
2 2015-03-01 3 1 115
5 2011-06-01 1 2 205
6 2011-07-01 2 2 210
7 2011-08-01 3 2 211
8 2011-09-01 4 2 215
Related
Suppose to have two dataframes, df1 and df2, with equal number of columns, but different number of rows, e.g:
df1 = pd.DataFrame([(1,2),(3,4),(5,6),(7,8),(9,10),(11,12)], columns=['a','b'])
a b
1 1 2
2 3 4
3 5 6
4 7 8
5 9 10
6 11 12
df2 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=['a','b'])
a b
1 100 200
2 300 400
3 500 600
I would like to add df2 to the df1 tail (df1.loc[df2.shape[0]:]), thus obtaining:
a b
1 1 2
2 3 4
3 5 6
4 107 208
5 309 410
6 511 612
Any idea?
Thanks!
If there is more rows in df1 like in df2 rows is possible use DataFrame.iloc with convert values to numpy array for avoid alignment (different indices create NaNs):
df1.iloc[-df2.shape[0]:] += df2.to_numpy()
print (df1)
a b
0 1 2
1 3 4
2 5 6
3 107 208
4 309 410
5 511 612
For general solution working with any number of rows with unique indices in both Dataframe with rename and DataFrame.add:
df = df1.add(df2.rename(dict(zip(df2.index[::-1], df1.index[::-1]))), fill_value=0)
print (df)
a b
0 1.0 2.0
1 3.0 4.0
2 5.0 6.0
3 107.0 208.0
4 309.0 410.0
5 511.0 612.0
I have a dataframe in the following format:
time
parameter
TimeDelta
1
123
-
2
456
1
4
122
2
7
344
3
8
344
1
How to build an additional column with labeling, once TimeDelta is greater than e.g. 1.5?
And also apply this labeling for the following rows once TimeDelta is again greater than 1.5?
time
parameter
TimeDelta
Label
1
123
-
1
2
456
1
1
4
122
2
2
7
344
3
3
8
344
1
3
I do not want to loop over every row, which is extremely slow.
Maybe it is possible with cumsum() to flag all the following rows up to the next value above threshold?
You can use part of soluton from previous answer, add 1 and assign to new column:
df['Label'] = pd.to_numeric(df['TimeDelta'], errors='coerce').gt(1.5).cumsum().add(1)
print (df)
time parameter TimeDelta Label
0 1 123 - 1
1 2 456 1 1
2 4 122 2 2
3 7 344 3 3
4 8 344 1 3
How would you do the following?
For customers with churn=1, take the average of their last 3 purchases based on the month they leave. E.g. Churn_Month=3, then average last 3 purchases: from Mar, Feb and Jan if available. Sometimes would be 2 or 1 purchase.
For customers with churn=0, take the average of their last 3 purchases when available, sometimes would be 2 or 1 purchase.
And put it all in one pandas Dataframe. See Expected Output
Available information
df1: Here you'll find transactions with customer id, date, purchase1 and purchase2.
ID DATE P1 P2
0 1 2003-04-01 449 55
1 4 2003-02-01 406 213
2 3 2003-11-01 332 372
3 1 2003-03-01 61 336
4 3 2003-10-01 428 247
5 3 2003-12-01 335 339
6 3 2003-09-01 367 41
7 2 2003-01-01 11 270
8 1 2003-01-01 55 102
9 2 2003-02-01 244 500
10 1 2003-02-01 456 272
11 5 2003-03-01 240 180
12 4 2002-12-01 156 152
13 5 2003-01-01 144 185
14 4 2003-01-01 246 428
15 1 2003-05-01 492 97
16 5 2003-02-01 371 66
17 5 2003-04-01 246 428
18 5 2003-05-01 406 213
df2: Here you'll find customer ID, whether they leave the company or not and the month when they left (E.g. 3.0 = March)
ID Churn Churn_Month
0 1 1 3.0
1 2 0 0.0
2 3 1 12.0
3 4 0 0.0
4 5 1 4.0
Expected Output:
Mean of P1 and P2 by ID, merged with df2 information. ID will be the new index.
ID P1 P2 Churn Churn_Month
1 190.6 236.6 1 3.0
2 127.5 385 0 0.0
3 365 319.3 1 12.0
4 269.3 264.3 0 0.0
5 285.6 224.6 1 4.0
Some extra details were necessary here. First, when Churn == 1 assume that the customer left. Using df2 you can determine which month they left and remove any data that occurred after. From there it's pretty straight forward in terms of grouping, aggregating, and filter the data.
# merge
df3 = df1.merge(df2)
# convert DATE to datetime
df3.DATE = pd.to_datetime(df3.DATE)
# filter rows where month of (DATE is <= Churn_Month and Churn == 1)
# or Churn == 0
df3 = df3.loc[
((df3.Churn == 1) & (df3.DATE.dt.month <= df3.Churn_Month)) |
(df3.Churn == 0)
].copy()
# sort values ascending
df3.sort_values([
'ID',
'DATE',
'P1',
'P2',
'Churn',
'Churn_Month'
], inplace=True)
# groupby ID, Churn
# take last 3 DATEs
# merge with original to filter rows
# group on ID, Churn, and Churn_Month
# average P1 and P2
# reset_index to get columns back
# round results to 1 decimal at the end
df3.groupby([
'ID',
'Churn'
]).DATE.nth([
-1, -2, -3
]).reset_index().merge(df3).groupby([
'ID',
'Churn',
'Churn_Month'
])[[
'P1',
'P2'
]].mean().reset_index().round(1)
Results
ID Churn Churn_Month P1 P2
0 1 1 3.0 190.7 236.7
1 2 0 0.0 127.5 385.0
2 3 1 12.0 365.0 319.3
3 4 0 0.0 269.3 264.3
4 5 1 4.0 285.7 224.7
I have 2 dataframes that I am wanting to compare one to the other and add a 'True/False' to a new column in the first based on the comparison.
My data resembles:
DF1:
cat sub-cat low high
3 3 1 208 223
4 3 1 224 350
8 4 1 223 244
9 4 1 245 350
13 5 1 232 252
14 5 1 253 350
DF2:
Cat Sub-Cat Rating
0 5 1 246
1 5 2 239
2 8 1 203
3 8 2 218
4 K 1 149
5 K 2 165
6 K 1 171
7 K 2 185
8 K 1 157
9 K 2 171
Desired result would be for DF2 to have an additional column with a True or False depending on if, based on the cat and sub-cat, that the rating is between the low.min() and high.max() or Null if no matches found to compare.
Have been running rounds with this for far too long with no results to speak of.
Thank you in advance for any assistance.
Update:
First row would look something like:
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
As it falls within the min low and the max high.
Example: There are two rows in DF1 for cat = 5 and sub-cat = 2. I need to get the minimum low and the maximum high from those 2 rows and then check if the rating from row 0 in DF2 falls within the minimum low and maximum high from the two matching rows in DF1
join post groupby.agg
d2 = DF2.join(
DF1.groupby(
['cat', 'sub-cat']
).agg(dict(low='min', high='max')),
on=['Cat', 'Sub-Cat']
)
d2
Cat Sub-Cat Rating high low
0 5 1 246 350.0 232.0
1 5 2 239 NaN NaN
2 8 1 203 NaN NaN
3 8 2 218 NaN NaN
4 K 1 149 NaN NaN
5 K 2 165 NaN NaN
6 K 1 171 NaN NaN
7 K 2 185 NaN NaN
8 K 1 157 NaN NaN
9 K 2 171 NaN NaN
assign with .loc
DF2.loc[d2.eval('low <= Rating <= high'), 'In-Spec'] = True
DF2
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
1 5 2 239 NaN
2 8 1 203 NaN
3 8 2 218 NaN
4 K 1 149 NaN
5 K 2 165 NaN
6 K 1 171 NaN
7 K 2 185 NaN
8 K 1 157 NaN
9 K 2 171 NaN
To add a new column based on a boolean expression would involve something along the lines of:
temp = boolean code involving inequality
df2['new column name'] = temp
However I'm not sure I understand, the first row in your DF2 table for instance, has a rating of 246, which means it's true for row 13 of DF1, but false for row 14. What would you like it to return?
You can do it like this
df2['In-Spec'] = 'False'
df2['In-Spec'][(df2['Rating'] > df1['low']) & (df2['Rating'] < df1['high'])] = 'True'
But which rows should be compared with each others? Do you want them to compare by their index or by their cat & subcat names?
In python, given a list of ratings as:
import pandas as pd
path = 'ratings_ml100k.csv'
data = pd.read_csv(path,sep= ',')
print(data)
user_id item_id rating
28422 100 690 4
32020 441 751 4
15819 145 265 5
where the items are:
print(itemsTrain)
[ 690 751 265 ..., 1650 1447 1507]
For each item, I would like to compute the number of ratings. Is there anyway to do this without resorting to a Loop? All ideas are appreciated,
data is a pandas dataframe. The desire output should look like this:
pop =
item_id rating_count
690 120
751 10
265 159
... ...
Note that itemsTrain contain unique item_ids in the rating dataset data.
you can do it this way:
In [200]: df = pd.DataFrame(np.random.randint(0,8,(15,2)),columns=['id', 'rating'])
In [201]: df
Out[201]:
id rating
0 4 6
1 0 1
2 2 4
3 2 5
4 2 7
5 3 5
6 6 1
7 4 3
8 4 3
9 3 2
10 2 4
11 7 7
12 3 1
13 2 7
14 7 3
In [202]: df.groupby('id').rating.count()
Out[202]:
id
0 1
2 5
3 3
4 3
6 1
7 2
Name: rating, dtype: int64
if you want to have result as a DF (you can also name the count column as you wish):
In [206]: df.groupby('id').rating.count().to_frame('count').reset_index()
Out[206]:
id count
0 0 1
1 2 5
2 3 3
3 4 3
4 6 1
5 7 2
you can also count # of unique ratings:
In [203]: df.groupby('id').rating.nunique()
Out[203]:
id
0 1
2 3
3 3
4 2
6 1
7 2
Name: rating, dtype: int64
You can use the method df.groupby() to group items by item_id and then use the method count() to sum the ratings.
Do as follows :
# df is your dataframe
v # the method allows you to sum values of the previous feature
df.groupby('item_id').rating.count()
^ ^ # the feature you want to sum upon its values
^
# The method allows you to group the samples by the feature "item_id"
# which is supposed to be unique