Pandas: groupby and get tail based on some column value - python

I have a dataframe looks like
id week value
1 1 15
1 2 29
1 3 49
1 3 19
2 6 10
2 7 99
2 8 53
How extract dataframe based on the last 2 weeks for each id?
It's like a tail but not for the records.
Desirable output
id week value
1 2 29
1 3 49
1 3 19
2 7 99
2 8 53

This more like factorized then pick the last two of each group
m = df.iloc[::-1].groupby('id')['week'].transform(lambda x :x.factorize()[0]).isin([0,1])
out = df[m]
id week value
1 1 2 29
2 1 3 49
3 1 3 19
5 2 7 99
6 2 8 53
Or we fix the tail with drop_duplicates
df.merge(df.drop_duplicates(['id','week']).groupby('id').tail(2).drop('value',1))
id week value
0 1 2 29
1 1 3 49
2 1 3 19
3 2 7 99
4 2 8 53

Assume data have been sorted by id and week, then groupby tail will do the job
df.groupby('id').tail(2)
Revision:
(df[['id', 'week']]
.drop_duplicates()
.groupby('id')
.tail(2)
.merge(df)
)

Related

Python Dataframe GroupBy Function

I am having hard time understanding what the code below does. I initially thought it was counting the unique appearances of the values in (weight age) and (weight height) however when I ran this example, I found out it was doing something else.
data = [[0,33,15,4],[1,44,12,3],[0,44,12,5],[1,33,15,4],[0,77,13,4],[1,33,15,4],[1,99,40,7],[0,58,45,4],[1,11,13,4]]
df = pd.DataFrame(data,columns=["Lbl","Weight","Age","Height"])
print (df)
def group_fea(df,key,target):
'''
Adds columns for feature combinations
'''
tmp = df.groupby(key, as_index=False)[target].agg({
key+target + '_nunique': 'nunique',
}).reset_index()
del tmp['index']
print("****{}****".format(target))
return tmp
#Add feature combinations
feature_key = ['Weight']
feature_target = ['Age','Height']
for key in feature_key:
for target in feature_target:
tmp = group_fea(df,key,target)
df = df.merge(tmp,on=key,how='left')
print (df)
Lbl Weight Age Height
0 0 33 15 4
1 1 44 12 3
2 0 44 12 5
3 1 33 15 4
4 0 77 13 4
5 1 33 15 4
6 1 99 40 7
7 0 58 45 4
8 1 11 13 4
****Age****
****Height****
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1
I want to understand what the values in WeightAge_nunique WeightHeight_nunique mean
The value of WeightAge_nunique on a given row is the number of unique Ages that have the same Weight. The corresponding thing is true of WeightHeight_nunique. E.g., for people of Weight=44, there is only 1 unique age (12), hence WeightAge_nunique=1 on those rows, but there are 2 unique Heights (3 and 5), hence WeightHeight_nunique=2 on those same rows.
You can see that this happens because the grouping function groups by the "key" column (Weight), then performs the "nunique" aggregation function on the "target" column (either Age or Height).
Let us try transform
g = df.groupby('Weight').transform('nunique')
df['WeightAge_nunique'] = g['Age']
df['WeightHeight_nunique'] = g['Height']
df
Out[196]:
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1

How do you parse out data from a dataframe for each ID when an adjacent column contains a certain value?

I have a large dataframe in the following format. I need to parse out only the values where values ==1 and through the remaining id. This should reset on each ID so that it takes the first value in a unique id that contains the value 1 and ends when the id number terminates.
d={'ID':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5] \
,'values':[0,0,0,1,0,1,0,1,1,1,0,1,0,0,0,0,0,0,1,1,0,1,0,1,1,1,1,1,] }
df=pd.DataFrame(data=d)
df=pd.DataFrame(data=d)
df
ND = {'ID':[1,1,2,2,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5],\
'values':[1,0,1,0,1,1,1,1,0,0,1,1,0,1,0,1,1,1,1,1]}
df_final=pd.DataFrame(ND)
df_final
'''
IIUC,
df[df.groupby('ID')['values'].transform('cummax')==1]
Output:
ID values
3 1 1
4 1 0
5 2 1
6 2 0
7 2 1
8 2 1
9 2 1
11 3 1
12 3 0
13 3 0
18 4 1
19 4 1
20 4 0
21 4 1
22 4 0
23 5 1
24 5 1
25 5 1
26 5 1
27 5 1
Details, use cummax to keep the value of 1 after first found. Then use equal to 1 to create a boolean series, which then is used to do boolean indexing.
if your column values is only 0 and 1, you can use groupby.cummax that will replace 0 by 1 if they are after a 1 per ID, then use this as a boolean mask:
df_ = df[df.groupby('ID')['values'].cummax().astype(bool).to_numpy()]
print(df_)
ID values
3 1 1
4 1 0
5 2 1
6 2 0
7 2 1
8 2 1
9 2 1
11 3 1
12 3 0
13 3 0
18 4 1
19 4 1
20 4 0
21 4 1
22 4 0
23 5 1
24 5 1
25 5 1
26 5 1
27 5 1

Remove duplicates after a certain number of occurrences

How do we filter the dataframe below to remove all duplicate ID rows after a certain number of ID occurrence. I.E. remove all rows of ID == 0 after the 3rd occurrence of ID == 0
Thanks
pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=['ID', 'Value']).sort_values('ID')
Output:
ID Value
0 7
0 8
0 5
0 5
... ... ...
9 7
9 7
9 1
9 3
Desired Output for filter_count = 3:
Output:
ID Value
0 7
0 8
0 5
1 7
1 7
1 1
2 3
If you want to do this for all IDs, use:
df.groupby("ID").head(3)
For single ID, you can assign a new column using cumcount and then filter by conditions:
df["count"] = df.groupby("ID")["Value"].cumcount()
print (df.loc[(df["ID"].ne(0))|((df["ID"].eq(0)&(df["count"]<3)))])
ID Value count
64 0 6 0
77 0 6 1
83 0 0 2
44 1 7 0
58 1 5 1
40 1 2 2
35 1 7 3
89 1 9 4
19 1 7 5
10 1 3 6
45 2 4 0
68 2 1 1
74 2 4 2
75 2 8 3
34 2 4 4
60 2 6 5
78 2 0 6
31 2 8 7
97 2 9 8
2 2 6 9
93 2 8 10
13 2 2 11
...
I will do without groupby
df = pd.concat([df.loc[df.ID==0].head(3),df.loc[df.ID!=0]])
Thanks Henry,
I modified your code and I think this should work as well.
Your df.groupby("ID").head(3) is great. Thanks.
df["count"] = df.groupby("ID")["Value"].cumcount()
df.loc[df["count"]<3].drop(['count'], axis=1)

Find average based upon two conditions; create column from these averages

I have a df with weather reporting data. It has over 2 million rows and the following columns.
ID MONTH TEMP
1 1 0
1 1 10
2 1 50
2 1 60
3 1 80
3 1 90
1 2 0
1 2 10
2 2 50
2 2 60
3 2 80
3 2 90
I am looking to create an column for the average monthly temperature. I need a faster way than for-loops. The values for average monthly temperature are from the TEMP column. I would like them to be specific to each ID for each MONTH.
ID MONTH TEMP AVE MONTHLY TEMP
1 1 0 5
1 1 10 5
2 1 50 55
2 1 60 55
3 1 80 85
3 1 90 85
1 2 0 5
1 2 10 5
2 2 50 55
2 2 60 55
3 2 80 85
3 2 90 85
Use groupby.transform:
df['AVE MONTHLY TEMP']=df.groupby(['ID','MONTH'])['TEMP'].transform('mean')
print(df)
Output
ID MONTH TEMP AVE MONTHLY TEMP
0 1 1 0 5
1 1 1 10 5
2 2 1 50 55
3 2 1 60 55
4 3 1 80 85
5 3 1 90 85
6 1 2 0 5
7 1 2 10 5
8 2 2 50 55
9 2 2 60 55
10 3 2 80 85
11 3 2 90 85
I think this solution may work better if you have millions of lines of data as those groupings may repeat (ID, MONTH). This makes an assumption that the ID series is always grouped as you have in your data. I'm trying to think out of the box here as you said you have a million lines of data:
df['AVG MONTHLY TEMP'] = df.groupby(df['ID'].ne(df['ID'].shift()).cumsum(), as_index=False)['TEMP'].transform('mean')
Also, if you average temperatures are ALWAYS grouped in two you can do this formula as well:
df.groupby(np.arange(len(df))//2)['TEMP'].transform('mean')
output:
ID MONTH TEMP AVG MONTHLY TEMP
0 1 1 0 5
1 1 1 10 5
2 2 1 50 55
3 2 1 60 55
4 3 1 80 85
5 3 1 90 85
6 1 2 0 5
7 1 2 10 5
8 2 2 50 55
9 2 2 60 55
10 3 2 80 85
11 3 2 90 85
I hope this help or give ideas as a million lines of data is a lot of data

Pandas series Max value for column based on index

I am trying to extract the max value for a column based on the index. I have this series:
Hour Values
1 0
1 3
1 1
2 0
2 5
2 4
...
23 3
23 4
23 2
24 1
24 9
24 2
and am looking to add a new column 'Max Value' that will have the maximum of the 'Values' column for each value, based on the index (Hour):
Hour Values Max Value
1 0 3
1 3 3
1 1 3
2 0 5
2 5 5
2 4 5
...
23 3 4
23 4 4
23 2 4
24 1 9
24 9 9
24 2 9
I can do this in excel, but am new to pandas. The closest I have come is this scratchy effort, which is as far as I have got, but I get a syntax error on the first '=':
df['Max Value'] = 0
df['Max Value'][(df['Hour'] =1)] = df['Value'].max()
Use transform('max') method:
In [61]: df['Max Value'] = df.groupby('Hour')['Values'].transform('max')
In [62]: df
Out[62]:
Hour Values Max Value
0 1 0 3
1 1 3 3
2 1 1 3
3 2 0 5
4 2 5 5
5 2 4 5
6 23 3 4
7 23 4 4
8 23 2 4
9 24 1 9
10 24 9 9
11 24 2 9

Categories

Resources