select multiple nth values in grouping with conditional aggregate - pandas - python

i've got a pd.DataFrame with four columns
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2]
, 'A':['H','H','E','E','H','E','E','H','H']
, 'B':[4,5,2,7,6,1,3,1,0]
, 'C':['M','D','M','D','M','M','M','D','D']})
id A B C
0 1 H 4 M
1 1 H 5 D
2 1 E 2 M
3 1 E 7 D
4 1 H 6 M
5 2 E 1 M
6 2 E 3 M
7 2 H 1 D
8 2 H 0 D
I'd like to group by id and get the value of B for the nth (let's say second) occurrence of A = 'H' for each id in agg_B1 and value of B for the nth (let's say first) occurrence of C='M':
desired output:
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
desired_output = df.groupby('id').agg(
agg_B1= ('B',lambda x:x[df.loc[x.index].loc[df.A== 'H'][1]])
, agg_B2= ('B',lambda x:x[df.loc[x.index].loc[df.C== 'M'][0]])
).reset_index()
TypeError: Indexing a Series with DataFrame is not supported, use the appropriate DataFrame column
Obviously, I'm doing something wrong with the indexing.
Edit: if possible, I'd like to use aggregate with lambda function, because there are multiple aggregate outputs of other sorts that I'd like to extract at the same time.

Your solution is possible change if need GroupBy.agg:
desired_output = df.groupby('id').agg(
agg_B1= ('B',lambda x:x[df.loc[x.index, 'A']== 'H'].iat[1]),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
But if performance is important and also not sure if always exist second value matched H for first condition I suggest processing each condition separately and last add to original aggregated values:
#some sample aggregations
df0 = df.groupby('id').agg({'B':'sum', 'C':'last'})
df1 = df[df['A'].eq('H')].groupby("id")['B'].nth(1).rename('agg_B1')
df2 = df[df['C'].eq('M')].groupby("id")['B'].first().rename('agg_B2')
desired_output = pd.concat([df0, df1, df2], axis=1)
print (desired_output)
B C agg_B1 agg_B2
id
1 24 M 5 4
2 5 D 0 1
EDIT1: If need GroupBy.agg is possible test if failed indexing and then add missing value:
#for second value in sample working nice
def f1(x):
try:
return x[df.loc[x.index, 'A']== 'H'].iat[1]
except:
return np.nan
desired_output = df.groupby('id').agg(
agg_B1= ('B',f1),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
#third value not exist so added missing value NaN
def f1(x):
try:
return x[df.loc[x.index, 'A']== 'H'].iat[2]
except:
return np.nan
desired_output = df.groupby('id').agg(
agg_B1= ('B',f1),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 6.0 4
1 2 NaN 1
What working same like:
df1 = df[df['A'].eq('H')].groupby("id")['B'].nth(2).rename('agg_B1')
df2 = df[df['C'].eq('M')].groupby("id")['B'].first().rename('agg_B2')
desired_output = pd.concat([df1, df2], axis=1)
print (desired_output)
agg_B1 agg_B2
id
1 6.0 4
2 NaN 1

Filter for rows where A equals H, then grab the second row with the nth function :
df.query("A=='H'").groupby("id").nth(1)
A B
id
1 H 5
2 H 0
Python works on a zero based notation, so row 2 will be nth(1)

Related

Filter rows with more than 1 value in a set and count their occurrence pandas python

Let's assume, I have the following data frame.
Id Combinations
1 (A,B)
2 (C,)
3 (A,D)
4 (D,E,F)
5 (F)
I would like to filter out Combination column values with more than value in a set. Something like below. AND I would like count the number of occurrence as whole in Combination column. For example, ID number 2 and 5 should be removed since their value in a set is only 1.
The result I am looking for is:
ID Combination Frequency
1 A 2
1 B 1
3 A 2
3 D 2
4 D 2
4 E 1
4 F 2
Can anyone help to get the above result in Python pandas?
First if necessary convert values to lists:
df['Combinations'] = df['Combinations'].str.strip('(,)').str.split(',')
If need count after filtering only one values by Series.str.len in boolean indexing, then use DataFrame.explode and count values by Series.map with Series.value_counts:
df1 = df[df['Combinations'].str.len().gt(1)].explode('Combinations')
df1['Frequency'] = df1['Combinations'].map(df1['Combinations'].value_counts())
print (df1)
Id Combinations Frequency
0 1 A 2
0 1 B 1
2 3 A 2
2 3 D 2
3 4 D 2
3 4 E 1
3 4 F 1
Or if need count before removing them filter them by Series.duplicated in last step:
df2 = df.explode('Combinations')
df2['Frequency'] = df2['Combinations'].map(df2['Combinations'].value_counts())
df2 = df2[df2['Id'].duplicated(keep=False)]
Alternative:
df2 = df2[df2.groupby('Id').Id.transform('size') > 1]
Or:
df2 = df2[df2['Id'].map(df2['Id'].value_counts() > 1]
print (df2)
Id Combinations Frequency
0 1 A 2
0 1 B 1
2 3 A 2
2 3 D 2
3 4 D 2
3 4 E 1
3 4 F 2

How do I sort a Pandas dataframe Excel import?

I have imported the following Excel file but would like to sort it based on Frequency descending, but then with 'Other','No data' and 'All' (the total) at the bottom in that order. Is this possible?
table1 = pd.read_excel("table1.xlsx")
table1
Use:
df = pd.DataFrame({
'generalenq':list('abcdef'),
'percentage':[1,3,5,7,1,0],
'frequency':[5,3,6,9,2,4],
})
df.loc[0, 'generalenq'] = 'All'
df.loc[2, 'generalenq'] = 'No data'
df.loc[3, 'generalenq'] = 'Other'
print (df)
generalenq percentage frequency
0 All 1 5
1 b 3 3
2 No data 5 6
3 Other 7 9
4 e 1 2
5 f 0 4
First create dictionary for ordering by some integers. Then create mask by membership with Series.isin and sorting non matched rows selected with ~ for invert mask with boolean indexing:
d = {'Other':0,'No data':1,'All':2}
mask = df['generalenq'].isin(list(d.keys()))
df1 = df[~mask].sort_values('frequency', ascending=False)
print (df1)
generalenq percentage frequency
5 f 0 4
1 b 3 3
4 e 1 2
Then filter matched rows by mask and create helper column for sorting by mapped dict:
df2 = df[mask].assign(new = lambda x: x['generalenq'].map(d)).sort_values('new').drop('new', 1)
print (df2)
generalenq percentage frequency
3 Other 7 9
2 No data 5 6
0 All 1 5
And last join together by concat:
df = pd.concat([df1, df2], ignore_index=True)
print (df)
generalenq percentage frequency
0 f 0 4
1 b 3 3
2 e 1 2
3 Other 7 9
4 No data 5 6
5 All 1 5

Subset df by last valid item

I can return the index of the last valid item but I'm hoping to subset a df using the same method. For instance, the code below returns the last time 2 appears in the df. But I want to return the df using this index.
import pandas as pd
df = pd.DataFrame({
'Number' : [2,3,2,4,2,1],
'Code' : ['x','a','b','c','f','y'],
})
df_last = df[df['Number'] == 2].last_valid_index()
print(df_last)
4
Intended Output:
Number Code
0 2 x
1 3 a
2 2 b
3 4 c
4 2 f
You can use loc, but solution working only if at least one value 2 in column:
df = df.loc[:df[df['Number'] == 2].last_valid_index()]
print (df)
Number Code
0 2 x
1 3 a
2 2 b
3 4 c
4 2 f
General solution should be:
df = df[(df['Number'] == 2)[::-1].cumsum().ne(0)[::-1]]
print (df)
Number Code
0 2 x
1 3 a
2 2 b
3 4 c
4 2 f

How to drop float values from a column - pandas

I have a dataframe given shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1],
'val' :[5,6.4,5.4,6,6,6]
})
It looks like as shown below
I would like to drop the values from val column which ends with .[1-9]. Basically I would like to retain values like 5.0,6.0 and drop values like 5.4,6.4 etc
Though I tried below, it isn't accurate
df['val'] = df['val'].astype(int)
df.drop_duplicates() # it doesn't give expected output and not accurate.
I expect my output to be like as shown below
First idea is compare original value with casted column to integer, also assign integers back for expected output (integers in column):
s = df['val']
df['val'] = df['val'].astype(int)
df = df[df['val'] == s]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
Another idea is test is_integer:
mask = df['val'].apply(lambda x: x.is_integer())
df['val'] = df['val'].astype(int)
df = df[mask]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
If need floats in output you can use:
df1 = df[ df['val'].astype(int) == df['val']]
print (df1)
subject_id val
0 1 5.0
3 1 6.0
4 1 6.0
5 1 6.0
Use mod 1 to determine the residual. If residual is 0 it means the number is a int. Then use the results as a mask to select only those rows.
df.loc[df.val.mod(1).eq(0)].astype(int)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6

Pandas comparing multiindex dataframes without looping

I want to compare two multiindex dataframes and add another column to show the difference in values (if all index value match between the first dataframe and second dataframe) without using loops
index_a = [1,2,2,3,3,3]
index_b = [0,0,1,0,1,2]
index_c = [1,2,2,4,4,4]
index = pd.MultiIndex.from_arrays([index_a,index_b], names=('a','b'))
index_1 = pd.MultiIndex.from_arrays([index_c,index_b], names=('a','b'))
df1 = pd.DataFrame(np.random.rand(6,), index=index, columns=['p'])
df2 = pd.DataFrame(np.random.rand(6,), index=index_1, columns=['q'])
df1
p
a b
1 0 .4655
2 0 .8600
1 .9010
3 0 .0652
1 .5686
2 .8965
df2
q
a b
1 0 .6591
2 0 .5684
1 .5689
4 0 .9898
1 .3656
2 .6989
The resultant matrix (df1-df2) should look like
p diff
a b
1 0 .4655 -0.1936
2 0 .8600 .2916
1 .9010 .3321
3 0 .0652 No Match
1 .5686 No Match
2 .8965 No Match
Use reindex_like or reindex for intersection of indices:
df1['new'] = (df1['p'] - df2['q'].reindex_like(df1)).fillna('No Match')
#alternative
#df1['new'] = (df1['p'] - df2['q'].reindex(df1.index)).fillna('No Match')
print (df1)
p new
a b
1 0 0.955587 0.924466
2 0 0.312497 -0.310224
1 0.306256 0.231646
3 0 0.575613 No Match
1 0.674605 No Match
2 0.462807 No Match
Another idea with Index.intersection and DataFrame.loc:
df1['new'] = (df1['p'] - df2.loc[df2.index.intersection(df1.index), 'q']).fillna('No Match')
Or with merge with left join:
df = pd.merge(df1, df2, how='left', left_index=True, right_index=True)
df['new'] = (df['p'] - df['q']).fillna('No Match')
print (df)
p q new
a b
1 0 0.789693 0.665148 0.124544
2 0 0.082677 0.814190 -0.731513
1 0.762339 0.235435 0.526905
3 0 0.727695 NaN No Match
1 0.903596 NaN No Match
2 0.315999 NaN No Match
Use following to get the difference of matached indexes. Unmatch indices will be NaN
diff = df1['p'] - df2['q']
#Output
a b
1 0 -0.666542
2 0 -0.389033
1 0.064986
3 0 NaN
1 NaN
2 NaN
4 0 NaN
1 NaN
2 NaN
dtype: float64

Categories

Resources