my df looks like
group value
A 1
B 1
A 1
B 1
B 0
B 0
A 0
I want to create a df
value 0 1
group
A a b
B c d
where a,b,c,d are the counts of 0s and 1s in groups A and B respectively.
I tried group df.groupby('group').size() but that gave an overall count and did not split 0's and 1s. I tried a groupby count method too but have not been able to achieve the target data frame.
Use pd.crosstab:
pd.crosstab(df['group'], df['value'])
Output:
value 0 1
group
A 1 2
B 2 2
Use pivot table for this:
res = pd.pivot_table(df, index='group', columns='value', aggfunc='size')
>>>print(res)
value 0 1
group
A 1 2
B 2 2
Im using Pypsark. I have two dataframes, call them df1 and df2. I want df1 to create a new column to flag what rows of df1's columns (A, B) exist and do not exist in df2's columns D,E. 1 marking existence and 0 otherwise. An example of the transformation is:
df1
A
B
C
0
0
1
0
0
1
0
0
1
df2
D
E
F
G
1
2
1
2
0
0
1
2
1
2
1
2
Resulting df1
A
B
C
Exist
0
0
1
0
0
0
1
1
0
0
1
0
The focus columns from df1 are A,B and for df2 are D, E. Only the second row of these columns match so df1 has its newly created exist column marked as 1.
How can I achieve this?
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select a,b,c, case when d is null and e is null then 0 else 1 end exist from table1 left outer join table2 on A=D and B=E").show()
i've got a pd.DataFrame with four columns
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2]
, 'A':['H','H','E','E','H','E','E','H','H']
, 'B':[4,5,2,7,6,1,3,1,0]
, 'C':['M','D','M','D','M','M','M','D','D']})
id A B C
0 1 H 4 M
1 1 H 5 D
2 1 E 2 M
3 1 E 7 D
4 1 H 6 M
5 2 E 1 M
6 2 E 3 M
7 2 H 1 D
8 2 H 0 D
I'd like to group by id and get the value of B for the nth (let's say second) occurrence of A = 'H' for each id in agg_B1 and value of B for the nth (let's say first) occurrence of C='M':
desired output:
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
desired_output = df.groupby('id').agg(
agg_B1= ('B',lambda x:x[df.loc[x.index].loc[df.A== 'H'][1]])
, agg_B2= ('B',lambda x:x[df.loc[x.index].loc[df.C== 'M'][0]])
).reset_index()
TypeError: Indexing a Series with DataFrame is not supported, use the appropriate DataFrame column
Obviously, I'm doing something wrong with the indexing.
Edit: if possible, I'd like to use aggregate with lambda function, because there are multiple aggregate outputs of other sorts that I'd like to extract at the same time.
Your solution is possible change if need GroupBy.agg:
desired_output = df.groupby('id').agg(
agg_B1= ('B',lambda x:x[df.loc[x.index, 'A']== 'H'].iat[1]),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
But if performance is important and also not sure if always exist second value matched H for first condition I suggest processing each condition separately and last add to original aggregated values:
#some sample aggregations
df0 = df.groupby('id').agg({'B':'sum', 'C':'last'})
df1 = df[df['A'].eq('H')].groupby("id")['B'].nth(1).rename('agg_B1')
df2 = df[df['C'].eq('M')].groupby("id")['B'].first().rename('agg_B2')
desired_output = pd.concat([df0, df1, df2], axis=1)
print (desired_output)
B C agg_B1 agg_B2
id
1 24 M 5 4
2 5 D 0 1
EDIT1: If need GroupBy.agg is possible test if failed indexing and then add missing value:
#for second value in sample working nice
def f1(x):
try:
return x[df.loc[x.index, 'A']== 'H'].iat[1]
except:
return np.nan
desired_output = df.groupby('id').agg(
agg_B1= ('B',f1),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
#third value not exist so added missing value NaN
def f1(x):
try:
return x[df.loc[x.index, 'A']== 'H'].iat[2]
except:
return np.nan
desired_output = df.groupby('id').agg(
agg_B1= ('B',f1),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 6.0 4
1 2 NaN 1
What working same like:
df1 = df[df['A'].eq('H')].groupby("id")['B'].nth(2).rename('agg_B1')
df2 = df[df['C'].eq('M')].groupby("id")['B'].first().rename('agg_B2')
desired_output = pd.concat([df1, df2], axis=1)
print (desired_output)
agg_B1 agg_B2
id
1 6.0 4
2 NaN 1
Filter for rows where A equals H, then grab the second row with the nth function :
df.query("A=='H'").groupby("id").nth(1)
A B
id
1 H 5
2 H 0
Python works on a zero based notation, so row 2 will be nth(1)
So the data looks like this:
type value
A 0
A 6
B 5
C 0
A 3
C 0
I want to get the number of zeros in value column for each type in type column. Preferably in a new dataframe. So it would look like this:
type zero_count
A 1
B 0
C 2
What's the most efficient way to do this?
Compare column by Series.eq for ==, convert to integers 0/1 by Series.view or Series.astype and then aggregate by column df['type'] with sum:
df1 = df['value'].eq(0).view('i1').groupby(df['type']).sum().reset_index(name='zero_count')
df1 = df['value'].eq(0).astype(int).groupby(df['type']).sum().reset_index(name='zero_count')
print (df1)
type zero_count
0 A 1
1 B 0
2 C 2
So, i have the following dataframe:
id value
0 a 1
1 a 1
2 a 2
3 b 3
4 b 3
For example, for rows with id 'a', the minimum value is 1 and for id 'b', the minimum value is 3, so no rows would be deleted.
Output:
id value
0 a 1
1 a 1
2 b 3
3 b 3
So far, I've only grouped the rows with same id and found their lowest values but couldn't find a way to delete all expected rows.
I've used the following command:
min_values = df.loc[df.groupby(['id'])['value'].idxmin()]['value']
Using transform( idxmin: will only return the first index of min value , in your case you have duplicates so it would not return all index )
df[df.value==df.groupby('id').value.transform('min')]
Out[257]:
id value
0 a 1
1 a 1
3 b 3
4 b 3