Aggregating using arbitrary precedence in pandas - python

Given the dataframe
Column1 Column2 Column3
0 a foo 1
1 a bar 2
2 b baz 12
3 b foo 4
4 c bar 6
5 c foo 3
6 c baz 7
7 d foo 9
I'd like to groupby Column1, using an arbitrary order of precedence for which values to keep from column3.
For example, if the order of precedence is:
baz
bar
foo
then I would expect the output to show as
Column2
Column1
a 2
b 12
c 7
d 9
with the "a" group keeping the "bar" value because there is no "baz" for the "a" group, "b" group keeping the "baz" value, and so on.
What's the most elegent way to do this? Right now I'm applying a series of apply lambda's to work through each item, but it feels sloppy.
EDIT:
What if the precendence goes across multiple columns?
Ex.
Column1 Column2 Column3 Column4
0 a foo john 1
1 a bar jim 2
2 b baz jack 12
3 b foo jim 4
4 c bar john 6
5 c foo john 3
6 c baz jack 7
7 d foo jack 9
If the order of precedence across both Column2 and Column3 is:
jim
baz
foo
then I would expect the output to show as
Column2 Column3
Column1
a jim 2
b jim 4
c baz 7
d foo 9

You can try with the below logic with map then groupby+transform
order = ['baz','bar','foo']
d = {v:k for k,v in dict(enumerate(order)).items()}
out = df.assign(k=df['Column2'].map(d))
print(df[out['k'].eq(out.groupby("Column1")['k'].transform("min"))])
Column1 Column2 Column3
1 a bar 2
2 b baz 12
6 c baz 7
7 d foo 9
EDIT , for multiple columns, using the same logic as above, here is a way:
order = ['jim','baz','foo']
d = {i:e for e,i in enumerate(order)}
s = df[['Column2','Column3']].replace(d).apply(pd.to_numeric,errors='coerce').min(1)
out = (s[s.eq(s.groupby(df['Column1']).transform("min"))]
.replace(dict(enumerate(order))).rename("Col"))
df.loc[out.index,["Column1","Column4"]].join(out)
Column1 Column4 Col
1 a 2 jim
3 b 4 jim
6 c 7 baz
7 d 9 foo

If you have an order for all values in 'Column2' you can use loc after setting the index to impose your custom order, then drop_duplicates to keep only the highest precedence.
order = ['baz', 'bar', 'foo']
df.set_index('Column2').loc[order].drop_duplicates('Column1')
Column1 Column3
Column2
baz b 12
baz c 7
bar a 2
foo d 9
In your second case if you need to do this across multiple columns we first melt such that Column2 and Column3 are stacked into one long Series and the rest follows the same as above:
order = ['jim', 'baz', 'foo']
(df.melt(id_vars=['Column4', 'Column1'], value_vars=['Column2', 'Column3'])
.drop(columns='variable')
.set_index('value')
.loc[order]
.drop_duplicates('Column1')
)
Column4 Column1
value
jim 2 a
jim 4 b
baz 7 c
foo 9 d

You can try converting Column2 to categorical:
df['Column2'] = pd.Categorical(df['Column2'], ordered=True, categories=['baz','bar','foo'])
df.sort_values(['Column1','Column2']).drop_duplicates('Column1')
Output:
Column1 Column2 Column3
1 a bar 2
2 b baz 12
6 c baz 7
7 d foo 9

Related

How to optimize turning a group of wide pandas columns into two long pandas columns

I have a process that takes a dataframe and turns a set of wide pandas columns into two long pandas columns, like so:
original wide:
wide = pd.DataFrame(
{
'id':['foo'],
'a':[1],
'b':[2],
'c':[3],
'x':[4],
'y':[5],
'z':[6]
}
)
wide
id a b c x y z
0 foo 1 2 3 4 5 6
desired long:
lon = pd.DataFrame(
{
'id':['foo','foo','foo','foo','foo','foo'],
'type':['a','b','c','x','y','z'],
'val':[1,2,3,4,5,6]
}
)
lon
id type val
0 foo a 1
1 foo b 2
2 foo c 3
3 foo x 4
4 foo y 5
5 foo z 6
I found out a way to do this by chaining the following pandas assignments
(wide
.set_index('id')
.T
.unstack()
.reset_index()
.rename(columns={'level_1':'type',0:'val'})
)
id type val
0 foo a 1
1 foo b 2
2 foo c 3
3 foo x 4
4 foo y 5
5 foo z 6
But when I scale my data this seems to be posing issues for me. I was just looking for an alternative solution to what I have already accomplished that is perhaps faster/more computationally efficient.
I think you are looking for the pandas melt function.
Assuming your original dataframe is called wide, then:
df = pd.melt(wide, id_vars="id")
df.columns = ['id', 'type', 'val']
print(df)
Output:
id type val
0 foo a 1
1 foo b 2
2 foo c 3
3 foo x 4
4 foo y 5
5 foo z 6

Remove rows if it exists in the previous group

I have a GroupBy object. I want to to remove rows from current group if the same row exists in the previous group. Let's say this is (n-1)th group:
A B
0 foo 0
1 baz 1
2 foo 1
3 bar 1
And this n-th group
A B
0 foo 2
1 foo 1
2 baz 1
3 baz 3
After dropping all duplicates. Result of n-th group:
A B
0 foo 2
3 baz 3
EDIT:
I would like to achieve it without loop if possible
I am using merge with indicator here
yourdf=dfn.merge(df1,indicator=True,how='left').loc[lambda x : x['_merge']!='both']
yourdf
A B _merge
0 foo 2 left_only
3 baz 3 left_only
#yourdf.drop('_merge',1,inplace=True)
Since it is GrouBy Object so you can do with for loop here , using above code for n times

Select rows from a Pandas DataFrame with same values in one column but different value in the other column

Say I have the pandas DataFrame below:
A B C D
1 foo one 0 0
2 foo one 2 4
3 foo two 4 8
4 cat one 8 4
5 bar four 6 12
6 bar three 7 14
7 bar four 7 14
I would like to select all the rows that have equal values in A but differing values in B. So I would like the output of my code to be:
A B C D
1 foo one 0 0
3 foo two 4 8
5 bar three 7 14
6 bar four 7 14
What's the most efficient way to do this? I have approximately 11,000 rows with a lot of variation in the column values, but this situation comes up a lot. In my dataset, if elements in column A are equal then the corresponding column B value should also be equal, however due to mislabeling this is not the case and I would like to fix this, it would be impractical for me to do this one by one.
You can try groupby() + filter + drop_duplicates():
>>> df.groupby('A').filter(lambda g: len(g) > 1).drop_duplicates(subset=['A', 'B'], keep="first")
A B C D
0 foo one 0 0
2 foo two 4 8
4 bar four 6 12
5 bar three 7 14
OR, in case you want to drop duplicates between the subset of columns A & B then can use below but that will have the row having cat as well.
>>> df.drop_duplicates(subset=['A', 'B'], keep="first")
A B C D
0 foo one 0 0
2 foo two 4 8
3 cat one 8 4
4 bar four 6 12
5 bar three 7 14
Use groupby + filter + head:
result = df.groupby('A').filter(lambda g: len(g) > 1).groupby(['A', 'B']).head(1)
print(result)
Output
A B C D
0 foo one 0 0
2 foo two 4 8
4 bar four 6 12
5 bar three 7 14
The first group-by and filter will remove the rows with no duplicated A values (i.e. cat), the second will create groups with same A, B and for each of those get the first element.
The current answers are correct and may be more sophisticated too. If you have complex criteria, filter function will be very useful. If you are like me and want to keep things simple, i feel following is more beginner friendly way
>>> df = pd.DataFrame({
'A': ['foo', 'foo', 'foo', 'cat', 'bar', 'bar', 'bar'],
'B': ['one', 'one', 'two', 'one', 'four', 'three', 'four'],
'C': [0,2,4,8,6,7,7],
'D': [0,4,8,4,12,14,14]
}, index=[1,2,3,4,5,6,7])
>>> df = df.drop_duplicates(['A', 'B'], keep='last')
A B C D
2 foo one 2 4
3 foo two 4 8
4 cat one 8 4
6 bar three 7 14
7 bar four 7 14
>>> df = df[df.duplicated(['A'], keep=False)]
A B C D
2 foo one 2 4
3 foo two 4 8
6 bar three 7 14
7 bar four 7 14
keep='last' is optional here

Get the count for each subgroup in a multiple grouped pandas.DateFrame aggregated on one group

I have a DataFrame with two columns "A" and "B".
A B
0 foo one
1 bar one
2 foo two
3 bar one
4 foo two
5 bar two
6 foo one
7 foo one
8 xyz one
For each group in "A", I'm trying to get the count of each value of "B", i.e. each sub-group of B, but aggregated on the grouping of "A".
The result should look like this:
A B countOne countTwo
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
I have tried several approaches to no avail, so far I'm using this approach:
A_grouped = df.groupby(['A', 'B'])['A'].count()
A_grouped_ones = A_grouped[:,'one']
A_grouped_twos = A_grouped[:,'two']
df['countOne'] = df['A'].map(lambda a: A_grouped_ones[a] if a in A_grouped_ones else 0)
df['countTwo'] = df['A'].map(lambda a: A_grouped_twos[a] if a in A_grouped_twos else 0)
However, this seems horribly inefficient two me. Is there a better solution?
You can use unstack with add_prefix for new DataFrame and join to original:
df1 = df.groupby(['A', 'B'])['A'].count().unstack(fill_value=0).add_prefix('count_')
print (df1)
B count_one count_two
A
bar 2 1
foo 3 2
xyz 1 0
df = df.join(df1, on='A')
print (df)
A B count_one count_two
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
Another alternative is use size:
df1 = df.groupby(['A', 'B']).size().unstack(fill_value=0).add_prefix('count_')
Differences are size includes NaN values, count does not - check this answer.

Assigning value to new column ['E'] based on column ['A'] value using dataframes

In the example below. I am trying to generate a column 'E' that is assigned either [1 or 2] depending on a conditional statement on column A.
I've tried various options but they throw a slicing error. (Should it not be something like this to assign a value to new column 'E'?
df2= df.loc[df['A'] == 'foo']['E'] = 1
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print('Filter the content')
df2= df.loc[df['A'] == 'foo']
print(df2)
# A B C D E
# 0 foo one 0 0 1
# 2 foo two 2 4 1
# 4 foo two 4 8 1
# 6 foo one 6 12 1
# 7 foo three 7 14 1
df3= df.loc[df['A'] == 'bar']
print(df3)
# A B C D E
# 1 bar one 1 2 2
# 3 bar three 3 6 2
# 5 bar two 5 10 2
#Combile df2 and df3 back to df and print df
print(df)
# A B C D E
# 0 foo one 0 0 1
# 1 bar one 1 2 2
# 2 foo two 2 4 1
# 3 bar three 3 6 2
# 4 foo two 4 8 1
# 5 bar two 5 10 2
# 6 foo one 6 12 1
# 7 foo three 7 14 1
What about simply this?
df['E'] = np.where(df['A'] == 'foo', 1, 2)
This does what I think you're trying to do. Create a column E in your dataframe that is 1 if A==foo, and 2 if A!=foo.
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
df['E']=np.ones([df.shape[0],])*2
df.loc[df.A=='foo','E']=1
df.E=df.E.astype(int)
print(df)
Note: Your suggested solution df2= df.loc[df['A'] == 'foo']['E'] = 1 uses serial slicing, rather than taking advantage of loc. To slice df rows by the first conditional and return the column E, you should instead use df.loc[df['A']=='foo','E']
Note II: If you have more than one conditional, you could also use .replace() and pass in a dictionary. In this case mapping foo to 1, bar to 2, and so on.
for brevity (characters)
df.assign(E=df.A.ne('foo')+1)
A B C D E
0 foo one 0 0 1
1 bar one 1 2 2
2 foo two 2 4 1
3 bar three 3 6 2
4 foo two 4 8 1
5 bar two 5 10 2
6 foo one 6 12 1
7 foo three 7 14 1
for brevity (time)
df.assign(E=(df.A.values != 'foo') + 1)
A B C D E
0 foo one 0 0 1
1 bar one 1 2 2
2 foo two 2 4 1
3 bar three 3 6 2
4 foo two 4 8 1
5 bar two 5 10 2
6 foo one 6 12 1
7 foo three 7 14 1

Categories

Resources