Remove rows if it exists in the previous group

Remove rows if it exists in the previous group - python

I have a GroupBy object. I want to to remove rows from current group if the same row exists in the previous group. Let's say this is (n-1)th group:
A B
0 foo 0
1 baz 1
2 foo 1
3 bar 1
And this n-th group
A B
0 foo 2
1 foo 1
2 baz 1
3 baz 3
After dropping all duplicates. Result of n-th group:
A B
0 foo 2
3 baz 3
EDIT:
I would like to achieve it without loop if possible

I am using merge with indicator here
yourdf=dfn.merge(df1,indicator=True,how='left').loc[lambda x : x['_merge']!='both']
yourdf
A B _merge
0 foo 2 left_only
3 baz 3 left_only
#yourdf.drop('_merge',1,inplace=True)
Since it is GrouBy Object so you can do with for loop here , using above code for n times

Related

Aggregating using arbitrary precedence in pandas

Given the dataframe
Column1 Column2 Column3
0 a foo 1
1 a bar 2
2 b baz 12
3 b foo 4
4 c bar 6
5 c foo 3
6 c baz 7
7 d foo 9
I'd like to groupby Column1, using an arbitrary order of precedence for which values to keep from column3.
For example, if the order of precedence is:
baz
bar
foo
then I would expect the output to show as
Column2
Column1
a 2
b 12
c 7
d 9
with the "a" group keeping the "bar" value because there is no "baz" for the "a" group, "b" group keeping the "baz" value, and so on.
What's the most elegent way to do this? Right now I'm applying a series of apply lambda's to work through each item, but it feels sloppy.
EDIT:
What if the precendence goes across multiple columns?
Ex.
Column1 Column2 Column3 Column4
0 a foo john 1
1 a bar jim 2
2 b baz jack 12
3 b foo jim 4
4 c bar john 6
5 c foo john 3
6 c baz jack 7
7 d foo jack 9
If the order of precedence across both Column2 and Column3 is:
jim
baz
foo
then I would expect the output to show as
Column2 Column3
Column1
a jim 2
b jim 4
c baz 7
d foo 9

You can try with the below logic with map then groupby+transform
order = ['baz','bar','foo']
d = {v:k for k,v in dict(enumerate(order)).items()}
out = df.assign(k=df['Column2'].map(d))
print(df[out['k'].eq(out.groupby("Column1")['k'].transform("min"))])
Column1 Column2 Column3
1 a bar 2
2 b baz 12
6 c baz 7
7 d foo 9
EDIT , for multiple columns, using the same logic as above, here is a way:
order = ['jim','baz','foo']
d = {i:e for e,i in enumerate(order)}
s = df[['Column2','Column3']].replace(d).apply(pd.to_numeric,errors='coerce').min(1)
out = (s[s.eq(s.groupby(df['Column1']).transform("min"))]
.replace(dict(enumerate(order))).rename("Col"))
df.loc[out.index,["Column1","Column4"]].join(out)
Column1 Column4 Col
1 a 2 jim
3 b 4 jim
6 c 7 baz
7 d 9 foo

If you have an order for all values in 'Column2' you can use loc after setting the index to impose your custom order, then drop_duplicates to keep only the highest precedence.
order = ['baz', 'bar', 'foo']
df.set_index('Column2').loc[order].drop_duplicates('Column1')
Column1 Column3
Column2
baz b 12
baz c 7
bar a 2
foo d 9
In your second case if you need to do this across multiple columns we first melt such that Column2 and Column3 are stacked into one long Series and the rest follows the same as above:
order = ['jim', 'baz', 'foo']
(df.melt(id_vars=['Column4', 'Column1'], value_vars=['Column2', 'Column3'])
.drop(columns='variable')
.set_index('value')
.loc[order]
.drop_duplicates('Column1')
)
Column4 Column1
value
jim 2 a
jim 4 b
baz 7 c
foo 9 d

You can try converting Column2 to categorical:
df['Column2'] = pd.Categorical(df['Column2'], ordered=True, categories=['baz','bar','foo'])
df.sort_values(['Column1','Column2']).drop_duplicates('Column1')
Output:
Column1 Column2 Column3
1 a bar 2
2 b baz 12
6 c baz 7
7 d foo 9

How to delete different column names with duplicated values?

Given this DF:
a b c d
1 2 1 4
4 3 4 2
foo bar foo yes
What is the best way to delete same columns but with different name in a large pandas DF? For example:
a b d
1 2 4
4 3 2
foo bar yes
Column c was removed from the above dataframe becase a and c where the same column but with different name. So far I tried to
df = df.iloc[:, ~df.columns.duplicated()]
However it is not clear to me how to check the row values inside the DF?

use transpose as below
df.T.drop_duplicates().T
I tried straight forward approach - loop through column names and compare each column with rest of others. Use np.all for exact match. These approach took only 336ms.
repeated_columns = []
for i, column in enumerate(df.columns):
r_columns = df.columns[i+1:]
for r_c in r_columns:
if np.all(df[column] == df[r_c]):
repeated_columns.append(r_c)
new_columns = [x for x in df.columns if x not in repeated_columns]
df[new_columns]
It will give you following output
a b d
0 1 2 4
1 4 3 2
2 foo bar yes

df.loc[:,~df.T.duplicated()]
a b d
0 1 2 4
1 4 3 2
2 foo bar yes

Advanced MultiIndex sorting and indexing

I have a data with >100k rows and I need to efficiently regroup it from the left DataFrame to the multiindexed right one which indices are sorted by the sum of values in the 3rd column and inside each index 2nd column values are sorted by values in the 3rd column. All sortings are descending.
I have no idea how to do it correctly and already spent whole day figuring it out.
a b c a sum b c %
foo one 1 foo 5 one 3 3/5
foo two 2 two 2 2/5
bar one 1 => baz 4 two 3 3/4
baz one 1 one 1 1/4
baz two 3 bar 3 six 2 2/3
foo one 2 one 1 1/3
bar six 2
UPDATE:
The code given by #jezrael works really good but it outputs in this way:
%
a sum b c
foo 5 one 3 0.60
two 2 0.40
six NaN NaN
baz 4 two 3 0.75
one 1 0.25
six NaN NaN
bar 1 one 1 1.00
two NaN NaN
six NaN NaN
Is it possible to get rid of these strings with NaN?
UPDATE #2:
I've found the problem which gives NaNs problem. This was caused by 'category' data type. How it affects the behavior of the code I don't know. Just pointing out the cause.

I believe need:
#aggregate sum by a, b columns
df = df.groupby(['a','b'], as_index=False)['c'].sum()
print (df)
a b c
0 bar one 1
1 baz one 1
2 baz two 3
3 foo one 3
4 foo two 2
#create new column by position with transform sum per a column
df.insert(1, 'sum', df.groupby('a')['c'].transform('sum'))
#division of columns
df['%'] = df['c'].div(df['sum'])
print (df)
a sum b c %
0 bar 1 one 1 1.00
1 baz 4 one 1 0.25
2 baz 4 two 3 0.75
3 foo 5 one 3 0.60
4 foo 5 two 2 0.40
#sorting by multiple columns and create MultiIndex
df = df.sort_values(['sum','c'], ascending=False).set_index(['a','sum','b', 'c'])
print (df)
%
a sum b c
foo 5 one 3 0.60
two 2 0.40
baz 4 two 3 0.75
one 1 0.25
bar 1 one 1 1.00

Get the count for each subgroup in a multiple grouped pandas.DateFrame aggregated on one group

I have a DataFrame with two columns "A" and "B".
A B
0 foo one
1 bar one
2 foo two
3 bar one
4 foo two
5 bar two
6 foo one
7 foo one
8 xyz one
For each group in "A", I'm trying to get the count of each value of "B", i.e. each sub-group of B, but aggregated on the grouping of "A".
The result should look like this:
A B countOne countTwo
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
I have tried several approaches to no avail, so far I'm using this approach:
A_grouped = df.groupby(['A', 'B'])['A'].count()
A_grouped_ones = A_grouped[:,'one']
A_grouped_twos = A_grouped[:,'two']
df['countOne'] = df['A'].map(lambda a: A_grouped_ones[a] if a in A_grouped_ones else 0)
df['countTwo'] = df['A'].map(lambda a: A_grouped_twos[a] if a in A_grouped_twos else 0)
However, this seems horribly inefficient two me. Is there a better solution?

You can use unstack with add_prefix for new DataFrame and join to original:
df1 = df.groupby(['A', 'B'])['A'].count().unstack(fill_value=0).add_prefix('count_')
print (df1)
B count_one count_two
A
bar 2 1
foo 3 2
xyz 1 0
df = df.join(df1, on='A')
print (df)
A B count_one count_two
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
Another alternative is use size:
df1 = df.groupby(['A', 'B']).size().unstack(fill_value=0).add_prefix('count_')
Differences are size includes NaN values, count does not - check this answer.

Draw sample from all groups in a DataFrame

How do I draw a sample (say, 10% randomly or alternatively every nth row) of rows from within each group inside a dataframe ?
e.g. from when grouping by 'name':
name a b
foo 1 1
foo 4 1
foo 3 3
bar 2 1
bar 3 7
bar 4 3
bar 1 2
I want to get something like:
name a b
foo 4 1
bar 3 7
bar 1 2
many thanks

You can use groupby to group by your name column and then apply sample to randomly get samples from the subgroups.
First, let's see the dummy data:
print(df)
name a b
0 foo 1 1
1 foo 4 1
2 foo 3 3
3 bar 2 1
4 bar 3 7
5 bar 4 3
6 bar 1 2
fraction defines the percentage of random sample. It is set to 0.5 here for your small dummy data set:
fraction = 0.5
result = df.groupby("name", group_keys=False).apply(lambda x: x.sample(frac=fraction))
print(result)
name a b
3 bar 2 1
6 bar 1 2
0 foo 1 1
2 foo 3 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove rows if it exists in the previous group - python

Related

Aggregating using arbitrary precedence in pandas

How to delete different column names with duplicated values?

Advanced MultiIndex sorting and indexing

Get the count for each subgroup in a multiple grouped pandas.DateFrame aggregated on one group

Draw sample from all groups in a DataFrame

Categories

Resources