How to label duplicate groups in pandas? - python

I have a DataFrame:
>>> df
A
0 foo
1 bar
2 foo
3 baz
4 foo
5 bar
I need to find all the duplicate groups and label them with sequential dgroup_id's:
>>> df
A dgroup_id
0 foo 1
1 bar 2
2 foo 1
3 baz
4 foo 1
5 bar 2
(This means that foo belongs to the first group of duplicates, bar to the second group of duplicates, and baz is not duplicated.)
I did this:
import pandas as pd
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
duplicates = df.groupby('A').size()
duplicates = duplicates[duplicates>1]
# Yes, this is ugly, but I didn't know how to do it otherwise:
duplicates[duplicates.reset_index().index] = duplicates.reset_index().index
df.insert(1, 'dgroup_id', df['A'].map(duplicates))
This leads to:
>>> df
A dgroup_id
0 foo 1.0
1 bar 0.0
2 foo 1.0
3 baz NaN
4 foo 1.0
5 bar 0.0
Is there a simpler/shorter way to achieve this in pandas? I read that maybe pandas.factorize could be of help here, but I don't know how to use it... (the pandas documentation on this function is of no help)
Also: I don't mind neither the 0-based group count, nor the weird sorting order; but I would like to have the dgroup_id's as ints, not floats.

You can make a list of duplicates by get_duplicates() then set the dgroup_id by A's index
def find_index(string):
if string in duplicates:
return duplicates.index(string)+1
else:
return 0
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
duplicates = df.set_index('A').index.get_duplicates()
df['dgroup_id'] = df['A'].apply(find_index)
df
Output:
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz 0
4 foo 2
5 bar 1
​

Use chained operation to first get value_count for each A, calculate the sequence number for each group, and then join back to the original DF.
(
pd.merge(df,
df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
.cumsum().rename('dgroup_id').to_frame(),
left_on='A', right_index=True).sort_index()
)
Out[49]:
A dgroup_id
0 foo 1.0
1 bar 2.0
2 foo 1.0
3 baz NaN
4 foo 1.0
5 bar 2.0
If you need Nan for unique groups, you can't have int as the datatype which is a pandas limitation at the moment. If you are ok with set 0 for unique groups, you can do something like:
(
pd.merge(df,
df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
.cumsum().rename('dgroup_id').to_frame().fillna(0).astype(int),
left_on='A', right_index=True).sort_index()
)
A dgroup_id
0 foo 1
1 bar 2
2 foo 1
3 baz 0
4 foo 1
5 bar 2

Use duplicated to identify where dups are. Use where to replace singletons with ''. Use categorical to factorize.
dups = df.A.duplicated(keep=False)
df.assign(dgroup_id=df.A.where(dups, '').astype('category').cat.codes)
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz 0
4 foo 2
5 bar 1
If you insist on the zeros being ''
dups = df.A.duplicated(keep=False)
df.assign(
dgroup_id=df.A.where(dups, '').astype('category').cat.codes.replace(0, ''))
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz
4 foo 2
5 bar 1

You could go for:
import pandas as pd
import numpy as np
df = pd.DataFrame(['foo', 'bar', 'foo', 'baz', 'foo', 'bar',], columns=['name'])
# Create the groups order
ordered_names = df['name'].drop_duplicates().tolist() # ['foo', 'bar', 'baz']
# Find index of each element in the ordered list
df['duplication_index'] = df['name'].apply(lambda x: ordered_names.index(x) + 1)
# Discard non-duplicated entries
df.loc[~df['name'].duplicated(keep=False), 'duplication_index'] = np.nan
print(df)
# name duplication_index
# 0 foo 1.0
# 1 bar 2.0
# 2 foo 1.0
# 3 baz NaN
# 4 foo 1.0
# 5 bar 2.0

df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
key_set = set(df['A'])
df_a = pd.DataFrame(list(key_set))
df_a['dgroup_id'] = df_a.index
result = pd.merge(df,df_a,left_on='A',right_on=0,how='left')
In [32]: result.drop(0,axis=1)
Out[32]:
A dgroup_id
0 foo 2
1 bar 0
2 foo 2
3 baz 1
4 foo 2
5 bar 0

Related

How to add a list of strings to a new column in Pandas?

Given a string array, a = ['foo', 'bar', 'foo2'], how would you add this as a new column to an existing dataframe, df.
The shape of the df before adding:
a b
0 3 3
1 3 3
2 3 3
after adding:
a b new_column
0 3 3 foo
1 3 3 bar
2 3 3 foo2
Just assign it in.
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1,2,3], "b": [4,5,6]})
>>> df
a b
0 1 4
1 2 5
2 3 6
>>> df["c"] = ["foo", "bar", "foo2"]
>>> df
a b c
0 1 4 foo
1 2 5 bar
2 3 6 foo2
>>>

Python: dataframe manipulation and aggregation in pandas

I have the following df:
dfdict = {'letter': ['a', 'a', 'a', 'b', 'b'], 'category': ['foo', 'foo', 'bar', 'bar', 'spam']}
df1 = pd.DataFrame(dfdict)
category letter
0 foo a
1 foo a
2 bar a
3 bar b
4 spam b
I want it to output me an aggregated count df like this:
a b
foo 2 0
bar 1 1
spam 0 1
This seems like it should be an easy operation. I have figured out how to use
df1 = df1.groupby(['category','letter']).size() to get:
category letter
bar a 1
b 1
foo a 2
spam b 1
This is closer, except now I need the letters a, b along the top and the counts coming down.
You can using crosstab
pd.crosstab(df1.category,df1.letter)
Out[554]:
letter a b
category
bar 1 1
foo 2 0
spam 0 1
To fix your code , adding unstack
df1.groupby(['category','letter']).size().unstack(fill_value=0)
Out[556]:
letter a b
category
bar 1 1
foo 2 0
spam 0 1

Creating a new column based on condition of an index

I have df:
arrays = [np.array(['bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'baz', 'baz', 'baz', 'baz', 'baz', 'baz', 'foo', 'foo', 'foo', 'foo', 'foo', 'foo']),
np.array(['one','two'] * 9),
np.array([1,2,3] * 6)]
df = pd.DataFrame(np.random.randn(18,2), index=arrays)
col0 col1
bar one 1 0.872359 -1.115871
two 2 -0.937908 -0.528563
one 3 -0.118874 0.286595
two 1 -0.507698 1.364643
one 2 1.507611 1.379498
two 3 -1.398019 -1.603056
baz one 1 1.498263 0.412380
two 2 -0.930022 -1.483657
one 3 -0.438157 1.465089
two 1 0.161887 1.346587
one 2 0.167086 1.246322
two 3 0.276344 -1.206415
foo one 1 -0.045389 -0.759927
two 2 0.087999 -0.435753
one 3 -0.232054 -2.221466
two 1 -1.299483 1.697065
one 2 0.612211 -1.076738
two 3 -1.482573 0.907826
And now I want to create 'NEW' column that:
for 'bar'
if index.level(2) > 1
"NEW" = col1
else
"NEW" = col2
for 'baz' the same with >2
for 'foo' the same with >3
How to do it without Py loops?
You can use get_level_values for select index values by levels and then for new column numpy.where:
#if possible use dictionary
d = {'bar':1, 'baz':2, 'foo':3}
m = df.index.get_level_values(2) > df.rename(d).index.get_level_values(0)
df['NEW'] = np.where(m, df.col1, df.col2)
For more general solution useSeries.rank:
a = df.index.get_level_values(2)
b = df.index.get_level_values(0).to_series().rank(method='dense')
df['NEW'] = np.where(a > b, df.col1, df.col2)
Detail:
print (b)
bar 1.0
bar 1.0
bar 1.0
bar 1.0
bar 1.0
bar 1.0
baz 2.0
baz 2.0
baz 2.0
baz 2.0
baz 2.0
baz 2.0
foo 3.0
foo 3.0
foo 3.0
foo 3.0
foo 3.0
foo 3.0
dtype: float64
I think you can avoid looping over df.level(0) by using df.factorize:
In [40]: thresh = pd.factorize(df.index.get_level_values(0).values)[0] + 1
In [41]: mask = df.index.get_level_values(2) > thresh
In [42]: df['NEW'] = np.where(mask, df.col1, df.col2)
In [43]: df
Out[43]:
col1 col2 NEW
bar one 1 0.247222 -0.270104 -0.270104
two 2 0.429196 -1.385352 0.429196
one 3 0.782293 -1.565623 0.782293
two 1 0.392214 1.023960 1.023960
one 2 -1.628410 -0.484275 -1.628410
two 3 0.256757 0.529373 0.256757
baz one 1 -0.568608 -0.776577 -0.776577
two 2 2.142408 -0.815413 -0.815413
one 3 0.860080 0.501965 0.860080
two 1 -0.267029 -0.025360 -0.025360
one 2 0.187145 -0.063436 -0.063436
two 3 0.351296 -2.050649 0.351296
foo one 1 0.704941 0.176698 0.176698
two 2 -0.380353 1.027745 1.027745
one 3 -1.337364 -0.568359 -0.568359
two 1 -0.588601 -0.800426 -0.800426
one 2 1.513358 -0.616237 -0.616237
two 3 0.244831 1.027109 1.027109

Unstack dataframe and keep columns

I have a DataFrame that is in a too much "compact" form. The DataFrame is currently like this :
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': ['A','B'],
'bar': ['1', '2'],
'baz': [np.nan, '3']})
bar baz foo
0 1 NaN A
1 2 3 B
And I need to "unstack" it to be like so :
> df = pd.DataFrame({'foo': ['A','B', 'B'],
'type': ['bar', 'bar', 'baz'],
'value': ['1', '2', '3']})
foo type value
0 A bar 1
1 B bar 2
2 B baz 3
No matter how I try to pivot, I can't get it right.
Use melt() method:
In [39]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type')
Out[39]:
foo type value
0 A bar 1
1 B bar 2
2 A baz NaN
3 B baz 3
or
In [38]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type').dropna()
Out[38]:
foo type value
0 A bar 1
1 B bar 2
3 B baz 3
set your index to foo, then stack:
df.set_index('foo').stack()
foo
A bar 1
B bar 2
baz 3
dtype: object

Pandas count null values in a groupby function

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : [np.nan, 'bla2', np.nan, 'bla3', np.nan, np.nan, np.nan, np.nan]})
Output:
A B C
0 foo one NaN
1 bar one bla2
2 foo two NaN
3 bar three bla3
4 foo two NaN
5 bar two NaN
6 foo one NaN
7 foo three NaN
I would like to use groupby in order to count the number of NaN's for the different combinations of foo.
Expected Output (EDIT):
A B C D
0 foo one NaN 2
1 bar one bla2 0
2 foo two NaN 2
3 bar three bla3 0
4 foo two NaN 2
5 bar two NaN 1
6 foo one NaN 2
7 foo three NaN 1
Currently I am trying this:
df['count']=df.groupby(['A'])['B'].isnull().transform('sum')
But this is not working...
Thank You
I think you need groupby with sum of NaN values:
df2 = df.C.isnull().groupby([df['A'],df['B']]).sum().astype(int).reset_index(name='count')
print(df2)
A B count
0 bar one 0
1 bar three 0
2 bar two 1
3 foo one 2
4 foo three 1
5 foo two 2
Notice that the .isnull() is on the original Dataframe column, not on the groupby()-object. The groupby() does not have .isnull() but if it would have it, it would be expected to give the same result as with .isnull() on the original DataFrame.
If need filter first add boolean indexing:
df = df[df['A'] == 'foo']
df2 = df.C.isnull().groupby([df['A'],df['B']]).sum().astype(int)
print(df2)
A B
foo one 2
three 1
two 2
Or simpler:
df = df[df['A'] == 'foo']
df2 = df['B'].value_counts()
print(df2)
one 2
two 2
three 1
Name: B, dtype: int64
EDIT: Solution is very similar, only add transform:
df['D'] = df.C.isnull().groupby([df['A'],df['B']]).transform('sum').astype(int)
print(df)
A B C D
0 foo one NaN 2
1 bar one bla2 0
2 foo two NaN 2
3 bar three bla3 0
4 foo two NaN 2
5 bar two NaN 1
6 foo one NaN 2
7 foo three NaN 1
Similar solution:
df['D'] = df.C.isnull()
df['D'] = df.groupby(['A','B'])['D'].transform('sum').astype(int)
print(df)
A B C D
0 foo one NaN 2
1 bar one bla2 0
2 foo two NaN 2
3 bar three bla3 0
4 foo two NaN 2
5 bar two NaN 1
6 foo one NaN 2
7 foo three NaN 1
df[df.A == 'foo'].groupby('b').agg({'C': lambda x: x.isnull().sum()})
returns:
=> C
B
one 2
three 1
two 2
just add this parameter dropna=False
df.groupby(['A', 'B','C'], dropna=False).size()
check the documentation:
dropnabool, default True
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

Categories

Resources