Python: dataframe manipulation and aggregation in pandas

Python: dataframe manipulation and aggregation in pandas - python

I have the following df:
dfdict = {'letter': ['a', 'a', 'a', 'b', 'b'], 'category': ['foo', 'foo', 'bar', 'bar', 'spam']}
df1 = pd.DataFrame(dfdict)
category letter
0 foo a
1 foo a
2 bar a
3 bar b
4 spam b
I want it to output me an aggregated count df like this:
a b
foo 2 0
bar 1 1
spam 0 1
This seems like it should be an easy operation. I have figured out how to use
df1 = df1.groupby(['category','letter']).size() to get:
category letter
bar a 1
b 1
foo a 2
spam b 1
This is closer, except now I need the letters a, b along the top and the counts coming down.

You can using crosstab
pd.crosstab(df1.category,df1.letter)
Out[554]:
letter a b
category
bar 1 1
foo 2 0
spam 0 1
To fix your code , adding unstack
df1.groupby(['category','letter']).size().unstack(fill_value=0)
Out[556]:
letter a b
category
bar 1 1
foo 2 0
spam 0 1

Related

Pandas group multiple columns and append value based on condition in non-grouped column

I'd like to group several columns in my dataframe, then append a new column to the original dataframe with a non-aggregated value determined by a condition in another column that falls outside of the grouping. For example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'cat' : ['foo', 'foo', 'foo', 'foo','foo','foo',
'bar', 'bar', 'bar',' bar','bar', 'bar'],
'subcat' : ['a', 'a','a', 'b', 'b', 'b',
'c', 'c','c','d', 'd', 'd'],
'bin' : [1,0,0,0,1,0,0,0,1,0,0,1],
'value':[2,5,7,6,3,9,8,3,2,1,2,4]
})
I'd like to group by both 'cat' and 'subcat', and I'm hoping to append the corresponding 'value' as a new column where 'bin' == 1.
This is my desired output:
df = pd.DataFrame({'cat' : ['foo', 'foo', 'foo', 'foo','foo','foo',
'bar', 'bar', 'bar',' bar','bar', 'bar'],
'subcat' : ['a', 'a','a', 'b', 'b', 'b',
'c', 'c','c','d', 'd', 'd'],
'bin' : [1,0,0,0,1,0,0,0,1,0,0,1],
'value':[2,5,7,6,3,9,8,3,2,1,2,4],
'new_value':[2,2,2,3,3,3,2,2,2,4,4,4]
})
I've tried various approaches including the following, but my merge yields more rows than expected so am hoping to find a different route.
vals = df[df['bin'] == 1].loc[:,('cat', 'subcat', 'value')]
df_merged = pd.merge(left = df, right = vals, how = "left", on = ('cat','subcat'))
Thanks!

Try with loc with groupby and idxmax:
df['new_value'] = df.loc[df.groupby(['subcat'])['bin'].transform('idxmax'), 'value'].reset_index(drop=True)
print(df)
Output:
cat subcat bin value new_value
0 foo a 1 2 2
1 foo a 0 5 2
2 foo a 0 7 2
3 foo b 0 6 3
4 foo b 1 3 3
5 foo b 0 9 3
6 bar c 0 8 2
7 bar c 0 3 2
8 bar c 1 2 2
9 bar d 0 1 4
10 bar d 0 2 4
11 bar d 1 4 4

Selecting repeated values of a multi-index level with .loc

First lets suppose I have a pandas dataframe with a single index. If I use .loc[] to select index 'A' twice, it will return a dataframe with index 'A' repeated twice:
df_1 = pd.DataFrame([1,2,3], index=['A','B','C'], columns=['Col_1'])
df_1
Col_1
A 1
B 2
C 3
df_1.loc[['A','A','B']]
Col_1
A 1
A 1
B 2
Now lets suppose we have a dataframe with a multi-index. If I use .loc[] to select index 'A' twice, it will return a dataframe with index 'A' included only once:
ix = pd.MultiIndex.from_product([['A', 'B', 'C'], ['foo', 'bar']], names=['Idx1', 'Idx2'])
data = np.arange(len(ix))
df_2 = pd.DataFrame(data, index=ix, columns=['Col_1'])
df_2
Col_1
Idx1 Idx2
A foo 0
bar 1
B foo 2
bar 3
C foo 4
bar 5
df_2.loc[['A','A','B']]
Col_1
Idx1 Idx2
A foo 0
bar 1
B foo 2
bar 3
Is there any way to select repeated values of a multi-index level using .loc?

Pandas tries to keep the levels of a MultiIndex unique. When you use loc with a list that refers to values of the first level of the MultiIndex it will keep things unique. If you want something different, you'll need to be explicit and use tuples.
specific_index_values = (
[('A', 'foo'), ('A', 'bar')] * 2 +
[('B', 'foo'), ('B', 'bar')]
)
df_2.loc[specific_index_values, :]
Col_1
Idx1 Idx2
A foo 0
bar 1
foo 0
bar 1
B foo 2
bar 3
pandas.concat
I find this distasteful but...
pd.concat([df_2.loc[[x]] for x in ['A', 'A', 'B']])
Col_1
Idx1 Idx2
A foo 0
bar 1
foo 0
bar 1
B foo 2
bar 3

How to label duplicate groups in pandas?

I have a DataFrame:
>>> df
A
0 foo
1 bar
2 foo
3 baz
4 foo
5 bar
I need to find all the duplicate groups and label them with sequential dgroup_id's:
>>> df
A dgroup_id
0 foo 1
1 bar 2
2 foo 1
3 baz
4 foo 1
5 bar 2
(This means that foo belongs to the first group of duplicates, bar to the second group of duplicates, and baz is not duplicated.)
I did this:
import pandas as pd
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
duplicates = df.groupby('A').size()
duplicates = duplicates[duplicates>1]
# Yes, this is ugly, but I didn't know how to do it otherwise:
duplicates[duplicates.reset_index().index] = duplicates.reset_index().index
df.insert(1, 'dgroup_id', df['A'].map(duplicates))
This leads to:
>>> df
A dgroup_id
0 foo 1.0
1 bar 0.0
2 foo 1.0
3 baz NaN
4 foo 1.0
5 bar 0.0
Is there a simpler/shorter way to achieve this in pandas? I read that maybe pandas.factorize could be of help here, but I don't know how to use it... (the pandas documentation on this function is of no help)
Also: I don't mind neither the 0-based group count, nor the weird sorting order; but I would like to have the dgroup_id's as ints, not floats.

You can make a list of duplicates by get_duplicates() then set the dgroup_id by A's index
def find_index(string):
if string in duplicates:
return duplicates.index(string)+1
else:
return 0
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
duplicates = df.set_index('A').index.get_duplicates()
df['dgroup_id'] = df['A'].apply(find_index)
df
Output:
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz 0
4 foo 2
5 bar 1

Use chained operation to first get value_count for each A, calculate the sequence number for each group, and then join back to the original DF.
(
pd.merge(df,
df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
.cumsum().rename('dgroup_id').to_frame(),
left_on='A', right_index=True).sort_index()
)
Out[49]:
A dgroup_id
0 foo 1.0
1 bar 2.0
2 foo 1.0
3 baz NaN
4 foo 1.0
5 bar 2.0
If you need Nan for unique groups, you can't have int as the datatype which is a pandas limitation at the moment. If you are ok with set 0 for unique groups, you can do something like:
(
pd.merge(df,
df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
.cumsum().rename('dgroup_id').to_frame().fillna(0).astype(int),
left_on='A', right_index=True).sort_index()
)
A dgroup_id
0 foo 1
1 bar 2
2 foo 1
3 baz 0
4 foo 1
5 bar 2

Use duplicated to identify where dups are. Use where to replace singletons with ''. Use categorical to factorize.
dups = df.A.duplicated(keep=False)
df.assign(dgroup_id=df.A.where(dups, '').astype('category').cat.codes)
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz 0
4 foo 2
5 bar 1
If you insist on the zeros being ''
dups = df.A.duplicated(keep=False)
df.assign(
dgroup_id=df.A.where(dups, '').astype('category').cat.codes.replace(0, ''))
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz
4 foo 2
5 bar 1

You could go for:
import pandas as pd
import numpy as np
df = pd.DataFrame(['foo', 'bar', 'foo', 'baz', 'foo', 'bar',], columns=['name'])
# Create the groups order
ordered_names = df['name'].drop_duplicates().tolist() # ['foo', 'bar', 'baz']
# Find index of each element in the ordered list
df['duplication_index'] = df['name'].apply(lambda x: ordered_names.index(x) + 1)
# Discard non-duplicated entries
df.loc[~df['name'].duplicated(keep=False), 'duplication_index'] = np.nan
print(df)
# name duplication_index
# 0 foo 1.0
# 1 bar 2.0
# 2 foo 1.0
# 3 baz NaN
# 4 foo 1.0
# 5 bar 2.0

df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
key_set = set(df['A'])
df_a = pd.DataFrame(list(key_set))
df_a['dgroup_id'] = df_a.index
result = pd.merge(df,df_a,left_on='A',right_on=0,how='left')
In [32]: result.drop(0,axis=1)
Out[32]:
A dgroup_id
0 foo 2
1 bar 0
2 foo 2
3 baz 1
4 foo 2
5 bar 0

Unstack dataframe and keep columns

I have a DataFrame that is in a too much "compact" form. The DataFrame is currently like this :
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': ['A','B'],
'bar': ['1', '2'],
'baz': [np.nan, '3']})
bar baz foo
0 1 NaN A
1 2 3 B
And I need to "unstack" it to be like so :
> df = pd.DataFrame({'foo': ['A','B', 'B'],
'type': ['bar', 'bar', 'baz'],
'value': ['1', '2', '3']})
foo type value
0 A bar 1
1 B bar 2
2 B baz 3
No matter how I try to pivot, I can't get it right.

Use melt() method:
In [39]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type')
Out[39]:
foo type value
0 A bar 1
1 B bar 2
2 A baz NaN
3 B baz 3
or
In [38]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type').dropna()
Out[38]:
foo type value
0 A bar 1
1 B bar 2
3 B baz 3

set your index to foo, then stack:
df.set_index('foo').stack()
foo
A bar 1
B bar 2
baz 3
dtype: object

use column name as condition for where on pandas DataFrame

Say I have the following DataFrame:
arrays = [['foo', 'foo', 'bar', 'bar'],
['A', 'B', 'C', 'D']]
tuples = list(zip(*arrays))
columnValues = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.rand(4,4), columns = columnValues)
print(df)
foo bar
A B C D
0 0.037362 0.470010 0.315396 0.333798
1 0.339038 0.396307 0.487242 0.064883
2 0.691654 0.793609 0.044490 0.384154
3 0.605801 0.967021 0.156839 0.123816
I want to produce the following output:
foo bar
A B C D
0 0 0 0.315396 0.333798
1 0 0 0.487242 0.064883
2 0 0 0.044490 0.384154
3 0 0 0.156839 0.123816
I think I can use pd.DataFrame.where() for this, however I don't see how to pass the column name bar as a condition.
EDIT: I'm looking for a way to specifically use bar instead of foo to produce the desired outcome, as foo would actually be many columns
EDIT2: Unfortunately list comprehension breaks if the list contains all the column labels. Explicitly writing out the for loop does work though.
So instead of this:
df.loc[:, [col for col in df.columns.levels[0] if col != 'bar']] = 0
I use this:
for col in df.columns.levels[0]:
if not(col in nameList):
df.loc[:,col]=0

Use slicing to set your data. Here, you could access sub-columns (A, B), under foo.
In [12]: df
Out[12]:
foo bar
A B C D
0 0.040251 0.119267 0.170111 0.582362
1 0.978192 0.592043 0.515702 0.630627
2 0.762532 0.667234 0.450505 0.103858
3 0.871375 0.397503 0.966837 0.870184
In [13]: df.loc[:, 'foo'] = 0
In [14]: df
Out[14]:
foo bar
A B C D
0 0 0 0.170111 0.582362
1 0 0 0.515702 0.630627
2 0 0 0.450505 0.103858
3 0 0 0.966837 0.870184
If you want to set all columns except bar, you could do.
In [15]: df.loc[:, [col for col in df.columns.levels[0] if col != 'bar']] = 0

You could use get_level_values, I guess:
>>> df
foo bar
A B C D
0 0.039728 0.065875 0.825380 0.240403
1 0.617857 0.895751 0.484237 0.506315
2 0.332381 0.047287 0.011291 0.346073
3 0.216224 0.024978 0.834353 0.500970
>>> df.loc[:, df.columns.get_level_values(0) != "bar"] = 0
>>> df
foo bar
A B C D
0 0 0 0.825380 0.240403
1 0 0 0.484237 0.506315
2 0 0 0.011291 0.346073
3 0 0 0.834353 0.500970
df.columns.droplevel(1) != "bar" should also work, although I don't like it as much even though it's shorter because it inverts the selection logic.

Easier, without loc
df['foo'] = 0

If you happen not to have this multi index you can use:
df.ix[:,['A','B']] = 0
This replaces automatically the values in your columns 'A' and 'B' by 0.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: dataframe manipulation and aggregation in pandas - python

You can using crosstab pd.crosstab(df1.category,df1.letter) Out[554]: letter a b category bar 1 1 foo 2 0 spam 0 1 To fix your code , adding unstack df1.groupby(['category','letter']).size().unstack(fill_value=0) Out[556]: letter a b category bar 1 1 foo 2 0 spam 0 1

Related

Pandas group multiple columns and append value based on condition in non-grouped column

Selecting repeated values of a multi-index level with .loc

How to label duplicate groups in pandas?

Unstack dataframe and keep columns

use column name as condition for where on pandas DataFrame

Categories

Resources