How can I aggregate a dataframe on specific values?

How can I aggregate a dataframe on specific values? - python

I have a pandas dataframe df like this, say
ID activity date
1 A 4
1 B 8
1 A 12
1 C 12
2 B 9
2 A 10
3 A 3
3 D 4
and I would like to return a table that counts the number of occurence of some activity in a precise list, say l = [A, B] in this case, then
ID activity(count)_A activity(count)_B
1 2 1
2 1 2
3 1 0
is what I need.
What is the quickest way to perform that ? ideally without for loop
Thanks !
Edit: I know there is pivot function to do this kind of job. But in my case I have much more activity types than what I really need to count in the list l. Is it still optimal to use pivot ?

You can use isin with boolean indexing as first step and then pivoting - fastest should be groupby, size and unstack, then pivot_table and last crosstab, the best test each solution with real data:
df2 = (df[df['activity'].isin(['A','B'])]
.groupby(['ID','activity'])
.size()
.unstack(fill_value=0)
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))
print (df2)
ID activity(count)_A activity(count)_B
0 1 2 1
1 2 1 1
2 3 1 0
Or:
df1 = df[df['activity'].isin(['A','B'])]
df2 = (pd.crosstab(df1['ID'], df1['activity'])
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))
Or:
df2 = (df[df['activity'].isin(['A','B'])]
.pivot_table(index='ID', columns='activity', aggfunc='size', fill_value=0)
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))

I believe df.groupby('activity').size().reset_index(name='count')
should do as you expect.

Just aggregate by Counter and use pd.DataFrame default constructor
from collections import Counter
agg_= df.groupby(df.index).ID.agg(Counter).tolist()
ndf = pd.DataFrame(agg_)
A B C D
0 2 1.0 1.0 NaN
1 1 1.0 NaN NaN
2 1 NaN NaN 1.0
If you have l = ['A', 'B'], just filter
ndf[l]
A B
0 2 1.0
1 1 1.0
2 1 NaN

Related

Python DataFrame: count of occurances based on another column

I have a Python Data Frame of teams and a place that they have achieved (1, 2 or 3)
Team
place
A
1
A
1
A
1
A
2
A
3
A
1
A
1
B
2
B
2
I want to manipulate the df to look like this below. So it is a count of how often a team has achieved each place.
Team
1
2
3
A
5
1
1
B
0
2
0

You could use pandas.crosstab:
pd.crosstab(df['Team'], df['place'])
or a simple groupby+size and unstack:
(df.groupby(['Team', 'place']).size()
.unstack('place', fill_value=0)
)
output:
place 1 2 3
Team
A 5 1 1
B 0 2 0
all as columns
(pd.crosstab(df['Team'], df['place'])
.rename_axis(columns=None)
.reset_index()
)
output:
Team 1 2 3
0 A 5 1 1
1 B 0 2 0

You can get the value counts for each group and then unstack the index. The rest is twiddling to get your exact output.
(df.groupby('Team')['place']
.value_counts()
.unstack(fill_value=0)
.reset_index()
.rename_axis(None, axis=1)
)

How can I groupby a DataFrame at the same time I count the values and put in different columns?

I have a DataFrame that looks like the one below
Index Category Class
0 1 A
1 1 A
2 1 B
3 2 A
4 3 B
5 3 B
And I would like to get an output data frame that groups by category and have one column for each of the classes with the counting of the occurrences of that class in each category, such as the one below
Index Category A B
0 1 2 1
1 2 1 0
2 3 0 2
So far I've tried various combinations of the groupby and agg methods, but I still can't get what I want. I've also tried df.pivot_table(index='Category', columns='Class', aggfunc='count'), but that return a DataFrame without columns. Any ideas of what could work in this case?

You can use aggfunc="size" to achieve your desired result:
>>> df.pivot_table(index='Category', columns='Class', aggfunc='size', fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Alternatively, you can use .groupby(...).size() to get the counts, and then unstack to reshape your data as well:
>>> df.groupby(["Category", "Class"]).size().unstack(fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2

Assign a dummy value to count:
out = df.assign(val=1).pivot_table('val', 'Category', 'Class',
aggfunc='count', fill_value=0).reset_index()
print(out)
# Output
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2

import pandas as pd
df = pd.DataFrame({'Index':[0,1,2,3,4,5],
'Category': [1,1,1,2,3,3],
'Class':['A','A','B','A','B','B'],
})
df = df.groupby(['Category', 'Class']).count()
df = df.pivot_table(index='Category', columns='Class')
print(df)
output:
Index
Class A B
Category
1 2.0 1.0
2 1.0 NaN
3 NaN 2.0

Use crosstab:
pd.crosstab(df['Category'], df['Class']).reset_index()
output:
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2

pandas - groupby a column and get the max length of another string column with nulls

I have a pandas DataFrame like this:
source text_column
0 a abcdefghi
1 a abcde
2 b qwertyiop
3 c plmnkoijb
4 a NaN
5 c abcde
6 b qwertyiop
7 b qazxswedcdcvfr
and I would like to get the length of text_column after grouping source column, like below:
source something
a 9
b 14
c 9
Here's what I have tried till now and all of them generate error:
>>> # first creating the group by object
>>> text_group = mydf.groupby(by=['source'])
>>> # now try to get the max length of "text_column" by each "source"
>>> text_group['text_column'].map(len).max()
>>> text_group['text_column'].len().max()
>>> text_group['text_column'].str.len().max()
How do I get the max length of text_column with another column grouped by.
and to avoid creating new question, how do I also get the 2nd biggest length and the respective values(1st and 2nd largest sentences in text_column).

First idea is use lambda function with Series.str.len and max:
df = (df.groupby('source')['text_column']
.agg(lambda x: x.str.len().max())
.reset_index(name='something'))
print (df)
source something
0 a 9.0
1 b 14.0
2 c 9.0
Or you can first use Series.str.len and then aggregate max:
df = (df['text_column'].str.len()
.groupby(df['source'])
.max()
.reset_index(name='something'))
print (df)
Also if need integers first use DataFrame.dropna:
df = (df.dropna(subset=['text_column'])
.assign(text_column=lambda x: x['text_column'].str.len())
.groupby('source', as_index=False)['text_column']
.max())
print (df)
source text_column
0 a 9
1 b 14
2 c 9
EDIT: for first and second top values use DataFrame.sort_values with GroupBy.head:
df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.sort_values(['source','something'], ascending=[True, False])
.groupby('source', as_index=False)
.head(2))
print (df1)
source text_column something
0 a abcdefghi 9
1 a abcde 5
7 b qazxswedcdcvfr 14
2 b qwertyiop 9
3 c plmnkoijb 9
5 c abcde 5
Alternative solution with SeriesGroupBy.nlargest, obviously slowier:
df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.groupby('source')['something']
.nlargest(2)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
source something
0 a 9
1 a 5
2 b 14
3 b 9
4 c 9
5 c 5
Last solution for new columns by top1, top2:
df=df.dropna(subset=['text_column']).assign(something=lambda x: x['text_column'].str.len())
df = df.sort_values(['source','something'], ascending=[True, False])
df['g'] = df.groupby('source').cumcount().add(1)
df = (df[df['g'].le(2)].pivot('source','g','something')
.add_prefix('top')
.rename_axis(index=None, columns=None))
print (df)
top1 top2
a 9 5
b 14 9
c 9 5

Just get the lengths first with assign and str.len:
df.assign(text_column=df['text_column'].str.len()).groupby('source', as_index=False).max()
source text_column
0 a 9.0
1 b 14.0
2 c 9.0
>>>

The easiest solution to me looks sth like this (tested) - you do not actually need a groupby:
df['str_len'] = df.text_column.str.len()
df.sort_values(['str_len'], ascending=False)\
.drop_duplicates(['source'])\
.drop(columns='text_column')
source str_len
7 b 14.0
0 a 9.0
3 c 9.0
regarding your 2nd question, I think a groupby serves you well:
top_x = 2
df.groupby('source', as_index=False)\
.apply(lambda sourcedf: sourcedf.sort_values('str_len').nlargest(top_x, columns='str_len', keep='all'))\
.drop(columns='text_column')

Find the latest occurrence of an class item and store how many values are between these two in a pandas DataFrame

I have a pandas DataFrame with some labels for n classes. Now I want to add a column and store how many items are between two elements of the same class.
Class
0 0
1 1
2 1
3 1
4 0
and I want to get this:
Class Shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
This is the code I used:
df = pd.DataFrame({'Class':[0,1,1,1,0]})
df['Shift'] = np.nan
for item in df.Class.unique():
_df = df[df['Class'] == item]
_df = _df.reset_index().rename({'index':'idx'}, axis=1)
df.loc[_df.idx, 'Shift'] = _df['idx'].diff().values
df
This seems circuitous to me. Is there a more elegant way of producing this output?

You could do:
df['shift'] = np.arange(len(df))
df['shift'] = df.groupby('Class')['shift'].diff()
print(df)
Output
Class shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
As an alternative:
df['shift'] = df.assign(shift=np.arange(len(df))).groupby('Class')['shift'].diff()
The idea is to create a column with consecutive values, group by the Class column and compute the diff on the new column.

If there is default RangeIndex use Index.to_series with grouping by column df['Class'] and DataFrameGroupBy.diff:
df['Shift'] = df.index.to_series().groupby(df['Class']).diff()
Similar alternative is create helper column:
df['Shift'] = df.assign(tmp = df.index).groupby('Class')['tmp'].diff()
print (df)
Class Shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
Your solution with reseting index should be simplify by:
df['Shift'] = df.reset_index().groupby('Class')['index'].diff().to_numpy()

Reshaping into binary variables using pandas python

I am still new to Python pandas' pivot_table and im trying to reshape the data to have a binary indicator if a value is in a certain observation. I have follow some previous codes and got some encouraging results, however instead of 1 and zeros as Is my ideal result I get a sum. Please see a small sample data set below
ID SKILL NUM
1 A 1
1 A 1
1 B 1
2 C 1
3 C 1
3 C 1
3 E 1
The results I am aiming for is:
ID A B C E
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
My code atm get the following result:
ID A B C E
1 2 1 0 0
2 0 0 2 0
3 0 0 0 1
Should I remove the duplicates first??
The code I'm using atm is below;
df_pivot = df2.pivot_table(index='Job_posting_ID', columns='SKILL', aggfunc=len, fill_value=0)

You can use get_dummies with set_index for indicator columns and then get max values per index:
df = pd.get_dummies(df.set_index('ID')['SKILL']).max(level=0)
For better performance remove duplicates by drop_duplicates and reshape by set_index with unstack:
df = df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0)
Solution with pivot, but then is necessary replace NaNs to 0:
df = df.drop_duplicates(['ID','SKILL']).pivot('ID','SKILL','NUM').fillna(0).astype(int)
If want use your solution, just remove duplicates, but better is unstack, beacuse data are never aggregated, because not duplicated pairs ID with SKILL:
df2 = df.drop_duplicates(['ID','SKILL'])
df_pivot = (df2.pivot_table(index='ID',
columns='SKILL',
values='NUM',
aggfunc=len,
fill_value=0))
print (df_pivot)
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1

Try like this:
df.pivot_table(index='ID', columns='SKILL', values='NUM', aggfunc=lambda x: len(x.unique()), fill_value=0)
Or this:
df.pivot_table(index='ID', columns='SKILL',aggfunc=lambda x: int(x.any()), fill_value=0)
Whichever suits you best.

You can use aggfunc='any' and convert to int as a separate step. This avoids having to use a lambda / custom function, and may be more efficient.
df_pivot = df.pivot_table(index='ID', columns='SKILL',
aggfunc='any', fill_value=0).astype(int)
print(df_pivot)
NUM
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1
The same would work with aggfunc=len + conversion to int, except this is likely to be more expensive.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I aggregate a dataframe on specific values? - python

I believe df.groupby('activity').size().reset_index(name='count') should do as you expect.

Related

Python DataFrame: count of occurances based on another column

How can I groupby a DataFrame at the same time I count the values and put in different columns?

pandas - groupby a column and get the max length of another string column with nulls

Find the latest occurrence of an class item and store how many values are between these two in a pandas DataFrame

Reshaping into binary variables using pandas python

Categories

Resources