import pandas as pd
test = pd.DataFrame({'Area': ['Tipperary','Tipperary','Cork','Dublin'],
'Deaths': [11,33,44,55]}
)
I have this problem on a much larger scale but for readability I have created a smaller version, what groupby logic do i need to group by the Area column and sum, meaning I end up with 3 rows as opposed to 4 because Tipperary is in there twice. Say if I had 6 columns altogether how would I do this and keep my existing dataframe as it is? IE just reduce the row count because of the duplicated values in 'Area'
If the other columns have more than just numbers, you can use .groupby and .agg with different functions for each column. If you do not want to move the grouping column to the index, you can set the parameter as_index = False in groupby.
import pandas as pd
test = pd.DataFrame({'Area': ['Tipperary', 'Tipperary', 'Cork', 'Dublin'],
'Deaths': [11, 33, 44, 55],
'Text': ['a', 'b', 'c', 'd'],
'Numbers': [1, 4, 3, 2]}
)
out = test.groupby('Area', as_index=False).agg({'Deaths': 'sum', 'Text': lambda x: ','.join(i for i in x), 'Numbers': 'max'})
print(out)
Prints:
Area Deaths Text Numbers
0 Cork 44 c 3
1 Dublin 55 d 2
2 Tipperary 44 a,b 4
You can simply use the .groupby method
import pandas as pd
test = pd.DataFrame({'Area': ['Tipperary','Tipperary','Cork','Dublin'],
'Deaths': [11,33,44,55]}
)
test.groupby('Area').sum()
Related
I am trying to subset pandas dataframe by a group / category, calculate a statistic and apply it to the original dataframe for missing values in the group.
df1 = pd.DataFrame({
'City': ['SF','NYC','SF','NYC','SF','CHI','LA','LA','CHI'],
'Val': [2,4,0,0,7,4,3,5,6]
})
for name, group in df.groupby(['City']):
dff = df[df['City'] == name]
# Calculate mean
df1 = dff[dff['Val'] != 0]
mean_val = int(df1['Val'].mean())
Now, I need to apply mean_val to all 0s in the subset.
We could mask out the 0 values then groupby transform to calculate the mean and fillna to put the means back, lastly convert the column to int using astype:
s = df['Val'].mask(df['Val'].eq(0))
df['Val'] = s.fillna(s.groupby(df['City']).transform('mean')).astype(int)
Or we can boolean index where Val is 0, mask out the 0 values and assign the results of groupby transform back using loc:
m = df['Val'].eq(0)
df.loc[m, 'Val'] = (
df['Val'].mask(m)
.groupby(df['City']).transform('mean')
.astype(int)
)
Both produce:
df:
City Val
0 SF 2
1 NYC 4
2 SF 4
3 NYC 4
4 SF 7
5 CHI 4
6 LA 3
7 LA 5
8 CHI 6
We could filter dff to get the index locations relative to df and assign back to modify the original approach:
for name, group in df.groupby(['City']):
dff = df[df['City'] == name]
# Calculate mean
df1 = dff[dff['Val'] != 0]
mean_val = int(df1['Val'].mean())
# Assign mean back to `df` at index locations where Val is 0 in group
df.loc[dff[(dff['Val'] == 0)].index, 'Val'] = mean_val
Although looping is highly discouraged in pandas due to increased runtime.
However, if we are going to use the iterable from groupby we should use the values returned instead of filtering from df:
for name, group in df.groupby(['City']):
# Create Boolean Index
m = group['Val'] != 0
# Calculate mean from grouped dataframe `group`
mean_val = int(group.loc[m, 'Val'].mean())
# Assign mean back to `df` at index locations where Val is 0 in group
df.loc[group[~m].index, 'Val'] = mean_val
DataFrame and imports:
import pandas as pd
df = pd.DataFrame({
'City': ['SF', 'NYC', 'SF', 'NYC', 'SF', 'CHI', 'LA', 'LA', 'CHI'],
'Val': [2, 4, 0, 0, 7, 4, 3, 5, 6],
})
I have a pandas dataframe that contains duplicates according to one column (ID), but has differing values in several other columns. My goal is to remove the duplicates based on ID, but to concatenate the information from the other columns.
Here is an example of what I'm working with:
ID Age Gender Form Signature Level
000 30 M Paper Yes A
000 30 M Electronic No B
001 42 Paper No B
After processing, I would like the data to look like this:
ID Age Gender Form Signature Level
000 30 M Paper, Electronic Yes, No A, B
001 42 Paper No B
First, I filled the nAn cells with "Not Noted" so that I can use the groupby function. I tried the following code:
df = df.groupby(['ID', 'Age', 'Gender'])['Form'].apply(set).reset_index()
This takes care of concatenating the Form column, but I cannot figure out how to incorporate the Signature and Level columns as well. Does anyone have any suggestions?
You can do this by modifying each column separately and then concatenating them with some basic list comprehension and the pd.concat function.
g = df.groupby(['ID', 'Age', 'Gender'])
concatCols = ['Form', 'Signature', 'Level'] #? Used to define which columns should be concatenated
df = pd.concat([g[c].apply(set) for c in concatCols], axis=1).reset_index()
print(df)
You can do it like this:
import pandas as pd
df = pd.DataFrame({'ID': ['000', '000', '001'],
'Age': [30, 30, 42],
'Gender': ['M', 'M', ''],
'Form': ['Paper', 'Electronic', 'Paper'],
'Signature': ['Yes', 'No', 'No'],
'Level': ['A', 'B', 'B']})
df = df.groupby(['ID', 'Age', 'Gender']).agg({'Form': set, 'Signature': set, 'Level': set}).reset_index()
print(df)
Output:
ID Age Gender Form Signature Level
0 000 30 M {Electronic, Paper} {No, Yes} {B, A}
1 001 42 {Paper} {No} {B}
df = pd.DataFrame([['A',7], ['A',5], ['B',6]], columns = ['group', 'value'])
If I want to keep one row by group, the one having the minimum value, I use :
df[df['value'] == df.groupby('group')['value'].transform('min')]
However if I want to keep the row with the lowest index, the following does not work :
df[df.index == df.groupby('group').index.transform('min')]
I know I just could use reset_index() and deal with the index as a column, but can I avoid this :
df[df.reset_index()['index'] == df.reset_index().groupby('group')['index'].transform('min')]
You can sort by index (if it's not already sorted) and then take the first row in each group:
df.sort_index().groupby('group').first()
You could do:
import pandas as pd
df = pd.DataFrame([['A', 7], ['A', 5], ['B', 6]], columns=['group', 'value'])
idxs = df.reset_index().groupby('group').index.idxmin()
result = df.loc[idxs]
print(result)
Output
group value
0 A 7
2 B 6
Given a data frame that looks like this
GROUP VALUE
1 5
2 2
1 10
2 20
1 7
I would like to compute the difference between the largest and smallest value within each group. That is, the result should be
GROUP DIFF
1 5
2 18
What is an easy way to do this in Pandas?
What is a fast way to do this in Pandas for a data frame with about 2 million rows and 1 million groups?
Using #unutbu 's df
per timing
unutbu's solution is best over large data sets
import pandas as pd
import numpy as np
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
df.groupby('GROUP')['VALUE'].agg(np.ptp)
GROUP
1 5
2 18
Name: VALUE, dtype: int64
np.ptp docs returns the range of an array
timing
small df
large df
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 100, VALUE=np.random.rand(1000000)))
large df
many groups
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 10000, VALUE=np.random.rand(1000000)))
groupby/agg generally performs best when you take advantage of the built-in aggregators such as 'max' and 'min'. So to obtain the difference, first compute the max and min and then subtract:
import pandas as pd
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
result = df.groupby('GROUP')['VALUE'].agg(['max','min'])
result['diff'] = result['max']-result['min']
print(result[['diff']])
yields
diff
GROUP
1 5
2 18
Note: this will get the job done, but #piRSquared's answer has faster methods.
You can use groupby(), min(), and max():
df.groupby('GROUP')['VALUE'].apply(lambda g: g.max() - g.min())
Question
How do I get the following result without having to assign a function dictionary for every column?
df.groupby(level=0).agg({'one': {'SUM': 'sum', 'HowMany': 'count'},
'two': {'SUM': 'sum', 'HowMany': 'count'}})
What I've done so far
Consider the df:
import pandas as pd
import numpy as np
idx = pd.MultiIndex.from_product([['A', 'B'], ['One', 'Two']],
names=['Alpha', 'Numeric'])
df = pd.DataFrame(np.arange(8).reshape(4, 2), idx, ['one', 'two'])
df
I want to use groupby().agg() where I run the set of functions and rename their output columns.
This works fine.
df.groupby(level=0).agg({'one': {'SUM': 'sum', 'HowMany': 'count'}})
But I want to do this for all columns. I could do this:
df.groupby(level=0).agg(['sum', 'count'])
But I'm missing the great renaming I've done. I'd hoped that this would work:
df.groupby(level=0).agg({'SUM': 'sum', 'HowMany': 'count'})
But it doesn't. I get this error:
KeyError: 'SUM'
This makes sense. Pandas is looking at the keys of the passed dictionary for columns names. It's how I got the example at the start to work.
You can use set_levels:
g = df.groupby(level=0).agg(['sum', 'count'])
g.columns.set_levels(['SUM', 'HowMany'], 1, inplace=True)
g
>>>
one two
SUM HowMany SUM HowMany
Alpha
A 2 2 4 2
B 10 2 12 2
is using .rename() an option for you?
In [7]: df.groupby(level=0).agg(['sum', 'count']).rename(columns=dict(sum='SUM', count='HowMany'))
Out[7]:
one two
SUM HowMany SUM HowMany
Alpha
A 2 2 4 2
B 10 2 12 2
This is an ugly answer:
gb = df.stack(0).groupby(level=[0, -1])
df1 = gb.agg({'SUM': 'sum', 'HowMany': 'count'})
df1.unstack().swaplevel(0, 1, 1).sort_index(1, 0)