Pandas create a custom groupby aggregation for column

Pandas create a custom groupby aggregation for column - python

Is there a way in Pandas to create a new column that is a function of two column's aggregation, so that for any arbitrary grouping it preserves the function? This would be functionally similar to creating a calculated column in excel and pivoting by labels.
df1 = pd.DataFrame({'lab':['lab1','lab2']*5,'A':[1,2]*5,'B':[4,5]*5})
df1['C'] = df1.apply(lambda x: x['A']/x['B'],axis=1)
pd.pivot_table(df1,index='lab',{'A':sum,'B':sum,'C':lambda x: x['A']/x['B']})
should return:
|lab|A B|C|
|----|---|---|
|lab1|5 |20|.25|
|lab2|10|25 |.4|
i'd like to aggregate by 'lab' (or any combination of labels) and have the dataframe return the aggregation without having to re-define the column calculation. I realize this is trivial to manually code, but it's repetitive when you have many columns.

There are two ways you can do this using apply or agg:
import numpy as np
import pandas as pd
# Method 1
df1.groupby('lab').apply(lambda df: pd.Series({'A': df['A'].sum(), 'B': df['B'].sum(), 'C': df['C'].unique()[0]})).reset_index()
# Method 2
df1.groupby('lab').agg({'A': 'sum',
'B': 'sum',
'C': lambda x: np.unique(x)}).reset_index()
# output
lab A B C
0 lab1 5 20 0.25
1 lab2 10 25 0.40

Related

How to factorize entire DataFrame in pyspark

I have a Pyspark DataFrame and I want to factorize the entire df instead of each column to avoid the case that 2 different values in 2 columns have the same factorized value. I could do it with pandas as following:
_, b = pd.factorize(df.values.T.reshape(-1, ))
df = df.apply(lambda x: pd.Categorical(x, b).codes)
df = df.replace(-1, np.NaN)
Does anyone know how to do the same in Pyspark? Thank you very much.

Create new column with Pandas .apply() even if DataFrame is empty

I want to use Pandas apply to create a new column, and I want this functionality to be fail-save even if that DataFrame is empty. Here is a minimal example that works as expected:
df = pd.DataFrame(np.array([[1,2],[3,4]]), columns=['a','b']) # two columns
add = lambda x: x['a'] + x['b'] # add column a and b # add two values
df['c'] = df.apply( add, axis=1 ) # creates new column c, as anticipated
However, it gets problematic when df happens to be empty. Consider the following example where now the DataFrame is empty, but otherwise equal:
df = pd.DataFrame( columns=['a','b']) # two columns, but no values
df['c'] = df.apply( add, axis=1 ) # raises an error!
How can I execute this last column safely, such that it just appends a column 'c' to the DataFrame, even if df is empty?
Interestingly enough, this works
df.apply( add, axis=1 )
but cannot be appended as column 'c'.

If you want to create a new column c based on the sum of columns a and b, then you can just do the following:
df['c'] = df['a'] + df['b'] # creates new column c, as anticipated :)
That way, you don't need to assign a lambda expression to a function add, (it is not recommended to assign lambda expressions a function).
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['a', 'b']) # two columns
print(df)
a b
0 1 2
1 3 4
df['c'] = df['a'] + df['b'] # creates new column c, as anticipated
print(df)
a b c
0 1 2 3
1 3 4 7
df = pd.DataFrame(columns=['a', 'b']) # two columns, but no values
df['c'] = df['a'] + df['b'] # creates new column c, as anticipated
print(df)
Empty DataFrame
Columns: [a, b, c]
Index: []
Even if the dataframe is empty, above method will work.

If one axis (rows or columns) is empty then the apply function returns an empty result.
Your defined lambda function returns a pandas.Series. For handling empty pandas.DataFrame it is necessary to be more explicit on the result type of the apply method and use the reduce mode.
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
This will work:
df = pd.DataFrame(columns=['a','b'])
df['c'] = df.apply(add, axis=1, result_type='reduce')

Group dataframe and aggregate data from several columns into a new column

I want to group this dataframe by column a, and create a new column (d) with all values from both column b and column c.
data_dict = {'a': list('aabbcc'),
'b': list('123456'),
'c': list('xxxyyy')}
df = pd.DataFrame(data_dict)
From this...
to this
I've figured out one way of doing it,
df['d'] = df['b'] + df['c']
df.groupby('a').agg({'d': lambda x: ','.join(x)})
but is there a more pandas way?

I think "more pandas" is hard to define, but you are able to groupby agg directly on the series if you're trying to avoid the temp column:
g = (df['b'] + df['c']).groupby(df['a']).agg(','.join).to_frame('d')
g:
d
a
a 1x,2x
b 3x,4y
c 5y,6y

Applying Function to Rows of a Dataframe in Python

I have a dataframe and within 1 of the columns is a nested dictionary. I want to create a function where you pass each row and a column name and the function json_normalizes the the column into a dataframe. However, I keep getting and error 'function takes 2 positional arguments, 6 were given' There are more than 6 columns in the dataframe and more than 6 columns in the row[col] (see below) so I am confused as how 6 arguments are being provided.
import pandas as pd
from pandas.io.json import json_normalize
def fix_row_(row, col):
if type(row[col]) == list:
df = json_normalize(row[col])
df['id'] = row['id']
else:
df = pd.DataFrame()
return df
new_df = data.apply(lambda x: fix_po_(x, 'Items'), axis=1)
So new_df will be a dataframe of dataframes. In the example below, it would just be a dataframe with A,B,C as columns and 1,2,3 as the values.
Quasi-reproducible example:
my_dict = {'A': 1, 'B': 2, 'C': 3}
ids = pd.Series(['id1','id2','id3'],name='ids')
data= pd.DataFrame(ids)
data['my_column']=''
m = data['ids'].eq('id1')
data.loc[m, 'my_column'] = [my_dict] * m.sum()

Just pass your column using axis=1
df.apply(lambda x: fix_row_(x['my_column']), axis=1)

How to use groupby agg and rename functions for all columns

Question
How do I get the following result without having to assign a function dictionary for every column?
df.groupby(level=0).agg({'one': {'SUM': 'sum', 'HowMany': 'count'},
'two': {'SUM': 'sum', 'HowMany': 'count'}})
What I've done so far
Consider the df:
import pandas as pd
import numpy as np
idx = pd.MultiIndex.from_product([['A', 'B'], ['One', 'Two']],
names=['Alpha', 'Numeric'])
df = pd.DataFrame(np.arange(8).reshape(4, 2), idx, ['one', 'two'])
df
I want to use groupby().agg() where I run the set of functions and rename their output columns.
This works fine.
df.groupby(level=0).agg({'one': {'SUM': 'sum', 'HowMany': 'count'}})
But I want to do this for all columns. I could do this:
df.groupby(level=0).agg(['sum', 'count'])
But I'm missing the great renaming I've done. I'd hoped that this would work:
df.groupby(level=0).agg({'SUM': 'sum', 'HowMany': 'count'})
But it doesn't. I get this error:
KeyError: 'SUM'
This makes sense. Pandas is looking at the keys of the passed dictionary for columns names. It's how I got the example at the start to work.

You can use set_levels:
g = df.groupby(level=0).agg(['sum', 'count'])
g.columns.set_levels(['SUM', 'HowMany'], 1, inplace=True)
g
>>>
one two
SUM HowMany SUM HowMany
Alpha
A 2 2 4 2
B 10 2 12 2

is using .rename() an option for you?
In [7]: df.groupby(level=0).agg(['sum', 'count']).rename(columns=dict(sum='SUM', count='HowMany'))
Out[7]:
one two
SUM HowMany SUM HowMany
Alpha
A 2 2 4 2
B 10 2 12 2

This is an ugly answer:
gb = df.stack(0).groupby(level=[0, -1])
df1 = gb.agg({'SUM': 'sum', 'HowMany': 'count'})
df1.unstack().swaplevel(0, 1, 1).sort_index(1, 0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas create a custom groupby aggregation for column - python

Related

How to factorize entire DataFrame in pyspark

Create new column with Pandas .apply() even if DataFrame is empty

Group dataframe and aggregate data from several columns into a new column

Applying Function to Rows of a Dataframe in Python

How to use groupby agg and rename functions for all columns

Categories

Resources