pandas groupby - custom function

pandas groupby - custom function - python

I have the following dataframe to which I use groupby and sum():
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1").sum()
This results in the following:
col1 col2
A 6.0
B 15.0
C 0.0
I want C to show NaN instead of 0 since all of the values for C are NaN. How can I accomplish this? Apply() with a lambda function? Any help would be appreciated.

Use this:
df.groupby('col1').apply(pd.DataFrame.sum,skipna=False).reset_index(drop=True)
#Or --> df.groupby('col1',as_index=False).apply(pd.DataFrame.sum,skipna=False)
Without the apply() thanks to #piRSquared:
df.set_index('col1').sum(level=0, min_count=1).reset_index()
thanks #Alollz :
If you want to return sum of groups containing NaN and not just NaNs
df.set_index('col1').sum(level=0,min_count=1).reset_index()
Output
col1 col2
0 AAA 6.0
1 BBB 15.0
2 CCC NaN

Thanks to #piRSquared, #Alollz, and #anky_91:
You can use without setting index and reset index:
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1", as_index=False).sum(min_count=1)
Output:
col1 col2
0 A 6.0
1 B 15.0
2 C NaN

make the call to sum have the parameter skipna = False.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
that link should provide the documentation you need and I expect that will fix your problem.

Related

Why nan values are shown after reindexing?

I am trying to reindex the columns, but it's displaying nan values. I am not able to understand why?
data = {
"age": [50, 40, 30, 40],
"qualified": [True, False, False, False]
}
index = ["P", "Q", "R", "S"]
df = pd.DataFrame(data, index=index)
new = ["A", "B", "C", "D"]
newdf = df.reindex(new)
print(newdf)
Output:
age qualified
A NaN NaN
B NaN NaN
C NaN NaN
D NaN NaN

I think you need DataFrame.set_index, with nested list if need replace index values by new values:
new = ["A", "B", "C", "D"]
newdf = df.set_index([new])
#alternative
#newdf.index = new
print(newdf)
age qualified
A 50 True
B 40 False
C 30 False
D 40 False
Method DataFrame.reindex working different - it create new index by list with alignment data - it means first match existing values of index by values of new list new and for not matching values create NaNs:
data = {
"age": [50, 40, 30, 40],
"qualified": [True, False, False, False]
}
index = ["A", "Q", "D", "C"]
df = pd.DataFrame(data, index=index)
new = ["A", "B", "C"]
newdf = df.reindex(new)
print(newdf)
age qualified
A 50.0 True
B NaN NaN
C 40.0 False

How to check in which column is certain value in pandas.DataFrame?

I have DataFrame in Python like below:
df = pd.DataFrame({"col1" : ["a", "b", "c"], "col2" : ["a", "d", "e"], "col3" : ["r", "t" , "g"]})
And I would like to check in which columns is value "a" (of course in "col1" and "col2"). How can I check it ?

(df=='a').any()
col1 True
col2 True
col3 False

If need columns names in list compare all values by DataFrame.eq with DataFrame.any for check if at least one True (match) per columns, last filter columns names:
c = df.columns[df.eq('a').any()].tolist()
print (c)
['col1', 'col2']
If need filter columns to new DataFrame use DataFrame.loc:
df1 = df.loc[:, df.eq('a').any()]
print (df1)
col1 col2
0 a a
1 b d
2 c e

equivalent python and pandas operation for group_by + mutate + indexing column vectors within mutate in R

Sample data frame in Python:
d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)
Now I want to get the same output in Python with pandas as I get in R with the code below. So I want to get the change in percentage in col1 by group in col2.
data.frame(col1 = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
col2 = c(3, 4, 5, 1, 3, 9, 16, 18, 23)) -> df
df %>%
dplyr::group_by(col1) %>%
dplyr::mutate(perc = (dplyr::last(col2) - col2[1]) / col2[1])
In python, I tried:
def perc_change(column):
index_1 = tu_in[column].iloc[0]
index_2 = tu_in[column].iloc[-1]
perc_change = (index_2 - index_1) / index_1
return(perc_change)
d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)
df.assign(perc_change = lambda x: x.groupby["col1"]["col2"].transform(perc_change))
But it gives me an error saying: 'method' object is not subscriptable.
I am new to python and trying to convert some R code into python. How can I solve this in an elegant way? Thank you!

You don't want transform here. transform is typically used when your aggregation returns a scalar value per group and you want to broadcast that result to all rows that belong to that group in the original DataFrame. Because GroupBy.pct_change already returns a result indexed like the original, you aggregate and assign back.
df['perc_change'] = df.groupby('col1')['col2'].pct_change()
# col1 col2 perc_change
#0 a 3 NaN
#1 a 4 0.333333
#2 a 5 0.250000
#3 b 1 NaN
#4 b 3 2.000000
#5 b 9 2.000000
#6 c 5 NaN
#7 c 7 0.400000
#8 c 23 2.285714
But if instead what you need is the overall percentage change within a group, so it's the difference in the first and last value divided by the first value, you would then want transform.
df.groupby('col1')['col2'].transform(lambda x: (x.iloc[-1] - x.iloc[0])/x.iloc[0])
0 0.666667
1 0.666667
2 0.666667
3 8.000000
4 8.000000
5 8.000000
6 3.600000
7 3.600000
8 3.600000
Name: col2, dtype: float64

Pandas: Calculate mean leaving out own row's value

I want to calculate means by group, leaving out the value of the row itself.
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
I know how to return means by group:
df.groupby('col1').agg({'col2': 'mean'})
Which returns:
Out[247]:
col1 col2
1 a 4
3 a -5
5 a 4
But what I want is mean by group, leaving out the row's value. E.g. for the first row:
df.query('col1 == "a"')[1:4].mean()
which returns:
Out[251]:
col2 1.0
dtype: float64
Edit:
Expected output is a dataframe of the same format as df above, with a column mean_excl_own which is the mean across all other members in the group, excluding the row's own value.

You could GroupBy col1and transform with the mean. Then subtract the value from a given row from the mean:
df['col2'] = df.groupby('col1').col2.transform('mean').sub(df.col2)

Thanks for all your input. I ended up using the approach linked to by #VnC.
Here's how I solved it:
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
group_summary = df.groupby('col1', as_index=False)['col2'].agg(['mean', 'count'])
df = pd.merge(df, group_summary, on = 'col1')
df['other_sum'] = df['col2'] * df['mean'] - df['col2']
df['result'] = df['other_sum'] / (df['count'] - 1)
Check out the final result:
df['result']
Which prints:
Out:
0 1.000000
1 -0.333333
2 2.666667
3 -0.333333
4 3.000000
5 3.000000
Name: result, dtype: float64
Edit: I previously had some trouble with column names, but I fixed it using this answer.

Python Pandas : Select data and ignoring KeyErrors

Note: my question isn't this one, but something a little more subtle.
Say I have a dataframe that looks like this
df =
A B C
0 3 3 1
1 2 1 9
df[["A", "B", "D"]] will raise a KeyError.
Is there a python pandas way to let df[["A", "B", "D"]] == df[["A", "B"]]? (Ie: just select the columns that exist.)
One solution might be
good_columns = list(set(df.columns).intersection(["A", "B", "D"]))
mydf = df[good_columns]
But this has two problems:
It's clunky and inelegant.
The ordering of mydf.columns could be ["A", "B"] or ["B", "A"].

You can use filter, this will just ignore any extra keys:
df.filter(["A","B","D"])
A B
0 3 3
1 2 1

You can use a conditional list comprehension:
target_cols = ['A', 'B', 'D']
>>> df[[c for c in target_cols if c in df]]
A B
0 3 3
1 2 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas groupby - custom function - python

make the call to sum have the parameter skipna = False. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html that link should provide the documentation you need and I expect that will fix your problem.

Related

Why nan values are shown after reindexing?

How to check in which column is certain value in pandas.DataFrame?

equivalent python and pandas operation for group_by + mutate + indexing column vectors within mutate in R

Pandas: Calculate mean leaving out own row's value

Python Pandas : Select data and ignoring KeyErrors

Categories

Resources