I'm wondering how to aggregate data within a grouped pandas dataframe by a function where I take into account the value stored in some column of the dataframe. This would be useful in operations where order of operations matters, such as division.
For example I have:
In [8]: df
Out[8]:
class cat xer
0 a 1 2
1 b 1 4
2 c 1 9
3 a 2 6
4 b 2 8
5 c 2 3
I want to group by by class and for each class divide the xer value corresponding to cat == 1 by that for cat == 2. In other words, the entries in the final output should be:
class div
0 a 0.33 (i.e. 2/6)
1 b 0.5 (i.e. 4/8)
2 c 3 (i.e. 9/3)
Is this possible to do using groupby? I can't quite figure out how to do it without manually iterating through each class and even so it's not clean or fun.
Without doing anything too clever:
In [11]: one = df[df["cat"] == 1].set_index("class")["xer"]
In [12]: two = df[df["cat"] == 2].set_index("class")["xer"]
In [13]: one / two
Out[13]:
class
a 0.333333
b 0.500000
c 3.000000
Name: xer, dtype: float64
Given your DataFrame, you can use the following:
df.groupby('class').agg({'xer': lambda L: reduce(pd.np.divide, L)})
Which gives you:
xer
class
a 0.333333
b 0.500000
c 3.000000
This caters for > 2 per group (if needs be), but you might want to ensure your df is sorted by cat first to ensure they appear in the right order.
You may want to rearrange your data to make it easier to view:
df2 = df.set_index(['class', 'cat']).unstack()
>>> df2
xer
cat 1 2
class
a 2 6
b 4 8
c 9 3
You can then do the following to get your desired result:
>>> df2.iloc[:,0].div(df2.iloc[:, 1])
class
a 0.333333
b 0.500000
c 3.000000
Name: (xer, 1), dtype: float64
This is one approach, step by step:
# get cat==1 and cat==2 merged by class
grouped = df[df.cat==1].merge(df[df.cat==2], on='class')
# calculate div
grouped['div'] = grouped.xer_x / grouped.xer_y
# return the final dataframe
grouped[['class', 'div']]
which yields:
class div
0 a 0.333333
1 b 0.500000
2 c 3.000000
Related
I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
I am trying to select a bunch of single rows in bunch of dataframes and trying to make a new data frame by concatenating them together.
Here is a simple example
x=pd.DataFrame([[1,2,3],[1,2,3]],columns=["A","B","C"])
A B C
0 1 2 3
1 1 2 3
a=x.loc[0,:]
A 1
B 2
C 3
Name: 0, dtype: int64
b=x.loc[1,:]
A 1
B 2
C 3
Name: 1, dtype: int64
c=pd.concat([a,b])
I end up with this:
A 1
B 2
C 3
A 1
B 2
C 3
Name: 0, dtype: int64
Whearas I would expect the original data frame:
A B C
0 1 2 3
1 1 2 3
I can get the values and create a new dataframe, but this doesn't seem like the way to do it.
If you want to concat two series vertically (vertical stacking), then one option is a concat and transpose.
Another is using np.vstack:
pd.DataFrame(np.vstack([a, b]), columns=a.index)
A B C
0 1 2 3
1 1 2 3
Since you are slicing by index I'd use .iloc and then notice the difference between [[]] and [] which return a DataFrame and Series*
a = x.iloc[[0]]
b = x.iloc[[1]]
pd.concat([a, b])
# A B C
#0 1 2 3
#1 1 2 3
To still use .loc, you'd do something like
a = x.loc[[0,]]
b = x.loc[[1,]]
*There's a small caveat that if index 0 is duplicated in x then x.loc[0,:] will return a DataFrame and not a Series.
It looks like you want to make a new dataframe from a collection of records. There's a method for that:
import pandas as pd
x = pd.DataFrame([[1,2,3],[1,2,3]], columns=["A","B","C"])
a = x.loc[0,:]
b = x.loc[1,:]
c = pd.DataFrame.from_records([a, b])
print(c)
# A B C
# 0 1 2 3
# 1 1 2 3
After encountering this code:
I was confused about the usage of both .apply and lambda. Firstly does .apply apply the desired change to all elements in all the columns specified or each column one by one? Secondly, does x in lambda x: iterate through every element in specified columns or columns separately? Thirdly, does x.min or x.max give us the minimum or maximum of all the elements in specified columns or minimum and maximum elements of each column separately? Any answer explaining the whole process would make me more than grateful.
Thanks.
I think here is the best avoid apply - loops under the hood and working with subset of DataFrame by columns from list:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
c = ['B','C','D']
So first select minimal values of selected columns and similar maximal:
print (df[c].min())
B 4
C 2
D 0
dtype: int64
Then subtract and divide:
print ((df[c] - df[c].min()))
B C D
0 0 5 1
1 1 6 3
2 0 7 5
3 1 2 7
4 1 0 1
5 0 1 0
print (df[c].max() - df[c].min())
B 1
C 7
D 7
dtype: int64
df[c] = (df[c] - df[c].min()) / (df[c].max() - df[c].min())
print (df)
A B C D E F
0 a 0.0 0.714286 0.142857 5 a
1 b 1.0 0.857143 0.428571 3 a
2 c 0.0 1.000000 0.714286 6 a
3 d 1.0 0.285714 1.000000 9 b
4 e 1.0 0.000000 0.142857 2 b
5 f 0.0 0.142857 0.000000 4 b
EDIT:
For debug apply is best create custom function:
def f(x):
#for each loop return column
print (x)
#return scalar - min
print (x.min())
#return new Series - column
print ((x-x.min())/ (x.max() - x.min()))
return (x-x.min())/ (x.max() - x.min())
df[c] = df[c].apply(f)
print (df)
Check if the data are really being normalised. Because x.min and x.max may simply take the min and max of a single value, hence no normalisation would occur.
I just tried sorting my dataframe and used the following function:
df[df.count >= df.count.quantile(.95)]
It returned the error:
AttributeError: 'function' object has no attribute 'quantile'
But bracketing the series works fine:
df[df['count'] >= df['count'].quantile(.95)]
It's not the first time I've gotten different results based on this distinction, but it also usually doesn't happen, and I always thought that these were two identical objects.
Why does this happen?
Because count is one of data frame's built in method, when you use . it is recognized as the method instead of the column count, i.e, . prioritize built-in method over columns:
df = pd.DataFrame({
'A':[1,2,3],
'B':[2,3,4],
'count': [4,5,6]
})
df.count()
#A 3
#B 3
#count 3
#dtype: int64
df.count
# V V V V V V V V
#<bound method DataFrame.count of A B count
#0 1 2 4
#1 2 3 5
#2 3 4 6>
Another distinction between dot and bracket is, you can not use dot to create a new column. i.e. if the column doesn't exist, df.column = ... won't work, you have to use bracket in this case. as df[column] = ..., using the above dummy data frame:
# original data frame
df
# A B count
#0 1 2 4
#1 2 3 5
#2 3 4 6
Using dot to create a new column won't work, C is set as an attribute instead of a column:
df.C = 2
df
# A B count
#0 1 2 4
#1 2 3 5
#2 3 4 6
While bracket is the standard way to add a new column to the data frame:
df['C'] = 2
df
# A B count C
#0 1 2 4 2
#1 2 3 5 2
#2 3 4 6 2
If a column already exists, it's valid to modify it with dot assuming the data frame doesn't have an attribute with the same name (as is the case of the count above):
df.B = 3
df
# A B count C
#0 1 3 4 2
#1 2 3 5 2
#2 3 3 6 2
Let's say I create the following dataframe with a df.set_index('Class','subclass'), bearing in mind there are multiple Classes with subclasses... A>Z.
Class subclass
A a
A b
A c
A d
B a
B b
How would I count the subclasses in the Class and create a separate column named no of classes such that I can see the Class with the greatest number of subclasses? I was thinking some sort of for loop which runs through the Class letters and counts the subclass if that Class letter is still the same. However, this seems a bit counterintuitive for such a problem. Would there be a more simple approach such as a df.groupby[].count?
The desired output would be:
Class subclass No. of classes
A a 4
A b
A c
A d
B a 2
B b
I have tried the level parameter as shown in group multi-index pandas dataframe but this doesn't seem to work for me
EDIT:
I did not mention that I wanted a return of the Class with the greatest number of subclasses. I achieved this with:
df.reset_index().groupby('Class')['subclass'].nunique().idxmax()
You can use transform, but get duplicates values:
df['No. of classes'] = df.groupby(level='Class')['val'].transform('size')
print (df)
val No. of classes
Class subclass
A a 1 4
b 4 4
c 5 4
d 4 4
B a 1 2
b 2 2
But if need empty values:
df['No. of classes'] = df.groupby(level='Class')
.apply(lambda x: pd.Series( [len(x)] + [np.nan] * (len(x)-1)))
.values
print (df)
val No. of classes
Class subclass
A a 1 4.0
b 4 NaN
c 5 NaN
d 4 NaN
B a 1 2.0
b 2 NaN
Another solution for get Class with greatest number is:
df = df.groupby(level=['Class'])
.apply(lambda x: x.index.get_level_values('subclass').nunique())
.idxmax()
print (df)
A
You can use transform to add an aggregated calculation back to the original df as a new column:
In [165]:
df['No. of classes'] = df.groupby('Class')['subclass'].transform('count')
df
Out[165]:
Class subclass No. of classes
0 A a 4
1 A b 4
2 A c 4
3 A d 4
4 B a 2
5 B b 2