Selecting rows from pandas by subset of multiindex - python

I have a multiindex dataframe in pandas, with 4 columns in the index, and some columns of data. An example is below:
import pandas as pd
import numpy as np
cnames = ['K1', 'K2', 'K3', 'K4', 'D1', 'D2']
rdata = pd.DataFrame(np.random.randint(1, 3, size=(8, len(cnames))), columns=cnames)
rdata.set_index(cnames[:4], inplace=True)
rdata.sortlevel(inplace=True)
print(rdata)
D1 D2
K1 K2 K3 K4
1 1 1 1 1 2
1 1 2
2 1 2 1
2 1 2 2 1
2 1 2 1
2 1 2 2 2 1
2 1 2 1 1
2 1 1
[8 rows x 2 columns]
What I want to do is select the rows where there are exactly 2 values at the K3 level. Not 2 rows, but two distinct values. I've found how to generate a sort of mask for what I want:
filterFunc = lambda x: len(set(x.index.get_level_values('K3'))) == 2
mask = rdata.groupby(level=cnames[:2]).apply(filterFunc)
print(mask)
K1 K2
1 1 True
2 True
2 1 False
2 False
dtype: bool
And I'd hoped that since rdata.loc[1, 2] allows you to match on just part of the index, it would be possible to do the same thing with a boolean vector like this. Unfortunately, rdata.loc[mask] fails with IndexingError: Unalignable boolean Series key provided.
This question seemed similar, but the answer given there doesn't work for anything other than the top level index, since index.get_level_values only works on a single level, not multiple ones.
Following the suggestion here I managed to accomplish what I wanted with
rdata[[mask.loc[k1, k2] for k1, k2, k3, k4 in rdata.index]]
however, both getting the count of distinct values using len(set(index.get_level_values(...))) and building the boolean vector afterwards by iterating over every row feels more like I'm fighting the framework to achieve something that seems like a simple task in a multiindex setup. Is there a better solution?
This is using pandas 0.13.1.

There might be something better, but you could at least bypass defining mask by using groupby-filter:
rdata.groupby(level=cnames[:2]).filter(
lambda grp: (grp.index.get_level_values('K3')
.unique().size) == 2)
Out[83]:
D1 D2
K1 K2 K3 K4
1 1 1 1 1 2
1 1 2
2 1 2 1
2 1 2 2 1
2 1 2 1
[5 rows x 2 columns]
It is faster than my previous suggestions. It does really well for small DataFrames:
In [84]: %timeit rdata.groupby(level=cnames[:2]).filter(lambda grp: grp.index.get_level_values('K3').unique().size == 2)
100 loops, best of 3: 3.84 ms per loop
In [76]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.groupby(level=['K3']).ngroups == 2)
100 loops, best of 3: 11.9 ms per loop
In [77]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: len(set(grp.index.get_level_values('K3'))) == 2)
100 loops, best of 3: 13.4 ms per loop
and is still the fastest for large DataFrames, though not by as much:
In [78]: rdata2 = pd.concat([rdata]*100000)
In [85]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.index.get_level_values('K3').unique().size == 2)
1 loops, best of 3: 756 ms per loop
In [79]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.groupby(level=['K3']).ngroups == 2)
1 loops, best of 3: 772 ms per loop
In [80]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: len(set(grp.index.get_level_values('K3'))) == 2)
1 loops, best of 3: 1 s per loop

Related

Pandas count NAs with a groupby for all columns [duplicate]

This question already has answers here:
Pandas count null values in a groupby function
(3 answers)
Groupby class and count missing values in features
(5 answers)
Closed 3 years ago.
This question shows how to count NAs in a dataframe for a particular column C. How do I count NAs for all columns (that aren't the groupby column)?
Here is some test code that doesn't work:
#!/usr/bin/env python3
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,2,2],
'b':[1,np.nan,2,np.nan],
'c':[1,np.nan,2,3]})
# result = df.groupby('a').isna().sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method
# result = df.groupby('a').transform('isna').sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method
result = df.isna().groupby('a').sum()
print(result)
# result:
# b c
# a
# False 2.0 1.0
result = df.groupby('a').apply(lambda _df: df.isna().sum())
print(result)
# result:
# a b c
# a
# 1 0 2 1
# 2 0 2 1
Desired output:
b c
a
1 1 1
2 1 0
It's always best to avoid groupby.apply in favor of the basic functions which are cythonized, as this scales better with many groups. This will lead to a great increase in performance. In this case first check isnull() on the entire DataFrame then groupby + sum.
df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
# b c
#a
#1 1 1
#2 1 0
To illustrate the performance gain:
import pandas as pd
import numpy as np
N = 50000
df = pd.DataFrame({'a': [*range(N//2)]*2,
'b': np.random.choice([1, np.nan], N),
'c': np.random.choice([1, np.nan], N)})
%timeit df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#7.89 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
#9.47 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Your question has the answer (You mistyped _df as df):
result = df.groupby('a')['b', 'c'].apply(lambda _df: _df.isna().sum())
result
b c
a
1 1 1
2 1 0
Using apply with isna and sum. Plus we select the correct columns, so we don't get the unnecessary a column:
Note: apply can be slow, it's recommended to use one of the vectorized solutions, see the answers of WenYoBen, Anky or ALollz
df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
Output
b c
a
1 1 1
2 1 0
Another way would be set_index() on a and groupby on the index and sum:
df.set_index('a').isna().groupby(level=0).sum()*1
Or:
df.set_index('a').isna().groupby(level=0).sum().astype(int)
Or without groupby courtesy #WenYoBen:
df.set_index('a').isna().sum(level=0).astype(int)
b c
a
1 1 1
2 1 0
I will do count then sub with value_counts, the reason why I did not using apply , cause it is usually has bad performance
df.groupby('a')[['b','c']].count().rsub(df.a.value_counts(dropna=False),axis=0)
Out[78]:
b c
1 1 1
2 1 0
Alternative
df.isna().drop('a',1).astype(int).groupby(df['a']).sum()
Out[83]:
b c
a
1 1 1
2 1 0
You need to drop the column after using apply.
df.groupby('a').apply(lambda x: x.isna().sum()).drop('a',1)
Output:
b c
a
1 1 1
2 1 0
Another dirty work:
df.set_index('a').isna().astype(int).groupby(level=0).sum()
Output:
b c
a
1 1 1
2 1 0
You could write your own aggregation function as follows:
df.groupby('a').agg(lambda x: x.isna().sum())
which results in
b c
a
1 1.0 1.0
2 1.0 0.0

Adding column of weights to pandas DF by a condition on the DFs column

Whats the most pythonic way to add a column (of weights) to an existing Pandas DataFrame "df" by a condition on dfs column?
Small example:
df = pd.DataFrame({'A' : [1, 2, 3], 'B' : [4, 5, 6]})
df
Out[110]:
A B
0 1 4
1 2 5
2 3 6
I'd Like to add a "weight" column where if df['B'] >= 6 then df['weight'] = 20, else, df['weight'] = 1
So my output will be:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Approach #1
Here's one with type-conversion and scaling -
df['weight'] = (df['B'] >= 6)*19+1
Approach #2
Another possibly faster one with using the underlying array data -
df['weight'] = (df['B'].values >= 6)*19+1
Approach #3
Leverage multi-cores with numexpr module -
import numexpr as ne
val = df['B'].values
df['weight'] = ne.evaluate('(val >= 6)*19+1')
Timings on 500k rows as commented by OP for a random data in range [0,9) for the vectorized methods posted thus far -
In [149]: np.random.seed(0)
...: df = pd.DataFrame({'B' : np.random.randint(0,9,(500000))})
# #jpp's soln
In [150]: %timeit df['weight1'] = np.where(df['B'] >= 6, 20, 1)
100 loops, best of 3: 3.57 ms per loop
# #jpp's soln with array data
In [151]: %timeit df['weight2'] = np.where(df['B'].values >= 6, 20, 1)
100 loops, best of 3: 3.27 ms per loop
In [154]: %timeit df['weight3'] = (df['B'] >= 6)*19+1
100 loops, best of 3: 2.73 ms per loop
In [155]: %timeit df['weight4'] = (df['B'].values >= 6)*19+1
1000 loops, best of 3: 1.76 ms per loop
In [156]: %%timeit
...: val = df['B'].values
...: df['weight5'] = ne.evaluate('(val >= 6)*19+1')
1000 loops, best of 3: 1.14 ms per loop
One last one ...
With the output being 1 or 20, we could safely use lower precision : uint8 for a turbo speedup over already discussed ones, like so -
In [208]: %timeit df['weight6'] = (df['B'].values >= 6)*np.uint8(19)+1
1000 loops, best of 3: 428 µs per loop
You can use numpy.where for a vectorised solution:
df['weight'] = np.where(df['B'] >= 6, 20, 1)
Result:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Here's a method using df.apply
df['weight'] = df.apply(lambda row: 20 if row['B'] >= 6 else 1, axis=1)
Output:
In [6]: df
Out[6]:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20

pandas rolling max with groupby

I have a problem getting the rolling function of Pandas to do what I wish. I want for each frow to calculate the maximum so far within the group. Here is an example:
df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])
looks like
id value
0 1 3
1 1 6
2 1 3
3 2 2
4 2 1
Now I wish to obtain the following DataFrame:
id value
0 1 3
1 1 6
2 1 6
3 2 2
4 2 2
The problem is that when I do
df.groupby('id')['value'].rolling(1).max()
I get the same DataFrame back. And when I do
df.groupby('id')['value'].rolling(3).max()
I get a DataFrame with Nans. Can someone explain how to properly use rolling or some other Pandas function to obtain the DataFrame I want?
It looks like you need cummax() instead of .rolling(N).max()
In [29]: df['new'] = df.groupby('id').value.cummax()
In [30]: df
Out[30]:
id value new
0 1 3 3
1 1 6 6
2 1 3 6
3 2 2 2
4 2 1 2
Timing (using brand new Pandas version 0.20.1):
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [4]: df.shape
Out[4]: (50000, 2)
In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop
In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop
NOTE: from Pandas 0.20.0 what's new
Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)
Using apply will be a tiny bit faster:
# Using apply
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop
Other method:
df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop

Pandas - get a summarized pivot DataFrame from row counts

Given this DataFrame:
bowl cookie
0 one chocolate
1 two chocolate
2 two chocolate
3 two vanilla
4 one vanilla
5 one vanilla
6 one vanilla
7 one vanilla
8 one vanilla
9 two chocolate
I'd like to obtain the following summarized DataFrame:
vanilla chocolate
one 5 1
two 1 3
Apart from proceeding manually with:
vanilla_bowl1 = len(df_picks[(df_picks['bowl'] == 'one') & (df_picks['cookie'] == 'vanilla')])
vanilla_bowl2 = len(df_picks[(df_picks['bowl'] == 'two') & (df_picks['cookie'] == 'vanilla')])
chocolate_bowl1 = ...
chocolate_bowl2 = ...
Is there a way to do that in a single operation with Pandas?
Note: I've had a look at df.pivot() and this would work provided that I add a column count equal to 1 in each row:
bowl cookie count
0 one chocolate 1
1 two chocolate 1
2 two chocolate 1
3 two vanilla 1
4 one vanilla 1
5 one vanilla 1
6 one vanilla 1
7 one vanilla 1
8 one vanilla 1
9 two chocolate 1
And then
df.pivot(index='bowl', columns='cookie', values='count')
However, I'm wondering if there is a more direct method, that wouldn't require adding the count column in the first place.
The most concise way is probably the pandas.crosstab function:
>>> pandas.crosstab(d.bowl, d.cookie)
cookie chocolate vanilla
bowl
one 1 5
two 3 1
you can use pivot_table() method:
In [33]: df.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
Out[33]:
cookie chocolate vanilla
bowl
one 1 5
two 3 1
alternatively you can use groupby(), size() and unstack() - that's how pivot_table() does it under the hood:
In [36]: df.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
Out[36]:
cookie chocolate vanilla
bowl
one 1 5
two 3 1
Timing for 100K rows DF:
In [48]: big = pd.concat([df] * 10**4, ignore_index=True)
In [49]: big.shape
Out[49]: (100000, 2)
In [50]: %timeit pd.crosstab(big.bowl, big.cookie)
10 loops, best of 3: 58 ms per loop
In [51]: %timeit big.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
10 loops, best of 3: 38.4 ms per loop
In [52]: %timeit big.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
10 loops, best of 3: 34.2 ms per loop
In [118]: %timeit pir(big)
1 loop, best of 3: 631 ms per loop
In [119]: big.shape
Out[119]: (100000, 2)
Timing for 1M rows DF:
In [53]: big = pd.concat([big] * 10, ignore_index=True)
In [54]: big.shape
Out[54]: (1000000, 2)
In [55]: %timeit pd.crosstab(big.bowl, big.cookie)
1 loop, best of 3: 446 ms per loop
In [56]: %timeit big.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
1 loop, best of 3: 333 ms per loop
In [57]: %timeit big.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
1 loop, best of 3: 327 ms per loop
In [121]: %timeit pir(big)
1 loop, best of 3: 7.08 s per loop
In [122]: big.shape
Out[122]: (1000000, 2)
a numpy approach
from itertools import product
import pandas as pd
import numpy as np
def pir(df):
ub = pd.Index(np.unique(df.values[:, 0]), name='bowl')
uc = pd.Index(np.unique(df.values[:, 1]), name='cookie')
u = np.array(list(product(ub.values, uc.values)))
e = u[:, None] == df.values
return pd.DataFrame(
e.all(2).sum(1).reshape(-1, 2),
ub, uc
)
pir(df)

A better way to aggregate data and keep table structure and column names with Pandas

Suppose I have a dataset like the following
df = pd.DataFrame({'x1':['a','a','b','b'], 'x2':[True, True, True, False], 'x3':[1,1,1,1]})
df
x1 x2 x3
0 a True 1
1 a True 1
2 b True 1
3 b False 1
I often want to perform a groupby-aggregate operation where I group by multiple columns and apply multiple functions to one column. Furthermore, I usually don't want a multi-indexed, multi-level table. To accomplish this, it's taking me three lines of code which seems excessive.
For example
bg = df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}})
bg.columns = bg.columns.droplevel(0)
bg.reset_index()
Is there a better way? Not to gripe, but I'm coming from an R/data.table background where something like this is a nice one-liner like
df[, list(my_sum=sum(x3), my_mean=mean(x3)), by=list(x1, x2)]
How about this:
In [81]: bg = df.groupby(['x1', 'x2'], as_index=False)['x3'].agg({'my_sum':np.sum, 'my_mean':np.mean})
In [82]: print bg
x1 x2 my_sum my_mean
0 a True 2 1
1 b False 1 1
2 b True 1 1
You could use #Happy01 answer but instead of as_index=False you could add reset_index to the end:
In [1331]: df.groupby(['x1', 'x2'])['x3'].agg( {'my_sum':np.sum, 'my_mean':np.mean}).reset_index()
Out[1331]:
x1 x2 my_mean my_sum
0 a True 1 2
1 b False 1 1
2 b True 1 1
Benchmarking, for reset_index it works faster:
In [1333]: %timeit df.groupby(['x1', 'x2'], as_index=False)['x3'].agg({'my_sum':np.sum, 'my_mean':np.mean})
100 loops, best of 3: 3.18 ms per loop
In [1334]: %timeit df.groupby(['x1', 'x2'])['x3'].agg( {'my_sum':np.sum, 'my_mean':np.mean}).reset_index()
100 loops, best of 3: 2.82 ms per loop
You could do the same as your solution but with one line. Transpose your dataframe then do reset_index to drop your x3 column or level 0, then transposing back and do reset_index again to achieve your desired output:
In [1374]: df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}).T.reset_index(level=0, drop=True).T.reset_index()
Out[1374]:
x1 x2 my_mean my_sum
0 a True 1 2
1 b False 1 1
2 b True 1 1
But it works slower:
In [1375]: %timeit df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}).T.reset_index(level=0, drop=True).T.reset_index()
100 loops, best of 3: 5.13 ms per loop

Categories

Resources