How to access pandas groupby dataframe by key - python

How do I access the corresponding groupby dataframe in a groupby object by the key?
With the following groupby:
rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
'B': rand.randn(6),
'C': rand.randint(0, 20, 6)})
gb = df.groupby(['A'])
I can iterate through it to get the keys and groups:
In [11]: for k, gp in gb:
print 'key=' + str(k)
print gp
key=bar
A B C
1 bar -0.611756 18
3 bar -1.072969 10
5 bar -2.301539 18
key=foo
A B C
0 foo 1.624345 5
2 foo -0.528172 11
4 foo 0.865408 14
I would like to be able to access a group by its key:
In [12]: gb['foo']
Out[12]:
A B C
0 foo 1.624345 5
2 foo -0.528172 11
4 foo 0.865408 14
But when I try doing that with gb[('foo',)] I get this weird pandas.core.groupby.DataFrameGroupBy object thing which doesn't seem to have any methods that correspond to the DataFrame I want.
The best I could think of is:
In [13]: def gb_df_key(gb, key, orig_df):
ix = gb.indices[key]
return orig_df.ix[ix]
gb_df_key(gb, 'foo', df)
Out[13]:
A B C
0 foo 1.624345 5
2 foo -0.528172 11
4 foo 0.865408 14
but this is kind of nasty, considering how nice pandas usually is at these things.
What's the built-in way of doing this?

You can use the get_group method:
In [21]: gb.get_group('foo')
Out[21]:
A B C
0 foo 1.624345 5
2 foo -0.528172 11
4 foo 0.865408 14
Note: This doesn't require creating an intermediary dictionary / copy of every subdataframe for every group, so will be much more memory-efficient than creating the naive dictionary with dict(iter(gb)). This is because it uses data-structures already available in the groupby object.
You can select different columns using the groupby slicing:
In [22]: gb[["A", "B"]].get_group("foo")
Out[22]:
A B
0 foo 1.624345
2 foo -0.528172
4 foo 0.865408
In [23]: gb["C"].get_group("foo")
Out[23]:
0 5
2 11
4 14
Name: C, dtype: int64

Wes McKinney (pandas' author) in Python for Data Analysis provides the following recipe:
groups = dict(list(gb))
which returns a dictionary whose keys are your group labels and whose values are DataFrames, i.e.
groups['foo']
will yield what you are looking for:
A B C
0 foo 1.624345 5
2 foo -0.528172 11
4 foo 0.865408 14

Rather than
gb.get_group('foo')
I prefer using gb.groups
df.loc[gb.groups['foo']]
Because in this way you can choose multiple columns as well. for example:
df.loc[gb.groups['foo'],('A','B')]

gb = df.groupby(['A'])
gb_groups = grouped_df.groups
If you are looking for selective groupby objects then, do: gb_groups.keys(), and input desired key into the following key_list..
gb_groups.keys()
key_list = [key1, key2, key3 and so on...]
for key, values in gb_groups.items():
if key in key_list:
print(df.ix[values], "\n")

I was looking for a way to sample a few members of the GroupBy obj - had to address the posted question to get this done.
create groupby object based on some_key column
grouped = df.groupby('some_key')
pick N dataframes and grab their indices
sampled_df_i = random.sample(grouped.indices, N)
grab the groups
df_list = map(lambda df_i: grouped.get_group(df_i), sampled_df_i)
optionally - turn it all back into a single dataframe object
sampled_df = pd.concat(df_list, axis=0, join='outer')

df.groupby('A').get_group('foo')
is equivalent to:
df[df['A'] == 'foo']

Related

Python 3 pandas.groupby.filter

I am trying to perform a groupby filter that is very similar to the example in this documentation: pandas groupby filter
>>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
... 'foo', 'bar'],
... 'B' : [1, 2, 3, 4, 5, 6],
... 'C' : [2.0, 5., 8., 1., 2., 9.]})
>>> grouped = df.groupby('A')
>>> grouped.filter(lambda x: x['B'].mean() > 3.)
A B C
1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0
I am trying to return a DataFrame that has all 3 columns, but only 2 rows. Those 2 rows contain the minimum values of column B, after grouping by column A. I tried the following line of code:
grouped.filter(lambda x: x['B'] == x['B'].min())
But this doesn't work, and I get this error:
TypeError: filter function returned a Series, but expected a scalar bool
The DataFrame I am trying to return should look like this:
A B C
0 foo 1 2.0
1 bar 2 5.0
I would appreciate any help you can provide. Thank you, in advance, for your help.
The short answer:
grouped.apply(lambda x: x[x['B'] == x['B']].min())
... and the longer one:
Your grouped object has 2 groups:
In[25]: for df in grouped:
...: print(df)
...:
('bar',
A B C
1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0)
('foo',
A B C
0 foo 1 2.0
2 foo 3 8.0
4 foo 5 2.0)
filter() method for GroupBy object is for filtering groups as entities, NOT for filtering their individual rows. So using the filter() method, you may obtain only 4 results:
an empty DataFrame (0 rows),
rows of the group 'bar' (3 rows),
rows of the group 'foo' (3 rows),
rows of both groups (6 rows)
Nothing else, regardless of the used parameter (boolean function) in the filter() method.
So you have to use some other method. An appropriate one is the very flexible apply() method, which lets you apply an arbitrary function which
takes a DataFrame (a group of GroupBy object) as its only parameter,
returns either a Pandas object or a scalar.
In your case that function should return (for every of your 2 groups) the 1-row DataFrame having the minimal value in the column 'B', so we will use the Boolean mask
group['B'] == group['B'].min()
for selecting such a row (or - maybe - more rows):
In[26]: def select_min_b(group):
...: return group[group['B'] == group['B'].min()]
Now using this function as a parameter of the apply() method of GroupBy object grouped we will obtain
In[27]: grouped.apply(select_min_b)
Out[27]:
A B C
A
bar 1 bar 2 5.0
foo 0 foo 1 2.0
Note:
The same, but as only one command (using the lambda function):
grouped.apply(lambda group: group[group['B'] == group['B']].min())
There's a fundamental difference: In the documentation example, there is a single Boolean value per group. That is, you return the entire group if the mean is greater than 3. In your example, you want to filter specific rows within a group.
For your task the usual trick is to sort values and use .head or .tail to filter to the row with the smallest or largest value respectively:
df.sort_values('B').groupby('A').head(1)
# A B C
#0 foo 1 2.0
#1 bar 2 5.0
For more complicated queries you can use .transform or .apply to create a Boolean Series to slice. Also in this case safer if multiple rows share the minimum and you need all of them:
df[df.groupby('A').B.transform(lambda x: x == x.min())]
# A B C
#0 foo 1 2.0
#1 bar 2 5.0
No need groupby :-)
df.sort_values('B').drop_duplicates('A')
Out[288]:
A B C
0 foo 1 2.0
1 bar 2 5.0
>>> # sort=False to return the rows in the order they originally occurred
>>> df.loc[df.groupby("A", sort=False)["B"].idxmin()]
A B C
0 foo 1 2.0
1 bar 2 5.0
df.groupby('A').apply(lambda x: x.loc[x['B'].idxmin(), ['B','C']]).reset_index()

Sum values in df column based on partial name of another column

Given the dataframe
a b
foo123 5
foo456 8
bar234 1
bar324 6
How do I add the values from b based on the only the first several characters of a? The ouput I'm looking for is:
a b
foo 13
bar 7
There are too many entries for column a to set manually, so something like the following won't work:
if df['a'].startswith('foo'):
sum(b)
I'm thinking something more like if df['a'] has first three characters that match, add all the corresponding rows for b.
If your substrings do not all have the same length, use str.extract, extract relevant portions from a and then use that to perform a groupby + sum operation on b:
# assuming your frame is df1
df1.groupby(df1['a'].str.extract(r'^(\D+)', expand=False))['b'].sum().reset_index()
a b
0 bar 7
1 foo 13
For more performance, pre-assign a first;
df1['a'] = df1['a'].str.extract(r'^(\D+)', expand=False)
df1.groupby('a', as_index=False)['b'].sum()
a b
0 bar 7
1 foo 13
If all substrings are of the same size, just slice and groupby:
df1.groupby(df1['a'].str[:3])['b'].sum().reset_index()
a b
0 bar 7
1 foo 13
replace number to ''
df.groupby(df.a.str.replace('\d+', '')).b.sum()
Out[1353]:
a
bar 7
foo 13
Name: b, dtype: int64

Pandas Dataframe Reshaping

I have a dataframe as show below
>> df
A 1
B 2
A 5
B 6
A 7
B 8
How do I reformat it to make it
A 1 5 7
B 2 6 8
Thanks
Given a data frame like this
df = pd.DataFrame(dict(one=list('ABABAB'), two=range(6)))
you can do
df.groupby('one').two.apply(lambda s: s.reset_index(drop=True)).unstack()
# 0 1 2
# one
# A 0 2 4
# B 1 3 5
or (slightly slower, and giving a slightly different result)
df.groupby('one').apply(lambda d: d.two.reset_index(drop=True))
# two 0 1 2
# one
# A 0 2 4
# B 1 3 5
The first approach works with a DataFrameGroupBy, the second uses a SeriesGroupBy.
You can grab the series and use np.reshape to keep the correct dimensions.
The order = 'F' makes it scroll through columns (such as Fortran), order = 'C' scrolls through rows like C
Then it gets into a dataframe
df = pd.DataFrame(data=np.arange(10))
data = df['a'].values.reshape((2, 5), order='F')
df = pd.DataFrame(data=data, index=['a', 'b'])
how did you generate this data frame. I think it should have been generated using dictionary and then generate dataframe using that dict.
d = {'A': [1,5,7], 'B':[2,6,8]}
df = pandas.DataFrame(data=d, index=['p1','p2','p3'])
and then you can use df.T to transpose your dataframe if you need to.

How to avoid mutiple columns on Pandas.Merge

Imagine I have the following DataFrames on Pandas:
In [7]: A= pd.DataFrame([['foo'],['bar'],['quz'],['baz']],columns=['key'])
In [8]: A['value'] = 'None'
In [9]: A
Out[9]:
key value
0 foo None
1 bar None
2 quz None
3 baz None
In [10]: B = pd.DataFrame([['foo',5],['bar',6],['quz',7]],columns= ['key','value'])
In [11]: B
Out[11]:
key value
0 foo 5
1 bar 6
2 quz 7
In [12]: pd.merge(A,B, on='key', how='outer')
Out[12]:
key value_x value_y
0 foo None 5
1 bar None 6
2 quz None 7
3 baz None NaN
But what I want is (avoiding the repeat column basically):
key value
0 foo 5
1 bar 6
2 quz 7
3 baz NaN
I suppose I can take the output and drop the _x value and rename the _y but that seems like an overkill. On SQL this would be trivial.
EDIT:
John as recomended to use:
In [1]: A.set_index('key', inplace=True)
A.update(B.set_index('key'), join='left', overwrite=True)
A.reset_index(inplace=True)
This works and does what I asked for.
In the example you are merging two dataframes with the same column, one contains strings ('None') the other integers, pandas doesn't know which column value you want to keep and which should be replaced, so it creates a column for both.
You can use update instead
In [10]: A.update(B, join='left', overwrite=True)
In [11]: A
Out[11]:
key value
0 foo 5
1 bar 6
2 quz 7
3 baz NaN
Another solution would be to just state the values that you want for the given column:
In [15]: A.loc[B.index, 'value'] = B.value
In [16]: A
Out[16]:
key value
0 foo 5
1 bar 6
2 quz 7
3 baz NaN
Personally I prefer the second solution because I know exactly what is happening, but the first is probably closer to what you are looking for in your question.
EDIT:
If the indices don't match, I'm not quite sure how to make this happen. Hence I would suggest making them match:
In [1]: A.set_index('key', inplace=True)
A.update(B.set_index('key'), join='left', overwrite=True)
A.reset_index(inplace=True)
It may be that there is a better way to do this, but I don't believe pandas has a way to perform this operation outright.
The second solution can also be used with the updated index:
In [24]: A.set_index('key', inplace=True)
A.loc[B.key, 'value'] = B.value.tolist()

How to df.groupby(cols).apply(my_func) for some columns, while leave a few columns not tackled?

Suppose I have a Pandas dataframe df has columns a,b,c,d...z . And I want to: df.groupby('a').apply(my_func()) for columns d-z, while leave column 'b' & 'c' unchanged . How to do that ?
I notice Pandas can apply different function to different column by passing a dict . But I have a long column list and just want parameters to set or tip to simply tell Pandas to bypass some columns and apply my_func() to rest of columns ? (Otherwise I have to build a long dict)
One simple (and general) approach is to create a view of the dataframe with the subset you are interested in (or, stated for your case, a view with all columns except the ones you want to ignore), and then use APPLY for that view.
In [116]: df
Out[116]:
a b c d f
0 one 3 0.493808 a bob
1 two 8 0.150585 b alice
2 one 6 0.641816 c michael
3 two 5 0.935653 d joe
4 one 1 0.521159 e kate
Use your favorite methods to create the view you need. You could select a range of columns like so df_view = df.ix[:,'b':'d'], but the following might be more useful for your scenario:
#I want all columns except two
cols = df.columns.tolist()
mycols = [x for x in cols if not x in ['a','f']]
df_view = df[mycols]
Apply your function to that view. (Note this doesn't yet change anything in df.)
In [158]: df_view.apply(lambda x: x /2)
Out[158]:
b c d
0 1 0.246904 20
1 4 0.075293 25
2 3 0.320908 28
3 2 0.467827 28
4 0 0.260579 24
Update the df using update()
In [156]: df.update(df_view.apply(lambda x: x/2))
In [157]: df
Out[157]:
a b c d f
0 one 1 0.246904 20 bob
1 two 4 0.075293 25 alice
2 one 3 0.320908 28 michael
3 two 2 0.467827 28 joe
4 one 0 0.260579 24 kate

Categories

Resources