I am trying to perform a groupby filter that is very similar to the example in this documentation: pandas groupby filter
>>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
... 'foo', 'bar'],
... 'B' : [1, 2, 3, 4, 5, 6],
... 'C' : [2.0, 5., 8., 1., 2., 9.]})
>>> grouped = df.groupby('A')
>>> grouped.filter(lambda x: x['B'].mean() > 3.)
A B C
1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0
I am trying to return a DataFrame that has all 3 columns, but only 2 rows. Those 2 rows contain the minimum values of column B, after grouping by column A. I tried the following line of code:
grouped.filter(lambda x: x['B'] == x['B'].min())
But this doesn't work, and I get this error:
TypeError: filter function returned a Series, but expected a scalar bool
The DataFrame I am trying to return should look like this:
A B C
0 foo 1 2.0
1 bar 2 5.0
I would appreciate any help you can provide. Thank you, in advance, for your help.
The short answer:
grouped.apply(lambda x: x[x['B'] == x['B']].min())
... and the longer one:
Your grouped object has 2 groups:
In[25]: for df in grouped:
...: print(df)
...:
('bar',
A B C
1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0)
('foo',
A B C
0 foo 1 2.0
2 foo 3 8.0
4 foo 5 2.0)
filter() method for GroupBy object is for filtering groups as entities, NOT for filtering their individual rows. So using the filter() method, you may obtain only 4 results:
an empty DataFrame (0 rows),
rows of the group 'bar' (3 rows),
rows of the group 'foo' (3 rows),
rows of both groups (6 rows)
Nothing else, regardless of the used parameter (boolean function) in the filter() method.
So you have to use some other method. An appropriate one is the very flexible apply() method, which lets you apply an arbitrary function which
takes a DataFrame (a group of GroupBy object) as its only parameter,
returns either a Pandas object or a scalar.
In your case that function should return (for every of your 2 groups) the 1-row DataFrame having the minimal value in the column 'B', so we will use the Boolean mask
group['B'] == group['B'].min()
for selecting such a row (or - maybe - more rows):
In[26]: def select_min_b(group):
...: return group[group['B'] == group['B'].min()]
Now using this function as a parameter of the apply() method of GroupBy object grouped we will obtain
In[27]: grouped.apply(select_min_b)
Out[27]:
A B C
A
bar 1 bar 2 5.0
foo 0 foo 1 2.0
Note:
The same, but as only one command (using the lambda function):
grouped.apply(lambda group: group[group['B'] == group['B']].min())
There's a fundamental difference: In the documentation example, there is a single Boolean value per group. That is, you return the entire group if the mean is greater than 3. In your example, you want to filter specific rows within a group.
For your task the usual trick is to sort values and use .head or .tail to filter to the row with the smallest or largest value respectively:
df.sort_values('B').groupby('A').head(1)
# A B C
#0 foo 1 2.0
#1 bar 2 5.0
For more complicated queries you can use .transform or .apply to create a Boolean Series to slice. Also in this case safer if multiple rows share the minimum and you need all of them:
df[df.groupby('A').B.transform(lambda x: x == x.min())]
# A B C
#0 foo 1 2.0
#1 bar 2 5.0
No need groupby :-)
df.sort_values('B').drop_duplicates('A')
Out[288]:
A B C
0 foo 1 2.0
1 bar 2 5.0
>>> # sort=False to return the rows in the order they originally occurred
>>> df.loc[df.groupby("A", sort=False)["B"].idxmin()]
A B C
0 foo 1 2.0
1 bar 2 5.0
df.groupby('A').apply(lambda x: x.loc[x['B'].idxmin(), ['B','C']]).reset_index()
Related
I have a DataFrame with many missing values in columns which I wish to groupby:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
In [4]: df.groupby('b').groups
Out[4]: {'4': [0], '6': [2]}
see that Pandas has dropped the rows with NaN target values. (I want to include these rows!)
Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I want to avoid writing too complicated pieces of code.
Any suggestions? Should I write a function for this or is there a simple solution?
pandas >= 1.1
From pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False:
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
# Example from the docs
df
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
# without NA (the default)
df.groupby('b').sum()
a c
b
1.0 2 3
2.0 2 5
# with NA
df.groupby('b', dropna=False).sum()
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
This is mentioned in the Missing Data section of the docs:
NA groups in GroupBy are automatically excluded. This behavior is consistent with R
One workaround is to use a placeholder before doing the groupby (e.g. -1):
In [11]: df.fillna(-1)
Out[11]:
a b
0 1 4
1 2 -1
2 3 6
In [12]: df.fillna(-1).groupby('b').sum()
Out[12]:
a
b
-1 2
4 1
6 3
That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue - which uses the same placeholder hack).
However, as described in another answer, "from pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False"
Ancient topic, if someone still stumbles over this--another workaround is to convert via .astype(str) to string before grouping. That will conserve the NaN's.
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
df['b'] = df['b'].astype(str)
df.groupby(['b']).sum()
a
b
4 1
6 3
nan 2
I am not able to add a comment to M. Kiewisch since I do not have enough reputation points (only have 41 but need more than 50 to comment).
Anyway, just want to point out that M. Kiewisch solution does not work as is and may need more tweaking. Consider for example
>>> df = pd.DataFrame({'a': [1, 2, 3, 5], 'b': [4, np.NaN, 6, 4]})
>>> df
a b
0 1 4.0
1 2 NaN
2 3 6.0
3 5 4.0
>>> df.groupby(['b']).sum()
a
b
4.0 6
6.0 3
>>> df.astype(str).groupby(['b']).sum()
a
b
4.0 15
6.0 3
nan 2
which shows that for group b=4.0, the corresponding value is 15 instead of 6. Here it is just concatenating 1 and 5 as strings instead of adding it as numbers.
All answers provided thus far result in potentially dangerous behavior as it is quite possible you select a dummy value that is actually part of the dataset. This is increasingly likely as you create groups with many attributes. Simply put, the approach doesn't always generalize well.
A less hacky solve is to use pd.drop_duplicates() to create a unique index of value combinations each with their own ID, and then group on that id. It is more verbose but does get the job done:
def safe_groupby(df, group_cols, agg_dict):
# set name of group col to unique value
group_id = 'group_id'
while group_id in df.columns:
group_id += 'x'
# get final order of columns
agg_col_order = (group_cols + list(agg_dict.keys()))
# create unique index of grouped values
group_idx = df[group_cols].drop_duplicates()
group_idx[group_id] = np.arange(group_idx.shape[0])
# merge unique index on dataframe
df = df.merge(group_idx, on=group_cols)
# group dataframe on group id and aggregate values
df_agg = df.groupby(group_id, as_index=True)\
.agg(agg_dict)
# merge grouped value index to results of aggregation
df_agg = group_idx.set_index(group_id).join(df_agg)
# rename index
df_agg.index.name = None
# return reordered columns
return df_agg[agg_col_order]
Note that you can now simply do the following:
data_block = [np.tile([None, 'A'], 3),
np.repeat(['B', 'C'], 3),
[1] * (2 * 3)]
col_names = ['col_a', 'col_b', 'value']
test_df = pd.DataFrame(data_block, index=col_names).T
grouped_df = safe_groupby(test_df, ['col_a', 'col_b'],
OrderedDict([('value', 'sum')]))
This will return the successful result without having to worry about overwriting real data that is mistaken as a dummy value.
One small point to Andy Hayden's solution – it doesn't work (anymore?) because np.nan == np.nan yields False, so the replace function doesn't actually do anything.
What worked for me was this:
df['b'] = df['b'].apply(lambda x: x if not np.isnan(x) else -1)
(At least that's the behavior for Pandas 0.19.2. Sorry to add it as a different answer, I do not have enough reputation to comment.)
I answered this already, but some reason the answer was converted to a comment. Nevertheless, this is the most efficient solution:
Not being able to include (and propagate) NaNs in groups is quite aggravating. Citing R is not convincing, as this behavior is not consistent with a lot of other things. Anyway, the dummy hack is also pretty bad. However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs.
dfgrouped = df.groupby(['b']).a.agg(['sum','size','count'])
dfgrouped['sum'][dfgrouped['size']!=dfgrouped['count']] = None
When these differ, you can set the value back to None for the result of the aggregation function for that group.
If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6
Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)
You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.
Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.
Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c
You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.
If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
I have a dataframe with several columns (the features).
>>> print(df)
col1 col2
a 1 1
b 2 2
c 3 3
d 3 2
I would like to compute the mode of one of them. This is what happens:
>>> print(df['col1'].mode())
0 3
dtype: int64
I would like to output simply the value 3.
This behavoiur is quite strange, if you consider that the following very similar code is working:
>>> print(df['col1'].mean())
2.25
So two questions: why does this happen? How can I obtain the pure mode value as it happens for the mean?
Because Series.mode() can return multiple values:
consider the following DF:
In [77]: df
Out[77]:
col1 col2
a 1 1
b 2 2
c 3 3
d 3 2
e 2 3
In [78]: df['col1'].mode()
Out[78]:
0 2
1 3
dtype: int64
From docstring:
Empty if nothing occurs at least 2 times. Always returns Series
even if only one value.
If you want to chose the first value:
In [83]: df['col1'].mode().iloc[0]
Out[83]: 2
In [84]: df['col1'].mode()[0]
Out[84]: 2
I agree that it's too cumbersome
df['col1'].mode().iloc[0].values[0]
a series can have one mean(), but a series can have more than one mode()
like
<2,2,2,3,3,3,4,4,4,5,6,7,8> its mode 2,3,4.
the output must be indexed
mode() will return all values that tie for the most frequent value.
In order to support that functionality, it must return a collection, which takes the form of a dataFrame or Series.
For example, if you had a series:
[2, 2, 3, 3, 5, 5, 6]
Then the most frequent values occur twice. The result would then be the series [2, 3, 5] since each of these occur twice.
If you want to collapse this into a single value, you can access the first value, compute the max(), min(), or whatever makes most sense for your application.
Here is my dataframe
import pandas as pd
df = pd.DataFrame({'A': ['one', 'one', 'two', 'two', 'one'] ,
'B': ['Ar', 'Br', 'Cr', 'Ar','Ar'] ,
'C': ['12/15/2011', '11/11/2001', '08/30/2015', '07/3/1999','03/03/2000' ],
'D':[1,7,3,4,5]})
My goal is to group by column A and sort within grouped results by column B.
Here is what I came up with:
sort_group = df.sort_values('B').groupby('A')
I was hoping that grouping operation would not distort order, but it does not work and also returns not a dataframe, but groupby object
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000008B190B8>
Any suggestions?
You cannot apply sort_values directly to a groupby object but you need an apply:
df.groupby('A').apply(lambda x: x.sort_values('B'))
gives you the desired output:
A B C D
A
one 0 one Ar 12/15/2011 1
4 one Ar 03/03/2000 5
1 one Br 11/11/2001 7
two 3 two Ar 07/3/1999 4
2 two Cr 08/30/2015 3
I usually use only sort_values to indirectly group values based on column A and sort within the groups by column B. This is:
sort_group = df.sort_values(['A', 'B'])
which will give you this:
A B C D
0 one Ar 12/15/2011 1
4 one Ar 03/03/2000 5
1 one Br 11/11/2001 7
3 two Ar 07/3/1999 4
2 two Cr 08/30/2015 3
This will return a normal DataFrame where you continue your analysis.
How do I access the corresponding groupby dataframe in a groupby object by the key?
With the following groupby:
rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
'B': rand.randn(6),
'C': rand.randint(0, 20, 6)})
gb = df.groupby(['A'])
I can iterate through it to get the keys and groups:
In [11]: for k, gp in gb:
print 'key=' + str(k)
print gp
key=bar
A B C
1 bar -0.611756 18
3 bar -1.072969 10
5 bar -2.301539 18
key=foo
A B C
0 foo 1.624345 5
2 foo -0.528172 11
4 foo 0.865408 14
I would like to be able to access a group by its key:
In [12]: gb['foo']
Out[12]:
A B C
0 foo 1.624345 5
2 foo -0.528172 11
4 foo 0.865408 14
But when I try doing that with gb[('foo',)] I get this weird pandas.core.groupby.DataFrameGroupBy object thing which doesn't seem to have any methods that correspond to the DataFrame I want.
The best I could think of is:
In [13]: def gb_df_key(gb, key, orig_df):
ix = gb.indices[key]
return orig_df.ix[ix]
gb_df_key(gb, 'foo', df)
Out[13]:
A B C
0 foo 1.624345 5
2 foo -0.528172 11
4 foo 0.865408 14
but this is kind of nasty, considering how nice pandas usually is at these things.
What's the built-in way of doing this?
You can use the get_group method:
In [21]: gb.get_group('foo')
Out[21]:
A B C
0 foo 1.624345 5
2 foo -0.528172 11
4 foo 0.865408 14
Note: This doesn't require creating an intermediary dictionary / copy of every subdataframe for every group, so will be much more memory-efficient than creating the naive dictionary with dict(iter(gb)). This is because it uses data-structures already available in the groupby object.
You can select different columns using the groupby slicing:
In [22]: gb[["A", "B"]].get_group("foo")
Out[22]:
A B
0 foo 1.624345
2 foo -0.528172
4 foo 0.865408
In [23]: gb["C"].get_group("foo")
Out[23]:
0 5
2 11
4 14
Name: C, dtype: int64
Wes McKinney (pandas' author) in Python for Data Analysis provides the following recipe:
groups = dict(list(gb))
which returns a dictionary whose keys are your group labels and whose values are DataFrames, i.e.
groups['foo']
will yield what you are looking for:
A B C
0 foo 1.624345 5
2 foo -0.528172 11
4 foo 0.865408 14
Rather than
gb.get_group('foo')
I prefer using gb.groups
df.loc[gb.groups['foo']]
Because in this way you can choose multiple columns as well. for example:
df.loc[gb.groups['foo'],('A','B')]
gb = df.groupby(['A'])
gb_groups = grouped_df.groups
If you are looking for selective groupby objects then, do: gb_groups.keys(), and input desired key into the following key_list..
gb_groups.keys()
key_list = [key1, key2, key3 and so on...]
for key, values in gb_groups.items():
if key in key_list:
print(df.ix[values], "\n")
I was looking for a way to sample a few members of the GroupBy obj - had to address the posted question to get this done.
create groupby object based on some_key column
grouped = df.groupby('some_key')
pick N dataframes and grab their indices
sampled_df_i = random.sample(grouped.indices, N)
grab the groups
df_list = map(lambda df_i: grouped.get_group(df_i), sampled_df_i)
optionally - turn it all back into a single dataframe object
sampled_df = pd.concat(df_list, axis=0, join='outer')
df.groupby('A').get_group('foo')
is equivalent to:
df[df['A'] == 'foo']