Semantics of DataFrame groupby method - python

I find the behavior of the groupby method on a DataFrame object unexpected.
Let me explain with an example.
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
data1 = df['data1']
data1
# Out[14]:
# 0 1.989430
# 1 -0.250694
# 2 -0.448550
# 3 0.776318
# 4 -1.843558
# Name: data1, dtype: float64
data1 does not have the 'key1' column anymore.
So I would expect to get an error if I applied the following operation:
grouped = data1.groupby(df['key1'])
But I don't, and I can further apply the mean method on grouped to get the expected result.
grouped.mean()
# Out[13]:
# key1
# a -0.034941
# b 0.163884
# Name: data1, dtype: float64
However, the above operation does create a group using the 'key1' column of df.
How can this happen? Does the interpreter store information of the originating DataFrame (df in this case) with the created DataFrame/series (data1 in this case)?
Thank you.

It is only syntactic sugar, check here - selection by columns (Series) separately:
This is mainly syntactic sugar for the alternative and much more verbose
s = df['data1'].groupby(df['key1']).mean()
print (s)
key1
a 0.565292
b 0.106360
Name: data1, dtype: float64

Although the grouping columns are typically from the same dataframe or series, they don't have to be.
Your statement data1.groupby(df['key1']) is equivalent to data1.groupby(['a', 'a', 'b', 'b', 'a']). In fact, you can inspect the actual groups:
>>> data1.groupby(['a', 'a', 'b', 'b', 'a']).groups
{'a': [0, 1, 4], 'b': [2, 3]}
This means that your groupby on data1 will have a group a using rows 0, 1, and 4 from data1 and a group b using rows 2 and 3.

Related

Identify the columns which contain zero and output its location

Suppose I have a dataframe where some columns contain a zero value as one of their elements (or potentially more than one zero). I don't specifically want to retrieve these columns or discard them (I know how to do that) - I just want to locate these. For instance: if there is are zeros somewhere in the 4th, 6th and the 23rd columns, I want a list with the output [4,6,23].
You could iterate over the columns, checking whether 0 occurs in each columns values:
[i for i, c in enumerate(df.columns) if 0 in df[c].values]
Use any() for the fastest, vectorized approach.
For instance,
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [0, 100, 200],
'col3': ['a', 'b', 'c']})
Then,
>>> s = df.eq(0).any()
col1 False
col2 True
col3 False
dtype: bool
From here, it's easy to get the indexes. For example,
>>> s[s].tolist()
['col2']
Many ways to retrieve the indexes from a pd.Series of booleans.
Here is an approach that leverages a couple of lambda functions:
d = {'a': np.random.randint(10, size=100),
'b': np.random.randint(1,10, size=100),
'c': np.random.randint(10, size=100),
'd': np.random.randint(1,10, size=100)
}
df = pd.DataFrame(d)
df.apply(lambda x: (x==0).any())[lambda x: x].reset_index().index.to_list()
[0, 2]
Another idea based on #rafaelc slick answer (but returning relative locations of the columns instead of column names):
df.eq(0).any().reset_index()[lambda x: x[0]].index.to_list()
[0, 2]
Or with the column names instead of locations:
df.apply(lambda x: (x==0).any())[lambda x: x].index.to_list()
['a', 'c']

Filter pandas df by boolean series

I have a dataframe foo and a True/False series bar:
foo = pd.DataFrame(
[['a', 1], ['b', 2], ['a', 3]],
index=[0, 1, 2], columns=['col1', 'col2'])
bar = pd.Series({'a': True, 'b': False})
I want to filter foo on col1 based on the truthiness of bar. Here are some approaches that work:
foo[foo['col1'].isin(bar.where(bar == True).dropna().index)
foo[foo['col1'].isin([k for k, v in bar.to_dict().items() if v])
# desired result
col1 col2
0 a 1
2 a 3
However, I think both approaches are a bit messy / not so intuitive to read, was wondering if I was missing any basic Pandas filtering concepts that allow for a simpler approach.
Use Series.map and index with the result:
foo[foo.col1.map(bar)]
col1 col2
0 a 1
2 a 3

How can I get the name of grouping columns from a Pandas GroupBy object?

Suppose I have the following dataframe:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Bar=[1, 2, 3, 4]))
i.e.:
Bar Foo
0 1 A
1 2 A
2 3 B
3 4 B
Then I create a pandas.GroupBy object:
g = df.groupby('Foo')
How can I get, from g, the fact that g is grouped by a column originally named Foo?
If I do g.groups I get:
{'A': Int64Index([0, 1], dtype='int64'),
'B': Int64Index([2, 3], dtype='int64')}
That tells me the values that the Foo column takes ('A' and 'B') but not the original column name.
Now, I can just do something like:
g.first().index.name
But it seems odd that there's not an attribute of g with the group name in it, so I feel like I must be missing something. In particular, if g was grouped by multiple columns, then the above doesn't work:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Baz=['C', 'D', 'C', 'D'], Bar=[1, 2, 3, 4]))
g = df.groupby(['Foo', 'Baz'])
g.first().index.name # returns None, because it's a MultiIndex
g.first().index.names # returns ['Foo', 'Baz']
For context, I am trying to do some plotting with a grouped dataframe, and I want to be able to label each facet (which is plotting a single group) with the name of that group as well as the group label.
Is there a better way?
Query GroupBy.BaseGrouper.names to get a list of all groupers:
df.groupby('Foo').grouper.names
Which gives,
['Foo']

How to replace values in multiple categoricals in a pandas DataFrame

I want to replace certain values in a dataframe containing multiple categoricals.
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
If I apply .replace on a single column, the result is as expected:
>>> df.s1.replace('a', 1)
0 1
1 b
2 c
Name: s1, dtype: object
If I apply the same operation to the whole dataframe, an error is shown (short version):
>>> df.replace('a', 1)
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
During handling of the above exception, another exception occurred:
ValueError: Wrong number of dimensions
If the dataframe contains integers as categories, the following happens:
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
>>> df.replace(1, 3)
s1 s2
0 3 3
1 2 3
2 3 4
But,
>>> df.replace(1, 2)
ValueError: Wrong number of dimensions
What am I missing?
Without digging, that seems to be buggy to me.
My Work Around
pd.DataFrame.apply with pd.Series.replace
This has the advantage that you don't need to mess with changing any types.
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
df.apply(pd.Series.replace, to_replace=1, value=2)
s1 s2
0 2 2
1 2 3
2 3 4
Or
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.apply(pd.Series.replace, to_replace='a', value=1)
s1 s2
0 1 1
1 b c
2 c d
#cᴏʟᴅsᴘᴇᴇᴅ's Work Around
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.applymap(str).replace('a', 1)
s1 s2
0 1 1
1 b c
2 c d
The reason for such behavior is different set of categorical values for each column:
In [224]: df.s1.cat.categories
Out[224]: Index(['a', 'b', 'c'], dtype='object')
In [225]: df.s2.cat.categories
Out[225]: Index(['a', 'c', 'd'], dtype='object')
so if you will replace to a value that is in both categories it'll work:
In [226]: df.replace('d','a')
Out[226]:
s1 s2
0 a a
1 b c
2 c a
As a solution you might want to make your columns categorical manually, using:
pd.Categorical(..., categories=[...])
where categories would have all possible values for all columns...

Issues with adding new rows to a pandas dataframe

Apologies if the formatting on this is strange, it's the first time I've posted anything. I've created a multi-index data frame in Python, which works fine:
arrays = [['one','one', 'two', 'two'],
['A','B','A','B']]
tuples = list(zip(*arrays))
mindex = pd.MultiIndex.from_tuples(tuples)
s = pd.DataFrame(data=np.random.randn(4), index=mindex, columns=(['Values']))
s
This works fine, except that I think I should be able to add new rows by simply typing
s['Values'].loc[('Three', 'A')] = 1
s['Values'].loc[('Three','B')]= 2
This returns no error message, and I can check it has worked by entering
s['Values'].loc[('Three', 'A')]
Which gives me 1. So all as expected.
However, I can't see the 'Three' data in Jupyter notebook - if simply type
s
then it only shows me the original one, two, A & B rows. This is probably because the new row is not the index:
s.index
returns
MultiIndex(levels=[['one', 'two'], ['A', 'B']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Can anyone please give me a hint as to what's going on here? I'd like rows I subsequently add to appear in the index. Should I be using the .append function instead? It seems a bit cumbersome and other posts have recommended using the .loc approach above to add rows.
Thanks!
I believe you need select column(s) in function DataFrame.loc:
s.loc[('Three', 'A'), 'Values'] = 1
s.loc[('Three', 'B'), 'Values'] = 2
print (s)
Values
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234
Three A 1.000000
B 2.000000
print (s.index)
MultiIndex(levels=[['one', 'two', 'Three'], ['A', 'B']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
because your solution add values to column (Series), but not to DataFrame:
s['Values'].loc[('Three', 'A')] = 1
print (s['Values'])
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234
Three A 1.000000
Name: Values, dtype: float64
print (s)
Values
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234

Categories

Resources