Could you explain to me, what the purpose of the 'DataFrame.columns.name' attribute is?
I unintentionally got it after creating a pivot table and resetting the index.
import pandas as pd
df = pd.DataFrame(['a', 'b'])
print(df.head())
# OUTPUT:
# 0
# 0 a
1 b
df.columns.name = 'temp'
print(df.head())
# OUTPUT:
# temp 0
# 0 a
# 1 b
giving name to column levels could be useful in many ways when you manipulate your data.
a simple example would be when you use `stack()'
df = pd.DataFrame([['a', 'b'], ['d', 'e']], columns=['hello', 'world'])
print(df.stack())
0 hello a
world b
1 hello d
world e
df.columns.name = 'temp'
print(df.stack())
temp
0 hello a
world b
1 hello d
world e
dtype: object
as you can see the stacked df has kept the level name of the columns. in a multi-index / multi-level dataframe this could be very useful
slightly more complex example (from the doc):
tuples = [('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
pd.MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)
print(s)
first second
bar one -0.9166
two 1.0698
baz one -0.8749
two 1.3895
foo one 0.5333
two 0.1014
qux one -1.2350
two -0.6479
dtype: float64
s.unstack()
second one two
first
bar -0.9166 1.0698
baz -0.8749 1.3895
foo 0.5333 0.1014
qux -1.2350 -0.6479
Related
Suppose I have the following dataframe:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Bar=[1, 2, 3, 4]))
i.e.:
Bar Foo
0 1 A
1 2 A
2 3 B
3 4 B
Then I create a pandas.GroupBy object:
g = df.groupby('Foo')
How can I get, from g, the fact that g is grouped by a column originally named Foo?
If I do g.groups I get:
{'A': Int64Index([0, 1], dtype='int64'),
'B': Int64Index([2, 3], dtype='int64')}
That tells me the values that the Foo column takes ('A' and 'B') but not the original column name.
Now, I can just do something like:
g.first().index.name
But it seems odd that there's not an attribute of g with the group name in it, so I feel like I must be missing something. In particular, if g was grouped by multiple columns, then the above doesn't work:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Baz=['C', 'D', 'C', 'D'], Bar=[1, 2, 3, 4]))
g = df.groupby(['Foo', 'Baz'])
g.first().index.name # returns None, because it's a MultiIndex
g.first().index.names # returns ['Foo', 'Baz']
For context, I am trying to do some plotting with a grouped dataframe, and I want to be able to label each facet (which is plotting a single group) with the name of that group as well as the group label.
Is there a better way?
Query GroupBy.BaseGrouper.names to get a list of all groupers:
df.groupby('Foo').grouper.names
Which gives,
['Foo']
I find the behavior of the groupby method on a DataFrame object unexpected.
Let me explain with an example.
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
data1 = df['data1']
data1
# Out[14]:
# 0 1.989430
# 1 -0.250694
# 2 -0.448550
# 3 0.776318
# 4 -1.843558
# Name: data1, dtype: float64
data1 does not have the 'key1' column anymore.
So I would expect to get an error if I applied the following operation:
grouped = data1.groupby(df['key1'])
But I don't, and I can further apply the mean method on grouped to get the expected result.
grouped.mean()
# Out[13]:
# key1
# a -0.034941
# b 0.163884
# Name: data1, dtype: float64
However, the above operation does create a group using the 'key1' column of df.
How can this happen? Does the interpreter store information of the originating DataFrame (df in this case) with the created DataFrame/series (data1 in this case)?
Thank you.
It is only syntactic sugar, check here - selection by columns (Series) separately:
This is mainly syntactic sugar for the alternative and much more verbose
s = df['data1'].groupby(df['key1']).mean()
print (s)
key1
a 0.565292
b 0.106360
Name: data1, dtype: float64
Although the grouping columns are typically from the same dataframe or series, they don't have to be.
Your statement data1.groupby(df['key1']) is equivalent to data1.groupby(['a', 'a', 'b', 'b', 'a']). In fact, you can inspect the actual groups:
>>> data1.groupby(['a', 'a', 'b', 'b', 'a']).groups
{'a': [0, 1, 4], 'b': [2, 3]}
This means that your groupby on data1 will have a group a using rows 0, 1, and 4 from data1 and a group b using rows 2 and 3.
I know that we can use .nunique() on a groupby column to find out the unique number of elements in the column like below:
df = pd.DataFrame({'c1':['foo', 'bar', 'foo', 'foo'], 'c2': ['A', 'B', 'A', 'B'], 'c3':[1, 2, 1, 1]})
c1 c2 c3
0 foo A 1
1 bar B 2
2 foo A 1
3 foo B 1
df.groupby('c1')['c2'].nunique()
c1
bar 1
foo 2
Name: c2, dtype: int64
However, now I have a groupby object that contains multiple columns, is there any way to find out the number of unique rows?
df.groupby('c1')['c2', 'c3'].???
Update:
So the end result I want is the number of unique rows within each group that's grouped based on the 'c1' column, such as this:
foo 2
bar 1
Update 2:
Here's a new test dataframe:
df = pd.DataFrame({'c1': ['foo', 'bar', 'foo', 'foo', 'bar'], 'c2': ['A'
, 'B', 'A', 'B', 'A'], 'c3': [1, 2, 1, 1, 1]})
UPDATE:
In [131]: df.groupby(['c1','c2','c3']).size().rename('count').reset_index()[['c1','count']].drop_duplicates(subset=['c1'])
Out[131]:
c1 count
0 bar 1
1 foo 2
OLD answer:
IIYC you need this:
In [43]: df.groupby(['c1','c2','c3']).size()
Out[43]:
c1 c2 c3
bar B 2 1
foo A 1 2
B 1 1
dtype: int64
Finally figured out how to do this!
import pandas as pd
import numpy as np
df = pd.DataFrame({'c1': ['foo', 'bar', 'foo', 'foo', 'bar'],
'c2': ['A', 'B', 'A', 'B', 'A'],
'c3': [1, 2, 1, 1, 1]})
def check_unique(df):
return len(df.groupby(list(df.columns.values)))
print(df.groupby('c1').apply(check_unique))
If need by nunique concanecated column c2 and c3, the easier is use:
df['c'] = df.c2 + df.c3.astype(str)
print (df.groupby('c1')['c'].nunique())
c1
bar 1
foo 2
Name: c, dtype: int64
Or groupby by Series c by column df.c1:
c = df.c2.astype(str) + df.c3.astype(str)
print (c.groupby([df.c1]).nunique())
c1
bar 2
foo 2
dtype: int64
Given I have this multiindexed dataframe:
>>> import pandas as p
>>> import numpy as np
...
>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo']),
... np.array(['one', 'two', 'one', 'two', 'one', 'two'])]
...
>>> s = p.Series(np.random.randn(6), index=arrays)
>>> s
bar one -1.046752
two 2.035839
baz one 1.192775
two 1.774266
foo one -1.716643
two 1.158605
dtype: float64
How I should do to eliminate index bar?
I tried with drop
>>> s1 = s.drop('bar')
>>> s1
baz one 1.192775
two 1.774266
foo one -1.716643
two 1.158605
dtype: float64
Seems OK but bar is still there in some bizarre way:
>>> s1.index
MultiIndex(levels=[[u'bar', u'baz', u'foo'], [u'one', u'two']],
labels=[[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1['bar']
Series([], dtype: float64)
>>>
How could I get ride of any residue from this index label ?
Definitely looks like a bug.
s1.index.tolist() returns to the expected value without "bar".
>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]
s1["bar"] returns a null Series.
>>> s1["bar"]
Series([], dtype: float64)
The standard methods to override this don't seem to work either:
>>> del s1["bar"]
>>> s1["bar"]
Series([], dtype: float64)
>>> s1.__delitem__("bar")
>>> s1["bar"]
Series([], dtype: float64)
However, as expected, trying grab a new key invokes a KeyError:
>>> s1["booz"]
... KeyError: 'booz'
The main difference is when you actually look at the source code between the two in pandas.core.index.py
class MultiIndex(Index):
...
def _get_levels(self):
return self._levels
...
def _get_labels(self):
return self._labels
# ops compat
def tolist(self):
"""
return a list of the Index values
"""
return list(self.values)
So, the index.tolist() and the _labels aren't accessing the same piece of shared information, in fact, they aren't even close to.
So, we can use this to manually update the resulting indexer.
>>> s1.index.labels
FrozenList([[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])
>>> s1.index.values
array([('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')], dtype=object)
If we compare this to the initial multindexed index, we get
>>> s.index.labels
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
>>> s.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])
So the _levels attributes aren't updated, while the values is.
EDIT: Overriding it wasn't as easy as I thought.
EDIT: Wrote a custom function to fix this behavior
from pandas.core.base import FrozenList, FrozenNDArray
def drop(series, level, index_name):
# make new tmp series
new_series = series.drop(index_name)
# grab all indexing labels, levels, attributes
levels = new_series.index.levels
labels = new_series.index.labels
index_pos = levels[level].tolist().index(index_name)
# now need to reset the actual levels
level_names = levels[level]
# has no __delitem__, so... need to remake
tmp_names = FrozenList([i for i in level_names if i != index_name])
levels = FrozenList([j if i != level else tmp_names
for i, j in enumerate(levels)])
# need to turn off validation
new_series.index.set_levels(levels, verify_integrity=False, inplace=True)
# reset the labels
level_labels = labels[level].tolist()
tmp_labels = FrozenNDArray([i-1 if i > index_pos else i
for i in level_labels])
labels = FrozenList([j if i != level else tmp_labels
for i, j in enumerate(labels)])
new_series.index.set_labels(labels, verify_integrity=False, inplace=True)
return new_series
Example user:
>>> s1 = drop(s, 0, "bar")
>>> s1.index
MultiIndex(levels=[[u'baz', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]
>>> s1["bar"]
...
KeyError: 'bar'
EDIT: This seems to be specific to dataframes/series with multiindexing, as the standard pandas.core.index.Index class does not have the same limitations. I would recommend filing a bug report.
Consider the same series with a standard index:
>>> s = p.Series(np.random.randn(6))
>>> s.index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
>>> s.drop(0, inplace=True)
>>> s.index
Int64Index([1, 2, 3, 4, 5], dtype='int64')
The same is true for a dataframe
>>> df = p.DataFrame([np.random.randn(6), np.random.randn(6)])
>>> df.index
Int64Index([0, 1], dtype='int64')
>>> df.drop(0, inplace=True)
>>> df.index
Int64Index([1], dtype='int64')
See long discussion here.
Bottom line, its not obvious when to recompute the levels, as the operation a user is doing is unknown (think from the Index perspective). For example, say you are dropping, then adding a value to a level (e.g. via indexing). This would be very wasteful and somewhat compute intensive.
In [11]: s1.index
Out[11]:
MultiIndex(levels=[[u'bar', u'baz', u'foo'], [u'one', u'two']],
labels=[[1, 1, 2, 2], [0, 1, 0, 1]])
Here is the actual index itself.
In [12]: s1.index.values
Out[12]: array([('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')], dtype=object)
In [13]: s1.index.get_level_values(0)
Out[13]: Index([u'baz', u'baz', u'foo', u'foo'], dtype='object')
In [14]: s1.index.get_level_values(1)
Out[14]: Index([u'one', u'two', u'one', u'two'], dtype='object')
If you really feel it is necessary to 'get rid' of the removed level, then simply recreate the index. However, it is not harmful at all. These factorizations (e.g. the labels) are hidden from the user (yes they are displayed, but that is to be honest more of a confusion pain point, hence this question).
In [15]: pd.MultiIndex.from_tuples(s1.index.values)
Out[15]:
MultiIndex(levels=[[u'baz', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
I am trying to subset hierarchical data that has two row ids.
Say I have data in hdf
index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
['one', 'two', 'three']],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
[0, 1, 2, 0, 1, 1, 2, 0, 1, 2]])
hdf = DataFrame(np.random.randn(10, 3), index=index,
columns=['A', 'B', 'C'])
hdf
And I wish to subset so that i see foo and qux, subset to return only sub-row two and columns A and C.
I can do this in two steps as follows:
sub1 = hdf.ix[['foo','qux'], ['A', 'C']]
sub1.xs('two', level=1)
Is there a single-step way to do this?
thanks
In [125]: hdf[hdf.index.get_level_values(0).isin(['foo', 'qux']) & (hdf.index.get_level_values(1) == 'two')][['A', 'C']]
Out[125]:
A C
foo two -0.113320 -1.215848
qux two 0.953584 0.134363
Much more complicated, but it would be better if you have many different values you want to choose in level one.
Doesn't look the nicest, but use tuples to get the rows you want and then squares brackets to select the columns.
In [36]: hdf.loc[[('foo', 'two'), ('qux', 'two')]][['A', 'C']]
Out[36]:
A C
foo two -0.356165 0.565022
qux two -0.701186 0.026532
loc could be swapped out for ix here.
itertools to the rescue:
>>> from itertools import product
>>>
>>> def _p(*iterables):
... return list(product(*iterables))
...
>>> hdf.ix[ _p(('foo','qux'),('two',)), ['A','C'] ]
A C
foo two 1.125401 1.389568
qux two 1.051455 -0.271256
>>>
Thanks everyone for your help. I also hit upon this solution:
hdf.ix[['bar','qux'], ['A', 'C']].xs('two', level=1)