Completely remove one index label from a multiindex, in a dataframe - python

Given I have this multiindexed dataframe:
>>> import pandas as p
>>> import numpy as np
...
>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo']),
... np.array(['one', 'two', 'one', 'two', 'one', 'two'])]
...
>>> s = p.Series(np.random.randn(6), index=arrays)
>>> s
bar one -1.046752
two 2.035839
baz one 1.192775
two 1.774266
foo one -1.716643
two 1.158605
dtype: float64
How I should do to eliminate index bar?
I tried with drop
>>> s1 = s.drop('bar')
>>> s1
baz one 1.192775
two 1.774266
foo one -1.716643
two 1.158605
dtype: float64
Seems OK but bar is still there in some bizarre way:
>>> s1.index
MultiIndex(levels=[[u'bar', u'baz', u'foo'], [u'one', u'two']],
labels=[[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1['bar']
Series([], dtype: float64)
>>>
How could I get ride of any residue from this index label ?

Definitely looks like a bug.
s1.index.tolist() returns to the expected value without "bar".
>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]
s1["bar"] returns a null Series.
>>> s1["bar"]
Series([], dtype: float64)
The standard methods to override this don't seem to work either:
>>> del s1["bar"]
>>> s1["bar"]
Series([], dtype: float64)
>>> s1.__delitem__("bar")
>>> s1["bar"]
Series([], dtype: float64)
However, as expected, trying grab a new key invokes a KeyError:
>>> s1["booz"]
... KeyError: 'booz'
The main difference is when you actually look at the source code between the two in pandas.core.index.py
class MultiIndex(Index):
...
def _get_levels(self):
return self._levels
...
def _get_labels(self):
return self._labels
# ops compat
def tolist(self):
"""
return a list of the Index values
"""
return list(self.values)
So, the index.tolist() and the _labels aren't accessing the same piece of shared information, in fact, they aren't even close to.
So, we can use this to manually update the resulting indexer.
>>> s1.index.labels
FrozenList([[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])
>>> s1.index.values
array([('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')], dtype=object)
If we compare this to the initial multindexed index, we get
>>> s.index.labels
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
>>> s.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])
So the _levels attributes aren't updated, while the values is.
EDIT: Overriding it wasn't as easy as I thought.
EDIT: Wrote a custom function to fix this behavior
from pandas.core.base import FrozenList, FrozenNDArray
def drop(series, level, index_name):
# make new tmp series
new_series = series.drop(index_name)
# grab all indexing labels, levels, attributes
levels = new_series.index.levels
labels = new_series.index.labels
index_pos = levels[level].tolist().index(index_name)
# now need to reset the actual levels
level_names = levels[level]
# has no __delitem__, so... need to remake
tmp_names = FrozenList([i for i in level_names if i != index_name])
levels = FrozenList([j if i != level else tmp_names
for i, j in enumerate(levels)])
# need to turn off validation
new_series.index.set_levels(levels, verify_integrity=False, inplace=True)
# reset the labels
level_labels = labels[level].tolist()
tmp_labels = FrozenNDArray([i-1 if i > index_pos else i
for i in level_labels])
labels = FrozenList([j if i != level else tmp_labels
for i, j in enumerate(labels)])
new_series.index.set_labels(labels, verify_integrity=False, inplace=True)
return new_series
Example user:
>>> s1 = drop(s, 0, "bar")
>>> s1.index
MultiIndex(levels=[[u'baz', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]
>>> s1["bar"]
...
KeyError: 'bar'
EDIT: This seems to be specific to dataframes/series with multiindexing, as the standard pandas.core.index.Index class does not have the same limitations. I would recommend filing a bug report.
Consider the same series with a standard index:
>>> s = p.Series(np.random.randn(6))
>>> s.index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
>>> s.drop(0, inplace=True)
>>> s.index
Int64Index([1, 2, 3, 4, 5], dtype='int64')
The same is true for a dataframe
>>> df = p.DataFrame([np.random.randn(6), np.random.randn(6)])
>>> df.index
Int64Index([0, 1], dtype='int64')
>>> df.drop(0, inplace=True)
>>> df.index
Int64Index([1], dtype='int64')

See long discussion here.
Bottom line, its not obvious when to recompute the levels, as the operation a user is doing is unknown (think from the Index perspective). For example, say you are dropping, then adding a value to a level (e.g. via indexing). This would be very wasteful and somewhat compute intensive.
In [11]: s1.index
Out[11]:
MultiIndex(levels=[[u'bar', u'baz', u'foo'], [u'one', u'two']],
labels=[[1, 1, 2, 2], [0, 1, 0, 1]])
Here is the actual index itself.
In [12]: s1.index.values
Out[12]: array([('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')], dtype=object)
In [13]: s1.index.get_level_values(0)
Out[13]: Index([u'baz', u'baz', u'foo', u'foo'], dtype='object')
In [14]: s1.index.get_level_values(1)
Out[14]: Index([u'one', u'two', u'one', u'two'], dtype='object')
If you really feel it is necessary to 'get rid' of the removed level, then simply recreate the index. However, it is not harmful at all. These factorizations (e.g. the labels) are hidden from the user (yes they are displayed, but that is to be honest more of a confusion pain point, hence this question).
In [15]: pd.MultiIndex.from_tuples(s1.index.values)
Out[15]:
MultiIndex(levels=[[u'baz', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Related

Weird Behaviour of Enumerate while using pandas dataframe

I have a dataframe(df):
df = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5],'f':[6]},index=[0])
I am using enumerate on row.
res = [tuple(x) for x in enumerate(df.values)]
print(res)
>>> [(1, 1, 6, 4, 2, 3, 5)] ### the elements are int type
Now when i change the datatype of one column of my dataframe df:
df2 = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5.5],'f':[6]},index=[0])
and again use enumerate, i get:
res2 = [tuple(x) for x in enumerate(df2.values)]
print(res2)
>>> [(1, 1.0, 6.0, 4.0, 2.0, 3.0, 5.5)] ### the elements data type has changed
I am not getting why?
Also i am looking for a solution where i have to get it in its own datatype.
For eg.
res = [(1, 1, 6, 4, 2, 3, 5.5)]
How can i Achieve this?
This has nothing to do with enumerate, that's a red herring. The issue is you are looking for mixed type output whereas Pandas prefers storing homogeneous data.
What you are looking for is not recommended with Pandas. Your data type should be int or float, not a combination. This has performance repercussions, since the only straightforward alternative is to use object dtype series, which only permits operations in Python time. Converting to object dtype is inefficient.
So here's what you can do:
res2 = df2.astype(object).values.tolist()[0]
print(res2)
[1, 6, 4, 2, 3, 5.5]
One method which avoids the object conversion:
from itertools import chain
from operator import itemgetter, methodcaller
iter_series = map(itemgetter(1), df2.items())
res2 = list(chain.from_iterable(map(methodcaller('tolist'), iter_series)))
[1, 6, 4, 2, 3, 5.5]
Performance benchmarking
If you want a list of tuples as output, one tuple for each row, then the series-based solution performs better:-
# Python 3.6.0, Pandas 0.19.2
df2 = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5.5],'f':[6]},index=[0])
from itertools import chain
from operator import itemgetter, methodcaller
n = 10**5
df2 = pd.concat([df2]*n)
def jpp_series(df2):
iter_series = map(itemgetter(1), df2.items())
return list(zip(*map(methodcaller('tolist'), iter_series)))
def jpp_object1(df2):
return df2.astype(object).values.tolist()
def jpp_object2(df2):
return list(map(tuple, df2.astype(object).values.tolist()))
assert jpp_series(df2) == jpp_object2(df2)
%timeit jpp_series(df2) # 39.7 ms per loop
%timeit jpp_object1(df2) # 43.7 ms per loop
%timeit jpp_object2(df2) # 68.2 ms per loop
The issue is that calling df2.values will cause df2's data to be returned as a numpy array having a single dtype, where all the integers are also coerced to float.
You can prevent this coercion by operating on object arrays.
Use astype(object) to convert the underlying numpy array to object and prevent type coercion:
>>> [(i, *x) for i, x in df2.astype(object).iterrows()]
[(0, 1, 2, 3, 4, 5.5, 6)]
Or,
>>> [(i, *x) for i, x in enumerate(df2.astype(object).values)]
[(0, 1, 2, 3, 4, 5.5, 6)]
Or, on older versions,
>>> [(i,) + tuple(x) for i, x in enumerate(df2.astype(object).values)]
[(0, 1, 2, 3, 4, 5.5, 6)]
Your df2 has mixed dtypes:
In [23]: df2 = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5.5],'f':[6]},index=[0])
...:
In [24]: df2.dtypes
Out[24]:
a int64
f int64
k int64
l int64
m int64
s float64
dtype: object
therefore, using .values will "upcast" to the lowest common denominator. From the doces:
The dtype will be a lower-common-denominator dtype (implicit
upcasting); that is to say if the dtypes (even of numeric types) are
mixed, the one that accommodates all will be chosen. Use this with
care if you are not dealing with the blocks.
It looks like you actually just want .itertuples:
In [25]: list(df2.itertuples())
Out[25]: [Pandas(Index=0, a=1, f=6, k=4, l=2, m=3, s=5.5)]
Note, this conveniently returns a list of namedtuple objects, if you really just want plain tuples, map tuple on to it:
In [26]: list(map(tuple, df2.itertuples()))
Out[26]: [(0, 1, 6, 4, 2, 3, 5.5)]
But there's really no need.

Can an empty MultiIndex be used to slice a DataFrame?

I have a pandas MultiIndex that I'm trying to use to slice a DataFrame. When the MultiIndex is empty, it results in a ValueError:
(Pdb) p _ix
MultiIndex(levels=[[], [], [], []],
labels=[[], [], [], []],
names=['foo', 'bar', 'baz', 'raz'])
(Pdb) p df.index
MultiIndex(levels=[['adni'], ['123', '234'], ['M12_s1', 'M24_s1'], ['CRB', 'CRB_crop', 'PON']],
labels=[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [1, 1, 1, 1, 1], [0, 0, 0, 0, 0]],
names=['foo', 'bar', 'baz', 'raz'])
(Pdb) p df.loc[_ix]
*** ValueError: operands could not be broadcast together with shapes (0,) (4,) (0,)
The names of the two indices match, so from what I understand this slicing should be fine.
In some conditions when _ix isn't empty, this works as expected. I can't find anything in the documentation that describes not supporting an empty DataFrame like this though. Am I missing something obvious?
Edit: adding concrete example:
$ cat so_qc.csv
foo,bar,baz,raz,qc
pd,andrew,M24_s1,CRB,True
pd,andrew,M24_s1,CRB_crop,True
$ cat so_df.csv
foo,bar,baz,raz,value
pd,andrew,M24_s1,CRB,0.701794977111406
pd,andrew,M24_s1,CRB,0.309406238674409
$ python
qc = pd.read_csv('so_qc.csv', index_col=[0,1,2,3], squeeze=True)
df = pd.read_csv('so_df.csv', index_col=[0,1,2,3])
# This is OK
df.loc[ qc.index.intersection(df.index) ]
# When I select only False elements from `qc` (which is none of them), ValueError about broadcasting
df.loc[ qc[~qc].index.intersection(df.index) ]

What is DataFrame.columns.name?

Could you explain to me, what the purpose of the 'DataFrame.columns.name' attribute is?
I unintentionally got it after creating a pivot table and resetting the index.
import pandas as pd
df = pd.DataFrame(['a', 'b'])
print(df.head())
# OUTPUT:
# 0
# 0 a
1 b
df.columns.name = 'temp'
print(df.head())
# OUTPUT:
# temp 0
# 0 a
# 1 b
giving name to column levels could be useful in many ways when you manipulate your data.
a simple example would be when you use `stack()'
df = pd.DataFrame([['a', 'b'], ['d', 'e']], columns=['hello', 'world'])
print(df.stack())
0 hello a
world b
1 hello d
world e
df.columns.name = 'temp'
print(df.stack())
temp
0 hello a
world b
1 hello d
world e
dtype: object
as you can see the stacked df has kept the level name of the columns. in a multi-index / multi-level dataframe this could be very useful
slightly more complex example (from the doc):
tuples = [('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
pd.MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)
print(s)
first second
bar one -0.9166
two 1.0698
baz one -0.8749
two 1.3895
foo one 0.5333
two 0.1014
qux one -1.2350
two -0.6479
dtype: float64
s.unstack()
second one two
first
bar -0.9166 1.0698
baz -0.8749 1.3895
foo 0.5333 0.1014
qux -1.2350 -0.6479

concatenate arrays with mixed types

consider the np.array a
a = np.concatenate(
[np.arange(2).reshape(-1, 1),
np.array([['a'], ['b']])],
axis=1)
a
array([['0', 'a'],
['1', 'b']],
dtype='|S11')
How can I execute this concatenation such that the first column of a remains integers?
You can mix types in a numpy array by using a numpy.object as the dtype:
>>> import numpy as np
>>> a = np.empty((2, 0), dtype=np.object)
>>> a = np.append(a, np.arange(2).reshape(-1,1), axis=1)
>>> a = np.append(a, np.array([['a'],['b']]), axis=1)
>>> a
array([[0, 'a'],
[1, 'b']], dtype=object)
>>> type(a[0,0])
<type 'int'>
>>> type(a[0,1])
<type 'str'>
A suggested duplicate recommends making a recarray or structured array.
Store different datatypes in one NumPy array?
In this case:
In [324]: a = np.rec.fromarrays((np.arange(2).reshape(-1,1), np.array([['a'],['b']])))
In [325]: a
Out[325]:
rec.array([[(0, 'a')],
[(1, 'b')]],
dtype=[('f0', '<i4'), ('f1', '<U1')])
In [326]: a['f0']
Out[326]:
array([[0],
[1]])
In [327]: a['f1']
Out[327]:
array([['a'],
['b']],
dtype='<U1')
(I have reopened this because I think both approaches need to acknowledged. Plus the object answer was already given and accepted.)

subsetting hierarchical data in pandas

I am trying to subset hierarchical data that has two row ids.
Say I have data in hdf
index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
['one', 'two', 'three']],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
[0, 1, 2, 0, 1, 1, 2, 0, 1, 2]])
hdf = DataFrame(np.random.randn(10, 3), index=index,
columns=['A', 'B', 'C'])
hdf
And I wish to subset so that i see foo and qux, subset to return only sub-row two and columns A and C.
I can do this in two steps as follows:
sub1 = hdf.ix[['foo','qux'], ['A', 'C']]
sub1.xs('two', level=1)
Is there a single-step way to do this?
thanks
In [125]: hdf[hdf.index.get_level_values(0).isin(['foo', 'qux']) & (hdf.index.get_level_values(1) == 'two')][['A', 'C']]
Out[125]:
A C
foo two -0.113320 -1.215848
qux two 0.953584 0.134363
Much more complicated, but it would be better if you have many different values you want to choose in level one.
Doesn't look the nicest, but use tuples to get the rows you want and then squares brackets to select the columns.
In [36]: hdf.loc[[('foo', 'two'), ('qux', 'two')]][['A', 'C']]
Out[36]:
A C
foo two -0.356165 0.565022
qux two -0.701186 0.026532
loc could be swapped out for ix here.
itertools to the rescue:
>>> from itertools import product
>>>
>>> def _p(*iterables):
... return list(product(*iterables))
...
>>> hdf.ix[ _p(('foo','qux'),('two',)), ['A','C'] ]
A C
foo two 1.125401 1.389568
qux two 1.051455 -0.271256
>>>
Thanks everyone for your help. I also hit upon this solution:
hdf.ix[['bar','qux'], ['A', 'C']].xs('two', level=1)

Categories

Resources