subsetting hierarchical data in pandas - python

I am trying to subset hierarchical data that has two row ids.
Say I have data in hdf
index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
['one', 'two', 'three']],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
[0, 1, 2, 0, 1, 1, 2, 0, 1, 2]])
hdf = DataFrame(np.random.randn(10, 3), index=index,
columns=['A', 'B', 'C'])
hdf
And I wish to subset so that i see foo and qux, subset to return only sub-row two and columns A and C.
I can do this in two steps as follows:
sub1 = hdf.ix[['foo','qux'], ['A', 'C']]
sub1.xs('two', level=1)
Is there a single-step way to do this?
thanks

In [125]: hdf[hdf.index.get_level_values(0).isin(['foo', 'qux']) & (hdf.index.get_level_values(1) == 'two')][['A', 'C']]
Out[125]:
A C
foo two -0.113320 -1.215848
qux two 0.953584 0.134363
Much more complicated, but it would be better if you have many different values you want to choose in level one.

Doesn't look the nicest, but use tuples to get the rows you want and then squares brackets to select the columns.
In [36]: hdf.loc[[('foo', 'two'), ('qux', 'two')]][['A', 'C']]
Out[36]:
A C
foo two -0.356165 0.565022
qux two -0.701186 0.026532
loc could be swapped out for ix here.

itertools to the rescue:
>>> from itertools import product
>>>
>>> def _p(*iterables):
... return list(product(*iterables))
...
>>> hdf.ix[ _p(('foo','qux'),('two',)), ['A','C'] ]
A C
foo two 1.125401 1.389568
qux two 1.051455 -0.271256
>>>

Thanks everyone for your help. I also hit upon this solution:
hdf.ix[['bar','qux'], ['A', 'C']].xs('two', level=1)

Related

How to get the position of certain columns in dataframe - Python [duplicate]

In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results

How can I get the name of grouping columns from a Pandas GroupBy object?

Suppose I have the following dataframe:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Bar=[1, 2, 3, 4]))
i.e.:
Bar Foo
0 1 A
1 2 A
2 3 B
3 4 B
Then I create a pandas.GroupBy object:
g = df.groupby('Foo')
How can I get, from g, the fact that g is grouped by a column originally named Foo?
If I do g.groups I get:
{'A': Int64Index([0, 1], dtype='int64'),
'B': Int64Index([2, 3], dtype='int64')}
That tells me the values that the Foo column takes ('A' and 'B') but not the original column name.
Now, I can just do something like:
g.first().index.name
But it seems odd that there's not an attribute of g with the group name in it, so I feel like I must be missing something. In particular, if g was grouped by multiple columns, then the above doesn't work:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Baz=['C', 'D', 'C', 'D'], Bar=[1, 2, 3, 4]))
g = df.groupby(['Foo', 'Baz'])
g.first().index.name # returns None, because it's a MultiIndex
g.first().index.names # returns ['Foo', 'Baz']
For context, I am trying to do some plotting with a grouped dataframe, and I want to be able to label each facet (which is plotting a single group) with the name of that group as well as the group label.
Is there a better way?
Query GroupBy.BaseGrouper.names to get a list of all groupers:
df.groupby('Foo').grouper.names
Which gives,
['Foo']

Issues with adding new rows to a pandas dataframe

Apologies if the formatting on this is strange, it's the first time I've posted anything. I've created a multi-index data frame in Python, which works fine:
arrays = [['one','one', 'two', 'two'],
['A','B','A','B']]
tuples = list(zip(*arrays))
mindex = pd.MultiIndex.from_tuples(tuples)
s = pd.DataFrame(data=np.random.randn(4), index=mindex, columns=(['Values']))
s
This works fine, except that I think I should be able to add new rows by simply typing
s['Values'].loc[('Three', 'A')] = 1
s['Values'].loc[('Three','B')]= 2
This returns no error message, and I can check it has worked by entering
s['Values'].loc[('Three', 'A')]
Which gives me 1. So all as expected.
However, I can't see the 'Three' data in Jupyter notebook - if simply type
s
then it only shows me the original one, two, A & B rows. This is probably because the new row is not the index:
s.index
returns
MultiIndex(levels=[['one', 'two'], ['A', 'B']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Can anyone please give me a hint as to what's going on here? I'd like rows I subsequently add to appear in the index. Should I be using the .append function instead? It seems a bit cumbersome and other posts have recommended using the .loc approach above to add rows.
Thanks!
I believe you need select column(s) in function DataFrame.loc:
s.loc[('Three', 'A'), 'Values'] = 1
s.loc[('Three', 'B'), 'Values'] = 2
print (s)
Values
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234
Three A 1.000000
B 2.000000
print (s.index)
MultiIndex(levels=[['one', 'two', 'Three'], ['A', 'B']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
because your solution add values to column (Series), but not to DataFrame:
s['Values'].loc[('Three', 'A')] = 1
print (s['Values'])
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234
Three A 1.000000
Name: Values, dtype: float64
print (s)
Values
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234

Semantics of DataFrame groupby method

I find the behavior of the groupby method on a DataFrame object unexpected.
Let me explain with an example.
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
data1 = df['data1']
data1
# Out[14]:
# 0 1.989430
# 1 -0.250694
# 2 -0.448550
# 3 0.776318
# 4 -1.843558
# Name: data1, dtype: float64
data1 does not have the 'key1' column anymore.
So I would expect to get an error if I applied the following operation:
grouped = data1.groupby(df['key1'])
But I don't, and I can further apply the mean method on grouped to get the expected result.
grouped.mean()
# Out[13]:
# key1
# a -0.034941
# b 0.163884
# Name: data1, dtype: float64
However, the above operation does create a group using the 'key1' column of df.
How can this happen? Does the interpreter store information of the originating DataFrame (df in this case) with the created DataFrame/series (data1 in this case)?
Thank you.
It is only syntactic sugar, check here - selection by columns (Series) separately:
This is mainly syntactic sugar for the alternative and much more verbose
s = df['data1'].groupby(df['key1']).mean()
print (s)
key1
a 0.565292
b 0.106360
Name: data1, dtype: float64
Although the grouping columns are typically from the same dataframe or series, they don't have to be.
Your statement data1.groupby(df['key1']) is equivalent to data1.groupby(['a', 'a', 'b', 'b', 'a']). In fact, you can inspect the actual groups:
>>> data1.groupby(['a', 'a', 'b', 'b', 'a']).groups
{'a': [0, 1, 4], 'b': [2, 3]}
This means that your groupby on data1 will have a group a using rows 0, 1, and 4 from data1 and a group b using rows 2 and 3.

Completely remove one index label from a multiindex, in a dataframe

Given I have this multiindexed dataframe:
>>> import pandas as p
>>> import numpy as np
...
>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo']),
... np.array(['one', 'two', 'one', 'two', 'one', 'two'])]
...
>>> s = p.Series(np.random.randn(6), index=arrays)
>>> s
bar one -1.046752
two 2.035839
baz one 1.192775
two 1.774266
foo one -1.716643
two 1.158605
dtype: float64
How I should do to eliminate index bar?
I tried with drop
>>> s1 = s.drop('bar')
>>> s1
baz one 1.192775
two 1.774266
foo one -1.716643
two 1.158605
dtype: float64
Seems OK but bar is still there in some bizarre way:
>>> s1.index
MultiIndex(levels=[[u'bar', u'baz', u'foo'], [u'one', u'two']],
labels=[[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1['bar']
Series([], dtype: float64)
>>>
How could I get ride of any residue from this index label ?
Definitely looks like a bug.
s1.index.tolist() returns to the expected value without "bar".
>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]
s1["bar"] returns a null Series.
>>> s1["bar"]
Series([], dtype: float64)
The standard methods to override this don't seem to work either:
>>> del s1["bar"]
>>> s1["bar"]
Series([], dtype: float64)
>>> s1.__delitem__("bar")
>>> s1["bar"]
Series([], dtype: float64)
However, as expected, trying grab a new key invokes a KeyError:
>>> s1["booz"]
... KeyError: 'booz'
The main difference is when you actually look at the source code between the two in pandas.core.index.py
class MultiIndex(Index):
...
def _get_levels(self):
return self._levels
...
def _get_labels(self):
return self._labels
# ops compat
def tolist(self):
"""
return a list of the Index values
"""
return list(self.values)
So, the index.tolist() and the _labels aren't accessing the same piece of shared information, in fact, they aren't even close to.
So, we can use this to manually update the resulting indexer.
>>> s1.index.labels
FrozenList([[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])
>>> s1.index.values
array([('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')], dtype=object)
If we compare this to the initial multindexed index, we get
>>> s.index.labels
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
>>> s.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])
So the _levels attributes aren't updated, while the values is.
EDIT: Overriding it wasn't as easy as I thought.
EDIT: Wrote a custom function to fix this behavior
from pandas.core.base import FrozenList, FrozenNDArray
def drop(series, level, index_name):
# make new tmp series
new_series = series.drop(index_name)
# grab all indexing labels, levels, attributes
levels = new_series.index.levels
labels = new_series.index.labels
index_pos = levels[level].tolist().index(index_name)
# now need to reset the actual levels
level_names = levels[level]
# has no __delitem__, so... need to remake
tmp_names = FrozenList([i for i in level_names if i != index_name])
levels = FrozenList([j if i != level else tmp_names
for i, j in enumerate(levels)])
# need to turn off validation
new_series.index.set_levels(levels, verify_integrity=False, inplace=True)
# reset the labels
level_labels = labels[level].tolist()
tmp_labels = FrozenNDArray([i-1 if i > index_pos else i
for i in level_labels])
labels = FrozenList([j if i != level else tmp_labels
for i, j in enumerate(labels)])
new_series.index.set_labels(labels, verify_integrity=False, inplace=True)
return new_series
Example user:
>>> s1 = drop(s, 0, "bar")
>>> s1.index
MultiIndex(levels=[[u'baz', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]
>>> s1["bar"]
...
KeyError: 'bar'
EDIT: This seems to be specific to dataframes/series with multiindexing, as the standard pandas.core.index.Index class does not have the same limitations. I would recommend filing a bug report.
Consider the same series with a standard index:
>>> s = p.Series(np.random.randn(6))
>>> s.index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
>>> s.drop(0, inplace=True)
>>> s.index
Int64Index([1, 2, 3, 4, 5], dtype='int64')
The same is true for a dataframe
>>> df = p.DataFrame([np.random.randn(6), np.random.randn(6)])
>>> df.index
Int64Index([0, 1], dtype='int64')
>>> df.drop(0, inplace=True)
>>> df.index
Int64Index([1], dtype='int64')
See long discussion here.
Bottom line, its not obvious when to recompute the levels, as the operation a user is doing is unknown (think from the Index perspective). For example, say you are dropping, then adding a value to a level (e.g. via indexing). This would be very wasteful and somewhat compute intensive.
In [11]: s1.index
Out[11]:
MultiIndex(levels=[[u'bar', u'baz', u'foo'], [u'one', u'two']],
labels=[[1, 1, 2, 2], [0, 1, 0, 1]])
Here is the actual index itself.
In [12]: s1.index.values
Out[12]: array([('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')], dtype=object)
In [13]: s1.index.get_level_values(0)
Out[13]: Index([u'baz', u'baz', u'foo', u'foo'], dtype='object')
In [14]: s1.index.get_level_values(1)
Out[14]: Index([u'one', u'two', u'one', u'two'], dtype='object')
If you really feel it is necessary to 'get rid' of the removed level, then simply recreate the index. However, it is not harmful at all. These factorizations (e.g. the labels) are hidden from the user (yes they are displayed, but that is to be honest more of a confusion pain point, hence this question).
In [15]: pd.MultiIndex.from_tuples(s1.index.values)
Out[15]:
MultiIndex(levels=[[u'baz', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Categories

Resources