Pandas .loc with Tuple column names - python

I am currently working with a panda that uses tuples for column names. When attempting to use .loc as I would for normal columns the tuple names cause it to error out.
Test code is below:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6,4),
columns=[('a','1'), ('b','2'), ('c','3'), 'nontuple'])
df1.loc[:3, 'nontuple']
df1.loc[:3, ('c','3')]
The second line works as expected and displays the column 'non tuple' from 0:3. The third line does not work and instead gives the error:
KeyError: "None of [('c', '3')] are in the [columns]
Any idea how to resolve this issue short of not using tuples as column names?
Also, I have found that the code below works even though the .loc doesn't:
df1.ix[:3][('c','3')]

Documenation
access by tuple, returns DF:
In [508]: df1.loc[:3, [('c', '3')]]
Out[508]:
(c, 3)
0 1.433004
1 -0.731705
2 -1.633657
3 0.565320
access by non-tuple column, returns series:
In [514]: df1.loc[:3, 'nontuple']
Out[514]:
0 0.783621
1 1.984459
2 -2.211271
3 -0.532457
Name: nontuple, dtype: float64
access by non-tuple column, returns DF:
In [517]: df1.loc[:3, ['nontuple']]
Out[517]:
nontuple
0 0.783621
1 1.984459
2 -2.211271
3 -0.532457
access any column by it's number, returns series:
In [515]: df1.iloc[:3, 2]
Out[515]:
0 1.433004
1 -0.731705
2 -1.633657
Name: (c, 3), dtype: float64
access any column(s) by it's number, returns DF:
In [516]: df1.iloc[:3, [2]]
Out[516]:
(c, 3)
0 1.433004
1 -0.731705
2 -1.633657
NOTE: pay attention at the differences between .loc[] and .iloc[] - they are filtering rows differently!
this works like Python's slicing:
In [531]: df1.iloc[0:2]
Out[531]:
(a, 1) (b, 2) (c, 3) nontuple
0 0.650961 -1.130000 1.433004 0.783621
1 0.073805 1.907998 -0.731705 1.984459
this includes right index boundary:
In [532]: df1.loc[0:2]
Out[532]:
(a, 1) (b, 2) (c, 3) nontuple
0 0.650961 -1.130000 1.433004 0.783621
1 0.073805 1.907998 -0.731705 1.984459
2 -1.511939 0.167122 -1.633657 -2.211271

Related

Quirky behavior of pandas.DataFrame.equals

I have noticed a quirky thing. Let's say A and B are dataframe.
A is:
A
a b c
0 x 1 a
1 y 2 b
2 z 3 c
3 w 4 d
B is:
B
a b c
0 1 x a
1 2 y b
2 3 z c
3 4 w d
As we can see above, the elements under column a in A and B are different, but A.equals(B) yields True
A==B correctly shows that the elements are not equal:
A==B
a b c
0 False False True
1 False False True
2 False False True
3 False False True
Question: Can someone please explain why .equals() yields True? Also, I researched this topic on SO. As per contract of pandas.DataFrame.equals, Pandas must return False. I'd appreciate any help.
I am a beginner, so I'd appreciate any help.
Here's json format and ._data of A and B
A
`A.to_json()`
Out[114]: '{"a":{"0":"x","1":"y","2":"z","3":"w"},"b":{"0":1,"1":2,"2":3,"3":4},"c":{"0":"a","1":"b","2":"c","3":"d"}}'
and A._data is
BlockManager
Items: Index(['a', 'b', 'c'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
IntBlock: slice(1, 2, 1), 1 x 4, dtype: int64
ObjectBlock: slice(0, 4, 2), 2 x 4, dtype: object
B
B's json format:
B.to_json()
'{"a":{"0":1,"1":2,"2":3,"3":4},"b":{"0":"x","1":"y","2":"z","3":"w"},"c":{"0":"a","1":"b","2":"c","3":"d"}}'
B._data
BlockManager
Items: Index(['a', 'b', 'c'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
IntBlock: slice(0, 1, 1), 1 x 4, dtype: int64
ObjectBlock: slice(1, 3, 1), 2 x 4, dtype: object
Alternative to sacul and U9-Forward's answers, I've done some further analysis and it looks like the reason you are seeing True and not False as you expected might have something more to do with this line of the docs:
This function requires that the elements have the same dtype as their respective elements in the other Series or DataFrame.
With the above dataframes, when I run df.equals(), this is what is returned:
>>> A.equals(B)
Out: True
>>> B.equals(C)
Out: False
These two align with what the other answers are saying, A and B are the same shape and have the same elements, so they are the same. While B and C have the same shape, but different elements, so they aren't the same.
On the other hand:
>>> A.equals(D)
Out: False
Here A and D have the same shape, and the same elements. But still they are returning false. The difference between this case and the one above is that all of the dtypes in the comparison match up, as it says the above docs quote. A and D both have the dtypes: str, int, str.
As in the answer you linked in your question, essentially the behaviour of pandas.DataFrame.equals mimics numpy.array_equal.
The docs for np.array_equal state that it returns:
True if two arrays have the same shape and elements, False otherwise.
Which your 2 dataframes satisfies.
From the docs:
Determines if two NDFrame objects contain the same elements. NaNs in the same location are considered equal.
Determines if two NDFrame objects contain the same elements!!!
ELEMNTS not including COLUMNS
So that's why returns True
If you want it to return false and check the columns do:
print((A==B).all().all())
Output:
False

Rolling over multiple columns returning one result in Pandas

Im struck over rolling a window over multiple columns in Pandas, what I have is:
df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8]})
def test(ts):
print(ts.shape)
df.rolling(2).apply(test)
However the problem is that ts.shape prints (2,) and I wanted it to print (2,2), that is include the whole window of both rows and columns.
What is wrong about my intuition about how rolling works and how can I get the results im after using Pandas?
You can use a little hack - get numeric columns length by select_dtypes and use this scalar value:
df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8], 'C':list('abcd')})
print (df)
A B C
0 1 5 a
1 2 6 b
2 3 7 c
3 4 8 d
cols = len(df.select_dtypes(include=[np.number]).columns)
print (cols)
2
def test(ts):
print(tuple((ts.shape[0], cols)))
return ts.sum()
(2, 2)
(2, 2)
(2, 2)
(2, 2)
(2, 2)
(2, 2)
df = df.rolling(2).apply(test)

Python Pandas: Inconsistent behaviour of boolean indexing on a Series using the len() method

I have a Series of strings and I need to apply boolean indexing using len() on it.
In one case it works, in another case it does not:
The working case is a groupby on a dataframe, followed by a unique() on the resulting Series and a apply(str) to change the resulting numpy.ndarray entries into strings:
import pandas as pd
df = pd.DataFrame({'A':['a','a','a','a','b','b','b','b'],'B':[1,2,2,3,4,5,4,4]})
dg = df.groupby('A')['B'].unique().apply(str)
db = dg[len(dg) > 2]
This just works fine and yields the desired result:
>>db
Out[119]: '[1 2 3]'
The following however throws KeyError: True:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
ls = ss[len(ss) > 2]
Both objects dg and ss are just Series of Strings:
>>type(dg)
Out[113]: pandas.core.series.Series
>>type(ss)
Out[114]: pandas.core.series.Series
>>type(dg['a'])
Out[115]: str
>>type(ss[0])
Out[116]: str
I'm following the syntax as described in the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
I can see a potential conflict because len(ss) on its own returns the length of the Series itself and now that exact command is used for boolean indexing ss[len(ss) > 2], but then I'd expect neither of the two examples to work.
Right now this behaviour seems inconsistent, unless I'm missing something obvious.
I think you need str.len, because need length of each value of Series:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
print (ss.str.len())
0 1
1 1
2 2
3 2
4 4
5 2
6 3
dtype: int64
print (ss.str.len() > 2)
0 False
1 False
2 False
3 False
4 True
5 False
6 True
dtype: bool
ls = ss[ss.str.len() > 2]
print (ls)
4 eeee
6 ggg
dtype: object
If use len, get length of Series:
print (len(ss))
7
Another solution is apply len:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
ls = ss[ss.apply(len) > 2]
print (ls)
4 eeee
6 ggg
dtype: object
First script is wrong, you need apply len also:
df = pd.DataFrame({'A':['a','a','a','a','b','b','b','b'],'B':[1,2,2,2,4,5,4,6]})
dg = df.groupby('A')['B'].unique()
print (dg)
A
a [1, 2]
b [4, 5, 6]
Name: B, dtype: object
db = dg[dg.apply(len) > 2]
print (db)
A
b [4, 5, 6]
Name: B, dtype: object
If cast list to str, you get another len (length of data + length of [] + length of whitespaces):
dg = df.groupby('A')['B'].unique().apply(str)
print (dg)
A
a [1 2]
b [4 5 6]
Name: B, dtype: object
print (dg.apply(len))
A
a 5
b 7
Name: B, dtype: int64

Pandas create multiple aggregations

Trying to see how hard or easy this is to do with Pandas.
Let's say one has a two columns with data such as:
Cat1 Cat2
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
D 4
As you see A and C have three common elements 1, 2, 3. B however has only two elements 1 and 2. D has only one element: 4.
How would one programmatically get to this same result. The idea will be to have each group returned somehow. So one will be [A, C] and [1, 2, 3], then [B] and [1, 2] and [D] with [4].
I know a program can be written to do this so I am trying to figure out if there is something on Pandas to do it without having to build stuff from scratch.
Thanks!
You can use groupby twice to achieve this.
df = df.groupby('Cat1')['Cat2'].apply(lambda x: tuple(set(x))).reset_index()
df = df.groupby('Cat2')['Cat1'].apply(lambda x: tuple(set(x))).reset_index()
I'm using tuple because pandas needs elements to be hashable in order to do a groupby. The code above doesn't distinguish between (1, 2, 3) and (1, 1, 2, 3). If you want to make this distinction, replace set with sorted.
The resulting output:
Cat2 Cat1
0 (1, 2) (B,)
1 (1, 2, 3) (A, C)
2 (4,) (D,)
You could also:
df = df.set_index('Cat1', append=True).unstack().loc[:, 'Cat2']
df = pd.Series({col: tuple(values.dropna()) for col, values in df.items()})
df = df.groupby(df.values).apply(lambda x: list(x.index))
to get
Cat1
(1.0, 2.0) [B]
(1.0, 2.0, 3.0) [A, C]
(4.0,) [D]

pandas: slice a MultiIndex by range of secondary index

I have a series with a MultiIndex like this:
import numpy as np
import pandas as pd
buckets = np.repeat(['a','b','c'], [3,5,1])
sequence = [0,1,5,0,1,2,4,50,0]
s = pd.Series(
np.random.randn(len(sequence)),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence))
)
# In [6]: s
# Out[6]:
# a 0 -1.106047
# 1 1.665214
# 5 0.279190
# b 0 0.326364
# 1 0.900439
# 2 -0.653940
# 4 0.082270
# 50 -0.255482
# c 0 -0.091730
I'd like to get the s['b'] values where the second index ('sequence') is between 2 and 10.
Slicing on the first index works fine:
s['a':'b']
# Out[109]:
# bucket value
# a 0 1.828176
# 1 0.160496
# 5 0.401985
# b 0 -1.514268
# 1 -0.973915
# 2 1.285553
# 4 -0.194625
# 5 -0.144112
But not on the second, at least by what seems to be the two most obvious ways:
1) This returns elements 1 through 4, with nothing to do with the index values
s['b'][1:10]
# In [61]: s['b'][1:10]
# Out[61]:
# 1 0.900439
# 2 -0.653940
# 4 0.082270
# 50 -0.255482
However, if I reverse the index and the first index is integer and the second index is a string, it works:
In [26]: s
Out[26]:
0 a -0.126299
1 a 1.810928
5 a 0.571873
0 b -0.116108
1 b -0.712184
2 b -1.771264
4 b 0.148961
50 b 0.089683
0 c -0.582578
In [25]: s[0]['a':'b']
Out[25]:
a -0.126299
b -0.116108
As Robbie-Clarken answers, since 0.14 you can pass a slice in the tuple you pass to loc:
In [11]: s.loc[('b', slice(2, 10))]
Out[11]:
b 2 -0.65394
4 0.08227
dtype: float64
Indeed, you can pass a slice for each level:
In [12]: s.loc[(slice('a', 'b'), slice(2, 10))]
Out[12]:
a 5 0.27919
b 2 -0.65394
4 0.08227
dtype: float64
Note: the slice is inclusive.
Old answer:
You can also do this using:
s.ix[1:10, "b"]
(It's good practice to do in a single ix/loc/iloc since this version allows assignment.)
This answer was written prior to the introduction of iloc in early 2013, i.e. position/integer location - which may be preferred in this case. The reason it was created was to remove the ambiguity from integer-indexed pandas objects, and be more descriptive: "I'm slicing on position".
s["b"].iloc[1:10]
That said, I kinda disagree with the docs that ix is:
most robust and consistent way
it's not, the most consistent way is to describe what you're doing:
use loc for labels
use iloc for position
use ix for both (if you really have to)
Remember the zen of python:
explicit is better than implicit
Since pandas 0.15.0 this works:
s.loc['b', 2:10]
Output:
b 2 -0.503023
4 0.704880
dtype: float64
With a DataFrame it's slightly different (source):
df.loc(axis=0)['b', 2:10]
As of pandas 0.14.0 it is possible to slice multi-indexed objects by providing .loc a tuple containing slice objects:
In [2]: s.loc[('b', slice(2, 10))]
Out[2]:
b 2 -1.206052
4 -0.735682
dtype: float64
The best way I can think of is to use 'select' in this case. Although it even says in the docs that "This method should be used only when there is no more direct way."
Indexing and selecting data
In [116]: s
Out[116]:
a 0 1.724372
1 0.305923
5 1.780811
b 0 -0.556650
1 0.207783
4 -0.177901
50 0.289365
0 1.168115
In [117]: s.select(lambda x: x[0] == 'b' and 2 <= x[1] <= 10)
Out[117]: b 4 -0.177901
not sure if this is ideal but it works by creating a mask
In [59]: s.index
Out[59]:
MultiIndex
[('a', 0) ('a', 1) ('a', 5) ('b', 0) ('b', 1) ('b', 2) ('b', 4)
('b', 50) ('c', 0)]
In [77]: s[(tpl for tpl in s.index if 2<=tpl[1]<=10 and tpl[0]=='b')]
Out[77]:
b 2 -0.586568
4 1.559988
EDIT : hayden's solution is the way to go

Categories

Resources