Quirky behavior of pandas.DataFrame.equals - python

I have noticed a quirky thing. Let's say A and B are dataframe.
A is:
A
a b c
0 x 1 a
1 y 2 b
2 z 3 c
3 w 4 d
B is:
B
a b c
0 1 x a
1 2 y b
2 3 z c
3 4 w d
As we can see above, the elements under column a in A and B are different, but A.equals(B) yields True
A==B correctly shows that the elements are not equal:
A==B
a b c
0 False False True
1 False False True
2 False False True
3 False False True
Question: Can someone please explain why .equals() yields True? Also, I researched this topic on SO. As per contract of pandas.DataFrame.equals, Pandas must return False. I'd appreciate any help.
I am a beginner, so I'd appreciate any help.
Here's json format and ._data of A and B
A
`A.to_json()`
Out[114]: '{"a":{"0":"x","1":"y","2":"z","3":"w"},"b":{"0":1,"1":2,"2":3,"3":4},"c":{"0":"a","1":"b","2":"c","3":"d"}}'
and A._data is
BlockManager
Items: Index(['a', 'b', 'c'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
IntBlock: slice(1, 2, 1), 1 x 4, dtype: int64
ObjectBlock: slice(0, 4, 2), 2 x 4, dtype: object
B
B's json format:
B.to_json()
'{"a":{"0":1,"1":2,"2":3,"3":4},"b":{"0":"x","1":"y","2":"z","3":"w"},"c":{"0":"a","1":"b","2":"c","3":"d"}}'
B._data
BlockManager
Items: Index(['a', 'b', 'c'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
IntBlock: slice(0, 1, 1), 1 x 4, dtype: int64
ObjectBlock: slice(1, 3, 1), 2 x 4, dtype: object

Alternative to sacul and U9-Forward's answers, I've done some further analysis and it looks like the reason you are seeing True and not False as you expected might have something more to do with this line of the docs:
This function requires that the elements have the same dtype as their respective elements in the other Series or DataFrame.
With the above dataframes, when I run df.equals(), this is what is returned:
>>> A.equals(B)
Out: True
>>> B.equals(C)
Out: False
These two align with what the other answers are saying, A and B are the same shape and have the same elements, so they are the same. While B and C have the same shape, but different elements, so they aren't the same.
On the other hand:
>>> A.equals(D)
Out: False
Here A and D have the same shape, and the same elements. But still they are returning false. The difference between this case and the one above is that all of the dtypes in the comparison match up, as it says the above docs quote. A and D both have the dtypes: str, int, str.

As in the answer you linked in your question, essentially the behaviour of pandas.DataFrame.equals mimics numpy.array_equal.
The docs for np.array_equal state that it returns:
True if two arrays have the same shape and elements, False otherwise.
Which your 2 dataframes satisfies.

From the docs:
Determines if two NDFrame objects contain the same elements. NaNs in the same location are considered equal.
Determines if two NDFrame objects contain the same elements!!!
ELEMNTS not including COLUMNS
So that's why returns True
If you want it to return false and check the columns do:
print((A==B).all().all())
Output:
False

Related

Python language construction when filtering the array

I can see many questions in SO about how to filter an array (usually using pandas or numpy). The examples are very simple:
df = pd.DataFrame({ 'val': [1, 2, 3, 4, 5, 6, 7] })
a = df[df.val > 3]
Intuitively I understand the df[df.val > 3] statement. But it confuses me from the syntax point of view. In other languages, I would expect a lambda function instead of df.val > 3.
The question is: Why this style is possible and what is going on underhood?
Update 1:
To be more clear about the confusing part: In other languages I saw the next syntax:
df.filter(row => row.val > 3)
I understand here what is going on under the hood: For every iteration, the lambda function is called with the row as an argument and returns a boolean value. But here df.val > 3 doesn't make sense to me because df.val looks like a column.
Moreover, I can write df[df > 3] and it will be compiled and executed successfully. And it makes me crazy because I don't understand how a DataFrame object can be compared to a number.
Create an array and dataframe from it:
In [104]: arr = np.arange(1,8); df = pd.DataFrame({'val':arr})
In [105]: arr
Out[105]: array([1, 2, 3, 4, 5, 6, 7])
In [106]: df
Out[106]:
val
0 1
1 2
2 3
3 4
4 5
5 6
6 7
numpy arrays have methods and operators that operate on the whole array. For example you can multiply the array by a scalar, or add a scalar to all elements. Or in this case compare each element to a scalar. That's all implemented by the class (numpy.ndarray), not by Python syntax.
In [107]: arr>3
Out[107]: array([False, False, False, True, True, True, True])
Similarly pandas implements these methods (or uses numpy methods on the underlying arrays). Selecting a column of a frame, with `df['val'] or:
In [108]: df.val
Out[108]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
Name: val, dtype: int32
This is a pandas Series (slight difference in display).
It can be compared to a scalar - as with the array:
In [110]: df.val>3
Out[110]:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
Name: val, dtype: bool
And the boolean array can be used to index the frame:
In [111]: df[arr>3]
Out[111]:
val
3 4
4 5
5 6
6 7
The boolean Series also works:
In [112]: df[df.val>3]
Out[112]:
val
3 4
4 5
5 6
6 7
Boolean array indexing works the same way:
In [113]: arr[arr>3]
Out[113]: array([4, 5, 6, 7])
Here I use the indexing to fetch values; setting values is analogous.

Python Pandas: Inconsistent behaviour of boolean indexing on a Series using the len() method

I have a Series of strings and I need to apply boolean indexing using len() on it.
In one case it works, in another case it does not:
The working case is a groupby on a dataframe, followed by a unique() on the resulting Series and a apply(str) to change the resulting numpy.ndarray entries into strings:
import pandas as pd
df = pd.DataFrame({'A':['a','a','a','a','b','b','b','b'],'B':[1,2,2,3,4,5,4,4]})
dg = df.groupby('A')['B'].unique().apply(str)
db = dg[len(dg) > 2]
This just works fine and yields the desired result:
>>db
Out[119]: '[1 2 3]'
The following however throws KeyError: True:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
ls = ss[len(ss) > 2]
Both objects dg and ss are just Series of Strings:
>>type(dg)
Out[113]: pandas.core.series.Series
>>type(ss)
Out[114]: pandas.core.series.Series
>>type(dg['a'])
Out[115]: str
>>type(ss[0])
Out[116]: str
I'm following the syntax as described in the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
I can see a potential conflict because len(ss) on its own returns the length of the Series itself and now that exact command is used for boolean indexing ss[len(ss) > 2], but then I'd expect neither of the two examples to work.
Right now this behaviour seems inconsistent, unless I'm missing something obvious.
I think you need str.len, because need length of each value of Series:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
print (ss.str.len())
0 1
1 1
2 2
3 2
4 4
5 2
6 3
dtype: int64
print (ss.str.len() > 2)
0 False
1 False
2 False
3 False
4 True
5 False
6 True
dtype: bool
ls = ss[ss.str.len() > 2]
print (ls)
4 eeee
6 ggg
dtype: object
If use len, get length of Series:
print (len(ss))
7
Another solution is apply len:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
ls = ss[ss.apply(len) > 2]
print (ls)
4 eeee
6 ggg
dtype: object
First script is wrong, you need apply len also:
df = pd.DataFrame({'A':['a','a','a','a','b','b','b','b'],'B':[1,2,2,2,4,5,4,6]})
dg = df.groupby('A')['B'].unique()
print (dg)
A
a [1, 2]
b [4, 5, 6]
Name: B, dtype: object
db = dg[dg.apply(len) > 2]
print (db)
A
b [4, 5, 6]
Name: B, dtype: object
If cast list to str, you get another len (length of data + length of [] + length of whitespaces):
dg = df.groupby('A')['B'].unique().apply(str)
print (dg)
A
a [1 2]
b [4 5 6]
Name: B, dtype: object
print (dg.apply(len))
A
a 5
b 7
Name: B, dtype: int64

contract of pandas.DataFrame.equals

I have a simple test case of a function which returns a df that can potentially contain NaN. I was testing if the output and expected output were equal.
>>> output
Out[1]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> expected
Out[2]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> output == expected
Out[3]:
r t ts tt ttct
0 True True True True True
1 True True True True True
2 True True True True True
However, I can't simply rely on the == operator because of NaNs. I was under the impression that the appropriate way to resolve this was by using the equals method. From the documentation:
pandas.DataFrame.equals
DataFrame.equals(other)
Determines if two NDFrame objects contain the same elements. NaNs in the same location are considered equal.
Nonetheless:
>>> expected.equals(log_events)
Out[4]: False
A little digging around reveals the difference in the frames:
>>> output._data
Out[5]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
FloatBlock: [r], 1 x 3, dtype: float64
IntBlock: [t, ts, tt, ttct], 4 x 3, dtype: int64
>>> expected._data
Out[6]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
IntBlock: [r, t, ts, tt, ttct], 5 x 3, dtype: int64
Force that output float block to int, or force the expected int block to float, and the test passes.
Obviously, there are different senses of equality, and the sort of test that DataFrame.equals performs could be useful in some cases. Nonetheless, the disparity between == and DataFrame.equals is frustrating to me and seems like an inconsistency. In pseudo-code, I would expect its behavior to match:
(self.index == other.index).all() \
and (self.columns == other.columns).all() \
and (self.values.fillna(SOME_MAGICAL_VALUE) == other.values.fillna(SOME_MAGICAL_VALUE)).all().all()
However, it doesn't. Am I wrong in my thinking, or is this an inconsistency in the Pandas API? Moreover, what IS the test I should be performing for my purposes, given the possible presence of NaN?
.equals() does just what it says. It tests for exact equality among elements, positioning of nans (and NaTs), dtype equality, and index equality. Think of this as as df is df2 type of test but they don't have to actually be the same object, IOW, df.equals(df.copy()) IS always True.
Your example fails because different dtypes are not equal (they may be equivalent though). So you can use com.array_equivalent for this, or (df == df2).all().all() if you don't have nans.
This is a replacement for np.array_equal which is broken for nan positional detections (and object dtypes).
It is mostly used internally. That said if you like an enhancement for equivalence (e.g. the elements are equivalent in the == sense and nan positionals match), pls open an issue on github. (and even better submit a PR!)
I used a workaround digging into the MagicMock instance:
assert mock_instance.call_count == 1
call_args = mock_instance.call_args[0]
call_kwargs = mock_instance.call_args[1]
pd.testing.assert_frame_equal(call_kwargs['dataframe'], pd.DataFrame())

Comparing pandas.Series for equality when they are in different orders

Pandas automatically aligns data indices of Series objects before applying the binary operators such as addition and subtraction, but this is not done when checking for equality. Why is this, and how do I overcome it?
Consider the following example:
In [15]: x = pd.Series(index=["A", "B", "C"], data=[1,2,3])
In [16]: y = pd.Series(index=["C", "B", "A"], data=[3,2,1])
In [17]: x
Out[17]:
A 1
B 2
C 3
dtype: int64
In [18]: y
Out[18]:
C 3
B 2
A 1
dtype: int64
In [19]: x==y
Out[19]:
A False
B True
C False
dtype: bool
In [20]: x-y
Out[20]:
A 0
B 0
C 0
dtype: int64
I am using pandas 0.12.0.
You can overcome it with:
In [5]: x == y.reindex(x.index)
Out[5]:
A True
B True
C True
dtype: bool
or
In [6]: x.sort_index() == y.sort_index()
Out[6]:
A True
B True
C True
dtype: bool
The 'why' is explained here: https://github.com/pydata/pandas/issues/1134#issuecomment-5347816
Update: there is an issue that dicusses this (https://github.com/pydata/pandas/issues/1134) and a PR to fix this (https://github.com/pydata/pandas/pull/6860)

pandas: slice a MultiIndex by range of secondary index

I have a series with a MultiIndex like this:
import numpy as np
import pandas as pd
buckets = np.repeat(['a','b','c'], [3,5,1])
sequence = [0,1,5,0,1,2,4,50,0]
s = pd.Series(
np.random.randn(len(sequence)),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence))
)
# In [6]: s
# Out[6]:
# a 0 -1.106047
# 1 1.665214
# 5 0.279190
# b 0 0.326364
# 1 0.900439
# 2 -0.653940
# 4 0.082270
# 50 -0.255482
# c 0 -0.091730
I'd like to get the s['b'] values where the second index ('sequence') is between 2 and 10.
Slicing on the first index works fine:
s['a':'b']
# Out[109]:
# bucket value
# a 0 1.828176
# 1 0.160496
# 5 0.401985
# b 0 -1.514268
# 1 -0.973915
# 2 1.285553
# 4 -0.194625
# 5 -0.144112
But not on the second, at least by what seems to be the two most obvious ways:
1) This returns elements 1 through 4, with nothing to do with the index values
s['b'][1:10]
# In [61]: s['b'][1:10]
# Out[61]:
# 1 0.900439
# 2 -0.653940
# 4 0.082270
# 50 -0.255482
However, if I reverse the index and the first index is integer and the second index is a string, it works:
In [26]: s
Out[26]:
0 a -0.126299
1 a 1.810928
5 a 0.571873
0 b -0.116108
1 b -0.712184
2 b -1.771264
4 b 0.148961
50 b 0.089683
0 c -0.582578
In [25]: s[0]['a':'b']
Out[25]:
a -0.126299
b -0.116108
As Robbie-Clarken answers, since 0.14 you can pass a slice in the tuple you pass to loc:
In [11]: s.loc[('b', slice(2, 10))]
Out[11]:
b 2 -0.65394
4 0.08227
dtype: float64
Indeed, you can pass a slice for each level:
In [12]: s.loc[(slice('a', 'b'), slice(2, 10))]
Out[12]:
a 5 0.27919
b 2 -0.65394
4 0.08227
dtype: float64
Note: the slice is inclusive.
Old answer:
You can also do this using:
s.ix[1:10, "b"]
(It's good practice to do in a single ix/loc/iloc since this version allows assignment.)
This answer was written prior to the introduction of iloc in early 2013, i.e. position/integer location - which may be preferred in this case. The reason it was created was to remove the ambiguity from integer-indexed pandas objects, and be more descriptive: "I'm slicing on position".
s["b"].iloc[1:10]
That said, I kinda disagree with the docs that ix is:
most robust and consistent way
it's not, the most consistent way is to describe what you're doing:
use loc for labels
use iloc for position
use ix for both (if you really have to)
Remember the zen of python:
explicit is better than implicit
Since pandas 0.15.0 this works:
s.loc['b', 2:10]
Output:
b 2 -0.503023
4 0.704880
dtype: float64
With a DataFrame it's slightly different (source):
df.loc(axis=0)['b', 2:10]
As of pandas 0.14.0 it is possible to slice multi-indexed objects by providing .loc a tuple containing slice objects:
In [2]: s.loc[('b', slice(2, 10))]
Out[2]:
b 2 -1.206052
4 -0.735682
dtype: float64
The best way I can think of is to use 'select' in this case. Although it even says in the docs that "This method should be used only when there is no more direct way."
Indexing and selecting data
In [116]: s
Out[116]:
a 0 1.724372
1 0.305923
5 1.780811
b 0 -0.556650
1 0.207783
4 -0.177901
50 0.289365
0 1.168115
In [117]: s.select(lambda x: x[0] == 'b' and 2 <= x[1] <= 10)
Out[117]: b 4 -0.177901
not sure if this is ideal but it works by creating a mask
In [59]: s.index
Out[59]:
MultiIndex
[('a', 0) ('a', 1) ('a', 5) ('b', 0) ('b', 1) ('b', 2) ('b', 4)
('b', 50) ('c', 0)]
In [77]: s[(tpl for tpl in s.index if 2<=tpl[1]<=10 and tpl[0]=='b')]
Out[77]:
b 2 -0.586568
4 1.559988
EDIT : hayden's solution is the way to go

Categories

Resources