contract of pandas.DataFrame.equals - python

I have a simple test case of a function which returns a df that can potentially contain NaN. I was testing if the output and expected output were equal.
>>> output
Out[1]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> expected
Out[2]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> output == expected
Out[3]:
r t ts tt ttct
0 True True True True True
1 True True True True True
2 True True True True True
However, I can't simply rely on the == operator because of NaNs. I was under the impression that the appropriate way to resolve this was by using the equals method. From the documentation:
pandas.DataFrame.equals
DataFrame.equals(other)
Determines if two NDFrame objects contain the same elements. NaNs in the same location are considered equal.
Nonetheless:
>>> expected.equals(log_events)
Out[4]: False
A little digging around reveals the difference in the frames:
>>> output._data
Out[5]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
FloatBlock: [r], 1 x 3, dtype: float64
IntBlock: [t, ts, tt, ttct], 4 x 3, dtype: int64
>>> expected._data
Out[6]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
IntBlock: [r, t, ts, tt, ttct], 5 x 3, dtype: int64
Force that output float block to int, or force the expected int block to float, and the test passes.
Obviously, there are different senses of equality, and the sort of test that DataFrame.equals performs could be useful in some cases. Nonetheless, the disparity between == and DataFrame.equals is frustrating to me and seems like an inconsistency. In pseudo-code, I would expect its behavior to match:
(self.index == other.index).all() \
and (self.columns == other.columns).all() \
and (self.values.fillna(SOME_MAGICAL_VALUE) == other.values.fillna(SOME_MAGICAL_VALUE)).all().all()
However, it doesn't. Am I wrong in my thinking, or is this an inconsistency in the Pandas API? Moreover, what IS the test I should be performing for my purposes, given the possible presence of NaN?

.equals() does just what it says. It tests for exact equality among elements, positioning of nans (and NaTs), dtype equality, and index equality. Think of this as as df is df2 type of test but they don't have to actually be the same object, IOW, df.equals(df.copy()) IS always True.
Your example fails because different dtypes are not equal (they may be equivalent though). So you can use com.array_equivalent for this, or (df == df2).all().all() if you don't have nans.
This is a replacement for np.array_equal which is broken for nan positional detections (and object dtypes).
It is mostly used internally. That said if you like an enhancement for equivalence (e.g. the elements are equivalent in the == sense and nan positionals match), pls open an issue on github. (and even better submit a PR!)

I used a workaround digging into the MagicMock instance:
assert mock_instance.call_count == 1
call_args = mock_instance.call_args[0]
call_kwargs = mock_instance.call_args[1]
pd.testing.assert_frame_equal(call_kwargs['dataframe'], pd.DataFrame())

Related

How does loc know which row to update when setting values?

df = pd.DataFrame({
'Product': ['Umbrella', 'Matress', 'Badminton',
'Shuttle', 'Sofa', 'Football'],
'MRP': [1200, 1500, 1600, 352, 5000, 500],
'Discount': [0, 10, 0, 10, 20, 40]
})
# Print the dataframe
print(df)
df.loc[df.MRP >= 1500, "Discount"] = -1
print(df)
I want to understand how the loc works. The purpose of loc is to get the row by label search. But in the above code, it seems iterate over each row, and insert -1 in the new col where the boolean is True? Does it do label search?
The only "real" indexing on a DataFrame are the positional indexes (the 0 indexed values which correspond to the underlying structures).
loc, therefore, always has to "Convert a potentially-label-based key into a positional indexer." _get_setitem_indexer.
Stepping out from under the hood the docs on pandas.DataFrame.loc explicitly allow:
A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
A list or array of labels, e.g. ['a', 'b', 'c'].
A slice object with labels, e.g. 'a':'f'.
A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
An alignable boolean Series. The index of the key will be aligned before masking.
An alignable Index. The Index of the returned selection will be the input.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
The benefit of loc is that it is extraordinarily flexible, particularly in terms of being able to chain this with other operations:
See:
df.groupby('Discount')['MRP'].agg(sum)
Discount
0 2800
10 1852
20 5000
40 500
Name: MRP, dtype: int64
Filtering this with Series.loc can be written as:
df.groupby('Discount')['MRP'].agg(sum).loc[lambda s: s >= 1500]
Discount
0 2800
10 1852
20 5000
Name: MRP, dtype: int64
Another huge benefit of loc is its ability to index both dimensions:
df.loc[df['MRP'] >= 1500, ['Product', 'Discount']] = np.nan
Product MRP Discount
0 Umbrella 1200 0.0
1 NaN 1500 NaN
2 NaN 1600 NaN
3 Shuttle 352 10.0
4 NaN 5000 NaN
5 Football 500 40.0
TLDR; The power of loc is its ability to translate various inputs into positional inputs, while the drawback is overhead of those conversions.
The first line of the documentation for DataFrame.loc states:
Access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array
Let's take a look at the expression df.MRP >= 1500. This is a boolean series with the same index as the dataframe:
>>> df.MRP >= 1500
0 False
1 True
2 True
3 False
4 True
5 False
Name: MRP, dtype: bool
So clearly there is at least an opportunity to match labels. What happens when you remove the labels?
>>> df.loc[(df.MRP >= 1500).to_numpy(), "Discount"]
1 10
2 0
4 20
Name: Discount, dtype: int64
So .loc will use the ordering of the DataFrame when labels are not available. This makes sense. But does it use order or labels when the labels don't match?
Make a Series like df.MRP >= 1500 but out of order to see what gets selected:
>>> ind1 = pd.Series([True, True, True, False, False, False], index=[1, 2, 4, 0, 3, 5])
>>> df.loc[ind1, "Discount"]
1 10
2 0
4 20
Name: Discount, dtype: int64
So clearly, when available label matching is happening. When not available, order is used instead:
>>> df.loc[ind1.to_numpy(), "Discount"]
0 0
1 10
2 0
Name: Discount, dtype: int64
Another interesting point is that the labels of the index expression must be a superset, not a subset of the DataFrame's index. For example, if you shorten ind by one element, this is what happens:
>>> ind2 = pd.Series([True, True, True, False, False], index=[1, 2, 4, 0, 3])
>>> df.loc[ind2, "Discount"]
...
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
and
>>> df.loc[ind2.to_numpy(), "Discount"]
...
IndexError: Boolean index has wrong length: 5 instead of 6
Adding an extra element when doing label matching is OK, however:
>>> ind3 = pd.Series([True, True, True, False, False, False, True], index=[1, 2, 4, 0, 3, 5, 6])
>>> df.loc[ind3, "Discount"]
1 10
2 0
4 20
Name: Discount, dtype: int64
Notice that element at index 6, which is not in the DataFrame, is ignored in the output.
And of course without labels, longer arrays are not acceptable either:
>>> df.loc[ind3.to_numpy(), "Discount"]
...
IndexError: Boolean index has wrong length: 7 instead of 6

How to correctly identify float values [0, 1] containing a dot, in DataFrame object dtype?

I have a dataframe like so, where my values are object dtype:
df = pd.DataFrame(data=['A', '290', '0.1744175757', '1', '1.0000000000'], columns=['Value'])
df
Out[65]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
Value 5 non-null object
dtypes: object(1)
memory usage: 120.0+ bytes
What I want to do is select only percentages, in this case values of 0.1744175757 and 1.0000000000, which just so happen in my data will all have a period/dot in them. This is a key point - I need to be able to differentiate between a 1 integer value, and a 1.0000000000 percentage, as well as a 0 and 0.0000000000.
I've tried to look for the presence of the dot character, but this doesn't work, it returns true for every value, and I'm unclear why.
df[df['Value'].str.contains('.')]
Out[67]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
I've also tried isdecimal(), but this isn't quite what I want:
df[df['Value'].str.isdecimal()]
Out[68]:
Value
1 290
3 1
The closest I've come up with a function:
def isPercent(x):
if pd.isnull(x):
return False
try:
x = float(x)
return x % 1 != 0
except:
return False
df[df['Value'].apply(isPercent)]
Out[74]:
Value
2 0.1744175757
but this fails to correctly identify scenarios of 1.0000000000 (and 0.0000000000).
I have two questions:
Why doesn't str.contains('.') work in this context? This seems like it's the easiest way since it will 100% of the time get me what I need in my data, but it returns True even if no '.' character is clearly in the value.
How might I correctly identify all values [0, 1] that have a dot character in the value?
str.contains performs a regex based search by default, and '.' will match any character by the regex engine. To disable it, use regex=False:
df[df['Value'].str.contains('.', regex=False)]
Value
2 0.1744175757
4 1.0000000000
You can also escape it to treat it literally:
df[df['Value'].str.contains(r'\.')]
Value
2 0.1744175757
4 1.0000000000
If you really want to pick up just float numbers, try using a regex that is a little more robust.
df[df['Value'].str.contains(r'\d+\.\d+')].astype(float)
Value
2 0.174418
4 1.000000

Quirky behavior of pandas.DataFrame.equals

I have noticed a quirky thing. Let's say A and B are dataframe.
A is:
A
a b c
0 x 1 a
1 y 2 b
2 z 3 c
3 w 4 d
B is:
B
a b c
0 1 x a
1 2 y b
2 3 z c
3 4 w d
As we can see above, the elements under column a in A and B are different, but A.equals(B) yields True
A==B correctly shows that the elements are not equal:
A==B
a b c
0 False False True
1 False False True
2 False False True
3 False False True
Question: Can someone please explain why .equals() yields True? Also, I researched this topic on SO. As per contract of pandas.DataFrame.equals, Pandas must return False. I'd appreciate any help.
I am a beginner, so I'd appreciate any help.
Here's json format and ._data of A and B
A
`A.to_json()`
Out[114]: '{"a":{"0":"x","1":"y","2":"z","3":"w"},"b":{"0":1,"1":2,"2":3,"3":4},"c":{"0":"a","1":"b","2":"c","3":"d"}}'
and A._data is
BlockManager
Items: Index(['a', 'b', 'c'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
IntBlock: slice(1, 2, 1), 1 x 4, dtype: int64
ObjectBlock: slice(0, 4, 2), 2 x 4, dtype: object
B
B's json format:
B.to_json()
'{"a":{"0":1,"1":2,"2":3,"3":4},"b":{"0":"x","1":"y","2":"z","3":"w"},"c":{"0":"a","1":"b","2":"c","3":"d"}}'
B._data
BlockManager
Items: Index(['a', 'b', 'c'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
IntBlock: slice(0, 1, 1), 1 x 4, dtype: int64
ObjectBlock: slice(1, 3, 1), 2 x 4, dtype: object
Alternative to sacul and U9-Forward's answers, I've done some further analysis and it looks like the reason you are seeing True and not False as you expected might have something more to do with this line of the docs:
This function requires that the elements have the same dtype as their respective elements in the other Series or DataFrame.
With the above dataframes, when I run df.equals(), this is what is returned:
>>> A.equals(B)
Out: True
>>> B.equals(C)
Out: False
These two align with what the other answers are saying, A and B are the same shape and have the same elements, so they are the same. While B and C have the same shape, but different elements, so they aren't the same.
On the other hand:
>>> A.equals(D)
Out: False
Here A and D have the same shape, and the same elements. But still they are returning false. The difference between this case and the one above is that all of the dtypes in the comparison match up, as it says the above docs quote. A and D both have the dtypes: str, int, str.
As in the answer you linked in your question, essentially the behaviour of pandas.DataFrame.equals mimics numpy.array_equal.
The docs for np.array_equal state that it returns:
True if two arrays have the same shape and elements, False otherwise.
Which your 2 dataframes satisfies.
From the docs:
Determines if two NDFrame objects contain the same elements. NaNs in the same location are considered equal.
Determines if two NDFrame objects contain the same elements!!!
ELEMNTS not including COLUMNS
So that's why returns True
If you want it to return false and check the columns do:
print((A==B).all().all())
Output:
False

Testing if value is contained in Pandas Series with mixed types

I have a pandas series, for example: x = pandas.Series([-1,20,"test"]).
Now I would like to test if -1 is contained in x without looping over the whole series. I could transform the whole series to string and then test if "-1" in x but sometimes I have -1.0 and sometime -1 and so on, so this is not a good choice.
Is there another possibility to approach this?
What about
x.isin([-1])
output:
0 True
1 False
2 False
dtype: bool
Or if you want to have a count of how many instances:
x.isin([-1]).sum()
Output:
1
I think you can do something like this to handle data that appears to be string-like and integer-like. Pandas Series are all a single datatype.
x = pd.Series([-1,20,"test","-1.0"])
print(x)
0 -1
1 20
2 test
3 -1.0
dtype: object
(pd.to_numeric(x, errors='coerce') == -1).sum()
Note: Any value that can cast into a number will return NaN.
Output
2
If you just want to see if a -1 appears in x then you can use
(pd.to_numeric(x, errors='coerce') == -1).sum() > 0
Output:
True
x.isin([-1])
Gives me:
0 True
1 False
2 False
dtype: bool
You can refer to docs for more info.

Python Pandas: Inconsistent behaviour of boolean indexing on a Series using the len() method

I have a Series of strings and I need to apply boolean indexing using len() on it.
In one case it works, in another case it does not:
The working case is a groupby on a dataframe, followed by a unique() on the resulting Series and a apply(str) to change the resulting numpy.ndarray entries into strings:
import pandas as pd
df = pd.DataFrame({'A':['a','a','a','a','b','b','b','b'],'B':[1,2,2,3,4,5,4,4]})
dg = df.groupby('A')['B'].unique().apply(str)
db = dg[len(dg) > 2]
This just works fine and yields the desired result:
>>db
Out[119]: '[1 2 3]'
The following however throws KeyError: True:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
ls = ss[len(ss) > 2]
Both objects dg and ss are just Series of Strings:
>>type(dg)
Out[113]: pandas.core.series.Series
>>type(ss)
Out[114]: pandas.core.series.Series
>>type(dg['a'])
Out[115]: str
>>type(ss[0])
Out[116]: str
I'm following the syntax as described in the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
I can see a potential conflict because len(ss) on its own returns the length of the Series itself and now that exact command is used for boolean indexing ss[len(ss) > 2], but then I'd expect neither of the two examples to work.
Right now this behaviour seems inconsistent, unless I'm missing something obvious.
I think you need str.len, because need length of each value of Series:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
print (ss.str.len())
0 1
1 1
2 2
3 2
4 4
5 2
6 3
dtype: int64
print (ss.str.len() > 2)
0 False
1 False
2 False
3 False
4 True
5 False
6 True
dtype: bool
ls = ss[ss.str.len() > 2]
print (ls)
4 eeee
6 ggg
dtype: object
If use len, get length of Series:
print (len(ss))
7
Another solution is apply len:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
ls = ss[ss.apply(len) > 2]
print (ls)
4 eeee
6 ggg
dtype: object
First script is wrong, you need apply len also:
df = pd.DataFrame({'A':['a','a','a','a','b','b','b','b'],'B':[1,2,2,2,4,5,4,6]})
dg = df.groupby('A')['B'].unique()
print (dg)
A
a [1, 2]
b [4, 5, 6]
Name: B, dtype: object
db = dg[dg.apply(len) > 2]
print (db)
A
b [4, 5, 6]
Name: B, dtype: object
If cast list to str, you get another len (length of data + length of [] + length of whitespaces):
dg = df.groupby('A')['B'].unique().apply(str)
print (dg)
A
a [1 2]
b [4 5 6]
Name: B, dtype: object
print (dg.apply(len))
A
a 5
b 7
Name: B, dtype: int64

Categories

Resources