Pandas DataFrames with NaNs equality comparison - python

In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:
ipdb> expect
1 2
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df
identifier 1 2
timestamp
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df[1][0]
nan
ipdb> df[1][0], expect[1][0]
(nan, nan)
ipdb> df[1][0] == expect[1][0]
False
ipdb> df[1][1] == expect[1][1]
True
ipdb> type(df[1][0])
<type 'numpy.float64'>
ipdb> type(expect[1][0])
<type 'numpy.float64'>
ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])
ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False
Given that I'm trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?
What is the simplest way to compare equality of Series/DataFrames including NaNs?

You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:
In [11]: from pandas.testing import assert_frame_equal
In [12]: assert_frame_equal(df, expected, check_names=False)
You can wrap this in a function with something like:
try:
assert_frame_equal(df, expected, check_names=False)
return True
except AssertionError:
return False
In more recent pandas this functionality has been added as .equals:
df.equals(expected)

One of the properties of NaN is that NaN != NaN is True.
Check out this answer for a nice way to do this using numexpr.
(a == b) | ((a != a) & (b != b))
says this (in pseudocode):
a == b or (isnan(a) and isnan(b))
So, either a equals b, or both a and b are NaN.
If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.
In [1]: df = DataFrame(rand(1e7, 15))
In [2]: df = df[df > 0.5]
In [3]: df2 = df.copy()
In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 14
dtypes: float64(15)
In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
1 loops, best of 3: 598 ms per loop
timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:
In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
1 loops, best of 3: 687 ms per loop

Like #PhillipCloud answer, but more written out
In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])
In [27]: df2 = df1.copy()
They really are equivalent
In [28]: result = df1 == df2
In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [30]: result
Out[30]:
0 1
0 True True
1 True True
A nan in df2 that doesn't exist in df1
In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])
In [32]: result = df1 == df2
In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [34]: result
Out[34]:
0 1
0 True True
1 False True
You can also fill with a value you know not to be in the frame
In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]:
0 1
0 True True
1 True True

Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.
Simply, df1.fillna('NULL') == df2.fillna('NULL'), if 'NULL' is not a value in the original data.
To be safe, do the following:
Example a) Compare two dataframes with NaN values
bools = (df1 == df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = True
assert bools.all().all()
Example b) Filter rows in df1 that do not match with df2
bools = (df1 != df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = False
df_outlier = df1[bools.all(axis=1)]
(Note: this is wrong - bools[pd.isnull(df1) == pd.isnull(df2)] = False)

df.fillna(0) == df2.fillna(0)
You can use fillna(). Documenation here.
from pandas import DataFrame
# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df
# comparison fails!
print df == df2
# all is well
print df.fillna(0) == df2.fillna(0)

Related

check columns in DataFrame for constant values explanation

I want to check a big DataFrame for constant columns and make a 2 list. The first for the columnnames with only zeros the second with the columnnames of constant values (excluding 0)
I found a solution (A in code) at Link but I dont understand it. A is making what i want but i dont know how and how i can get the list.
import numpy as np
import pandas as pd
data = [[0,1,1],[0,1,2],[0,1,3]]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
A =df.loc[:, (df != df.iloc[0]).any()]
Use:
m1 = (df == 0).all()
m2 = (df == df.iloc[0]).all()
a = df.columns[m1].tolist()
b = df.columns[~m1 & m2].tolist()
print (a)
['A']
print (b)
['B']
Explanation:
First compare all values by 0:
print (df == 0)
A B C
0 True False False
1 True False False
2 True False False
Then test if all values are Trues by DataFrame.all:
print ((df == 0).all())
A True
B False
C False
dtype: bool
Then compare first values of row by DataFrame.iloc:
print (df == df.iloc[0])
A B C
0 True True True
1 True True False
2 True True False
And test again by all:
print ((df == df.iloc[0]).all())
A True
B True
C False
dtype: bool
because exclude 0 chain inverted first mask by ~ with & for bitwise AND:
print (~m1 & m2)
A False
B True
C False
dtype: bool
This seems like a clean way to do what you want:
m1 = df.eq(0).all()
m2 = df.nunique().eq(1) & ~m1
m1[m1].index, m2[m2].index
# (Index(['A'], dtype='object'), Index(['B'], dtype='object'))
m1 gives you a boolean of columns that all have zeros:
m1
A True
B False
C False
dtype: bool
m2 gives you all columns with unique values, but not zeros (second condition re-uses the first)
m2
A False
B True
C False
dtype: bool
Deriving your lists is trivial from these masks.

Python Pandas - Cannot recognize a string from a column in another dataframe column

I've a dataframe with the following data:
Now I am trying to use the isIn method in order to produce a new column with the result if the col_a is in col_b.So in this case I am trying to produce the following output:
For this I am using this code:
df['res'] = df.col_a.isin(df.col_b)
But it's always return FALSE. I also try this: df['res'] = df.col_b.isin(df.col_a)
but with the same result... all the rows as FALSE.
What I am doing wrong?
Thanks!
You can check if value in col_a is in col_b per rows by apply:
df['res'] = df.apply(lambda x: x.col_a in x.col_b, axis=1)
Or by list comprehension:
df['res'] = [a in b for a, b in zip(df.col_a, df.col_b)]
EDIT: Error obviously mean there are missing values, so if-else statement is necessary:
df = pd.DataFrame({'col_a':['SQL','Java','C#', np.nan, 'Python', np.nan],
'col_b':['I.like_SQL_since_i_used_to_ETL',
'I like_programming_SQL.too',
'I prefer Java',
'I like beer',
np.nan,
np.nan]})
print (df)
df['res'] = df.apply(lambda x: x.col_a in x.col_b
if (x.col_a == x.col_a) and (x.col_b == x.col_b)
else False, axis=1)
df['res1'] = [a in b if (a == a) and (b == b) else False for a, b in zip(df.col_a, df.col_b)]
print (df)
col_a col_b res res1
0 SQL I.like_SQL_since_i_used_to_ETL True True
1 Java I like_programming_SQL.too False False
2 C# I prefer Java False False
3 NaN I like beer False False
4 Python NaN False False
5 NaN NaN False False

Python Compare rows in two columns and write a result conditionally

I've been searching for quite a while not not getting anywhere close to what I wanted to do...
I have a pandas dataframe in which I want to compare the value of column A to B and write a 1 or 0 in a new column if A and B are equal.
I could write an ugly for loop but I know this is not very pythony.
I'm pretty sure there is a way to do this with apply() but I'm not getting anywhere.
I'd like to be able to compare columns that contain integers as well as columns containing strings.
Thanks in advance for your help.
If df is a Pandas DataFrame, then
df['newcol'] = (df['A'] == df['B']).astype('int')
For example,
In [20]: df = pd.DataFrame({'A': [1,2,'foo'], 'B': [1,99,'foo']})
In [21]: df
Out[21]:
A B
0 1 1
1 2 99
2 foo foo
In [22]: df['newcol'] = (df['A'] == df['B']).astype('int')
In [23]: df
Out[23]:
A B newcol
0 1 1 1
1 2 99 0
2 foo foo 1
df['A'] == df['B'] returns a boolean Series:
In [24]: df['A'] == df['B']
Out[24]:
0 True
1 False
2 True
dtype: bool
astype('int') converts the True/False values to integers -- 0 for False and 1 for True.

matching of columns between two pandas dataframe

import pandas as pd
temp1 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp1['a'] = [1,2,2,3,3,4,4,4,9,11]
temp1['b'] = 'B'
temp2 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp2['a'] = [1,2,3,4,5,6,7,8,9,10]
temp2['b'] = 'B'
As the script above, I want to pickup rows from temp1 that column a was not seen at temp2. I can use %in% in R to do it easily, how can I do it in pandas?
update 01
the output should be one row which column a is 11 and column b is B
You can use isin to perform boolean indexing:
isin will produce a boolean index:
In [95]:
temp1.a.isin(temp2.a)
Out[95]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: a, dtype: bool
This can then be used as a mask in the final output:
In [94]:
# note the ~ this negates the result so equivalent of NOT
temp1[~temp1.a.isin(temp2.a)]
Out[94]:
a b
9 11 B
You can use isin to get the indices that are seen, and then negate the boolean indices:
temp1[~temp1.a.isin(temp2.a)]

How to check whether a pandas DataFrame is empty?

How to check whether a pandas DataFrame is empty? In my case I want to print some message in terminal if the DataFrame is empty.
You can use the attribute df.empty to check whether it's empty or not:
if df.empty:
print('DataFrame is empty!')
Source: Pandas Documentation
I use the len function. It's much faster than empty. len(df.index) is even faster.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
def empty(df):
return df.empty
def lenz(df):
return len(df) == 0
def lenzi(df):
return len(df.index) == 0
'''
%timeit empty(df)
%timeit lenz(df)
%timeit lenzi(df)
10000 loops, best of 3: 13.9 µs per loop
100000 loops, best of 3: 2.34 µs per loop
1000000 loops, best of 3: 695 ns per loop
len on index seems to be faster
'''
To see if a dataframe is empty, I argue that one should test for the length of a dataframe's columns index:
if len(df.columns) == 0: 1
Reason:
According to the Pandas Reference API, there is a distinction between:
an empty dataframe with 0 rows and 0 columns
an empty dataframe with rows containing NaN hence at least 1 column
Arguably, they are not the same. The other answers are imprecise in that df.empty, len(df), or len(df.index) make no distinction and return index is 0 and empty is True in both cases.
Examples
Example 1: An empty dataframe with 0 rows and 0 columns
In [1]: import pandas as pd
df1 = pd.DataFrame()
df1
Out[1]: Empty DataFrame
Columns: []
Index: []
In [2]: len(df1.index) # or len(df1)
Out[2]: 0
In [3]: df1.empty
Out[3]: True
Example 2: A dataframe which is emptied to 0 rows but still retains n columns
In [4]: df2 = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
df2
Out[4]: AA BB
0 1 11
1 2 22
2 3 33
In [5]: df2 = df2[df2['AA'] == 5]
df2
Out[5]: Empty DataFrame
Columns: [AA, BB]
Index: []
In [6]: len(df2.index) # or len(df2)
Out[6]: 0
In [7]: df2.empty
Out[7]: True
Now, building on the previous examples, in which the index is 0 and empty is True. When reading the length of the columns index for the first loaded dataframe df1, it returns 0 columns to prove that it is indeed empty.
In [8]: len(df1.columns)
Out[8]: 0
In [9]: len(df2.columns)
Out[9]: 2
Critically, while the second dataframe df2 contains no data, it is not completely empty because it returns the amount of empty columns that persist.
Why it matters
Let's add a new column to these dataframes to understand the implications:
# As expected, the empty column displays 1 series
In [10]: df1['CC'] = [111, 222, 333]
df1
Out[10]: CC
0 111
1 222
2 333
In [11]: len(df1.columns)
Out[11]: 1
# Note the persisting series with rows containing `NaN` values in df2
In [12]: df2['CC'] = [111, 222, 333]
df2
Out[12]: AA BB CC
0 NaN NaN 111
1 NaN NaN 222
2 NaN NaN 333
In [13]: len(df2.columns)
Out[13]: 3
It is evident that the original columns in df2 have re-surfaced. Therefore, it is prudent to instead read the length of the columns index with len(pandas.core.frame.DataFrame.columns) to see if a dataframe is empty.
Practical solution
# New dataframe df
In [1]: df = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
df
Out[1]: AA BB
0 1 11
1 2 22
2 3 33
# This data manipulation approach results in an empty df
# because of a subset of values that are not available (`NaN`)
In [2]: df = df[df['AA'] == 5]
df
Out[2]: Empty DataFrame
Columns: [AA, BB]
Index: []
# NOTE: the df is empty, BUT the columns are persistent
In [3]: len(df.columns)
Out[3]: 2
# And accordingly, the other answers on this page
In [4]: len(df.index) # or len(df)
Out[4]: 0
In [5]: df.empty
Out[5]: True
# SOLUTION: conditionally check for empty columns
In [6]: if len(df.columns) != 0: # <--- here
# Do something, e.g.
# drop any columns containing rows with `NaN`
# to make the df really empty
df = df.dropna(how='all', axis=1)
df
Out[6]: Empty DataFrame
Columns: []
Index: []
# Testing shows it is indeed empty now
In [7]: len(df.columns)
Out[7]: 0
Adding a new data series works as expected without the re-surfacing of empty columns (factually, without any series that were containing rows with only NaN):
In [8]: df['CC'] = [111, 222, 333]
df
Out[8]: CC
0 111
1 222
2 333
In [9]: len(df.columns)
Out[9]: 1
I prefer going the long route. These are the checks I follow to avoid using a try-except clause -
check if variable is not None
then check if its a dataframe and
make sure its not empty
Here, DATA is the suspect variable -
DATA is not None and isinstance(DATA, pd.DataFrame) and not DATA.empty
If a DataFrame has got Nan and Non Null values and you want to find whether the DataFrame
is empty or not then try this code.
when this situation can happen?
This situation happens when a single function is used to plot more than one DataFrame
which are passed as parameter.In such a situation the function try to plot the data even
when a DataFrame is empty and thus plot an empty figure!.
It will make sense if simply display 'DataFrame has no data' message.
why?
if a DataFrame is empty(i.e. contain no data at all.Mind you DataFrame with Nan values
is considered non empty) then it is desirable not to plot but put out a message :
Suppose we have two DataFrames df1 and df2.
The function myfunc takes any DataFrame(df1 and df2 in this case) and print a message
if a DataFrame is empty(instead of plotting):
df1 df2
col1 col2 col1 col2
Nan 2 Nan Nan
2 Nan Nan Nan
and the function:
def myfunc(df):
if (df.count().sum())>0: ##count the total number of non Nan values.Equal to 0 if DataFrame is empty
print('not empty')
df.plot(kind='barh')
else:
display a message instead of plotting if it is empty
print('empty')

Categories

Resources