Numpy/Pandas clean way to check if a specific value is NaN - python

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?

You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN

Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

Related

Pandas: find matching rows in two dataframes (without using `merge`)

Let's suppose I have these two dataframes with the same number of columns, but possibly different number of rows:
tmp = np.arange(0,12).reshape((4,3))
df = pd.DataFrame(data=tmp)
tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)
print(df)
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
print(df2)
a b c
0 3 4 5
1 100 4 100
2 101 100 3
I want to verify if the rows of df2 are matching any rows of df, that is I want to obtain a series (or an array) of boolean values that gives this result:
0 True
1 False
2 False
dtype: bool
I think something like the isin method should work, but I got this result, which results in a dataframe and is wrong:
print(df2.isin(df))
a b c
0 False False False
1 False False False
2 False False False
As a constraint, I wish to not use the merge method, since what I am doing is in fact a check on the data before applying merge itself.
Thank you for your help!
You can use numpy.isin, which will compare all elements in your arrays and return True or False for each element for each array.
Then using all() on each array, will get your desired output as the function returns True if all elements are true:
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])
0 True
1 False
2 False
dtype: bool
Breakdown of what is happening:
# np.isin
>>> np.isin(df2.values,df.values)
Out[139]:
array([[ True, True, True],
[False, True, False],
[False, False, True]])
# all()
>>> [m.all() for m in np.isin(df2.values,df.values)]
Out[140]: [True, False, False]
# pd.Series()
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])
Out[141]:
0 True
1 False
2 False
dtype: bool
Use np.in1d:
>>> df2.apply(lambda x: all(np.in1d(x, df)), axis=1)
0 True
1 False
2 False
dtype: bool
Another way, use frozenset:
>>> df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
0 True
1 False
2 False
dtype: bool
You can use a MultiIndex (expensive IMO):
pd.MultiIndex.from_frame(df2).isin(pd.MultiIndex.from_frame(df))
Out[32]: array([ True, False, False])
Another option is to create a dictionary, and run isin:
df2.isin({key : array.array for key, (_, array) in zip(df2, df.items())}).all(1)
Out[45]:
0 True
1 False
2 False
dtype: bool
There may be more efficient solutions, but you could append the two dataframes can call duplicated, e.g.:
df.append(df2).duplicated().iloc[df.shape[0]:]
This assumes that all rows in each DataFrame are distinct. Here are some benchmarks:
tmp1 = np.arange(0,12).reshape((4,3))
df1 = pd.DataFrame(data=tmp1, columns=["a", "b", "c"])
tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)
df1 = pd.concat([df1] * 10_000).reset_index()
df2 = pd.concat([df2] * 10_000).reset_index()
%timeit df1.append(df2).duplicated().iloc[df1.shape[0]:]
# 100 loops, best of 5: 4.16 ms per loop
%timeit pd.Series([m.all() for m in np.isin(df2.values,df1.values)])
# 10 loops, best of 5: 74.9 ms per loop
%timeit df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
# 1 loop, best of 5: 443 ms per loop
Try:
df[~df.apply(tuple,1).isin(df2.apply(tuple,1))]
Here is my result:

Invert a boolean Series will yield -1 for False and -2 for True in Pandas

I have a subset of Series in Pandas dataframe populated with bool value of True and False. I am trying to invert the series by using ~.
This is the original subset of the Series.
7 True
8 False
14 True
38 False
72 False
...
Name: Status, Length: 197, dtype: object
Now I am using the following the code to invert the value.
mask = ~subset_df['Status']
I am expecting the result to be
7 -2
8 -1
14 -2
38 -1
72 -1
...
Name: Status, Length: 197, dtype: object
but what I really want is the following output:
7 False
8 True
14 False
38 True
72 True
...
Name: Status, Length: 197, dtype: object
I would really appreciate if you could let me know how to invert a boolean Series without converting them into -1 and -2. Thank you so much.
For some reason, you have a series of object dtype, filled with what are probably ordinary Python bools. Applying ~ to such a series goes through elementwise and applies the ordinary ~ operator to each element, and ordinary Python bools inherit ~ from int - they do not perform logical negation for this operation.
You can convert your series to boolean dtype first before applying ~ to get logical negation:
~series.astype(bool)
On a sufficiently recent Pandas version (1.0 and up), you may wish to instead use the new nullable boolean dtype with astype('boolean') instead of astype(bool).
You should also figure out why your series has object dtype in the first place - it's likely that the correct place to address this issue is somewhere earlier in your code, not here. Perhaps you built it wrong, or you tried to use NaN or None to represent missing values.
You can use the replace function of pandas Series as:
subset_df['Status'].replace(to_replace=[True, False], value=[False, True])
This will return another series with replaced values. But if you want to change the actual dataframe, then you can add a parameter 'inplace=True' as:
subset_df['Status'].replace(to_replace=[True, False], value=[False, True], inplace=True)
EDIT:
If Status column is object/string type. Here we are checking if the Status value is True then we are replacing it with False and False string with True.
df=pd.DataFrame({
'Status':['True', 'False', 'True', 'False', 'False']
})
df.Status = np.where(df['Status']=='True',False,True)
df
Output
Status
0 False
1 True
2 False
3 True
4 True
If series is of boolean type, then below options can be followed.
s = pd.Series([True, False, True, False, False])
Two options
-s
or
np.invert(s)
Output
0 False
1 True
2 False
3 True
4 True
dtype: bool
What you tried works for df as well
df=pd.DataFrame({
'Status':[True, False, True, False, False]
})
df['Status'] = ~df['Status']
df
Output
Status
0 False
1 True
2 False
3 True
4 True

Equivalent of Dataframe "diff" with strings

Dataframes in pandas have some functions to perform computation between different rows, like diff (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html).
However, this only works with numeric computation (or at least objects that supports - operation).
Is there a way to perform a different between strings and return a boolean if the strings are equal?
For example:
>>> s = pd.Series(list("ABCCDEF"))
>>> s.str_diff()
0 NaN
1 False
2 False
3 True
4 False
5 False
6 False
dtype: bool
Thanks for Quang Hoang to point out the answer.
You just need to do a new series or Dataframe with a shift of rows and then compare.
>>> s = pd.Series(list("ABBCDDEF"))
# If you seach strings that are different
>>> s.ne(s.shift())
# If you seach strings that are equal
>>> s.eq(s.shift())
0 False
1 False
2 True
3 False
4 False
5 True
6 False
7 False
dtype: bool

Pandas boolean algebra: True if True in both columns

I would like to make a boolean vector that is created by the comparison of two input boolean vectors. I can use a for loop, but is there a better way to do this?
My ideal solution would look like this:
df['A'] = [True, False, False, True]
df['B'] = [True, False, False, False]
C = ((df['A']==True) or (df['B']==True)).as_matrix()
print C
>>> True, False, False, True
I think this is what you are looking for:
C = (df['A']) | (df['B'])
C
0 True
1 False
2 False
3 True
dtype: bool
You could then leave this as a series or convert it to a list or array
Alternatively you could use any method with axis=1 to search in index. It also will work for any number of columns where you have True values:
In [1105]: df
Out[1105]:
B A
0 True True
1 False False
2 False False
3 False True
In [1106]: df.any(axis=1)
Out[1106]:
0 True
1 False
2 False
3 True
dtype: bool

Is there a `in` like statement for a Pandas Series?

I can easily check whether each is equal to a number:
In [20]: s = pd.Series([1, 2, 3])
In [21]: s == 1
Out[21]:
0 True
1 False
2 False
My problem is, is there a function like s._in([1, 2]) and output something like
0 True
1 True
2 False
Yes, it is called isin. Do s.isin([1, 2]).

Categories

Resources