I've been trying to replace missing values in a Pandas dataframe, but without success. I tried the .fillna method and also tried to loop through the entire data set, checking each cell and replacing NaNs with a chosen value. However, in both cases, Python executes the script without throwing up any errors, but the NaN values remain.
When I dug a bit deeper, I discovered behaviour that seems erratic to me, best demonstrated with an example:
In[ ] X['Smokinginpregnancy'].head()
Out[ ]
Index
E09000002 NaN
E09000003 5.216126
E09000004 10.287496
E09000005 3.090379
E09000006 6.080041
Name: Smokinginpregnancy, dtype: float64
I know for a fact that the first item in this column is missing and pandas recognises it as NaN. In fact, if I call this item on its own, python tells me it's NaN:
In [ ] X['Smokinginpregnancy'][0]
Out [ ]
nan
However, when I test whether it's NaN, python returns False.
In [ ] X['Smokinginpregnancy'][0] == np.nan
Out [ ] False
I suspect that when .fillna is being executed, python checks whether the item is NaN but gets back a False, so it continues, leaving the cell alone.
Does anyone know what's going on? Any solutions? (apart from opening the csv file in excel and then manually replacing the values.)
I'm using Anaconda's Python 3 distribution.
You are doing:
X['Smokinginpregnancy'][0] == np.nan
This is guaranteed to return False because all NaNs compare unequal to everything by IEEE754 standard:
>>> x = float('nan')
>>> x == x
False
>>> x == 1
False
>>> x == float('nan')
False
See also here.
You have to use math.isnan to check for NaNs:
>>> math.isnan(x)
True
Or numpy.isnan
So use:
numpy.isnan(X['Smokinginpregnancy'][0])
Regarding pandas.fillna note that this function returns the filled array. Maybe you did something like:
X.fillna(...)
without reassigning X? Alternatively you must pass inplace=True to mutate the dataframe on which you are calling the method.
NaN in pandas can be check function pandas.isnull. I created boolean mask and return subset with NaN values.
Function filnna can be used for one column Smokinginpregnancy (more info in doc):
X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
or
X['Smokinginpregnancy'].fillna('100', inplace=True)
Warning:
Sometimes inplace=True can be ignored, better is not use. - link, github, github 3 comments.
All together:
print X['Smokinginpregnancy'].head()
#Index
#E09000002 NaN
#E09000003 5.216126
#E09000004 10.287496
#E09000005 3.090379
#E09000006 6.080041
#check NaN in column Smokinginpregnancy by boolean mask
mask = pd.isnull(X['Smokinginpregnancy'])
XNaN = X[mask]
print XNaN
# Smokinginpregnancy
#Index
#E09000002 NaN
#use function fillna for column Smokinginpregnancy
#X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
X['Smokinginpregnancy'].fillna('100', inplace=True)
print X
# Smokinginpregnancy
#Index
#E09000002 100
#E09000003 5.216126
#E09000004 10.2875
#E09000005 3.090379
#E09000006 6.080041
More information, why comparison doesn't work:
One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan. More info in Bakuriu's answer.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False
Related
My code is like:
a = pd.DataFrame([np.nan, True])
b = pd.DataFrame([True, np.nan])
c = a|b
print(c)
I don't know the result of logical operation result when one element is np.nan, but I expect them to be the same whatever the oder. But I got the result like this:
0
0 False
1 True
Why? Is this about short circuiting in pandas? I searched the doc of pandas but did not find answer.
My pandas version is 1.1.3
This is behaviour that is tied to np.nan, not pandas. Take the following examples:
print(True or np.nan)
print(np.nan or True)
Output:
True
nan
When performing the operation, dtype ends up mattering and the way that np.nan functions within the numpy library is what leads to this strange behaviour.
To get around this quirk, you can fill NaN values with False for example or some other token value which evaluates to False (using pandas.DataFrame.fillna()).
What could be the reason, why the max and min of my numpy array is nan?
I checked my array with:
for i in range(data[0]):
if data[i] == numpy.nan:
print("nan")
And there is no nan in my data.
Is my search wrong?
If not: What could be the reason for max and min being nan?
Here you go:
import numpy as np
a = np.array([1, 2, 3, np.nan, 4])
print(f'a.max() = {a.max()}')
print(f'np.nanmax(a) = {np.nanmax(a)}')
print(f'a.min() = {a.min()}')
print(f'np.nanmin(a) = {np.nanmin(a)}')
Output:
a.max() = nan
np.nanmax(a) = 4.0
a.min() = nan
np.nanmin(a) = 1.0
Balaji Ambresh showed precisely how to find min / max even
if the source array contains NaN, there is nothing to add on this matter.
But your code sample contains also other flaws that deserve to be pointed out.
Your loop contains for i in range(data[0]):.
You probably wanted to execute this loop for each element of data,
but your loop will be executed as many times as the value of
the initial element of data.
Variations:
If it is e.g. 1, it will be executed only once.
If it is 0 or negative, it will not be executed at all.
If it is >= than the size of data, IndexError exception
will be raised.
If your array contains at least 1 NaN, then the whole array
is of float type (NaN is a special case of float) and you get
TypeError exception: 'numpy.float64' object cannot be interpreted
as an integer.
Remedium (one of possible variants): This loop should start with
for elem in data: and the code inside should use elem as the
current element of data.
The next line contains if data[i] == numpy.nan:.
Even if you corrected it to if elem == np.nan:, the code inside
the if block will never be executed.
The reason is that np.nan is by definition not equal to any
other value, even it this other value is another np.nan.
Remedium: Change to if np.isnan(elem): (Balaji wrote in his comment
how to change your code, I added why).
And finally: How to check quickly an array for NaNs:
To get a detailed list, whether each element is NaN, run np.isnan(data)
and you will get a bool array.
To get a single answer, whether data contains at least one NaN,
no matter where, run np.isnan(data).any().
This code is shorter and runs significantly faster.
The reason is that np.nan == x is always False, even when x is np.nan . This is aligned with the NaN definition in Wikipedia.
Check yourself:
In [4]: import numpy as np
In [5]: np.nan == np.nan
Out[5]: False
If you want to check if a number x is np.nan, you must use
np.isnan(x)
If you want to get max/min of an np.array with nan's, use np.nanmax()/ np.nanmin():
minval = np.nanmin(data)
Easy use np.nanmax(variable_name) and np.nanmin(variable_name)
import numpy as np
z=np.arange(10,20)
z=np.where(z<15,np.nan,z)#Making below 15 z value as nan.
print(z)
print("z max value excluding nan :",np.nanmax(z))
print("z min value excluding nan :",np.nanmin(z))
I am trying to figure out whether or not a column in a pandas dataframe is boolean or not (and if so, if it has missing values and so on).
In order to test the function that I created I tried to create a dataframe with a boolean column with missing values. However, I would say that missing values are handled exclusively 'untyped' in python and there are some weird behaviours:
> boolean = pd.Series([True, False, None])
> print(boolean)
0 True
1 False
2 None
dtype: object
so the moment you put None into the list, it is being regarded as object because python is not able to mix the types bool and type(None)=NoneType back into bool. The same thing happens with math.nan and numpy.nan. The weirdest things happen when you try to force pandas into an area it does not want to go to :-)
> boolean = pd.Series([True, False, np.nan]).astype(bool)
> print(boolean)
0 True
1 False
2 True
dtype: bool
So 'np.nan' is being casted to 'True'?
Questions:
Given a data table where one column is of type 'object' but in fact it is a boolean column with missing values: how do I figure that out? After filtering for the non-missing values it is still of type 'object'... do I need to implement a try-catch-cast of every column into every imaginable data type in order to see the true nature of columns?
I guess that there is a logical explanation of why np.nan is being casted to True but this is an unwanted behaviour of the software pandas/python itself, right? So should I file a bug report?
Q1: I would start with combining
np.any(pd.isna(boolean))
to identify if there are any None Values in a column, and with
set(boolean)
You can identify, if there are only True, False and Nones inside. Combining with filtering (and if you prefer to also typcasting) you should be done.
Q2: see comment of #WeNYoBen
I've hit the same problem. I came up with the following solution:
from pandas import Series
def is_boolean_series(col: Series):
val = col[~col.isna()].iloc[0]
return type(val) == bool
This question already has answers here:
in operator, float("NaN") and np.nan
(2 answers)
Closed 5 years ago.
Consider the following pandas.Series:
import pandas as pd
import numpy as np
s = pd.Series([np.nan, 1, 1, np.nan])
s
0 NaN
1 1.0
2 1.0
3 NaN
dtype: float64
I want to find only unique values in this particular series using the built-in set function:
unqs = set(s)
unqs
{nan, 1.0, nan}
Why are there duplicate NaNs in the resultant set? Using a similar function (pandas.unique) does not produce this result, so what's the difference, here?
pd.unique(s)
array([ nan, 1.])
Like in Java, and JavaScript, nan in numpy does not equal itself.
>>> np.nan == np.nan
False
This means when the set constructor checks "do I have an instance of nan in this set yet?" it alwasy returns False
So… why?
nan in both cases means "value which cannot be represented by 'float'". This means an attempt to convert it to float necessarily fails. It's also unable to be sorted, because there's no way to tell if nan is supposed to be larger or smaller than any number.
After all, which is bigger "cat" or 7? And is "goofy" == "pluto"?
SO… what do I do?
There are a couple of ways to resolve this problem. Personally, I generally try to fill nan before processing: DataFrame.fillna will help with that, and I would always use df.unique() to get a set of unique values.
no_nas = s.dropna().unique()
with_nas = s.unique()
with_replaced_nas = s.fillna(-1).unique() # using a placeholder
(note: all of the above can be passed into the set constructor.
What if I don't want to use the Pandas way?
There are reasons not to use Pandas, or to rely on native objects instead of Pandas. These should suffice.
Your other option is to filter and remove the nan.
unqs = set(item for item in s if not np.isnan(item))
You could also replace things inline:
placeholder = '{placeholder}' # There are a variety of placeholder options.
unqs = set(item if not np.isnan(item) else placeholder for item in s)
I have an if statement where it checks if the data frame is not empty. The way I do it is the following:
if dataframe.empty:
pass
else:
#do something
But really I need:
if dataframe is not empty:
#do something
My question is - is there a method .not_empty() to achieve this? I also wanted to ask if the second version is better in terms of performance? Otherwise maybe it makes sense for me to leave it as it is i.e. the first version?
Just do
if not dataframe.empty:
# insert code here
The reason this works is because dataframe.empty returns True if dataframe is empty. To invert this, we can use the negation operator not, which flips True to False and vice-versa.
.empty returns a boolean value
>>> df_empty.empty
True
So if not empty can be written as
if not df.empty:
#Your code
Check pandas.DataFrame.empty
, might help someone.
You can use the attribute dataframe.empty to check whether it's empty or not:
if not dataframe.empty:
#do something
Or
if len(dataframe) != 0:
#do something
Or
if len(dataframe.index) != 0:
#do something
As already clearly explained by other commentators, you can negate a boolean expression in Python by simply prepending the not operator, hence:
if not df.empty:
# do something
does the trick.
I only want to clarify the meaning of "empty" in this context, because it was a bit confusing for me at first.
According to the Pandas documentation, the DataFrame.empty method returns True if any of the axes in the DataFrame are of length 0.
As a consequence, "empty" doesn't mean zero rows and zero columns, like someone might expect. A dataframe with zero rows (axis 1 is empty) but non-zero columns (axis 2 is not empty) is still considered empty:
> df = pd.DataFrame(columns=["A", "B", "C"])
> df.empty
True
Another interesting point highlighted in the documentation is a DataFrame that only contains NaNs is not considered empty.
> df = pd.DataFrame(columns=["A", "B", "C"], index=['a', 'b', 'c'])
> df
A B C
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
> df.empty
False
No doubt that the use of empty is the most comprehensive in this case (explicit is better than implicit).
However, the most efficient in term of computation time is through the usage of len :
if not len(df.index) == 0:
# insert code here
Source : this answer.
Another way:
if dataframe.empty == False:
#do something`