Unique values from pandas.Series [duplicate] - python

This question already has answers here:
in operator, float("NaN") and np.nan
(2 answers)
Closed 5 years ago.
Consider the following pandas.Series:
import pandas as pd
import numpy as np
s = pd.Series([np.nan, 1, 1, np.nan])
s
0 NaN
1 1.0
2 1.0
3 NaN
dtype: float64
I want to find only unique values in this particular series using the built-in set function:
unqs = set(s)
unqs
{nan, 1.0, nan}
Why are there duplicate NaNs in the resultant set? Using a similar function (pandas.unique) does not produce this result, so what's the difference, here?
pd.unique(s)
array([ nan, 1.])

Like in Java, and JavaScript, nan in numpy does not equal itself.
>>> np.nan == np.nan
False
This means when the set constructor checks "do I have an instance of nan in this set yet?" it alwasy returns False
So… why?
nan in both cases means "value which cannot be represented by 'float'". This means an attempt to convert it to float necessarily fails. It's also unable to be sorted, because there's no way to tell if nan is supposed to be larger or smaller than any number.
After all, which is bigger "cat" or 7? And is "goofy" == "pluto"?
SO… what do I do?
There are a couple of ways to resolve this problem. Personally, I generally try to fill nan before processing: DataFrame.fillna will help with that, and I would always use df.unique() to get a set of unique values.
no_nas = s.dropna().unique()
with_nas = s.unique()
with_replaced_nas = s.fillna(-1).unique() # using a placeholder
(note: all of the above can be passed into the set constructor.
What if I don't want to use the Pandas way?
There are reasons not to use Pandas, or to rely on native objects instead of Pandas. These should suffice.
Your other option is to filter and remove the nan.
unqs = set(item for item in s if not np.isnan(item))
You could also replace things inline:
placeholder = '{placeholder}' # There are a variety of placeholder options.
unqs = set(item if not np.isnan(item) else placeholder for item in s)

Related

Why does pandas use "NaN" from numpy, instead of its own null value?

This is somewhat of a broad topic, but I will try to pare it to some specific questions.
In starting to answer questions on SO, I have found myself sometimes running into a silly error like this when making toy data:
In[0]:
import pandas as pd
df = pd.DataFrame({"values":[1,2,3,4,5,6,7,8,9]})
df[df < 5] = np.nan
Out[0]:
NameError: name 'np' is not defined
I'm so used to automatically importing numpy with pandas that this doesn't usually occur in real code. However, it did make me wonder why pandas doesn't have it's own value/object for representing null values.
I only recently realized that you could just use the Python None instead for a similar situation:
import pandas as pd
df = pd.DataFrame({"values":[1,2,3,4,5,6,7,8,9]})
df[df < 5] = None
Which works as expected and doesn't produce an error. But I have felt like the convention on SO that I have seen is to use np.nan, and that people are usually referring to np.nan when discussing null values (this is perhaps why I hadn't realized None can be used, but maybe that was my own idiosyncrasy).
Briefly looking into this, I have seen now that pandas does have a pandas.NA value since 1.0.0, but I have never seen anyone use it in a post:
In[0]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'values':np.random.rand(20,)})
df['above'] = df['values']
df['below'] = df['values']
df['above'][df['values']>0.7] = np.nan
df['below'][df['values']<0.3] = pd.NA
df['names'] = ['a','b','c','a','b','c','a','b','c','a']*2
df.loc[df['names']=='a','names'] = pd.NA
df.loc[df['names']=='b','names'] = np.nan
df.loc[df['names']=='c','names'] = None
df
Out[0]:
values above below names
0 0.323531 0.323531 0.323531 <NA>
1 0.690383 0.690383 0.690383 NaN
2 0.692371 0.692371 0.692371 None
3 0.259712 0.259712 NaN <NA>
4 0.473505 0.473505 0.473505 NaN
5 0.907751 NaN 0.907751 None
6 0.642596 0.642596 0.642596 <NA>
7 0.229420 0.229420 NaN NaN
8 0.576324 0.576324 0.576324 None
9 0.823715 NaN 0.823715 <NA>
10 0.210176 0.210176 NaN <NA>
11 0.629563 0.629563 0.629563 NaN
12 0.481969 0.481969 0.481969 None
13 0.400318 0.400318 0.400318 <NA>
14 0.582735 0.582735 0.582735 NaN
15 0.743162 NaN 0.743162 None
16 0.134903 0.134903 NaN <NA>
17 0.386366 0.386366 0.386366 NaN
18 0.313160 0.313160 0.313160 None
19 0.695956 0.695956 0.695956 <NA>
So it seems that for numerical values, the distinction between these different null values doesn't matter, but they are represented differently for strings (and perhaps for other data types?).
My questions based on the above:
Is it conventional to use np.nan (rather than None) to represent null values in pandas?
Why did pandas not have its own null value for most of its lifetime (until last year)? What was the motivation for adding?
In cases where you can have multiple types of missing values in one Series or column, is there any difference between them? Why are they not represented identically (as with numerical data)?
I fully anticipate that I may have a flawed interpretation of things and the distinction between pandas and numpy, so please correct me.
A main dependency of pandas is numpy, in other words, pandas is built on-top of numpy. Because pandas inherits and uses many of the numpy methods, it makes sense to keep things consistent, that is, missing numeric data are represented with np.NaN.
(This choice to build upon numpy has consequences for other things too. For instance date and time operations are built upon the np.timedelta64 and np.datetime64 dtypes, not the standard datetime module.)
One thing you may not have known is that numpy has always been there with pandas
import pandas as pd
pd.np?
pd.np.nan
Though you might think this behavior could be better since you don't import numpy, this is discouraged and in the near future will be deprecated in favor of directly importing numpy
FutureWarning: The pandas.np module is deprecated and will be removed
from pandas in a future version. Import numpy directly instead
Is it conventional to use np.nan (rather than None) to represent null values in pandas?
If the data are numeric then yes, you should use np.NaN. None requires the dtype to be Object and with pandas you want numeric data stored in a numeric dtype. pandas will generally coerce to the proper null-type upon creation or import so that it can use the correct dtype
pd.Series([1, None])
#0 1.0
#1 NaN <- None became NaN so it can have dtype: float64
#dtype: float64
Why did pandas not have its own null value for most of its lifetime (until last year)? What was the motivation for adding?
pandas did not have it's own null value because it got by with np.NaN, which worked for the majority of circumstances. However with pandas it's very common to have missing data, an entire section of the documentation is devoted to this. NaN, being a float, does not fit into an integer container which means that any numeric Series with missing data is upcast to float. This can become problematic because of floating point math, and some integers cannot be represented perfectly with by a floating point number. As a result, any joins or merges could possibly fail.
# Gets upcast to float
pd.Series([1,2,np.NaN])
#0 1.0
#1 2.0
#2 NaN
#dtype: float64
# Can safely do merges/joins/math because things are still Int
pd.Series([1,2,np.NaN]).astype('Int64')
#0 1
#1 2
#2 <NA>
#dtype: Int64
Firstly, you can unify the nan values by a filter-function that returns only one value, let's say None.
I guess the reason is to make it unique in case of data-mining on data from numpy calculations or so on. So, the pandas nan means something different. Maybe, it does not make sense here in your special case, but it will have a meaning in other cases.
That's a great question!
My hunch is that this has to do with the fact that NumPy functions are implemented in C which makes it so fast. Python's None might not give you the same efficiency (or is probably translated into np.nan), while Pandas's pd.NA would likely be translated into NumPy's np.nan anyway, since Pandas requires NumPy.
Haven't found resources to support my claims yet, though.

faster replacement of -1 and 0 to NaNs in column for a large dataset

The 'azdias' is a dataframe which is my main dataset and meta data or feature summary of it lies in dataframe 'feat_info'. The 'feat_info' shows the values in every column that have been displayed as NaN.
Ex: column1 has values [-1,0] as NaN values. So my job will be to find and replace these -1,0 in column1 as NaN.
azdias dataframe:
feat_info dataframe:
I have tried following in jupyter notebook.
def NAFunc(x, miss_unknown_list):
x_output = x
for i in miss_unknown_list:
try:
miss_unknown_value = float(i)
except ValueError:
miss_unknown_value = i
if x == miss_unknown_value:
x_output = np.nan
break
return x_output
for cols in azdias.columns.tolist():
NAList = feat_info[feat_info.attribute == cols]['missing_or_unknown'].values[0]
azdias[cols] = azdias[cols].apply(lambda x: NAFunc(x, NAList))
Question 1: I am trying to impute NaN values. But my code is very
slow. I wish to speed up my process of execution.
I have attached sample of both dataframes:
azdias_sample
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST
0 -1 2 1 2.0 3
1 -1 1 2 5.0 1
2 -1 3 2 3.0 1
3 2 4 2 2.0 4
4 -1 3 1 5.0 4
feat_info_sample
attribute information_level type missing_or_unknown
AGER_TYP person categorical [-1,0]
ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
ANREDE_KZ person categorical [-1,0]
CJT_GESAMTTYP person categorical [0]
FINANZ_MINIMALIST person ordinal [-1]
If the azdias dataset is obtained from read_csv or similar IO functions, the na_values keyword argument can be used to specify column-specific missing value representations to make sure the returned data frame already has in-place NaN values from the very beginning. The sample code is shown in the following.
from ast import literal_eval
feat_info.set_index("attribute", inplace=True)
# A more concise but less efficient alternative is
# na_dict = feat_info["missing_or_unknown"].apply(literal_eval).to_dict()
na_dict = {attr: literal_eval(val) for attr, val in feat_info["missing_or_unknown"].items()}
df_azdias = pd.read_csv("azidas.csv", na_values=na_dict)
As for the data type, there is no built-in NaN representation for integer data types. Hence a float data type is needed. If the missing values are imputed using fillna, the downcast argument can be specified to make the returned series or data frame have an appropriate data type.
Try using the DataFrame's replace method. How about this?
for c in azdias.columns.tolist():
replace_list = feat_info[feat_info['attribute'] == c]['missing_or_unknown'].values
azidias[c] = azidias[c].replace(to_replace=list(replace_list), value=np.nan)
A couple things I'm not sure about without being able to execute your code:
In your example, you used .values[0]. Don't you want all the values?
I'm not sure if it's necessary to do to_replace=list(replace_list), it may work to just use to_replace=replace_list.
In general, I recommend thinking to yourself "surely Pandas has a function to do this for me." Often, they do. For performance with Pandas generally, avoid looping over and setting things. Vectorized methods tend to be much faster.

Average of a numpy array returns NaN

I have an np.array with over 330,000 rows. I simply try to take the average of it and it returns NaN. Even if I try to filter out any potential NaN values in my array (there shouldn't be any anyways), average returns NaN. Am I doing something totally wacky?
My code is here:
average(ngma_heat_daily)
Out[70]: nan
average(ngma_heat_daily[ngma_heat_daily != nan])
Out[71]: nan
try this:
>>> np.nanmean(ngma_heat_daily)
This function drops NaN values from your array before taking the mean.
Edit: the reason that average(ngma_heat_daily[ngma_heat_daily != nan]) doesn't work is because of this:
>>> np.nan == np.nan
False
according to the IEEE floating-point standard, NaN is not equal to itself! You could do this instead to implement the same idea:
>>> average(ngma_heat_daily[~np.isnan(ngma_heat_daily)])
np.isnan, np.isinf, and similar functions are very useful for this type of data masking.
Also, there is a function named nanmedian which ignores NaN values. Signature of that function is: numpy.nanmedian(a, axis=None, out=None, overwrite_input=False, keepdims=<no value>)

Erratic NaN behaviour in numpy/pandas

I've been trying to replace missing values in a Pandas dataframe, but without success. I tried the .fillna method and also tried to loop through the entire data set, checking each cell and replacing NaNs with a chosen value. However, in both cases, Python executes the script without throwing up any errors, but the NaN values remain.
When I dug a bit deeper, I discovered behaviour that seems erratic to me, best demonstrated with an example:
In[ ] X['Smokinginpregnancy'].head()
Out[ ]
Index
E09000002 NaN
E09000003 5.216126
E09000004 10.287496
E09000005 3.090379
E09000006 6.080041
Name: Smokinginpregnancy, dtype: float64
I know for a fact that the first item in this column is missing and pandas recognises it as NaN. In fact, if I call this item on its own, python tells me it's NaN:
In [ ] X['Smokinginpregnancy'][0]
Out [ ]
nan
However, when I test whether it's NaN, python returns False.
In [ ] X['Smokinginpregnancy'][0] == np.nan
Out [ ] False
I suspect that when .fillna is being executed, python checks whether the item is NaN but gets back a False, so it continues, leaving the cell alone.
Does anyone know what's going on? Any solutions? (apart from opening the csv file in excel and then manually replacing the values.)
I'm using Anaconda's Python 3 distribution.
You are doing:
X['Smokinginpregnancy'][0] == np.nan
This is guaranteed to return False because all NaNs compare unequal to everything by IEEE754 standard:
>>> x = float('nan')
>>> x == x
False
>>> x == 1
False
>>> x == float('nan')
False
See also here.
You have to use math.isnan to check for NaNs:
>>> math.isnan(x)
True
Or numpy.isnan
So use:
numpy.isnan(X['Smokinginpregnancy'][0])
Regarding pandas.fillna note that this function returns the filled array. Maybe you did something like:
X.fillna(...)
without reassigning X? Alternatively you must pass inplace=True to mutate the dataframe on which you are calling the method.
NaN in pandas can be check function pandas.isnull. I created boolean mask and return subset with NaN values.
Function filnna can be used for one column Smokinginpregnancy (more info in doc):
X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
or
X['Smokinginpregnancy'].fillna('100', inplace=True)
Warning:
Sometimes inplace=True can be ignored, better is not use. - link, github, github 3 comments.
All together:
print X['Smokinginpregnancy'].head()
#Index
#E09000002 NaN
#E09000003 5.216126
#E09000004 10.287496
#E09000005 3.090379
#E09000006 6.080041
#check NaN in column Smokinginpregnancy by boolean mask
mask = pd.isnull(X['Smokinginpregnancy'])
XNaN = X[mask]
print XNaN
# Smokinginpregnancy
#Index
#E09000002 NaN
#use function fillna for column Smokinginpregnancy
#X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
X['Smokinginpregnancy'].fillna('100', inplace=True)
print X
# Smokinginpregnancy
#Index
#E09000002 100
#E09000003 5.216126
#E09000004 10.2875
#E09000005 3.090379
#E09000006 6.080041
More information, why comparison doesn't work:
One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan. More info in Bakuriu's answer.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False

Proper way to use "opposite boolean" in Pandas data frame boolean indexing

I wanted to use a boolean indexing, checking for rows of my data frame where a particular column does not have NaN values. So, I did the following:
import pandas as pd
my_df.loc[pd.isnull(my_df['col_of_interest']) == False].head()
to see a snippet of that data frame, including only the values that are not NaN (most values are NaN).
It worked, but seems less-than-elegant. I'd want to type:
my_df.loc[!pd.isnull(my_df['col_of_interest'])].head()
However, that generated an error. I also spend a lot of time in R, so maybe I'm confusing things. In Python, I usually put in the syntax "not" where I can. For instance, if x is not none:, but I couldn't really do it here. Is there a more elegant way? I don't like having to put in a senseless comparison.
In general with pandas (and numpy), we use the bitwise NOT ~ instead of ! or not (whose behaviour can't be overridden by types).
While in this case we have notnull, ~ can come in handy in situations where there's no special opposite method.
>>> df = pd.DataFrame({"a": [1, 2, np.nan, 3]})
>>> df.a.isnull()
0 False
1 False
2 True
3 False
Name: a, dtype: bool
>>> ~df.a.isnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
>>> df.a.notnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
(For completeness I'll note that -, the unary negative operator, will also work on a boolean Series but ~ is the canonical choice, and - has been deprecated for numpy boolean arrays.)
Instead of using pandas.isnull() , you should use pandas.notnull() to find the rows where the column has not null values. Example -
import pandas as pd
my_df.loc[pd.notnull(my_df['col_of_interest'])].head()
pandas.notnull() is the boolean inverse of pandas.isnull() , as given in the documentation -
See also
pandas.notnull
boolean inverse of pandas.isnull

Categories

Resources