I am trying to figure out whether or not a column in a pandas dataframe is boolean or not (and if so, if it has missing values and so on).
In order to test the function that I created I tried to create a dataframe with a boolean column with missing values. However, I would say that missing values are handled exclusively 'untyped' in python and there are some weird behaviours:
> boolean = pd.Series([True, False, None])
> print(boolean)
0 True
1 False
2 None
dtype: object
so the moment you put None into the list, it is being regarded as object because python is not able to mix the types bool and type(None)=NoneType back into bool. The same thing happens with math.nan and numpy.nan. The weirdest things happen when you try to force pandas into an area it does not want to go to :-)
> boolean = pd.Series([True, False, np.nan]).astype(bool)
> print(boolean)
0 True
1 False
2 True
dtype: bool
So 'np.nan' is being casted to 'True'?
Questions:
Given a data table where one column is of type 'object' but in fact it is a boolean column with missing values: how do I figure that out? After filtering for the non-missing values it is still of type 'object'... do I need to implement a try-catch-cast of every column into every imaginable data type in order to see the true nature of columns?
I guess that there is a logical explanation of why np.nan is being casted to True but this is an unwanted behaviour of the software pandas/python itself, right? So should I file a bug report?
Q1: I would start with combining
np.any(pd.isna(boolean))
to identify if there are any None Values in a column, and with
set(boolean)
You can identify, if there are only True, False and Nones inside. Combining with filtering (and if you prefer to also typcasting) you should be done.
Q2: see comment of #WeNYoBen
I've hit the same problem. I came up with the following solution:
from pandas import Series
def is_boolean_series(col: Series):
val = col[~col.isna()].iloc[0]
return type(val) == bool
Related
I have a dataframe that has columns of attributes of male and female. There is a column df['long_hair'] with 0=no and 1=yes. I want to fill the missing values in this column with respect to it's gender. This is my code. But, the problem is, the inplace does not work with conditional statement column. So how can I do this?
df[df['Male']==1]['long_hair'].fillna(0,inplace=True)
This code means get the people who are male and fill the missing values with 0 (meaning they don't have long hair).
You can bypass it with assigning using pd.Series.where
cond bool Series/DataFrame, array-like, or callable
Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).
df.long_hair = df.long_hair.where((df.Male != 1) | df.long_hair.isnull())
This will keep the values when it is not male or not null.
Why does the value None convert to both True and False in this series?
Env:
Jupyter Notebok 6.0.3 in Jupyter Labs
Python 3.7.6
Imports:
from IPython.display import display
import pandas as pd
Converts None to True:
df_test1 = pd.DataFrame({'test_column':[0,1,None]})
df_test1['test_column'] = df_test1.test_column.astype(bool)
display(df_test1)
Converts None to False:
df_test2 = pd.DataFrame({'test_column':[0,1,None,'test']})
df_test2['test_column'] = df_test2.test_column.astype(bool)
display(df_test2)
Is this expected behavior?
Yes, this is expected behaviour, it leads from the initial dtype storage type of each series (column). The first input results in a series with floating point numbers, the second contains references to Python objects:
>>> pd.Series([0,1,None]).dtype
dtype('float64')
>>> pd.Series([0,1,None,'test']).dtype
dtype('O')
The float version of None is NaN, or Not a Number, which converts to True when interpreted as a boolean (as it is not equal to 0):
>>> pd.Series([0,1,None])[2]
nan
>>> bool(pd.Series([0,1,None])[2])
True
In the other case, the original None object was preserved, which converts to False:
>>> pd.Series([0,1,None,'test'])[2] is None
True
>>> bool(None)
False
So this comes down to automatic type inference, what type Pandas thinks is best suited for each column; see the DataFrame.infer_objects() method. The goal is to minimise storage requirements and operation performance; storing numbers as native 64-bit floating point values leads to faster numeric operations and a smaller memory footprint, while at the same time still being able to represent 'missing' values as NaN.
However, when you pass in a mix of numbers and strings, Panda's can't use a dedicated specialised array type and so falls back to the "Python object" type, which are references to the original Python objects.
Instead of letting Pandas guess as to what type you need, you could explicitly specify the type to be used. You could use one of the nullable integer types (which use Pandas.NA instead of NaN); converting these to booleans results in missing values converting to False:
>>> pd.Series([0,1,None], dtype=pd.Int64Dtype).astype(bool)
0 False
1 True
2 False
dtype: bool
Another option is to convert to a nullable boolean type, and so preserve the None / NaN indicators of missing data:
>>> pd.Series([0,1,None]).astype("boolean")
0 False
1 True
2 <NA>
dtype: boolean
Also see Working with missing data section in the user manual, as well as the nullable integer and nullable boolean data type manual pages.
Note that the Pandas notion of the NA value, representing missing data, is still considered experimental, which is why it is not yet the default. But if you want to 'opt in' for dataframes you just created, you can call the DataFrame.convert_dtypes() method right after creating the frame:
>>> df = pd.DataFrame({'prime_member':[0,1,None]}).convert_dtypes()
>>> df.prime_member
0 0
1 1
2 <NA>
Name: prime_member, dtype: Int64
The differences are datatypes df_test1.prime_member.dtype is float64, and you don't have None, but NaN. Now, bool(np.nan) is True.
However, when you have mixed type column: df_test2.prime_member.dtype is object. Then None remains None in the data, and bool(None) is False.
Problem:
Accessing the same column of a Dataframe I would like to compare if series is the same.
Data:
DATA link for copy and paste: API_link_to_data='https://raw.githubusercontent.com/jenfly/opsd/master/opsd_germany_daily.csv'
energyDF = pd.read_csv(API_link_to_data)
row3_LOC = energyDF.loc[[3],:]
row3_ILOC = energyDF.iloc[[3],:]
This code compares element wise
row3_LOC == row3_ILOC
getting a list with booleans
What I would like to get is TRUE, since row3_LOC and row3_ILOC are the same
Thanks
If you check,both row3_LOC and row3_ILOC are in turn dataframes.
print(type(row3_LOC))
print(type(row3_ILOC))
results in:
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
You can check if they are equal using row3_ILOC.equals(row3_LOC). Refer to the equals function.
You can compare two series using all():
(row3_loc == row3_ILOC).all()
As soon as one of the values don't match, you will get a false. You may also be interested in .any(), which checks whether at least one value is true.
Fill the Nans with NULL
energyDF.fillna('NULL')
energyDF = energyDF.fillna('NULL')
energyDF.loc[[3],:] == (energyDF.iloc[[3],:])
Date Consumption Wind Solar Wind+Solar
3 True True True True True
I'm trying to clean test data from Kaggle's Titanic dataset, specifically the columns - sex, fare, pclass, and age. In order to do this, I'd like to find out if any of these columns have empty values. I load the data:
import pandas as pd
test_address = 'path_to_data\test.csv'
test = pd.read_csv(test_address)
When I try to look for empty values in the columns,
True in test['Sex'].isna()
outputs True.
However,
test['Sex'].isna().value_counts()
outputs
False 418
Name: Sex, dtype: int64
which should mean there aren't any empty values (I could confirm this by visually scanning the csv file). These commands on test['Pclass'] have similar outputs.
Why is the 'True in' command giving me the wrong answer?
The operator in, when applied to a Series, checks if its left operand is in the index of the right operand. Since there is a row #1 (the numeric representation of True) in the series, the operator evaluates to True.
For the same reason False in df['Sex'].isna() is True, but False in df['Sex'][1:].isna() is False (there is no row #0 in the latter slice).
You should check if True in df['Sex'].isna().values.
I wanted to use a boolean indexing, checking for rows of my data frame where a particular column does not have NaN values. So, I did the following:
import pandas as pd
my_df.loc[pd.isnull(my_df['col_of_interest']) == False].head()
to see a snippet of that data frame, including only the values that are not NaN (most values are NaN).
It worked, but seems less-than-elegant. I'd want to type:
my_df.loc[!pd.isnull(my_df['col_of_interest'])].head()
However, that generated an error. I also spend a lot of time in R, so maybe I'm confusing things. In Python, I usually put in the syntax "not" where I can. For instance, if x is not none:, but I couldn't really do it here. Is there a more elegant way? I don't like having to put in a senseless comparison.
In general with pandas (and numpy), we use the bitwise NOT ~ instead of ! or not (whose behaviour can't be overridden by types).
While in this case we have notnull, ~ can come in handy in situations where there's no special opposite method.
>>> df = pd.DataFrame({"a": [1, 2, np.nan, 3]})
>>> df.a.isnull()
0 False
1 False
2 True
3 False
Name: a, dtype: bool
>>> ~df.a.isnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
>>> df.a.notnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
(For completeness I'll note that -, the unary negative operator, will also work on a boolean Series but ~ is the canonical choice, and - has been deprecated for numpy boolean arrays.)
Instead of using pandas.isnull() , you should use pandas.notnull() to find the rows where the column has not null values. Example -
import pandas as pd
my_df.loc[pd.notnull(my_df['col_of_interest'])].head()
pandas.notnull() is the boolean inverse of pandas.isnull() , as given in the documentation -
See also
pandas.notnull
boolean inverse of pandas.isnull