how to test if a variable is pd.NaT? - python

I'm trying to test if one of my variables is pd.NaT. I know it is NaT, and still it won't pass the test. As an example, the following code prints nothing :
a=pd.NaT
if a == pd.NaT:
print("a not NaT")
Does anyone have a clue ? Is there a way to effectively test if a is NaT?

Pandas NaT behaves like a floating-point NaN, in that it's not equal to itself. Instead, you can use pandas.isnull:
In [21]: pandas.isnull(pandas.NaT)
Out[21]: True
This also returns True for None and NaN.
Technically, you could also check for Pandas NaT with x != x, following a common pattern used for floating-point NaN. However, this is likely to cause issues with NumPy NaTs, which look very similar and represent the same concept, but are actually a different type with different behavior:
In [29]: x = pandas.NaT
In [30]: y = numpy.datetime64('NaT')
In [31]: x != x
Out[31]: True
In [32]: y != y
/home/i850228/.local/lib/python3.6/site-packages/IPython/__main__.py:1: FutureWarning: In the future, NAT != NAT will be True rather than False.
# encoding: utf-8
Out[32]: False
numpy.isnat, the function to check for NumPy NaT, also fails with a Pandas NaT:
In [33]: numpy.isnat(pandas.NaT)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-39a66bbf6513> in <module>()
----> 1 numpy.isnat(pandas.NaT)
TypeError: ufunc 'isnat' is only defined for datetime and timedelta.
pandas.isnull works for both Pandas and NumPy NaTs, so it's probably the way to go:
In [34]: pandas.isnull(pandas.NaT)
Out[34]: True
In [35]: pandas.isnull(numpy.datetime64('NaT'))
Out[35]: True

pd.NaT is pd.NaT
True
this works for me.

You can also use pandas.isna() for pandas.NaT, numpy.nan or None:
import pandas as pd
import numpy as np
x = (pd.NaT, np.nan, None)
[pd.isna(i) for i in x]
Output:
[True, True, True]

If it's in a Series (e.g. DataFrame column) you can also use .isna():
pd.Series(pd.NaT).isna()
# 0 True
# dtype: bool

This is what works for me
>>> a = pandas.NaT
>>> type(a) == pandas._libs.tslibs.nattype.NaTType
>>> True

Related

why is ( df1[col].dtype != 'category' )running on one computer successfully while giving error on other while trying to exclude category attribute? [duplicate]

I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({
'x': np.linspace(0, 50, 6),
'y': np.linspace(0, 20, 6),
'cat_column': random.sample('abcdef', 6)
})
df['cat_column'] = pd.Categorical(df2['cat_column'])
We can see that the dtype for the categorical column is 'category':
df.cat_column.dtype
Out[20]: category
And normally we can do a dtype check by just comparing to the name
of the dtype:
df.x.dtype == 'float64'
Out[21]: True
But this doesn't seem to work when trying to check if the x column
is categorical:
df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-94d2608815c4> in <module>()
----> 1 df.x.dtype == 'category'
TypeError: data type "category" not understood
Is there any way to do these types of checks in pandas v0.15+?
Use the name property to do the comparison instead, it should always work because it's just a string:
>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'
>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'
So, to sum up, you can end up with a simple, straightforward function:
def is_categorical(array_like):
return array_like.dtype.name == 'category'
First, the string representation of the dtype is 'category' and not 'categorical', so this works:
In [41]: df.cat_column.dtype == 'category'
Out[41]: True
But indeed, as you noticed, this comparison gives a TypeError for other dtypes, so you would have to wrap it with a try .. except .. block.
Other ways to check using pandas internals:
In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype)
Out[42]: True
In [43]: pd.api.types.is_categorical_dtype(df.cat_column)
Out[43]: True
For non-categorical columns, those statements will return False instead of raising an error. For example:
In [44]: pd.api.types.is_categorical_dtype(df.x)
Out[44]: False
For much older version of pandas, replace pd.api.types in the above snippet with pd.core.common.
Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for:
df['column'].name in df.select_dtypes(include='category').columns
Thanks to #Jeff.
In my pandas version (v1.0.3), a shorter version of joris' answer is available.
df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])})
print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False
print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True
print(pd.CategoricalDtype.is_dtype(df.noncat)) # False
print(pd.CategoricalDtype.is_dtype(df.categ)) # True
I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here.
It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following:
hasattr(column_to_check, 'cat')
So, as per the example given in the initial question, this would be:
hasattr(df.x, 'cat') #True
Nowadays you can use:
pandas.api.types.is_categorical_dtype(series)
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html
Available since at least pandas 1.0
Taking a look at #Jeff Tratner answer, since the condition df.cat_column.dtype == 'category' not needs to be True to be considered a column as cataegorical,
I propose this considering categorical the dtypes within 'categorical_dtypes' list:
def is_cat(column):
categorical_dtypes = ['object', 'category', 'bool']
if column.dtype.name in categorical_dtypes:
return True
else:
return False
´´´

Numpy select returning boolean error message

I would like to find matching strings in a path and use np.select to create a new column with labels dependant on the matches I found.
This is what I have written
import numpy as np
conditions = [a["properties_path"].str.contains('blog'),
a["properties_path"].str.contains('credit-card-readers/|machines|poss|team|transaction_fees'),
a["properties_path"].str.contains('signup|sign-up|create-account|continue|checkout'),
a["properties_path"].str.contains('complete'),
a["properties_path"] == '/za/|/',
a["properties_path"].str.contains('promo')]
choices = [ "blog","info_pages","signup","completed","home_page","promo"]
a["page_type"] = np.select(conditions, choices, default=np.nan)
However, when I run this code, I get this error message:
ValueError: invalid entry 0 in condlist: should be boolean ndarray
Here is a sample of my data
3124465 /blog/ts-st...
3124466 /card-machines
3124467 /card-machines
3124468 /card-machines
3124469 /promo/our-gift-to-you
3124470 /create-account/v1
3124471 /za/signup/
3124472 /create-account/v1
3124473 /sign-up
3124474 /za/
3124475 /sign-up/cart
3124476 /checkout/
3124477 /complete
3124478 /card-machines
3124479 /continue
3124480 /blog/article/get-car...
3124481 /blog/article/get-car...
3124482 /za/signup/
3124483 /credit-card-readers
3124484 /signup
3124485 /credit-card-readers
3124486 /create-account/v1
3124487 /credit-card-readers
3124488 /point-of-sale-app
3124489 /create-account/v1
3124490 /point-of-sale-app
3124491 /credit-card-readers
The .str methods operate on object columns. It's possible to have non-string values in such columns, and as a result pandas returns NaN for these rows instead of False. np then complains because this is not a Boolean.
Luckily, there's an argument to handle this: na=False
a["properties_path"].str.contains('blog', na=False)
Alternatively, you could change your conditions to:
a["properties_path"].str.contains('blog') == True
#or
a["properties_path"].str.contains('blog').fillna(False)
Sample
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 'foo', 'bar']})
conds = df.a.str.contains('f')
#0 NaN
#1 True
#2 False
#Name: a, dtype: object
np.select([conds], ['XX'])
#ValueError: invalid entry 0 in condlist: should be boolean ndarray
conds = df.a.str.contains('f', na=False)
#0 False
#1 True
#2 False
#Name: a, dtype: bool
np.select([conds], ['XX'])
#array(['0', 'XX', '0'], dtype='<U11')
Your data seem to have nan, so conditions have nan, which breaks np.select. To fix this, you can do:
s = a["properties_path"].fillna('')
and replace a['properties_path'] in each condition with s.

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced

I am trying to convert a csv into numpy array. In the numpy array, I am replacing few elements with NaN. Then, I wanted to find the indices of the NaN elements in the numpy array. The code is :
import pandas as pd
import matplotlib.pyplot as plyt
import numpy as np
filename = 'wether.csv'
df = pd.read_csv(filename,header = None )
list = df.values.tolist()
labels = list[0]
wether_list = list[1:]
year = []
month = []
day = []
max_temp = []
for i in wether_list:
year.append(i[1])
month.append(i[2])
day.append(i[3])
max_temp.append(i[5])
mid = len(max_temp) // 2
temps = np.array(max_temp[mid:])
temps[np.where(np.array(temps) == -99.9)] = np.nan
plyt.plot(temps,marker = '.',color = 'black',linestyle = 'none')
# plyt.show()
print(np.where(np.isnan(temps))[0])
# print(len(pd.isnull(np.array(temps))))
When I execute this, I am getting a warning and an error. The warning is :
wether.py:26: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
temps[np.where(np.array(temps) == -99.9)] = np.nan
The error is :
Traceback (most recent call last):
File "wether.py", line 30, in <module>
print(np.where(np.isnan(temps))[0])
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
This is a part of the dataset which I am using:
83168,2014,9,7,0.00000,89.00000,78.00000, 83.50000
83168,2014,9,22,1.62000,90.00000,72.00000, 81.00000
83168,2014,9,23,0.50000,87.00000,74.00000, 80.50000
83168,2014,9,24,0.35000,82.00000,73.00000, 77.50000
83168,2014,9,25,0.60000,85.00000,75.00000, 80.00000
83168,2014,9,26,0.76000,89.00000,77.00000, 83.00000
83168,2014,9,27,0.00000,89.00000,79.00000, 84.00000
83168,2014,9,28,0.00000,90.00000,81.00000, 85.50000
83168,2014,9,29,0.00000,90.00000,79.00000, 84.50000
83168,2014,9,30,0.50000,89.00000,75.00000, 82.00000
83168,2014,10,1,0.02000,91.00000,75.00000, 83.00000
83168,2014,10,2,0.03000,93.00000,77.00000, 85.00000
83168,2014,10,3,1.40000,93.00000,75.00000, 84.00000
83168,2014,10,4,0.06000,89.00000,75.00000, 82.00000
83168,2014,10,5,0.22000,91.00000,68.00000, 79.50000
83168,2014,10,6,0.00000,84.00000,68.00000, 76.00000
83168,2014,10,7,0.17000,85.00000,73.00000, 79.00000
83168,2014,10,8,0.06000,84.00000,73.00000, 78.50000
83168,2014,10,9,0.00000,87.00000,73.00000, 80.00000
83168,2014,10,10,0.00000,88.00000,80.00000, 84.00000
83168,2014,10,11,0.00000,87.00000,80.00000, 83.50000
83168,2014,10,12,0.00000,88.00000,80.00000, 84.00000
83168,2014,10,13,0.00000,88.00000,81.00000, 84.50000
83168,2014,10,14,0.04000,88.00000,77.00000, 82.50000
83168,2014,10,15,0.00000,88.00000,77.00000, 82.50000
83168,2014,10,16,0.09000,89.00000,72.00000, 80.50000
83168,2014,10,17,0.00000,85.00000,67.00000, 76.00000
83168,2014,10,18,0.00000,84.00000,65.00000, 74.50000
83168,2014,10,19,0.00000,84.00000,65.00000, 74.50000
83168,2014,10,20,0.00000,85.00000,69.00000, 77.00000
83168,2014,10,21,0.77000,87.00000,76.00000, 81.50000
83168,2014,10,22,0.69000,81.00000,71.00000, 76.00000
83168,2014,10,23,0.31000,82.00000,72.00000, 77.00000
83168,2014,10,24,0.71000,79.00000,73.00000, 76.00000
83168,2014,10,25,0.00000,81.00000,68.00000, 74.50000
83168,2014,10,26,0.00000,82.00000,67.00000, 74.50000
83168,2014,10,27,0.00000,83.00000,64.00000, 73.50000
83168,2014,10,28,0.00000,83.00000,66.00000, 74.50000
83168,2014,10,29,0.03000,86.00000,76.00000, 81.00000
83168,2014,10,30,0.00000,85.00000,69.00000, 77.00000
83168,2014,10,31,0.00000,85.00000,69.00000, 77.00000
83168,2014,11,1,0.00000,86.00000,59.00000, 72.50000
83168,2014,11,2,0.00000,77.00000,52.00000, 64.50000
83168,2014,11,3,0.00000,70.00000,52.00000, 61.00000
83168,2014,11,4,0.00000,77.00000,59.00000, 68.00000
83168,2014,11,5,0.02000,79.00000,73.00000, 76.00000
83168,2014,11,6,0.02000,82.00000,75.00000, 78.50000
83168,2014,11,7,0.00000,83.00000,66.00000, 74.50000
83168,2014,11,8,0.00000,84.00000,65.00000, 74.50000
83168,2014,11,9,0.00000,84.00000,65.00000, 74.50000
83168,2014,11,10,1.20000,72.00000,65.00000, 68.50000
83168,2014,11,11,0.08000,77.00000,61.00000, 69.00000
83168,2014,11,12,0.00000,80.00000,61.00000, 70.50000
83168,2014,11,13,0.00000,83.00000,63.00000, 73.00000
83168,2014,11,14,0.00000,83.00000,65.00000, 74.00000
83168,2014,11,15,0.00000,82.00000,64.00000, 73.00000
83168,2014,11,16,0.00000,83.00000,64.00000, 73.50000
83168,2014,11,17,0.07000,84.00000,64.00000, 74.00000
83168,2014,11,18,0.00000,86.00000,71.00000, 78.50000
83168,2014,11,19,0.57000,78.00000,55.00000, 66.50000
83168,2014,11,20,0.05000,72.00000,56.00000, 64.00000
83168,2014,11,21,0.05000,77.00000,63.00000, 70.00000
83168,2014,11,22,0.22000,77.00000,69.00000, 73.00000
83168,2014,11,23,0.06000,79.00000,76.00000, 77.50000
83168,2014,11,24,0.02000,84.00000,78.00000, 81.00000
83168,2014,11,25,0.00000,86.00000,78.00000, 82.00000
83168,2014,11,26,0.07000,85.00000,77.00000, 81.00000
83168,2014,11,27,0.21000,82.00000,55.00000, 68.50000
83168,2014,11,28,0.00000,73.00000,53.00000, 63.00000
83168,2015,1,8,0.00000,80.00000,57.00000,
83168,2015,1,9,0.05000,72.00000,56.00000,
83168,2015,1,10,0.00000,72.00000,57.00000,
83168,2015,1,11,0.00000,80.00000,57.00000,
83168,2015,1,12,0.05000,80.00000,59.00000,
83168,2015,1,13,0.85000,81.00000,69.00000,
83168,2015,1,14,0.05000,81.00000,68.00000,
83168,2015,1,15,0.00000,81.00000,64.00000,
83168,2015,1,16,0.00000,78.00000,63.00000,
83168,2015,1,17,0.00000,73.00000,55.00000,
83168,2015,1,18,0.00000,76.00000,55.00000,
83168,2015,1,19,0.00000,78.00000,55.00000,
83168,2015,1,20,0.00000,75.00000,56.00000,
83168,2015,1,21,0.02000,73.00000,65.00000,
83168,2015,1,22,0.00000,80.00000,64.00000,
83168,2015,1,23,0.00000,80.00000,71.00000,
83168,2015,1,24,0.00000,79.00000,72.00000,
83168,2015,1,25,0.00000,79.00000,49.00000,
83168,2015,1,26,0.00000,79.00000,49.00000,
83168,2015,1,27,0.10000,75.00000,53.00000,
83168,2015,1,28,0.00000,68.00000,53.00000,
83168,2015,1,29,0.00000,69.00000,53.00000,
83168,2015,1,30,0.00000,72.00000,60.00000,
83168,2015,1,31,0.00000,76.00000,58.00000,
83168,2015,2,1,0.00000,76.00000,58.00000,
83168,2015,2,2,0.05000,77.00000,58.00000,
83168,2015,2,3,0.00000,84.00000,56.00000,
83168,2015,2,4,0.00000,76.00000,56.00000,
I am unable to rectify the error. How to overcome the warning in the 26th line? How can one solve this error?
Update :
when I try the same thing in different way like reading dataset from file instead of converting to dataframes, I am not getting the error. What would be the reason for that? The code is :
weather_filename = 'wether.csv'
weather_file = open(weather_filename)
weather_data = weather_file.read()
weather_file.close()
# Break the weather records into lines
lines = weather_data.split('\n')
labels = lines[0]
values = lines[1:]
n_values = len(values)
# Break the list of comma-separated value strings
# into lists of values.
year = []
month = []
day = []
max_temp = []
j_year = 1
j_month = 2
j_day = 3
j_max_temp = 5
for i_row in range(n_values):
split_values = values[i_row].split(',')
if len(split_values) >= j_max_temp:
year.append(int(split_values[j_year]))
month.append(int(split_values[j_month]))
day.append(int(split_values[j_day]))
max_temp.append(float(split_values[j_max_temp]))
# Isolate the recent data.
i_mid = len(max_temp) // 2
temps = np.array(max_temp[i_mid:])
year = year[i_mid:]
month = month[i_mid:]
day = day[i_mid:]
temps[np.where(temps == -99.9)] = np.nan
# Remove all the nans.
# Trim both ends and fill nans in the middle.
# Find the first non-nan.
i_start = np.where(np.logical_not(np.isnan(temps)))[0][0]
temps = temps[i_start:]
year = year[i_start:]
month = month[i_start:]
day = day[i_start:]
i_nans = np.where(np.isnan(temps))[0]
print(i_nans)
What is wrong in the first code and why the second doesn't even give a warning?
Posting as it might help future users.
As correctly pointed out by others, np.isnan won't work for object or string dtypes. If you're using pandas, as mentioned here you can directly use pd.isnull, which should work in your case.
import pandas as pd
import numpy as np
var1 = ''
var2 = np.nan
>>> type(var1)
<class 'str'>
>>> type(var2)
<class 'float'>
>>> pd.isnull(var1)
False
>>> pd.isnull(var2)
True
Try replacing np.isnan with pd.isna. Pandas' isna supports category dtypes
What's the dtype of temps. I can reproduce your warning and error with a string dtype:
In [26]: temps = np.array([1,2,'string',0])
In [27]: temps
Out[27]: array(['1', '2', 'string', '0'], dtype='<U21')
In [28]: temps==-99.9
/usr/local/bin/ipython3:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
#!/usr/bin/python3
Out[28]: False
In [29]: np.isnan(temps)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-2ff7754ed926> in <module>()
----> 1 np.isnan(temps)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
First, comparing strings with the number gives this future warning.
Second, testing for nan produces the error.
Note that given the dtype, the nan assignment assigns a string value, not a float (np.nan is a float).
In [30]: temps[-1] = np.nan
In [31]: temps
Out[31]: array(['1', '2', 'string', 'nan'], dtype='<U21')
isnan(ndarray) fails on ndarray dtype of "object"
isnan(ndarray.astype(np.float)), but strings cannot be coerced to float.
This is likely a result of an unwanted float to string conversion. To repair it, just reverse it by adding string-to-float conversion (assuming data is convertible to a number) using float or np.float64:
np.isnan(float(str(np.nan)))
True
or
np.isnan(float(str("nan")))
True
rather than:
np.isnan(str(np.nan))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [164], line 1
----> 1 np.isnan(str(np.nan))
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Note that if your data is NOT convertible to numbers (floats), you need to use a string-compatible function such as pd.isna instead of np.isnan.
I came across this error when trying to transform my dataset using sklearn.preprocessing.OneHotEncoder. The error was thrown by _check_unknown function defined in sklearn.utils._encode.
This was caused by the fact that, at transform time, one of the columns to be transformed had a type float64 as opposed to object - in my case an entire column was NaN.
The solution was to cast the dataframe to object type before invoking transform:
ohe.transform(data.astype("O"))
Note: This answer is somewhat related to the title of the question because this error prompts when working with Decimal types.
I got the same error when considering Decimal type values. For some reason, one column of the dataframe I'm considering comes as decimal. For example, when calling .unique() on this column I got
[Decimal('0'), Decimal('95'), Decimal('38'), Decimal('25'),
Decimal('42'), Decimal('11'), Decimal('18'), Decimal('22'),
.....Decimal('220'), Decimal('724')]
As the traceback of the error showed me that it failed when calling some numpy function. I manage to reproduce the error by considering the min and maxvalues of the above array
from decimal import Decimal
xmin, xmax = Decimal('0'), Decimal('724')
np.isnan([xmin, xmax])
it will prompt the error
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The solution in this case was to cast all these values to int.
df.astype({col:int for col in desired_columns_to_convert})

Role of name of pandas.Series while doing difference

I have two pandas.Series objects, say a and b, having the same index, and when performing the difference a - b I get the error
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
which I don't understand where is coming from.
The Series a is obtained as a slice of a DataFrame whose index is a MultiIndex, and when I do a renaming
a.name = 0
the operation works fine (but if I rename to a tuple I get the same error).
Unfortunately, I am not able to reproduce a minimal example of the phenomenon (the difference of ad-hoc Series with name a tuple seems to work fine).
Any ideas on why this is happening?
If relevant, pandas version is 0.22.0
EDIT
The full traceback of the error:
----------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-e4efbf202d3c> in <module>()
----> 1 one - two
~/venv/lib/python3.4/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op)
727
728 if isinstance(rvalues, ABCSeries):
--> 729 name = _maybe_match_name(left, rvalues)
730 lvalues = getattr(lvalues, 'values', lvalues)
731 rvalues = getattr(rvalues, 'values', rvalues)
~/venv/lib/python3.4/site-packages/pandas/core/common.py in _maybe_match_name(a, b)
137 b_has = hasattr(b, 'name')
138 if a_has and b_has:
--> 139 if a.name == b.name:
140 return a.name
141 else:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
EDIT 2
Some more details on how a and b are obtained:
I have a DataFrame df whose index is a multyindex (year, id_)
I have a Series factors whose index are the columns of df (something like the standard deviation of the columns)
Then:
tmp = df.loc[(year, id_)]
a = tmp[factors != 0]
b = factors[factors != 0]
diff = a - b
and executing the last line the error happens.
EDIT 3
And it keeps happening also if I reduce the columns: the original df has around 1000 rows and columns, but reducing to the last 5 lines and columns, the problem persists!
For example, by doing
df = df.iloc[-10:][df.columns[-5:]]
line = df.iloc[-3]
factors = factors[df.columns]
a = line[factors != 0]
b = factors[factors != 0]
diff = a - b
I keep getting the same error, while printing a and b I obtain
a:
end_bin_68.750_100.000 0.002413
end_bin_75.000_100.000 0.002614
end_bin_81.250_100.000 0.001810
end_bin_87.500_100.000 0.002313
end_bin_93.750_100.000 0.001609
Name: (2015, 10000030), dtype: float64
b:
end_bin_68.750_100.000 0.001244
end_bin_75.000_100.000 0.001242
end_bin_81.250_100.000 0.000918
end_bin_87.500_100.000 0.000659
end_bin_93.750_100.000 0.000563
Name: 1, dtype: float64
While if I manually create df and factors with these same values (also in the indices) the error does not happen.
EDIT 4
While debugging, when one gets to the function _maybe_match_name one obtains the following:
ipdb> type(a.name)
<class 'tuple'>
ipdb> type(b.name)
<class 'numpy.int64'>
ipdb> a.name == b.name
a = end_bin_68.750_100.000 0.002413
end_bin_75.000_100.000 0.002614
end_bin_81.250_100.000 0.001810
end_bin_87.500_100.000 0.002313
end_bin_93.750_100.000 0.001609
Name: (2015, 10000030), dtype: float64
b = end_bin_68.750_100.000 0.001244
end_bin_75.000_100.000 0.001242
end_bin_81.250_100.000 0.000918
end_bin_87.500_100.000 0.000659
end_bin_93.750_100.000 0.000563
Name: 1, dtype: float64
ipdb> (a.name == b.name)
array([False, False])
EDIT 5
Finally I got to a minimal example:
a = pd.Series([1, 2, 3])
a.name = np.int64(13)
b = pd.Series([4, 5, 6])
b.name = (123, 789)
a - b
this raises the error to me, np.__version__ == 1.14.0 and pd.__version__ == 0.22.0
When an operation is made between two pandas Series it tries to give a name to the resulting Series.
s1 = pd.Series(np.random.randn(5))
s2 = pd.Series(np.random.randn(5))
s1.name = "hello"
s2.name = "hello"
s3 = s1-s2
s3.name
>>> "hello"
If the name is not the same, then the resulting Series has no name.
s1 = pd.Series(np.random.randn(5))
s2 = pd.Series(np.random.randn(5))
s1.name = "hello"
s2.name = "goodbye"
s3 = s1-s2
s3.name
>>>
This is done by comparing Series names with the function _maybe_match_name(), than is here on GitHub.
The comparison operator compares apparently in your case an array with a tuple, which is not possible (I haven't been able to reproduce the error), and raise the ValueError exception.
I guess it is a bug, what is weird is that np.int64(42) == ("A", "B")doesn't raise an exception for me.
But I have a FutureWarning from numpy:
FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison.
Which makes me think that you are using a extremely recent numpy version (you compiled it from the master branch on GitHub ?).
The bug will likely be corrected in next pandas release as it is a result of a future change in the behavior of numpy.
My guess is that the best thing to do is just to rename your Series before making operation as you already did b.name = None, or to change your numpy version (1.15.0works well).

Check if dataframe column is Categorical

I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({
'x': np.linspace(0, 50, 6),
'y': np.linspace(0, 20, 6),
'cat_column': random.sample('abcdef', 6)
})
df['cat_column'] = pd.Categorical(df2['cat_column'])
We can see that the dtype for the categorical column is 'category':
df.cat_column.dtype
Out[20]: category
And normally we can do a dtype check by just comparing to the name
of the dtype:
df.x.dtype == 'float64'
Out[21]: True
But this doesn't seem to work when trying to check if the x column
is categorical:
df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-94d2608815c4> in <module>()
----> 1 df.x.dtype == 'category'
TypeError: data type "category" not understood
Is there any way to do these types of checks in pandas v0.15+?
Use the name property to do the comparison instead, it should always work because it's just a string:
>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'
>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'
So, to sum up, you can end up with a simple, straightforward function:
def is_categorical(array_like):
return array_like.dtype.name == 'category'
First, the string representation of the dtype is 'category' and not 'categorical', so this works:
In [41]: df.cat_column.dtype == 'category'
Out[41]: True
But indeed, as you noticed, this comparison gives a TypeError for other dtypes, so you would have to wrap it with a try .. except .. block.
Other ways to check using pandas internals:
In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype)
Out[42]: True
In [43]: pd.api.types.is_categorical_dtype(df.cat_column)
Out[43]: True
For non-categorical columns, those statements will return False instead of raising an error. For example:
In [44]: pd.api.types.is_categorical_dtype(df.x)
Out[44]: False
For much older version of pandas, replace pd.api.types in the above snippet with pd.core.common.
Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for:
df['column'].name in df.select_dtypes(include='category').columns
Thanks to #Jeff.
In my pandas version (v1.0.3), a shorter version of joris' answer is available.
df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])})
print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False
print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True
print(pd.CategoricalDtype.is_dtype(df.noncat)) # False
print(pd.CategoricalDtype.is_dtype(df.categ)) # True
I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here.
It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following:
hasattr(column_to_check, 'cat')
So, as per the example given in the initial question, this would be:
hasattr(df.x, 'cat') #True
Nowadays you can use:
pandas.api.types.is_categorical_dtype(series)
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html
Available since at least pandas 1.0
Taking a look at #Jeff Tratner answer, since the condition df.cat_column.dtype == 'category' not needs to be True to be considered a column as cataegorical,
I propose this considering categorical the dtypes within 'categorical_dtypes' list:
def is_cat(column):
categorical_dtypes = ['object', 'category', 'bool']
if column.dtype.name in categorical_dtypes:
return True
else:
return False
´´´

Categories

Resources