how to test if a variable is pd.NaT? - python
I'm trying to test if one of my variables is pd.NaT. I know it is NaT, and still it won't pass the test. As an example, the following code prints nothing :
a=pd.NaT
if a == pd.NaT:
print("a not NaT")
Does anyone have a clue ? Is there a way to effectively test if a is NaT?
Pandas NaT behaves like a floating-point NaN, in that it's not equal to itself. Instead, you can use pandas.isnull:
In [21]: pandas.isnull(pandas.NaT)
Out[21]: True
This also returns True for None and NaN.
Technically, you could also check for Pandas NaT with x != x, following a common pattern used for floating-point NaN. However, this is likely to cause issues with NumPy NaTs, which look very similar and represent the same concept, but are actually a different type with different behavior:
In [29]: x = pandas.NaT
In [30]: y = numpy.datetime64('NaT')
In [31]: x != x
Out[31]: True
In [32]: y != y
/home/i850228/.local/lib/python3.6/site-packages/IPython/__main__.py:1: FutureWarning: In the future, NAT != NAT will be True rather than False.
# encoding: utf-8
Out[32]: False
numpy.isnat, the function to check for NumPy NaT, also fails with a Pandas NaT:
In [33]: numpy.isnat(pandas.NaT)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-39a66bbf6513> in <module>()
----> 1 numpy.isnat(pandas.NaT)
TypeError: ufunc 'isnat' is only defined for datetime and timedelta.
pandas.isnull works for both Pandas and NumPy NaTs, so it's probably the way to go:
In [34]: pandas.isnull(pandas.NaT)
Out[34]: True
In [35]: pandas.isnull(numpy.datetime64('NaT'))
Out[35]: True
pd.NaT is pd.NaT
True
this works for me.
You can also use pandas.isna() for pandas.NaT, numpy.nan or None:
import pandas as pd
import numpy as np
x = (pd.NaT, np.nan, None)
[pd.isna(i) for i in x]
Output:
[True, True, True]
If it's in a Series (e.g. DataFrame column) you can also use .isna():
pd.Series(pd.NaT).isna()
# 0 True
# dtype: bool
This is what works for me
>>> a = pandas.NaT
>>> type(a) == pandas._libs.tslibs.nattype.NaTType
>>> True
Related
why is ( df1[col].dtype != 'category' )running on one computer successfully while giving error on other while trying to exclude category attribute? [duplicate]
I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False. import pandas as pd import numpy as np import random df = pd.DataFrame({ 'x': np.linspace(0, 50, 6), 'y': np.linspace(0, 20, 6), 'cat_column': random.sample('abcdef', 6) }) df['cat_column'] = pd.Categorical(df2['cat_column']) We can see that the dtype for the categorical column is 'category': df.cat_column.dtype Out[20]: category And normally we can do a dtype check by just comparing to the name of the dtype: df.x.dtype == 'float64' Out[21]: True But this doesn't seem to work when trying to check if the x column is categorical: df.x.dtype == 'category' --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-22-94d2608815c4> in <module>() ----> 1 df.x.dtype == 'category' TypeError: data type "category" not understood Is there any way to do these types of checks in pandas v0.15+?
Use the name property to do the comparison instead, it should always work because it's just a string: >>> import numpy as np >>> arr = np.array([1, 2, 3, 4]) >>> arr.dtype.name 'int64' >>> import pandas as pd >>> cat = pd.Categorical(['a', 'b', 'c']) >>> cat.dtype.name 'category' So, to sum up, you can end up with a simple, straightforward function: def is_categorical(array_like): return array_like.dtype.name == 'category'
First, the string representation of the dtype is 'category' and not 'categorical', so this works: In [41]: df.cat_column.dtype == 'category' Out[41]: True But indeed, as you noticed, this comparison gives a TypeError for other dtypes, so you would have to wrap it with a try .. except .. block. Other ways to check using pandas internals: In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype) Out[42]: True In [43]: pd.api.types.is_categorical_dtype(df.cat_column) Out[43]: True For non-categorical columns, those statements will return False instead of raising an error. For example: In [44]: pd.api.types.is_categorical_dtype(df.x) Out[44]: False For much older version of pandas, replace pd.api.types in the above snippet with pd.core.common.
Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for: df['column'].name in df.select_dtypes(include='category').columns Thanks to #Jeff.
In my pandas version (v1.0.3), a shorter version of joris' answer is available. df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])}) print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True print(pd.CategoricalDtype.is_dtype(df.noncat)) # False print(pd.CategoricalDtype.is_dtype(df.categ)) # True
I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here. It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following: hasattr(column_to_check, 'cat') So, as per the example given in the initial question, this would be: hasattr(df.x, 'cat') #True
Nowadays you can use: pandas.api.types.is_categorical_dtype(series) Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html Available since at least pandas 1.0
Taking a look at #Jeff Tratner answer, since the condition df.cat_column.dtype == 'category' not needs to be True to be considered a column as cataegorical, I propose this considering categorical the dtypes within 'categorical_dtypes' list: def is_cat(column): categorical_dtypes = ['object', 'category', 'bool'] if column.dtype.name in categorical_dtypes: return True else: return False ´´´
Numpy select returning boolean error message
I would like to find matching strings in a path and use np.select to create a new column with labels dependant on the matches I found. This is what I have written import numpy as np conditions = [a["properties_path"].str.contains('blog'), a["properties_path"].str.contains('credit-card-readers/|machines|poss|team|transaction_fees'), a["properties_path"].str.contains('signup|sign-up|create-account|continue|checkout'), a["properties_path"].str.contains('complete'), a["properties_path"] == '/za/|/', a["properties_path"].str.contains('promo')] choices = [ "blog","info_pages","signup","completed","home_page","promo"] a["page_type"] = np.select(conditions, choices, default=np.nan) However, when I run this code, I get this error message: ValueError: invalid entry 0 in condlist: should be boolean ndarray Here is a sample of my data 3124465 /blog/ts-st... 3124466 /card-machines 3124467 /card-machines 3124468 /card-machines 3124469 /promo/our-gift-to-you 3124470 /create-account/v1 3124471 /za/signup/ 3124472 /create-account/v1 3124473 /sign-up 3124474 /za/ 3124475 /sign-up/cart 3124476 /checkout/ 3124477 /complete 3124478 /card-machines 3124479 /continue 3124480 /blog/article/get-car... 3124481 /blog/article/get-car... 3124482 /za/signup/ 3124483 /credit-card-readers 3124484 /signup 3124485 /credit-card-readers 3124486 /create-account/v1 3124487 /credit-card-readers 3124488 /point-of-sale-app 3124489 /create-account/v1 3124490 /point-of-sale-app 3124491 /credit-card-readers
The .str methods operate on object columns. It's possible to have non-string values in such columns, and as a result pandas returns NaN for these rows instead of False. np then complains because this is not a Boolean. Luckily, there's an argument to handle this: na=False a["properties_path"].str.contains('blog', na=False) Alternatively, you could change your conditions to: a["properties_path"].str.contains('blog') == True #or a["properties_path"].str.contains('blog').fillna(False) Sample import pandas as pd import numpy as np df = pd.DataFrame({'a': [1, 'foo', 'bar']}) conds = df.a.str.contains('f') #0 NaN #1 True #2 False #Name: a, dtype: object np.select([conds], ['XX']) #ValueError: invalid entry 0 in condlist: should be boolean ndarray conds = df.a.str.contains('f', na=False) #0 False #1 True #2 False #Name: a, dtype: bool np.select([conds], ['XX']) #array(['0', 'XX', '0'], dtype='<U11')
Your data seem to have nan, so conditions have nan, which breaks np.select. To fix this, you can do: s = a["properties_path"].fillna('') and replace a['properties_path'] in each condition with s.
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced
I am trying to convert a csv into numpy array. In the numpy array, I am replacing few elements with NaN. Then, I wanted to find the indices of the NaN elements in the numpy array. The code is : import pandas as pd import matplotlib.pyplot as plyt import numpy as np filename = 'wether.csv' df = pd.read_csv(filename,header = None ) list = df.values.tolist() labels = list[0] wether_list = list[1:] year = [] month = [] day = [] max_temp = [] for i in wether_list: year.append(i[1]) month.append(i[2]) day.append(i[3]) max_temp.append(i[5]) mid = len(max_temp) // 2 temps = np.array(max_temp[mid:]) temps[np.where(np.array(temps) == -99.9)] = np.nan plyt.plot(temps,marker = '.',color = 'black',linestyle = 'none') # plyt.show() print(np.where(np.isnan(temps))[0]) # print(len(pd.isnull(np.array(temps)))) When I execute this, I am getting a warning and an error. The warning is : wether.py:26: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison temps[np.where(np.array(temps) == -99.9)] = np.nan The error is : Traceback (most recent call last): File "wether.py", line 30, in <module> print(np.where(np.isnan(temps))[0]) TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' This is a part of the dataset which I am using: 83168,2014,9,7,0.00000,89.00000,78.00000, 83.50000 83168,2014,9,22,1.62000,90.00000,72.00000, 81.00000 83168,2014,9,23,0.50000,87.00000,74.00000, 80.50000 83168,2014,9,24,0.35000,82.00000,73.00000, 77.50000 83168,2014,9,25,0.60000,85.00000,75.00000, 80.00000 83168,2014,9,26,0.76000,89.00000,77.00000, 83.00000 83168,2014,9,27,0.00000,89.00000,79.00000, 84.00000 83168,2014,9,28,0.00000,90.00000,81.00000, 85.50000 83168,2014,9,29,0.00000,90.00000,79.00000, 84.50000 83168,2014,9,30,0.50000,89.00000,75.00000, 82.00000 83168,2014,10,1,0.02000,91.00000,75.00000, 83.00000 83168,2014,10,2,0.03000,93.00000,77.00000, 85.00000 83168,2014,10,3,1.40000,93.00000,75.00000, 84.00000 83168,2014,10,4,0.06000,89.00000,75.00000, 82.00000 83168,2014,10,5,0.22000,91.00000,68.00000, 79.50000 83168,2014,10,6,0.00000,84.00000,68.00000, 76.00000 83168,2014,10,7,0.17000,85.00000,73.00000, 79.00000 83168,2014,10,8,0.06000,84.00000,73.00000, 78.50000 83168,2014,10,9,0.00000,87.00000,73.00000, 80.00000 83168,2014,10,10,0.00000,88.00000,80.00000, 84.00000 83168,2014,10,11,0.00000,87.00000,80.00000, 83.50000 83168,2014,10,12,0.00000,88.00000,80.00000, 84.00000 83168,2014,10,13,0.00000,88.00000,81.00000, 84.50000 83168,2014,10,14,0.04000,88.00000,77.00000, 82.50000 83168,2014,10,15,0.00000,88.00000,77.00000, 82.50000 83168,2014,10,16,0.09000,89.00000,72.00000, 80.50000 83168,2014,10,17,0.00000,85.00000,67.00000, 76.00000 83168,2014,10,18,0.00000,84.00000,65.00000, 74.50000 83168,2014,10,19,0.00000,84.00000,65.00000, 74.50000 83168,2014,10,20,0.00000,85.00000,69.00000, 77.00000 83168,2014,10,21,0.77000,87.00000,76.00000, 81.50000 83168,2014,10,22,0.69000,81.00000,71.00000, 76.00000 83168,2014,10,23,0.31000,82.00000,72.00000, 77.00000 83168,2014,10,24,0.71000,79.00000,73.00000, 76.00000 83168,2014,10,25,0.00000,81.00000,68.00000, 74.50000 83168,2014,10,26,0.00000,82.00000,67.00000, 74.50000 83168,2014,10,27,0.00000,83.00000,64.00000, 73.50000 83168,2014,10,28,0.00000,83.00000,66.00000, 74.50000 83168,2014,10,29,0.03000,86.00000,76.00000, 81.00000 83168,2014,10,30,0.00000,85.00000,69.00000, 77.00000 83168,2014,10,31,0.00000,85.00000,69.00000, 77.00000 83168,2014,11,1,0.00000,86.00000,59.00000, 72.50000 83168,2014,11,2,0.00000,77.00000,52.00000, 64.50000 83168,2014,11,3,0.00000,70.00000,52.00000, 61.00000 83168,2014,11,4,0.00000,77.00000,59.00000, 68.00000 83168,2014,11,5,0.02000,79.00000,73.00000, 76.00000 83168,2014,11,6,0.02000,82.00000,75.00000, 78.50000 83168,2014,11,7,0.00000,83.00000,66.00000, 74.50000 83168,2014,11,8,0.00000,84.00000,65.00000, 74.50000 83168,2014,11,9,0.00000,84.00000,65.00000, 74.50000 83168,2014,11,10,1.20000,72.00000,65.00000, 68.50000 83168,2014,11,11,0.08000,77.00000,61.00000, 69.00000 83168,2014,11,12,0.00000,80.00000,61.00000, 70.50000 83168,2014,11,13,0.00000,83.00000,63.00000, 73.00000 83168,2014,11,14,0.00000,83.00000,65.00000, 74.00000 83168,2014,11,15,0.00000,82.00000,64.00000, 73.00000 83168,2014,11,16,0.00000,83.00000,64.00000, 73.50000 83168,2014,11,17,0.07000,84.00000,64.00000, 74.00000 83168,2014,11,18,0.00000,86.00000,71.00000, 78.50000 83168,2014,11,19,0.57000,78.00000,55.00000, 66.50000 83168,2014,11,20,0.05000,72.00000,56.00000, 64.00000 83168,2014,11,21,0.05000,77.00000,63.00000, 70.00000 83168,2014,11,22,0.22000,77.00000,69.00000, 73.00000 83168,2014,11,23,0.06000,79.00000,76.00000, 77.50000 83168,2014,11,24,0.02000,84.00000,78.00000, 81.00000 83168,2014,11,25,0.00000,86.00000,78.00000, 82.00000 83168,2014,11,26,0.07000,85.00000,77.00000, 81.00000 83168,2014,11,27,0.21000,82.00000,55.00000, 68.50000 83168,2014,11,28,0.00000,73.00000,53.00000, 63.00000 83168,2015,1,8,0.00000,80.00000,57.00000, 83168,2015,1,9,0.05000,72.00000,56.00000, 83168,2015,1,10,0.00000,72.00000,57.00000, 83168,2015,1,11,0.00000,80.00000,57.00000, 83168,2015,1,12,0.05000,80.00000,59.00000, 83168,2015,1,13,0.85000,81.00000,69.00000, 83168,2015,1,14,0.05000,81.00000,68.00000, 83168,2015,1,15,0.00000,81.00000,64.00000, 83168,2015,1,16,0.00000,78.00000,63.00000, 83168,2015,1,17,0.00000,73.00000,55.00000, 83168,2015,1,18,0.00000,76.00000,55.00000, 83168,2015,1,19,0.00000,78.00000,55.00000, 83168,2015,1,20,0.00000,75.00000,56.00000, 83168,2015,1,21,0.02000,73.00000,65.00000, 83168,2015,1,22,0.00000,80.00000,64.00000, 83168,2015,1,23,0.00000,80.00000,71.00000, 83168,2015,1,24,0.00000,79.00000,72.00000, 83168,2015,1,25,0.00000,79.00000,49.00000, 83168,2015,1,26,0.00000,79.00000,49.00000, 83168,2015,1,27,0.10000,75.00000,53.00000, 83168,2015,1,28,0.00000,68.00000,53.00000, 83168,2015,1,29,0.00000,69.00000,53.00000, 83168,2015,1,30,0.00000,72.00000,60.00000, 83168,2015,1,31,0.00000,76.00000,58.00000, 83168,2015,2,1,0.00000,76.00000,58.00000, 83168,2015,2,2,0.05000,77.00000,58.00000, 83168,2015,2,3,0.00000,84.00000,56.00000, 83168,2015,2,4,0.00000,76.00000,56.00000, I am unable to rectify the error. How to overcome the warning in the 26th line? How can one solve this error? Update : when I try the same thing in different way like reading dataset from file instead of converting to dataframes, I am not getting the error. What would be the reason for that? The code is : weather_filename = 'wether.csv' weather_file = open(weather_filename) weather_data = weather_file.read() weather_file.close() # Break the weather records into lines lines = weather_data.split('\n') labels = lines[0] values = lines[1:] n_values = len(values) # Break the list of comma-separated value strings # into lists of values. year = [] month = [] day = [] max_temp = [] j_year = 1 j_month = 2 j_day = 3 j_max_temp = 5 for i_row in range(n_values): split_values = values[i_row].split(',') if len(split_values) >= j_max_temp: year.append(int(split_values[j_year])) month.append(int(split_values[j_month])) day.append(int(split_values[j_day])) max_temp.append(float(split_values[j_max_temp])) # Isolate the recent data. i_mid = len(max_temp) // 2 temps = np.array(max_temp[i_mid:]) year = year[i_mid:] month = month[i_mid:] day = day[i_mid:] temps[np.where(temps == -99.9)] = np.nan # Remove all the nans. # Trim both ends and fill nans in the middle. # Find the first non-nan. i_start = np.where(np.logical_not(np.isnan(temps)))[0][0] temps = temps[i_start:] year = year[i_start:] month = month[i_start:] day = day[i_start:] i_nans = np.where(np.isnan(temps))[0] print(i_nans) What is wrong in the first code and why the second doesn't even give a warning?
Posting as it might help future users. As correctly pointed out by others, np.isnan won't work for object or string dtypes. If you're using pandas, as mentioned here you can directly use pd.isnull, which should work in your case. import pandas as pd import numpy as np var1 = '' var2 = np.nan >>> type(var1) <class 'str'> >>> type(var2) <class 'float'> >>> pd.isnull(var1) False >>> pd.isnull(var2) True
Try replacing np.isnan with pd.isna. Pandas' isna supports category dtypes
What's the dtype of temps. I can reproduce your warning and error with a string dtype: In [26]: temps = np.array([1,2,'string',0]) In [27]: temps Out[27]: array(['1', '2', 'string', '0'], dtype='<U21') In [28]: temps==-99.9 /usr/local/bin/ipython3:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison #!/usr/bin/python3 Out[28]: False In [29]: np.isnan(temps) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-29-2ff7754ed926> in <module>() ----> 1 np.isnan(temps) TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' First, comparing strings with the number gives this future warning. Second, testing for nan produces the error. Note that given the dtype, the nan assignment assigns a string value, not a float (np.nan is a float). In [30]: temps[-1] = np.nan In [31]: temps Out[31]: array(['1', '2', 'string', 'nan'], dtype='<U21')
isnan(ndarray) fails on ndarray dtype of "object" isnan(ndarray.astype(np.float)), but strings cannot be coerced to float.
This is likely a result of an unwanted float to string conversion. To repair it, just reverse it by adding string-to-float conversion (assuming data is convertible to a number) using float or np.float64: np.isnan(float(str(np.nan))) True or np.isnan(float(str("nan"))) True rather than: np.isnan(str(np.nan)) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In [164], line 1 ----> 1 np.isnan(str(np.nan)) TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' Note that if your data is NOT convertible to numbers (floats), you need to use a string-compatible function such as pd.isna instead of np.isnan.
I came across this error when trying to transform my dataset using sklearn.preprocessing.OneHotEncoder. The error was thrown by _check_unknown function defined in sklearn.utils._encode. This was caused by the fact that, at transform time, one of the columns to be transformed had a type float64 as opposed to object - in my case an entire column was NaN. The solution was to cast the dataframe to object type before invoking transform: ohe.transform(data.astype("O"))
Note: This answer is somewhat related to the title of the question because this error prompts when working with Decimal types. I got the same error when considering Decimal type values. For some reason, one column of the dataframe I'm considering comes as decimal. For example, when calling .unique() on this column I got [Decimal('0'), Decimal('95'), Decimal('38'), Decimal('25'), Decimal('42'), Decimal('11'), Decimal('18'), Decimal('22'), .....Decimal('220'), Decimal('724')] As the traceback of the error showed me that it failed when calling some numpy function. I manage to reproduce the error by considering the min and maxvalues of the above array from decimal import Decimal xmin, xmax = Decimal('0'), Decimal('724') np.isnan([xmin, xmax]) it will prompt the error TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' The solution in this case was to cast all these values to int. df.astype({col:int for col in desired_columns_to_convert})
Role of name of pandas.Series while doing difference
I have two pandas.Series objects, say a and b, having the same index, and when performing the difference a - b I get the error ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() which I don't understand where is coming from. The Series a is obtained as a slice of a DataFrame whose index is a MultiIndex, and when I do a renaming a.name = 0 the operation works fine (but if I rename to a tuple I get the same error). Unfortunately, I am not able to reproduce a minimal example of the phenomenon (the difference of ad-hoc Series with name a tuple seems to work fine). Any ideas on why this is happening? If relevant, pandas version is 0.22.0 EDIT The full traceback of the error: ---------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-15-e4efbf202d3c> in <module>() ----> 1 one - two ~/venv/lib/python3.4/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op) 727 728 if isinstance(rvalues, ABCSeries): --> 729 name = _maybe_match_name(left, rvalues) 730 lvalues = getattr(lvalues, 'values', lvalues) 731 rvalues = getattr(rvalues, 'values', rvalues) ~/venv/lib/python3.4/site-packages/pandas/core/common.py in _maybe_match_name(a, b) 137 b_has = hasattr(b, 'name') 138 if a_has and b_has: --> 139 if a.name == b.name: 140 return a.name 141 else: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() EDIT 2 Some more details on how a and b are obtained: I have a DataFrame df whose index is a multyindex (year, id_) I have a Series factors whose index are the columns of df (something like the standard deviation of the columns) Then: tmp = df.loc[(year, id_)] a = tmp[factors != 0] b = factors[factors != 0] diff = a - b and executing the last line the error happens. EDIT 3 And it keeps happening also if I reduce the columns: the original df has around 1000 rows and columns, but reducing to the last 5 lines and columns, the problem persists! For example, by doing df = df.iloc[-10:][df.columns[-5:]] line = df.iloc[-3] factors = factors[df.columns] a = line[factors != 0] b = factors[factors != 0] diff = a - b I keep getting the same error, while printing a and b I obtain a: end_bin_68.750_100.000 0.002413 end_bin_75.000_100.000 0.002614 end_bin_81.250_100.000 0.001810 end_bin_87.500_100.000 0.002313 end_bin_93.750_100.000 0.001609 Name: (2015, 10000030), dtype: float64 b: end_bin_68.750_100.000 0.001244 end_bin_75.000_100.000 0.001242 end_bin_81.250_100.000 0.000918 end_bin_87.500_100.000 0.000659 end_bin_93.750_100.000 0.000563 Name: 1, dtype: float64 While if I manually create df and factors with these same values (also in the indices) the error does not happen. EDIT 4 While debugging, when one gets to the function _maybe_match_name one obtains the following: ipdb> type(a.name) <class 'tuple'> ipdb> type(b.name) <class 'numpy.int64'> ipdb> a.name == b.name a = end_bin_68.750_100.000 0.002413 end_bin_75.000_100.000 0.002614 end_bin_81.250_100.000 0.001810 end_bin_87.500_100.000 0.002313 end_bin_93.750_100.000 0.001609 Name: (2015, 10000030), dtype: float64 b = end_bin_68.750_100.000 0.001244 end_bin_75.000_100.000 0.001242 end_bin_81.250_100.000 0.000918 end_bin_87.500_100.000 0.000659 end_bin_93.750_100.000 0.000563 Name: 1, dtype: float64 ipdb> (a.name == b.name) array([False, False]) EDIT 5 Finally I got to a minimal example: a = pd.Series([1, 2, 3]) a.name = np.int64(13) b = pd.Series([4, 5, 6]) b.name = (123, 789) a - b this raises the error to me, np.__version__ == 1.14.0 and pd.__version__ == 0.22.0
When an operation is made between two pandas Series it tries to give a name to the resulting Series. s1 = pd.Series(np.random.randn(5)) s2 = pd.Series(np.random.randn(5)) s1.name = "hello" s2.name = "hello" s3 = s1-s2 s3.name >>> "hello" If the name is not the same, then the resulting Series has no name. s1 = pd.Series(np.random.randn(5)) s2 = pd.Series(np.random.randn(5)) s1.name = "hello" s2.name = "goodbye" s3 = s1-s2 s3.name >>> This is done by comparing Series names with the function _maybe_match_name(), than is here on GitHub. The comparison operator compares apparently in your case an array with a tuple, which is not possible (I haven't been able to reproduce the error), and raise the ValueError exception. I guess it is a bug, what is weird is that np.int64(42) == ("A", "B")doesn't raise an exception for me. But I have a FutureWarning from numpy: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison. Which makes me think that you are using a extremely recent numpy version (you compiled it from the master branch on GitHub ?). The bug will likely be corrected in next pandas release as it is a result of a future change in the behavior of numpy. My guess is that the best thing to do is just to rename your Series before making operation as you already did b.name = None, or to change your numpy version (1.15.0works well).
Check if dataframe column is Categorical
I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False. import pandas as pd import numpy as np import random df = pd.DataFrame({ 'x': np.linspace(0, 50, 6), 'y': np.linspace(0, 20, 6), 'cat_column': random.sample('abcdef', 6) }) df['cat_column'] = pd.Categorical(df2['cat_column']) We can see that the dtype for the categorical column is 'category': df.cat_column.dtype Out[20]: category And normally we can do a dtype check by just comparing to the name of the dtype: df.x.dtype == 'float64' Out[21]: True But this doesn't seem to work when trying to check if the x column is categorical: df.x.dtype == 'category' --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-22-94d2608815c4> in <module>() ----> 1 df.x.dtype == 'category' TypeError: data type "category" not understood Is there any way to do these types of checks in pandas v0.15+?
Use the name property to do the comparison instead, it should always work because it's just a string: >>> import numpy as np >>> arr = np.array([1, 2, 3, 4]) >>> arr.dtype.name 'int64' >>> import pandas as pd >>> cat = pd.Categorical(['a', 'b', 'c']) >>> cat.dtype.name 'category' So, to sum up, you can end up with a simple, straightforward function: def is_categorical(array_like): return array_like.dtype.name == 'category'
First, the string representation of the dtype is 'category' and not 'categorical', so this works: In [41]: df.cat_column.dtype == 'category' Out[41]: True But indeed, as you noticed, this comparison gives a TypeError for other dtypes, so you would have to wrap it with a try .. except .. block. Other ways to check using pandas internals: In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype) Out[42]: True In [43]: pd.api.types.is_categorical_dtype(df.cat_column) Out[43]: True For non-categorical columns, those statements will return False instead of raising an error. For example: In [44]: pd.api.types.is_categorical_dtype(df.x) Out[44]: False For much older version of pandas, replace pd.api.types in the above snippet with pd.core.common.
Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for: df['column'].name in df.select_dtypes(include='category').columns Thanks to #Jeff.
In my pandas version (v1.0.3), a shorter version of joris' answer is available. df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])}) print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True print(pd.CategoricalDtype.is_dtype(df.noncat)) # False print(pd.CategoricalDtype.is_dtype(df.categ)) # True
I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here. It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following: hasattr(column_to_check, 'cat') So, as per the example given in the initial question, this would be: hasattr(df.x, 'cat') #True
Nowadays you can use: pandas.api.types.is_categorical_dtype(series) Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html Available since at least pandas 1.0
Taking a look at #Jeff Tratner answer, since the condition df.cat_column.dtype == 'category' not needs to be True to be considered a column as cataegorical, I propose this considering categorical the dtypes within 'categorical_dtypes' list: def is_cat(column): categorical_dtypes = ['object', 'category', 'bool'] if column.dtype.name in categorical_dtypes: return True else: return False ´´´