Check if dataframe column is Categorical

Check if dataframe column is Categorical - python

I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({
'x': np.linspace(0, 50, 6),
'y': np.linspace(0, 20, 6),
'cat_column': random.sample('abcdef', 6)
})
df['cat_column'] = pd.Categorical(df2['cat_column'])
We can see that the dtype for the categorical column is 'category':
df.cat_column.dtype
Out[20]: category
And normally we can do a dtype check by just comparing to the name
of the dtype:
df.x.dtype == 'float64'
Out[21]: True
But this doesn't seem to work when trying to check if the x column
is categorical:
df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-94d2608815c4> in <module>()
----> 1 df.x.dtype == 'category'
TypeError: data type "category" not understood
Is there any way to do these types of checks in pandas v0.15+?

Use the name property to do the comparison instead, it should always work because it's just a string:
>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'
>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'
So, to sum up, you can end up with a simple, straightforward function:
def is_categorical(array_like):
return array_like.dtype.name == 'category'

First, the string representation of the dtype is 'category' and not 'categorical', so this works:
In [41]: df.cat_column.dtype == 'category'
Out[41]: True
But indeed, as you noticed, this comparison gives a TypeError for other dtypes, so you would have to wrap it with a try .. except .. block.
Other ways to check using pandas internals:
In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype)
Out[42]: True
In [43]: pd.api.types.is_categorical_dtype(df.cat_column)
Out[43]: True
For non-categorical columns, those statements will return False instead of raising an error. For example:
In [44]: pd.api.types.is_categorical_dtype(df.x)
Out[44]: False
For much older version of pandas, replace pd.api.types in the above snippet with pd.core.common.

Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for:
df['column'].name in df.select_dtypes(include='category').columns
Thanks to #Jeff.

In my pandas version (v1.0.3), a shorter version of joris' answer is available.
df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])})
print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False
print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True
print(pd.CategoricalDtype.is_dtype(df.noncat)) # False
print(pd.CategoricalDtype.is_dtype(df.categ)) # True

I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here.
It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following:
hasattr(column_to_check, 'cat')
So, as per the example given in the initial question, this would be:
hasattr(df.x, 'cat') #True

Nowadays you can use:
pandas.api.types.is_categorical_dtype(series)
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html
Available since at least pandas 1.0

Taking a look at #Jeff Tratner answer, since the condition df.cat_column.dtype == 'category' not needs to be True to be considered a column as cataegorical,
I propose this considering categorical the dtypes within 'categorical_dtypes' list:
def is_cat(column):
categorical_dtypes = ['object', 'category', 'bool']
if column.dtype.name in categorical_dtypes:
return True
else:
return False
´´´

Related

why is ( df1[col].dtype != 'category' )running on one computer successfully while giving error on other while trying to exclude category attribute? [duplicate]

I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({
'x': np.linspace(0, 50, 6),
'y': np.linspace(0, 20, 6),
'cat_column': random.sample('abcdef', 6)
})
df['cat_column'] = pd.Categorical(df2['cat_column'])
We can see that the dtype for the categorical column is 'category':
df.cat_column.dtype
Out[20]: category
And normally we can do a dtype check by just comparing to the name
of the dtype:
df.x.dtype == 'float64'
Out[21]: True
But this doesn't seem to work when trying to check if the x column
is categorical:
df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-94d2608815c4> in <module>()
----> 1 df.x.dtype == 'category'
TypeError: data type "category" not understood
Is there any way to do these types of checks in pandas v0.15+?

Use the name property to do the comparison instead, it should always work because it's just a string:
>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'
>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'
So, to sum up, you can end up with a simple, straightforward function:
def is_categorical(array_like):
return array_like.dtype.name == 'category'

First, the string representation of the dtype is 'category' and not 'categorical', so this works:
In [41]: df.cat_column.dtype == 'category'
Out[41]: True
But indeed, as you noticed, this comparison gives a TypeError for other dtypes, so you would have to wrap it with a try .. except .. block.
Other ways to check using pandas internals:
In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype)
Out[42]: True
In [43]: pd.api.types.is_categorical_dtype(df.cat_column)
Out[43]: True
For non-categorical columns, those statements will return False instead of raising an error. For example:
In [44]: pd.api.types.is_categorical_dtype(df.x)
Out[44]: False
For much older version of pandas, replace pd.api.types in the above snippet with pd.core.common.

Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for:
df['column'].name in df.select_dtypes(include='category').columns
Thanks to #Jeff.

In my pandas version (v1.0.3), a shorter version of joris' answer is available.
df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])})
print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False
print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True
print(pd.CategoricalDtype.is_dtype(df.noncat)) # False
print(pd.CategoricalDtype.is_dtype(df.categ)) # True

I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here.
It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following:
hasattr(column_to_check, 'cat')
So, as per the example given in the initial question, this would be:
hasattr(df.x, 'cat') #True

Nowadays you can use:
pandas.api.types.is_categorical_dtype(series)
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html
Available since at least pandas 1.0

Taking a look at #Jeff Tratner answer, since the condition df.cat_column.dtype == 'category' not needs to be True to be considered a column as cataegorical,
I propose this considering categorical the dtypes within 'categorical_dtypes' list:
def is_cat(column):
categorical_dtypes = ['object', 'category', 'bool']
if column.dtype.name in categorical_dtypes:
return True
else:
return False
´´´

How to numerate from id = 1 and name the first column as "Id" [duplicate]

I need the index to start at 1 rather than 0 when writing a Pandas DataFrame to CSV.
Here's an example:
In [1]: import pandas as pd
In [2]: result = pd.DataFrame({'Count': [83, 19, 20]})
In [3]: result.to_csv('result.csv', index_label='Event_id')
Which produces the following output:
In [4]: !cat result.csv
Event_id,Count
0,83
1,19
2,20
But my desired output is this:
In [5]: !cat result2.csv
Event_id,Count
1,83
2,19
3,20
I realize that this could be done by adding a sequence of integers shifted by 1 as a column to my data frame, but I'm new to Pandas and I'm wondering if a cleaner way exists.

Index is an object, and default index starts from 0:
>>> result.index
Int64Index([0, 1, 2], dtype=int64)
You can shift this index by 1 with
>>> result.index += 1
>>> result.index
Int64Index([1, 2, 3], dtype=int64)

Just set the index before writing to CSV.
df.index = np.arange(1, len(df) + 1)
And then write it normally.

source: In Python pandas, start row index from 1 instead of zero without creating additional column
Working example:
import pandas as pdas
dframe = pdas.read_csv(open(input_file))
dframe.index = dframe.index + 1

Another way in one line:
df.shift()[1:]

In my opinion best practice is to set the index with a RangeIndex
import pandas as pd
result = pd.DataFrame(
{'Count': [83, 19, 20]},
index=pd.RangeIndex(start=1, stop=4, name='index')
)
>>> result
Count
index
1 83
2 19
3 20
I prefer this, because you can define the range and a possible step and a name for the index in one line.

This worked for me
df.index = np.arange(1, len(df)+1)

You can use this one:
import pandas as pd
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index += 1
print(result)
or this one, by getting the help of numpy library like this:
import pandas as pd
import numpy as np
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index = np.arange(1, len(result)+1)
print(result)
np.arange will create a numpy array and return values within a given interval which is (1, len(result)+1) and finally you will assign that array to result.index.

use this
df.index = np.arange(1, len(df)+1)

Add ".shift()[1:]" while creating a data frame
data = pd.read_csv(r"C:\Users\user\path\data.csv").shift()[1:]

Fork from the original answer, giving some cents:
if I'm not mistaken, starting from version 0.23, index object is RangeIndex type
From the official doc:
RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges. Using RangeIndex may in some instances improve computing speed.
In case of a huge index range, that makes sense, using the representation of the index, instead of defining the whole index at once (saving memory).
Therefore, an example (using Series, but it applies to DataFrame also):
>>> import pandas as pd
>>>
>>> countries = ['China', 'India', 'USA']
>>> ds = pd.Series(countries)
>>>
>>>
>>> type(ds.index)
<class 'pandas.core.indexes.range.RangeIndex'>
>>> ds.index
RangeIndex(start=0, stop=3, step=1)
>>>
>>> ds.index += 1
>>>
>>> ds.index
RangeIndex(start=1, stop=4, step=1)
>>>
>>> ds
1 China
2 India
3 USA
dtype: object
>>>
As you can see, the increment of the index object, changes the start and stop parameters.

This adds a column that accomplishes what you want
df.insert(0,"Column Name", np.arange(1,len(df)+1))

Following on from TomAugspurger's answer, we could use list comprehension rather than np.arrange(), which removes the requirement for importing the module: numpy. You can use the following instead:
df.index = [i+1 for i in range(len(df))]

Numpy select returning boolean error message

I would like to find matching strings in a path and use np.select to create a new column with labels dependant on the matches I found.
This is what I have written
import numpy as np
conditions = [a["properties_path"].str.contains('blog'),
a["properties_path"].str.contains('credit-card-readers/|machines|poss|team|transaction_fees'),
a["properties_path"].str.contains('signup|sign-up|create-account|continue|checkout'),
a["properties_path"].str.contains('complete'),
a["properties_path"] == '/za/|/',
a["properties_path"].str.contains('promo')]
choices = [ "blog","info_pages","signup","completed","home_page","promo"]
a["page_type"] = np.select(conditions, choices, default=np.nan)
However, when I run this code, I get this error message:
ValueError: invalid entry 0 in condlist: should be boolean ndarray
Here is a sample of my data
3124465 /blog/ts-st...
3124466 /card-machines
3124467 /card-machines
3124468 /card-machines
3124469 /promo/our-gift-to-you
3124470 /create-account/v1
3124471 /za/signup/
3124472 /create-account/v1
3124473 /sign-up
3124474 /za/
3124475 /sign-up/cart
3124476 /checkout/
3124477 /complete
3124478 /card-machines
3124479 /continue
3124480 /blog/article/get-car...
3124481 /blog/article/get-car...
3124482 /za/signup/
3124483 /credit-card-readers
3124484 /signup
3124485 /credit-card-readers
3124486 /create-account/v1
3124487 /credit-card-readers
3124488 /point-of-sale-app
3124489 /create-account/v1
3124490 /point-of-sale-app
3124491 /credit-card-readers

The .str methods operate on object columns. It's possible to have non-string values in such columns, and as a result pandas returns NaN for these rows instead of False. np then complains because this is not a Boolean.
Luckily, there's an argument to handle this: na=False
a["properties_path"].str.contains('blog', na=False)
Alternatively, you could change your conditions to:
a["properties_path"].str.contains('blog') == True
#or
a["properties_path"].str.contains('blog').fillna(False)
Sample
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 'foo', 'bar']})
conds = df.a.str.contains('f')
#0 NaN
#1 True
#2 False
#Name: a, dtype: object
np.select([conds], ['XX'])
#ValueError: invalid entry 0 in condlist: should be boolean ndarray
conds = df.a.str.contains('f', na=False)
#0 False
#1 True
#2 False
#Name: a, dtype: bool
np.select([conds], ['XX'])
#array(['0', 'XX', '0'], dtype='<U11')

Your data seem to have nan, so conditions have nan, which breaks np.select. To fix this, you can do:
s = a["properties_path"].fillna('')
and replace a['properties_path'] in each condition with s.

how to test if a variable is pd.NaT?

I'm trying to test if one of my variables is pd.NaT. I know it is NaT, and still it won't pass the test. As an example, the following code prints nothing :
a=pd.NaT
if a == pd.NaT:
print("a not NaT")
Does anyone have a clue ? Is there a way to effectively test if a is NaT?

Pandas NaT behaves like a floating-point NaN, in that it's not equal to itself. Instead, you can use pandas.isnull:
In [21]: pandas.isnull(pandas.NaT)
Out[21]: True
This also returns True for None and NaN.
Technically, you could also check for Pandas NaT with x != x, following a common pattern used for floating-point NaN. However, this is likely to cause issues with NumPy NaTs, which look very similar and represent the same concept, but are actually a different type with different behavior:
In [29]: x = pandas.NaT
In [30]: y = numpy.datetime64('NaT')
In [31]: x != x
Out[31]: True
In [32]: y != y
/home/i850228/.local/lib/python3.6/site-packages/IPython/__main__.py:1: FutureWarning: In the future, NAT != NAT will be True rather than False.
# encoding: utf-8
Out[32]: False
numpy.isnat, the function to check for NumPy NaT, also fails with a Pandas NaT:
In [33]: numpy.isnat(pandas.NaT)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-39a66bbf6513> in <module>()
----> 1 numpy.isnat(pandas.NaT)
TypeError: ufunc 'isnat' is only defined for datetime and timedelta.
pandas.isnull works for both Pandas and NumPy NaTs, so it's probably the way to go:
In [34]: pandas.isnull(pandas.NaT)
Out[34]: True
In [35]: pandas.isnull(numpy.datetime64('NaT'))
Out[35]: True

pd.NaT is pd.NaT
True
this works for me.

You can also use pandas.isna() for pandas.NaT, numpy.nan or None:
import pandas as pd
import numpy as np
x = (pd.NaT, np.nan, None)
[pd.isna(i) for i in x]
Output:
[True, True, True]

If it's in a Series (e.g. DataFrame column) you can also use .isna():
pd.Series(pd.NaT).isna()
# 0 True
# dtype: bool

This is what works for me
>>> a = pandas.NaT
>>> type(a) == pandas._libs.tslibs.nattype.NaTType
>>> True

start index at 1 for Pandas DataFrame

I need the index to start at 1 rather than 0 when writing a Pandas DataFrame to CSV.
Here's an example:
In [1]: import pandas as pd
In [2]: result = pd.DataFrame({'Count': [83, 19, 20]})
In [3]: result.to_csv('result.csv', index_label='Event_id')
Which produces the following output:
In [4]: !cat result.csv
Event_id,Count
0,83
1,19
2,20
But my desired output is this:
In [5]: !cat result2.csv
Event_id,Count
1,83
2,19
3,20
I realize that this could be done by adding a sequence of integers shifted by 1 as a column to my data frame, but I'm new to Pandas and I'm wondering if a cleaner way exists.

Index is an object, and default index starts from 0:
>>> result.index
Int64Index([0, 1, 2], dtype=int64)
You can shift this index by 1 with
>>> result.index += 1
>>> result.index
Int64Index([1, 2, 3], dtype=int64)

Just set the index before writing to CSV.
df.index = np.arange(1, len(df) + 1)
And then write it normally.

source: In Python pandas, start row index from 1 instead of zero without creating additional column
Working example:
import pandas as pdas
dframe = pdas.read_csv(open(input_file))
dframe.index = dframe.index + 1

Another way in one line:
df.shift()[1:]

In my opinion best practice is to set the index with a RangeIndex
import pandas as pd
result = pd.DataFrame(
{'Count': [83, 19, 20]},
index=pd.RangeIndex(start=1, stop=4, name='index')
)
>>> result
Count
index
1 83
2 19
3 20
I prefer this, because you can define the range and a possible step and a name for the index in one line.

This worked for me
df.index = np.arange(1, len(df)+1)

You can use this one:
import pandas as pd
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index += 1
print(result)
or this one, by getting the help of numpy library like this:
import pandas as pd
import numpy as np
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index = np.arange(1, len(result)+1)
print(result)
np.arange will create a numpy array and return values within a given interval which is (1, len(result)+1) and finally you will assign that array to result.index.

use this
df.index = np.arange(1, len(df)+1)

Add ".shift()[1:]" while creating a data frame
data = pd.read_csv(r"C:\Users\user\path\data.csv").shift()[1:]

Fork from the original answer, giving some cents:
if I'm not mistaken, starting from version 0.23, index object is RangeIndex type
From the official doc:
RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges. Using RangeIndex may in some instances improve computing speed.
In case of a huge index range, that makes sense, using the representation of the index, instead of defining the whole index at once (saving memory).
Therefore, an example (using Series, but it applies to DataFrame also):
>>> import pandas as pd
>>>
>>> countries = ['China', 'India', 'USA']
>>> ds = pd.Series(countries)
>>>
>>>
>>> type(ds.index)
<class 'pandas.core.indexes.range.RangeIndex'>
>>> ds.index
RangeIndex(start=0, stop=3, step=1)
>>>
>>> ds.index += 1
>>>
>>> ds.index
RangeIndex(start=1, stop=4, step=1)
>>>
>>> ds
1 China
2 India
3 USA
dtype: object
>>>
As you can see, the increment of the index object, changes the start and stop parameters.

This adds a column that accomplishes what you want
df.insert(0,"Column Name", np.arange(1,len(df)+1))

Following on from TomAugspurger's answer, we could use list comprehension rather than np.arrange(), which removes the requirement for importing the module: numpy. You can use the following instead:
df.index = [i+1 for i in range(len(df))]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check if dataframe column is Categorical - python

Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for: df['column'].name in df.select_dtypes(include='category').columns Thanks to #Jeff.

Nowadays you can use: pandas.api.types.is_categorical_dtype(series) Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html Available since at least pandas 1.0

Related

why is ( df1[col].dtype != 'category' )running on one computer successfully while giving error on other while trying to exclude category attribute? [duplicate]

How to numerate from id = 1 and name the first column as "Id" [duplicate]

Numpy select returning boolean error message

how to test if a variable is pd.NaT?

start index at 1 for Pandas DataFrame

Categories

Resources