I would like to find matching strings in a path and use np.select to create a new column with labels dependant on the matches I found.
This is what I have written
import numpy as np
conditions = [a["properties_path"].str.contains('blog'),
a["properties_path"].str.contains('credit-card-readers/|machines|poss|team|transaction_fees'),
a["properties_path"].str.contains('signup|sign-up|create-account|continue|checkout'),
a["properties_path"].str.contains('complete'),
a["properties_path"] == '/za/|/',
a["properties_path"].str.contains('promo')]
choices = [ "blog","info_pages","signup","completed","home_page","promo"]
a["page_type"] = np.select(conditions, choices, default=np.nan)
However, when I run this code, I get this error message:
ValueError: invalid entry 0 in condlist: should be boolean ndarray
Here is a sample of my data
3124465 /blog/ts-st...
3124466 /card-machines
3124467 /card-machines
3124468 /card-machines
3124469 /promo/our-gift-to-you
3124470 /create-account/v1
3124471 /za/signup/
3124472 /create-account/v1
3124473 /sign-up
3124474 /za/
3124475 /sign-up/cart
3124476 /checkout/
3124477 /complete
3124478 /card-machines
3124479 /continue
3124480 /blog/article/get-car...
3124481 /blog/article/get-car...
3124482 /za/signup/
3124483 /credit-card-readers
3124484 /signup
3124485 /credit-card-readers
3124486 /create-account/v1
3124487 /credit-card-readers
3124488 /point-of-sale-app
3124489 /create-account/v1
3124490 /point-of-sale-app
3124491 /credit-card-readers
The .str methods operate on object columns. It's possible to have non-string values in such columns, and as a result pandas returns NaN for these rows instead of False. np then complains because this is not a Boolean.
Luckily, there's an argument to handle this: na=False
a["properties_path"].str.contains('blog', na=False)
Alternatively, you could change your conditions to:
a["properties_path"].str.contains('blog') == True
#or
a["properties_path"].str.contains('blog').fillna(False)
Sample
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 'foo', 'bar']})
conds = df.a.str.contains('f')
#0 NaN
#1 True
#2 False
#Name: a, dtype: object
np.select([conds], ['XX'])
#ValueError: invalid entry 0 in condlist: should be boolean ndarray
conds = df.a.str.contains('f', na=False)
#0 False
#1 True
#2 False
#Name: a, dtype: bool
np.select([conds], ['XX'])
#array(['0', 'XX', '0'], dtype='<U11')
Your data seem to have nan, so conditions have nan, which breaks np.select. To fix this, you can do:
s = a["properties_path"].fillna('')
and replace a['properties_path'] in each condition with s.
Related
I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({
'x': np.linspace(0, 50, 6),
'y': np.linspace(0, 20, 6),
'cat_column': random.sample('abcdef', 6)
})
df['cat_column'] = pd.Categorical(df2['cat_column'])
We can see that the dtype for the categorical column is 'category':
df.cat_column.dtype
Out[20]: category
And normally we can do a dtype check by just comparing to the name
of the dtype:
df.x.dtype == 'float64'
Out[21]: True
But this doesn't seem to work when trying to check if the x column
is categorical:
df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-94d2608815c4> in <module>()
----> 1 df.x.dtype == 'category'
TypeError: data type "category" not understood
Is there any way to do these types of checks in pandas v0.15+?
Use the name property to do the comparison instead, it should always work because it's just a string:
>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'
>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'
So, to sum up, you can end up with a simple, straightforward function:
def is_categorical(array_like):
return array_like.dtype.name == 'category'
First, the string representation of the dtype is 'category' and not 'categorical', so this works:
In [41]: df.cat_column.dtype == 'category'
Out[41]: True
But indeed, as you noticed, this comparison gives a TypeError for other dtypes, so you would have to wrap it with a try .. except .. block.
Other ways to check using pandas internals:
In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype)
Out[42]: True
In [43]: pd.api.types.is_categorical_dtype(df.cat_column)
Out[43]: True
For non-categorical columns, those statements will return False instead of raising an error. For example:
In [44]: pd.api.types.is_categorical_dtype(df.x)
Out[44]: False
For much older version of pandas, replace pd.api.types in the above snippet with pd.core.common.
Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for:
df['column'].name in df.select_dtypes(include='category').columns
Thanks to #Jeff.
In my pandas version (v1.0.3), a shorter version of joris' answer is available.
df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])})
print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False
print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True
print(pd.CategoricalDtype.is_dtype(df.noncat)) # False
print(pd.CategoricalDtype.is_dtype(df.categ)) # True
I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here.
It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following:
hasattr(column_to_check, 'cat')
So, as per the example given in the initial question, this would be:
hasattr(df.x, 'cat') #True
Nowadays you can use:
pandas.api.types.is_categorical_dtype(series)
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html
Available since at least pandas 1.0
Taking a look at #Jeff Tratner answer, since the condition df.cat_column.dtype == 'category' not needs to be True to be considered a column as cataegorical,
I propose this considering categorical the dtypes within 'categorical_dtypes' list:
def is_cat(column):
categorical_dtypes = ['object', 'category', 'bool']
if column.dtype.name in categorical_dtypes:
return True
else:
return False
´´´
An example dataset I'm working with
df = pd.DataFrame({"competitorname": ["3 Musketeers", "Almond Joy"], "winpercent": [67.602936, 50.347546] }, index = [1, 2])
I am trying to see whether 3 Musketeers or Almond Joy has a higher winpercent. The code I wrote is:
more_popular = '3 Musketeers' if df.loc[df["competitorname"] == '3 Musketeers', 'winpercent'].values[0] > df.loc[df["competitorname"] == 'Almond Joy', 'winpercent'].values[0] else 'Almond Joy'
My question is
Can I select the values I am interested in without python returning a Series? Is there a way to just do
df[df["competitorname"] == 'Almond Joy', 'winpercent']
and then it would return a simple
50.347546
?
I know this doesn't make my code significantly shorter but I feel like I am missing something about getting values from pandas that would help me avoid constantly adding
.values[0]
The underlying issue is that there could be multiple matches, so we will always need to extract the match(es) at some point in the pipeline:
Use Series.idxmax on the boolean mask
Since False is 0 and True is 1, using Series.idxmax on the boolean mask will give you the index of the first True:
df.loc[df['competitorname'].eq('Almond Joy').idxmax(), 'winpercent']
# 50.347546
This assumes there is at least 1 True match, otherwise it will return the first False.
Or use Series.item on the result
This is basically just an alias for Series.values[0]:
df.loc[df['competitorname'].eq('Almond Joy'), 'winpercent'].item()
# 50.347546
This assumes there is exactly 1 True match, otherwise it will throw a ValueError.
How about simply sorting the dataframe by "winpercent" and then taking the top row?
df.sort_values(by="winpercent", ascending=False, inplace=True)
then to see the winner's row
df.head(1)
or to get the values
df.iloc[0]["winpercent"]
If you're sure that the returned Series has a single element, you can simply use .item() to get it:
import pandas as pd
df = pd.DataFrame({
"competitorname": ["3 Musketeers", "Almond Joy"],
"winpercent": [67.602936, 50.347546]
}, index = [1, 2])
s = df.loc[df["competitorname"] == 'Almond Joy', 'winpercent'] # a pandas Series
print(s)
# output
# 2 50.347546
# Name: winpercent, dtype: float64
v = df.loc[df["competitorname"] == 'Almond Joy', 'winpercent'].item() # a scalar value
print(v)
# output
# 50.347546
import pandas as pd
df = pd.DataFrame({'RMDS': ['10.686000','NYSE_XNAS','0.472590','qrtr'], 'Mstar': ['10.690000', 'NYSE_XNAS', '0.473590','mnthly']})
Dataframe df will look like this:
Mstar RMDS
0 10.690000 10.686000
1 NYSE_XNAS NYSE_XNAS
2 0.473590 0.472590
3 mnthly qrtr
I want to compare value of 'RMDS' with 'Mstar' and type of dataframe is 'object',this is huge dataframe and I need to compare rounded values
mask = np.around(pd.to_numeric(df.Mstar), 2) != np.around(pd.to_numeric(df.RMDS), 2)
df_Difference = df[mask]
since values in columns are not consistent so whenever string values are coming like 'qrtr' , above logic is failing as I am using pd.to_numeric but still i wanted to compare 'qrtr' from 'RMDS' to 'mnthly' in 'Mstar'
Is there any way I could handle this type of situation.
Use pd.to_numeric to convert what you can, then .fillna to get everything back that wasn't converted.
import pandas as pd
import numpy as np
df = np.round(df.apply(pd.to_numeric, errors='coerce'),2).fillna(df)
# RMDS Mstar
#0 10.69 10.69
#1 NYSE_XNAS NYSE_XNAS
#2 0.47 0.47
#3 qrtr mnthly
df.RMDS == df.Mstar
#0 True
#1 True
#2 True
#3 False
#dtype: bool
Alternatively, define your own function and use .applymap
def my_round(x):
try:
return np.round(float(x),2)
except ValueError:
return x
df = df.applymap(my_round)
I'm trying to test if one of my variables is pd.NaT. I know it is NaT, and still it won't pass the test. As an example, the following code prints nothing :
a=pd.NaT
if a == pd.NaT:
print("a not NaT")
Does anyone have a clue ? Is there a way to effectively test if a is NaT?
Pandas NaT behaves like a floating-point NaN, in that it's not equal to itself. Instead, you can use pandas.isnull:
In [21]: pandas.isnull(pandas.NaT)
Out[21]: True
This also returns True for None and NaN.
Technically, you could also check for Pandas NaT with x != x, following a common pattern used for floating-point NaN. However, this is likely to cause issues with NumPy NaTs, which look very similar and represent the same concept, but are actually a different type with different behavior:
In [29]: x = pandas.NaT
In [30]: y = numpy.datetime64('NaT')
In [31]: x != x
Out[31]: True
In [32]: y != y
/home/i850228/.local/lib/python3.6/site-packages/IPython/__main__.py:1: FutureWarning: In the future, NAT != NAT will be True rather than False.
# encoding: utf-8
Out[32]: False
numpy.isnat, the function to check for NumPy NaT, also fails with a Pandas NaT:
In [33]: numpy.isnat(pandas.NaT)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-39a66bbf6513> in <module>()
----> 1 numpy.isnat(pandas.NaT)
TypeError: ufunc 'isnat' is only defined for datetime and timedelta.
pandas.isnull works for both Pandas and NumPy NaTs, so it's probably the way to go:
In [34]: pandas.isnull(pandas.NaT)
Out[34]: True
In [35]: pandas.isnull(numpy.datetime64('NaT'))
Out[35]: True
pd.NaT is pd.NaT
True
this works for me.
You can also use pandas.isna() for pandas.NaT, numpy.nan or None:
import pandas as pd
import numpy as np
x = (pd.NaT, np.nan, None)
[pd.isna(i) for i in x]
Output:
[True, True, True]
If it's in a Series (e.g. DataFrame column) you can also use .isna():
pd.Series(pd.NaT).isna()
# 0 True
# dtype: bool
This is what works for me
>>> a = pandas.NaT
>>> type(a) == pandas._libs.tslibs.nattype.NaTType
>>> True
I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({
'x': np.linspace(0, 50, 6),
'y': np.linspace(0, 20, 6),
'cat_column': random.sample('abcdef', 6)
})
df['cat_column'] = pd.Categorical(df2['cat_column'])
We can see that the dtype for the categorical column is 'category':
df.cat_column.dtype
Out[20]: category
And normally we can do a dtype check by just comparing to the name
of the dtype:
df.x.dtype == 'float64'
Out[21]: True
But this doesn't seem to work when trying to check if the x column
is categorical:
df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-94d2608815c4> in <module>()
----> 1 df.x.dtype == 'category'
TypeError: data type "category" not understood
Is there any way to do these types of checks in pandas v0.15+?
Use the name property to do the comparison instead, it should always work because it's just a string:
>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'
>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'
So, to sum up, you can end up with a simple, straightforward function:
def is_categorical(array_like):
return array_like.dtype.name == 'category'
First, the string representation of the dtype is 'category' and not 'categorical', so this works:
In [41]: df.cat_column.dtype == 'category'
Out[41]: True
But indeed, as you noticed, this comparison gives a TypeError for other dtypes, so you would have to wrap it with a try .. except .. block.
Other ways to check using pandas internals:
In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype)
Out[42]: True
In [43]: pd.api.types.is_categorical_dtype(df.cat_column)
Out[43]: True
For non-categorical columns, those statements will return False instead of raising an error. For example:
In [44]: pd.api.types.is_categorical_dtype(df.x)
Out[44]: False
For much older version of pandas, replace pd.api.types in the above snippet with pd.core.common.
Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for:
df['column'].name in df.select_dtypes(include='category').columns
Thanks to #Jeff.
In my pandas version (v1.0.3), a shorter version of joris' answer is available.
df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])})
print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False
print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True
print(pd.CategoricalDtype.is_dtype(df.noncat)) # False
print(pd.CategoricalDtype.is_dtype(df.categ)) # True
I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here.
It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following:
hasattr(column_to_check, 'cat')
So, as per the example given in the initial question, this would be:
hasattr(df.x, 'cat') #True
Nowadays you can use:
pandas.api.types.is_categorical_dtype(series)
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html
Available since at least pandas 1.0
Taking a look at #Jeff Tratner answer, since the condition df.cat_column.dtype == 'category' not needs to be True to be considered a column as cataegorical,
I propose this considering categorical the dtypes within 'categorical_dtypes' list:
def is_cat(column):
categorical_dtypes = ['object', 'category', 'bool']
if column.dtype.name in categorical_dtypes:
return True
else:
return False
´´´