I am trying to do the following,
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': ('Harry','Sally','Megan'), 'Age': (30, 31,'NN')})
a={'target':"Age2",'check':"==30",'iftrue':["Is"]}
condis=[
df['Age'] a['check']
]
df[a['target']]= np.select(condis,a['iftrue'],default=" ")
print(df)
I am stuck at trying to convert the a['check'] parameter received as a string to a statement so this,
df['Age'] a['check']
should resolve/compile to
df['Age'] ==30
Could someone give me any ideas on how to achieve this? Maybe I am missing something very basic and simple here.
Thanks.
You can use eval to convert string to condition:
check = "==30"
age = "20"
print(eval(age+check))
>>> False
But it's not recommended because eval is a function is to use very carefully as it can execute arbitrary code it cause security issue and is hard to debug
a more proper solution would be for example to have an argument for comparison operator and one for comparaison parameter:
check_op = np.equal
check_arg = 30
print(check_op(check_arg, 20)
>>> False
Related
what I want to do is delete certain parts of a string and take the only near of AcoS and insert it into a new column.
import pandas as pd
data = [{"Campaign" : "Sf l Spy l Branded l ACoS 20 l Manual NX"}]
df = pd.DataFrame(data)
df.insert(1,"targetAcos", 0)
df["targetAcos"] = df["Campaign"].str.replace(r' l ACoS \(.*)\l', r'\1', regex=True)
print(df["targetAcos"])
But I guess I am kinda bad at this, I couldn't make it correctly so I hope you guys can explain how can you do.
I think the Pandas function you want to be using here is str.extract:
df["targetAcos"] = df["Campaign"].str.extract(r'\bl ACoS (\d+) l')
Or perhaps a more generic regex would be:
df["targetAcos"] = df["Campaign"].str.extract(r'\bACoS (\d+)\b')
I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({
'x': np.linspace(0, 50, 6),
'y': np.linspace(0, 20, 6),
'cat_column': random.sample('abcdef', 6)
})
df['cat_column'] = pd.Categorical(df2['cat_column'])
We can see that the dtype for the categorical column is 'category':
df.cat_column.dtype
Out[20]: category
And normally we can do a dtype check by just comparing to the name
of the dtype:
df.x.dtype == 'float64'
Out[21]: True
But this doesn't seem to work when trying to check if the x column
is categorical:
df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-94d2608815c4> in <module>()
----> 1 df.x.dtype == 'category'
TypeError: data type "category" not understood
Is there any way to do these types of checks in pandas v0.15+?
Use the name property to do the comparison instead, it should always work because it's just a string:
>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'
>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'
So, to sum up, you can end up with a simple, straightforward function:
def is_categorical(array_like):
return array_like.dtype.name == 'category'
First, the string representation of the dtype is 'category' and not 'categorical', so this works:
In [41]: df.cat_column.dtype == 'category'
Out[41]: True
But indeed, as you noticed, this comparison gives a TypeError for other dtypes, so you would have to wrap it with a try .. except .. block.
Other ways to check using pandas internals:
In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype)
Out[42]: True
In [43]: pd.api.types.is_categorical_dtype(df.cat_column)
Out[43]: True
For non-categorical columns, those statements will return False instead of raising an error. For example:
In [44]: pd.api.types.is_categorical_dtype(df.x)
Out[44]: False
For much older version of pandas, replace pd.api.types in the above snippet with pd.core.common.
Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for:
df['column'].name in df.select_dtypes(include='category').columns
Thanks to #Jeff.
In my pandas version (v1.0.3), a shorter version of joris' answer is available.
df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])})
print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False
print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True
print(pd.CategoricalDtype.is_dtype(df.noncat)) # False
print(pd.CategoricalDtype.is_dtype(df.categ)) # True
I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here.
It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following:
hasattr(column_to_check, 'cat')
So, as per the example given in the initial question, this would be:
hasattr(df.x, 'cat') #True
Nowadays you can use:
pandas.api.types.is_categorical_dtype(series)
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html
Available since at least pandas 1.0
Taking a look at #Jeff Tratner answer, since the condition df.cat_column.dtype == 'category' not needs to be True to be considered a column as cataegorical,
I propose this considering categorical the dtypes within 'categorical_dtypes' list:
def is_cat(column):
categorical_dtypes = ['object', 'category', 'bool']
if column.dtype.name in categorical_dtypes:
return True
else:
return False
ยดยดยด
Using pandas.read_csv with parse_dates option and a custom date parser, I find Pandas has a mind of its own about the data type it's reading.
Sample csv:
"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
The actual datecleaner is here, but what I do boils down to this:
import pandas as pd
def dateclean(date):
return str(int(date)) # Note: we return A STRING
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
print(df.birth_date)
Output:
0 NaN
1 1625.0
2 1533.0
Name: birth_date, dtype: float64
I get type float64, even when I specified str. Also, take out the first line in the CSV, the one with the empty birth_date, and I get type int. The workaround is easy:
return '"{}"'.format(int(date))
Is there a better way?
In data analysis, I can imagine it's useful that Pandas will say 'Hey dude, you thought you were reading strings, but in fact they're numbers'. But what's the rationale for overruling me when I tell it not to?
Using parse_dates / date_parser looks a bit complicated for me, unless you want to generalise your import on many date columns. I think you have more control with converters parameter, where you can fit dateclean() function. You can also experiment with dtype parameter.
The problem with original dateclean() function is that it fails on "" value, because int("") raises ValueError. Pandas seem to resort to standard import when it encounters this problem, but it will fail explicitly with converters.
Below is the code to demonstrate a fix:
import pandas as pd
from pathlib import Path
doc = """"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
"""
Path('my.csv').write_text(doc)
def dateclean(date):
try:
return str(int(date))
except ValueError:
return ''
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
df2 = pd.read_csv(
'my.csv',
converters = {'birth_date': dateclean}
)
print(df2.birth_date)
Hope it helps.
The problem is date_parser is designed specifically for conversion to datetime:
date_parser : function, default NoneFunction to use for converting a sequence of string columns to an array of datetime
instances.
There is no reason you should expect this parameter to work for other types. Instead, you can use the converters parameter. Here we use toolz.compose to apply int and then str. Alternatively, you can use lambda x: str(int(x)).
from io import StringIO
import pandas as pd
from toolz import compose
mystr = StringIO('''"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"''')
df = pd.read_csv(mystr,
converters={'birth_date': compose(str, int)},
engine='python')
print(df.birth_date)
0 NaN
1 1625
2 1533
Name: birth_date, dtype: object
If you need to replace NaN with empty strings, you can post-process with fillna:
print(df.birth_date.fillna(''))
0
1 1625
2 1533
Name: birth_date, dtype: object
I need to polish a csv dataset, but it seems the changes are not applied to the dataset itslef.
CSV is in this format:
ID, TRACK_LINK
761607, https://mylink.com//track/...
This is my script:
import pandas as pd
df = pd.read_csv('./file.csv').fillna('')
# remove double // from TRACK_LINK
def polish_track_link(track_link):
return track_link.replace("//track", "/track")
df['LINK'].apply(polish_track_link)
print(df)
this prints something like:
...
761607 https://mylink.com//track/...
note the //track
If I do print(df['LINK'].apply(polish_track_link)) I get:
...
761607, https://mylink.com/track/...
So the function polish_track_link works but it's not applied to the dataset. Any idea why?
You need assign back:
df['TRACK_LINK'] = df['TRACK_LINK'].apply(polish_track_link)
But better is use pandas functions str.replace or replace with regex=True for replace substrings:
df['TRACK_LINK'] = df['TRACK_LINK'].str.replace("//track", "/track")
Or:
df['TRACK_LINK'] = df['TRACK_LINK'].replace("//track", "/track", regex=True)
print(df)
ID TRACK_LINK
0 761607 https://mylink.com/track/
When running the following code:
for row,hit in hits.iterrows():
forwardRows = data[data.index.values > row];
I get this error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
If I look into what is being compared here I have these variables:
type(row)
pandas.tslib.Timestamp
row
Timestamp('2015-09-01 09:30:00')
is being compared with:
type(data.index.values[0])
numpy.datetime64
data.index.values[0]
numpy.datetime64('2015-09-01T10:30:00.000000000+0100')
I would like to understand whether this is something that can be easily fixed or should I upload a subset of my data? thanks
Although this isn't a direct answer to your question, I have a feeling that this is what you're looking for: pandas.DataFrame.truncate
You could use it as follows:
for row, hit in hits.iterrows():
forwardRows = data.truncate(before=row)
Here's a little toy example of how you might use it in general:
import pandas as pd
# let's create some data to play with
df = pd.DataFrame(
index=pd.date_range(start='2016-01-01', end='2016-06-01', freq='M'),
columns=['x'],
data=np.random.random(5)
)
# example: truncate rows before Mar 1
df.truncate(before='2016-03-01')
# example: truncate rows after Mar 1
df.truncate(after='2016-03-01')
When using values you put it into numpy world. Instead, try
for row,hit in hits.iterrows():
forwardRows = data[data.index > row];