Using str.contains on pandas dataframe [duplicate] - python

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 3 years ago.
This pandas python code generates the error message,
"TypeError: bad operand type for unary ~: 'float'"
I have no idea why because I'm trying to manipulate a str object
df_Anomalous_Vendor_Reasons[~df_Anomalous_Vendor_Reasons['V'].str.contains("File*|registry*")] #sorts, leaving only cases where reason is NOT File or Registry
Anybody got any ideas?

Credit to Davtho1983 comment above, I thought I'd add color to the comment for clarity.
For anyone stumbling on this later with the same error (like me).
It's a very simple fix. The documentation from pandas shows
Series.str.contains(pat, case=True, flags=0, na=nan, regex=True)
What's happening is the contains() method isn't being applied to na values in the DataFrame, they will remain na. You just need to fill na values with Boolean values so you may use the invert operator ~ .
With the example above one should use
df_Anomalous_Vendor_Reasons[~df_Anomalous_Vendor_Reasons['V'].str.contains("File*|registry*", na=False)]
Of course one should choose False or True for the na argument based on intended logic. Whichever Boolean value you choose for filling na will be inverted.

Related

How to use apply and lambda to change a pandas data frame column [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I cannot figure this out. I want to change the "type" column in this dataset to 0/1 values.
url = "http://www.stats.ox.ac.uk/pub/PRNN/pima.tr"
Pima_training = pd.read_csv(url,sep = '\s+')
Pima_training["type"] = Pima_training["type"].apply(lambda x : 1 if x == 'Yes' else 0)
I get the following error:
A value is trying to be set on a copy of a slice from a DataFrame.
This is a warning and won't break your code. This happens when pandas detects chained assignment, which is when you use multiple indexing operations, and there might be ambiguity about whether you are modifying the original df or a copy of the df. Other more experienced programmers have explained it in depth in another SO thread, so feel free to give it a read for a further explanation.
In your particular example, you don't need .apply at all here (see this question for why not, but using apply on a single column is very inefficient because it loops over rows internally), and I think it makes more sense to use .replace instead, and a pass a dictionary.
Pima_training['type'] = Pima_training['type'].replace({"No":0,"Yes":1})

pandas - is there any difference between df.loc[df['column_label'] == filter_value] and df[df['column_label'] == filter_value] [duplicate]

This question already has an answer here:
Pandas, loc vs non loc for boolean indexing
(1 answer)
Closed 2 years ago.
I am learning pandas and want to know the best practice for filtering rows of a DataFrame by column values.
According to https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html, the recommendation is to use optimized pandas data access methods such as .loc
An example from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html -
df.loc[df['shield'] > 6]
However, according to https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#where, a construction like tips[tips['time'] == 'Dinner'] could be used.
Why is the recommended .loc omitted? Is there any difference?
With .loc you can also correctly set a value, as not using it raises an you are trying to set a value on a copy of a DataFrame error. For getting something out of your DataFrame, there might be performance differences, but I don't know that.

Conditional Boolean Loc Indexing VS ISIN Indexing [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 4 years ago.
I get the same indexing results using these two methods:
data.loc[data['loan_status']== 'Default'].head()
data[data['loan_status'].isin(['Default'])].head()
Is there an advantage of one over the other?
Also is there a reason why isin needs a ([]) parameter to work whereas most methods simply need a ()?
.isin allows you to supply a list of values to check. For instance, if you were looking for 'Default' or Defaulted, or something like that, you could say:
data[data['loan_status'].isin(['Default', 'Defaulted'])].head()
Whereas otherwise, you would have to give it multiple conditions, like:
data.loc[(data['loan_status'] == 'Default') | (data['loan_status'] == 'Defaulted')]

pandas query None values [duplicate]

This question already has answers here:
Querying for NaN and other names in Pandas
(7 answers)
Closed 2 years ago.
I have a dataframe with columns of different dtypes and I need to use pandas.query to filter the columns.
Columns can include missing values: NaN, None and NaT and I need to display the rows that contain such values. Is there a way to do this in an expression passed to pandas.query? I am aware that it can be done using different methods but I need to know if it is doable through the query
For boolean columns I was able to use a workaround by stating:
df.query('col not in (True, False)')
but this won't work for other types of columns. Any help is appreciated including workarounds.
NaN is not equal to itself, so you can simply test if a column is equal to itself to filter it. This also seems to work for None although I'm not sure why, it may be getting cast to NaN at some point during the evaluation.
df.query('col == col')
For datetimes, this works, but feels pretty hacky, there might be a better way.
df.query('col not in [#pd.NaT]')

Get row-index values of Pandas DataFrame as list? [duplicate]

This question already has answers here:
How do I convert a Pandas series or index to a NumPy array? [duplicate]
(8 answers)
Closed 4 years ago.
I'm probably using poor search terms when trying to find this answer. Right now, before indexing a DataFrame, I'm getting a list of values in a column this way...
list = list(df['column'])
...then I'll set_index on the column. This seems like a wasted step. When trying the above on an index, I get a key error.
How can I grab the values in an index (both single and multi) and put them in a list or a list of tuples?
To get the index values as a list/list of tuples for Index/MultiIndex do:
df.index.values.tolist() # an ndarray method, you probably shouldn't depend on this
or
list(df.index.values) # this will always work in pandas
If you're only getting these to manually pass into df.set_index(), that's unnecessary. Just directly do df.set_index['your_col_name', drop=False], already.
It's very rare in pandas that you need to get an index as a Python list (unless you're doing something pretty funky, or else passing them back to NumPy), so if you're doing this a lot, it's a code smell that you're doing something wrong.

Categories

Resources