Conditional Boolean Loc Indexing VS ISIN Indexing [duplicate] - python

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 4 years ago.
I get the same indexing results using these two methods:
data.loc[data['loan_status']== 'Default'].head()
data[data['loan_status'].isin(['Default'])].head()
Is there an advantage of one over the other?
Also is there a reason why isin needs a ([]) parameter to work whereas most methods simply need a ()?

.isin allows you to supply a list of values to check. For instance, if you were looking for 'Default' or Defaulted, or something like that, you could say:
data[data['loan_status'].isin(['Default', 'Defaulted'])].head()
Whereas otherwise, you would have to give it multiple conditions, like:
data.loc[(data['loan_status'] == 'Default') | (data['loan_status'] == 'Defaulted')]

Related

How to use apply and lambda to change a pandas data frame column [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I cannot figure this out. I want to change the "type" column in this dataset to 0/1 values.
url = "http://www.stats.ox.ac.uk/pub/PRNN/pima.tr"
Pima_training = pd.read_csv(url,sep = '\s+')
Pima_training["type"] = Pima_training["type"].apply(lambda x : 1 if x == 'Yes' else 0)
I get the following error:
A value is trying to be set on a copy of a slice from a DataFrame.
This is a warning and won't break your code. This happens when pandas detects chained assignment, which is when you use multiple indexing operations, and there might be ambiguity about whether you are modifying the original df or a copy of the df. Other more experienced programmers have explained it in depth in another SO thread, so feel free to give it a read for a further explanation.
In your particular example, you don't need .apply at all here (see this question for why not, but using apply on a single column is very inefficient because it loops over rows internally), and I think it makes more sense to use .replace instead, and a pass a dictionary.
Pima_training['type'] = Pima_training['type'].replace({"No":0,"Yes":1})

pandas - is there any difference between df.loc[df['column_label'] == filter_value] and df[df['column_label'] == filter_value] [duplicate]

This question already has an answer here:
Pandas, loc vs non loc for boolean indexing
(1 answer)
Closed 2 years ago.
I am learning pandas and want to know the best practice for filtering rows of a DataFrame by column values.
According to https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html, the recommendation is to use optimized pandas data access methods such as .loc
An example from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html -
df.loc[df['shield'] > 6]
However, according to https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#where, a construction like tips[tips['time'] == 'Dinner'] could be used.
Why is the recommended .loc omitted? Is there any difference?
With .loc you can also correctly set a value, as not using it raises an you are trying to set a value on a copy of a DataFrame error. For getting something out of your DataFrame, there might be performance differences, but I don't know that.

Best way to select columns in python pandas dataframe [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 4 years ago.
I have two ways in my script how i select specific rows from dataframe:
1.
df2 = df1[(df1['column_x']=='some_value')]
2.
df2 = df1.loc[df1['column_x'].isin(['some_value'])]
From a efficiency perspective, and from a pythonic perspective (as in, what is most python way of coding) which method of selecting specific rows is preferred?
P.S. Also, I feel there are probably even more ways to achieve the same.
P.S.S. I feel that this question is already been asked, but i couldnt find it. Please reference if duplicate
They are different. df1[(df1['column_x']=='some_value')] is fine if you're just looking for a single value. The advantage of isin is that you can pass it multiple values. For example: df1.loc[df1['column_x'].isin(['some_value', 'another_value'])]
It's interesting to see that from a performance perspective, the first method (using ==) actually seems to be significantly slower than the second (using isin):
import timeit
df = pd.DataFrame({'x':np.random.choice(['a','b','c'],10000)})
def method1(df = df):
return df[df['x'] == 'b']
def method2(df=df):
return df[df['x'].isin(['b'])]
>>> timeit.timeit(method1,number=1000)/1000
0.001710233046906069
>>> timeit.timeit(method2,number=1000)/1000
0.0008507879299577325

Using str.contains on pandas dataframe [duplicate]

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 3 years ago.
This pandas python code generates the error message,
"TypeError: bad operand type for unary ~: 'float'"
I have no idea why because I'm trying to manipulate a str object
df_Anomalous_Vendor_Reasons[~df_Anomalous_Vendor_Reasons['V'].str.contains("File*|registry*")] #sorts, leaving only cases where reason is NOT File or Registry
Anybody got any ideas?
Credit to Davtho1983 comment above, I thought I'd add color to the comment for clarity.
For anyone stumbling on this later with the same error (like me).
It's a very simple fix. The documentation from pandas shows
Series.str.contains(pat, case=True, flags=0, na=nan, regex=True)
What's happening is the contains() method isn't being applied to na values in the DataFrame, they will remain na. You just need to fill na values with Boolean values so you may use the invert operator ~ .
With the example above one should use
df_Anomalous_Vendor_Reasons[~df_Anomalous_Vendor_Reasons['V'].str.contains("File*|registry*", na=False)]
Of course one should choose False or True for the na argument based on intended logic. Whichever Boolean value you choose for filling na will be inverted.

Get row-index values of Pandas DataFrame as list? [duplicate]

This question already has answers here:
How do I convert a Pandas series or index to a NumPy array? [duplicate]
(8 answers)
Closed 4 years ago.
I'm probably using poor search terms when trying to find this answer. Right now, before indexing a DataFrame, I'm getting a list of values in a column this way...
list = list(df['column'])
...then I'll set_index on the column. This seems like a wasted step. When trying the above on an index, I get a key error.
How can I grab the values in an index (both single and multi) and put them in a list or a list of tuples?
To get the index values as a list/list of tuples for Index/MultiIndex do:
df.index.values.tolist() # an ndarray method, you probably shouldn't depend on this
or
list(df.index.values) # this will always work in pandas
If you're only getting these to manually pass into df.set_index(), that's unnecessary. Just directly do df.set_index['your_col_name', drop=False], already.
It's very rare in pandas that you need to get an index as a Python list (unless you're doing something pretty funky, or else passing them back to NumPy), so if you're doing this a lot, it's a code smell that you're doing something wrong.

Categories

Resources