This question already has an answer here:
index a Python Pandas dataframe with multiple conditions SQL like where statement
(1 answer)
Closed 7 years ago.
I'm trying to run the following SQL statement (obviously in Python code) in pandas, but am getting nowhere:
select year, contraction, indiv_count, total_words from dataframe
where contraction in ("i'm","we're","we've","it's","they're")
Where contractions is char, and year, indiv_count, and total_words are int.
I'm not too familiar with pandas. How do I create a similar statement in Python?
I'd recommend reading the docs listed in Anton's comment if you haven't already, but it lacks documentation for the .isin() method which is what you will need to replicate the SQL in.
df[df['contraction'].isin(["i'm","we're","we've","it's","they're"])]
The columns selection can then be obtained using .loc[] or whatever you're favorite method for that is (there are many).
Related
This question already has an answer here:
Pandas, loc vs non loc for boolean indexing
(1 answer)
Closed 2 years ago.
I am learning pandas and want to know the best practice for filtering rows of a DataFrame by column values.
According to https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html, the recommendation is to use optimized pandas data access methods such as .loc
An example from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html -
df.loc[df['shield'] > 6]
However, according to https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#where, a construction like tips[tips['time'] == 'Dinner'] could be used.
Why is the recommended .loc omitted? Is there any difference?
With .loc you can also correctly set a value, as not using it raises an you are trying to set a value on a copy of a DataFrame error. For getting something out of your DataFrame, there might be performance differences, but I don't know that.
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 4 years ago.
I have two ways in my script how i select specific rows from dataframe:
1.
df2 = df1[(df1['column_x']=='some_value')]
2.
df2 = df1.loc[df1['column_x'].isin(['some_value'])]
From a efficiency perspective, and from a pythonic perspective (as in, what is most python way of coding) which method of selecting specific rows is preferred?
P.S. Also, I feel there are probably even more ways to achieve the same.
P.S.S. I feel that this question is already been asked, but i couldnt find it. Please reference if duplicate
They are different. df1[(df1['column_x']=='some_value')] is fine if you're just looking for a single value. The advantage of isin is that you can pass it multiple values. For example: df1.loc[df1['column_x'].isin(['some_value', 'another_value'])]
It's interesting to see that from a performance perspective, the first method (using ==) actually seems to be significantly slower than the second (using isin):
import timeit
df = pd.DataFrame({'x':np.random.choice(['a','b','c'],10000)})
def method1(df = df):
return df[df['x'] == 'b']
def method2(df=df):
return df[df['x'].isin(['b'])]
>>> timeit.timeit(method1,number=1000)/1000
0.001710233046906069
>>> timeit.timeit(method2,number=1000)/1000
0.0008507879299577325
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 4 years ago.
I get the same indexing results using these two methods:
data.loc[data['loan_status']== 'Default'].head()
data[data['loan_status'].isin(['Default'])].head()
Is there an advantage of one over the other?
Also is there a reason why isin needs a ([]) parameter to work whereas most methods simply need a ()?
.isin allows you to supply a list of values to check. For instance, if you were looking for 'Default' or Defaulted, or something like that, you could say:
data[data['loan_status'].isin(['Default', 'Defaulted'])].head()
Whereas otherwise, you would have to give it multiple conditions, like:
data.loc[(data['loan_status'] == 'Default') | (data['loan_status'] == 'Defaulted')]
This question already has answers here:
How to exclude multiple columns in Spark dataframe in Python
(4 answers)
Closed 4 years ago.
I am working with pyspark 2.0.
My code is :
for col in to_exclude:
df = df.drop(col)
I cannot do directly df = df.drop(*to_exclude) because in 2.0, drop method accept only 1 column at a time.
Is there a way to change my code and remove the for loop ?
First of all - worries not. Even if you do it in loop, it does not mean Spark executes separate queries for each drop. Queries are lazy, so it will build one big execution plan first, and then executes everything at once. (but you probably know it anyway)
However, if you still want to get rid of the loop within 2.0 API, I’d go with something opposite to what you’ve implemented: instead of dropping columns, select only needed:
df.select([col for col in df.columns if col not in to_exclude])
This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 7 years ago.
I have a DataFrame like this:
col1,col2
Sam,NL
Man,NL-USA
ho,CA-CN
And I would like to select the rows whose second column contains the word 'NL', which is something like SQL like command. Does anybody know about a similar command in Python Pandas?
The answer is here. Let df be your DataFrame.
df[df['col2'].str.contains('NL')]
This will select all the records that contain 'NL'.