This question already has answers here:
Querying for NaN and other names in Pandas
(7 answers)
Closed 2 years ago.
I have a dataframe with columns of different dtypes and I need to use pandas.query to filter the columns.
Columns can include missing values: NaN, None and NaT and I need to display the rows that contain such values. Is there a way to do this in an expression passed to pandas.query? I am aware that it can be done using different methods but I need to know if it is doable through the query
For boolean columns I was able to use a workaround by stating:
df.query('col not in (True, False)')
but this won't work for other types of columns. Any help is appreciated including workarounds.
NaN is not equal to itself, so you can simply test if a column is equal to itself to filter it. This also seems to work for None although I'm not sure why, it may be getting cast to NaN at some point during the evaluation.
df.query('col == col')
For datetimes, this works, but feels pretty hacky, there might be a better way.
df.query('col not in [#pd.NaT]')
Related
This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I cannot figure this out. I want to change the "type" column in this dataset to 0/1 values.
url = "http://www.stats.ox.ac.uk/pub/PRNN/pima.tr"
Pima_training = pd.read_csv(url,sep = '\s+')
Pima_training["type"] = Pima_training["type"].apply(lambda x : 1 if x == 'Yes' else 0)
I get the following error:
A value is trying to be set on a copy of a slice from a DataFrame.
This is a warning and won't break your code. This happens when pandas detects chained assignment, which is when you use multiple indexing operations, and there might be ambiguity about whether you are modifying the original df or a copy of the df. Other more experienced programmers have explained it in depth in another SO thread, so feel free to give it a read for a further explanation.
In your particular example, you don't need .apply at all here (see this question for why not, but using apply on a single column is very inefficient because it loops over rows internally), and I think it makes more sense to use .replace instead, and a pass a dictionary.
Pima_training['type'] = Pima_training['type'].replace({"No":0,"Yes":1})
This question already has an answer here:
Pandas, loc vs non loc for boolean indexing
(1 answer)
Closed 2 years ago.
I am learning pandas and want to know the best practice for filtering rows of a DataFrame by column values.
According to https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html, the recommendation is to use optimized pandas data access methods such as .loc
An example from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html -
df.loc[df['shield'] > 6]
However, according to https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#where, a construction like tips[tips['time'] == 'Dinner'] could be used.
Why is the recommended .loc omitted? Is there any difference?
With .loc you can also correctly set a value, as not using it raises an you are trying to set a value on a copy of a DataFrame error. For getting something out of your DataFrame, there might be performance differences, but I don't know that.
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 4 years ago.
I have two ways in my script how i select specific rows from dataframe:
1.
df2 = df1[(df1['column_x']=='some_value')]
2.
df2 = df1.loc[df1['column_x'].isin(['some_value'])]
From a efficiency perspective, and from a pythonic perspective (as in, what is most python way of coding) which method of selecting specific rows is preferred?
P.S. Also, I feel there are probably even more ways to achieve the same.
P.S.S. I feel that this question is already been asked, but i couldnt find it. Please reference if duplicate
They are different. df1[(df1['column_x']=='some_value')] is fine if you're just looking for a single value. The advantage of isin is that you can pass it multiple values. For example: df1.loc[df1['column_x'].isin(['some_value', 'another_value'])]
It's interesting to see that from a performance perspective, the first method (using ==) actually seems to be significantly slower than the second (using isin):
import timeit
df = pd.DataFrame({'x':np.random.choice(['a','b','c'],10000)})
def method1(df = df):
return df[df['x'] == 'b']
def method2(df=df):
return df[df['x'].isin(['b'])]
>>> timeit.timeit(method1,number=1000)/1000
0.001710233046906069
>>> timeit.timeit(method2,number=1000)/1000
0.0008507879299577325
I have a DataFrame which has a lot of NAs. pandas's groupby operation is ignoring any combinations with NA in it. Is there a way to include NAs in groups? If not, what are the alternatives to pandas groupby? I really don't want to fill in NAs because the fact that something is missing is useful information.
Edit: I noticed that my question is exactly the same issue reported in groupby columns with NaN (missing) values
Has there been any developments technology to get around this issue?
I will use some kind of non-NA representation for NA only for groupby, which can't be confused with proper data (e.g. -999999 or 'missing')
df.fillna(-999999).groupby(...)
As inplace argument has default value False your original dataframe will not be affected.
This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 3 years ago.
This pandas python code generates the error message,
"TypeError: bad operand type for unary ~: 'float'"
I have no idea why because I'm trying to manipulate a str object
df_Anomalous_Vendor_Reasons[~df_Anomalous_Vendor_Reasons['V'].str.contains("File*|registry*")] #sorts, leaving only cases where reason is NOT File or Registry
Anybody got any ideas?
Credit to Davtho1983 comment above, I thought I'd add color to the comment for clarity.
For anyone stumbling on this later with the same error (like me).
It's a very simple fix. The documentation from pandas shows
Series.str.contains(pat, case=True, flags=0, na=nan, regex=True)
What's happening is the contains() method isn't being applied to na values in the DataFrame, they will remain na. You just need to fill na values with Boolean values so you may use the invert operator ~ .
With the example above one should use
df_Anomalous_Vendor_Reasons[~df_Anomalous_Vendor_Reasons['V'].str.contains("File*|registry*", na=False)]
Of course one should choose False or True for the na argument based on intended logic. Whichever Boolean value you choose for filling na will be inverted.