Weird behavior when checking for NaNs in a pandas dataframe

Weird behavior when checking for NaNs in a pandas dataframe - python

I want to loop over all the rows in a df, checking that two conditions hold and, if they do, replace the value in a column with something else. I've attempted to do this two different ways:
if (sales.iloc[idx]['shelf'] in ("DRY BLENDS","LIQUID BLENDS")) & np.isnan(sales.iloc[idx]['traceable_blend']):
sales.iloc[idx]['traceable_blend'] = False
and:
if (sales.iloc[idx]['shelf'] in ("DRY BLENDS","LIQUID BLENDS")) & (sales.iloc[idx]['traceable_blend'] == np.NaN):
sales.iloc[idx]['traceable_blend'] = False
By including print statements we've verified that the if statement is actually functional, but no assignment ever takes place. Once we've run the loop, there are True and NaN values in the 'traceable_blend' column, but never False. Somehow the assignment is failing.
It looks like this might've worked:
if (sales.iloc[idx]['shelf'] in ("DRY BLENDS","LIQUID BLENDS")) & np.isnan(sales.iloc[idx]['traceable_blend']):
sales.at[idx, 'traceable_blend'] = False
But I would still like to understand what's happening.

This, sales.iloc[idx]['traceable_blend']=False, is index chaining, and will almost never work. In fact, you don't need to loop:
sales['traceable_blend'] = sales['traceable_blend'].fillna(sales['shelf'].isin(['DRY BLENDS', 'LIQUID BLENDS']))

Pandas offers two functions for checking for missing data (NaN or null): isnull() and notnull() - They return a boolean value. I suggest to try these instead of isnan()
You can also determine if any value is missing in your series by chaining .values.any()

Related

How to show all rows of a filter containing Boolean values

in a given dataframe in pandas, is there a way to see all the Booleans present in filt in the code below:
filt = dataframe['tag1'] =='ABC'
filt

TLDR
It's possible. I think you should use indexing, it's extensively described here. To be more specific you can use boolean indexing.
Code should look like this
filt = df[df.loc[:,"tag1"] == 'ABC]
Now what actually happens here
df.loc[:,"tag1"] returns all rows : character, but limits columns to just "tag1". Next df.loc[:,"tag1"] == 'ABC comperes returned rows with value "ABC", as the result grid of True/False will be created. True row was equal to "ABC" etc. Now the grand final. Whenever you pass grid of logical values to an dataframe they are treated as indicators whether or not to include the result. So let's say value at [0,0] in passed grid is True, therefore it will be included in the result.
I understand it's hard to wrap one's head around this concept but once you get it it's super useful. The best is to just play around with this iloc[] and loc[] functions.

If statement with multiple conditions but only check others if first fails

I'm trying to test two conditions, but only want to check the second condition if the first one fails. One example could be:
if df is None or len(df) == 0:
# do something
I know I can use two seperate if statements or a try...except block. However, I want to know if there is a more elegant pythonic way of doing it or is two seperate if statements the only way.

or is a short-circuit operator in python 3. This means the second condition (here len(df) == 0) is executed only if the first one (here df is None) is false.

You can running your code like this:
if df == none:
# Do this
elif len(df) == 0:
# Do this
else:
# Do this
It works fine and can definately help

The truth value of a series is ambiguous: use

I get an error stating
ValueError:The truth value of a series is ambiguous for the if
condition.
with the the following function:
for i , row in train_news.iterrows():
if train_news.iloc[:,0].isin(['mostly-true','half-true','true']):
train_news.iloc[:,0] = "true"
else :
train_news.iloc[:,0] = "false"

The problem is in your if statement -
if train_news.iloc[:,0].isin(['mostly-true','half-true','true'])
Think about what this does -
Let's say train_news.iloc[:,0] looks like this -
mostly-true
not-true
half-true
Now if you do train_news.iloc[:,0].isin(['mostly-true','half-true','true']), this will check iteratively whether each element is present in the list ['mostly-true','half-true','true']
So, this will yield another pandas.Series which looks like this -
True
False
True
The if statement in python, being the simpleton, expects one bool value and you are just confusing it by providing a bunch of boolean values. So, either you need to use .all() or .any() (those are the usual to-do things) at the end depending upon what you want

How to check if all values in a dataframe are True

pd.DataFrame.all and pd.DataFrame.any convert to bool all values and than assert all identities with the keyword True. This is ok as long as we are fine with the fact that non-empty lists and strings evaluate to True. However let assume that this is not the case.
>>> pd.DataFrame([True, 'a']).all().item()
True # Wrong
A workaround is to assert equality with True, but a comparison to True does not sound pythonic.
>>> (pd.DataFrame([True, 'a']) == True).all().item()
False # Right
Question: can we assert for identity with True without using == True

First of all, I do not advise this. Please do not use mixed dtypes inside your dataframe columns - that defeats the purpose of dataframes and they are no more useful than lists and no more performant than loops.
Now, addressing your actual question, spolier alert, you can't get over the ==. But you can hide it using the eq function. You may use
df.eq(True).all()
Or,
df.where(df.eq(True), False).all()
Note that
df.where(df.eq(True), False)
0
0 True
1 False
Which you may find useful if you want to convert non-"True" values to False for any other reason.

I would actually use
(pd.DataFrame([True, 'a']) == True).all().item()
This way, you're checking for the value of the object, not just checking the "truthy-ness" of it.
This seems perfectly pythonic to me because you're explicitly checking for the value of the object, not just whether or not it's a truthy value.

KeyError: False in pandas dataframe

import pandas as pd
businesses = pd.read_json(businesses_filepath, lines=True, encoding='utf_8')
restaurantes = businesses['Restaurants' in businesses['categories']]
I would like to remove the lines that do not have Restaurants in the categories column, and this column has lists, however gave the error 'KeyError: False' and I would like to understand why and how to solve.

The expression 'Restaurants' in businesses['categories'] returns the boolean value False. This is passed to the brackets indexing operator for the DataFrame businesses which does not contain a column called False and thus raises a KeyError.
What you are looking to do is something called boolean indexing which works like this.
businesses[businesses['categories'] == 'Restaurants']

If you find that your data contains spelling variations or alternative restaurant related terms, the following may be of benefit. Essentially you put your restaurant related terms in restuarant_lst. The lambda function returns true if any of the items in restaurant_lst are contained within each row of the business series. The .loc indexer filters out rows which return false for the lambda function.
restaurant_lst = ['Restaurant','restaurantes','diner','bistro']
restaurant = businesses.loc[businesses.apply(lambda x: any(restaurant_str in x for restaurant_str in restaurant_lst))]

The reason for this is that the Series class implements a custom in operator that doesn't return an iterable like the == does, here's a workaround
businesses[['Restaurants' in c for c in list(businesses['categories'])]]
hopefully this helps someone where you're looking for a substring in the column and not a full match.

I think what you meant was :
businesses = businesses.loc[businesses['categories'] == 'Restaurants']
that will only keep rows with the category restaurants

None of the answers here actually worked for me,
businesses[businesses['categories'] == 'Restaurants']
obviously won't work since the value in 'categories' is not a string, it's a list, meaning the comparison will always fail.
What does, however, work, is converting the column into tuples instead of strings:
businesses['categories'] = businesses['categories'].apply(tuple)
That allows you to use the standard .loc thing:
businesses.loc[businesses['categories'] == ('Restaurants',)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Weird behavior when checking for NaNs in a pandas dataframe - python

This, sales.iloc[idx]['traceable_blend']=False, is index chaining, and will almost never work. In fact, you don't need to loop: sales['traceable_blend'] = sales['traceable_blend'].fillna(sales['shelf'].isin(['DRY BLENDS', 'LIQUID BLENDS']))

Pandas offers two functions for checking for missing data (NaN or null): isnull() and notnull() - They return a boolean value. I suggest to try these instead of isnan() You can also determine if any value is missing in your series by chaining .values.any()

Related

How to show all rows of a filter containing Boolean values

If statement with multiple conditions but only check others if first fails

The truth value of a series is ambiguous: use

How to check if all values in a dataframe are True

KeyError: False in pandas dataframe

Categories

Resources