I have an if statement where it checks if the data frame is not empty. The way I do it is the following:
if dataframe.empty:
pass
else:
#do something
But really I need:
if dataframe is not empty:
#do something
My question is - is there a method .not_empty() to achieve this? I also wanted to ask if the second version is better in terms of performance? Otherwise maybe it makes sense for me to leave it as it is i.e. the first version?
Just do
if not dataframe.empty:
# insert code here
The reason this works is because dataframe.empty returns True if dataframe is empty. To invert this, we can use the negation operator not, which flips True to False and vice-versa.
.empty returns a boolean value
>>> df_empty.empty
True
So if not empty can be written as
if not df.empty:
#Your code
Check pandas.DataFrame.empty
, might help someone.
You can use the attribute dataframe.empty to check whether it's empty or not:
if not dataframe.empty:
#do something
Or
if len(dataframe) != 0:
#do something
Or
if len(dataframe.index) != 0:
#do something
As already clearly explained by other commentators, you can negate a boolean expression in Python by simply prepending the not operator, hence:
if not df.empty:
# do something
does the trick.
I only want to clarify the meaning of "empty" in this context, because it was a bit confusing for me at first.
According to the Pandas documentation, the DataFrame.empty method returns True if any of the axes in the DataFrame are of length 0.
As a consequence, "empty" doesn't mean zero rows and zero columns, like someone might expect. A dataframe with zero rows (axis 1 is empty) but non-zero columns (axis 2 is not empty) is still considered empty:
> df = pd.DataFrame(columns=["A", "B", "C"])
> df.empty
True
Another interesting point highlighted in the documentation is a DataFrame that only contains NaNs is not considered empty.
> df = pd.DataFrame(columns=["A", "B", "C"], index=['a', 'b', 'c'])
> df
A B C
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
> df.empty
False
No doubt that the use of empty is the most comprehensive in this case (explicit is better than implicit).
However, the most efficient in term of computation time is through the usage of len :
if not len(df.index) == 0:
# insert code here
Source : this answer.
Another way:
if dataframe.empty == False:
#do something`
Related
My code is like:
a = pd.DataFrame([np.nan, True])
b = pd.DataFrame([True, np.nan])
c = a|b
print(c)
I don't know the result of logical operation result when one element is np.nan, but I expect them to be the same whatever the oder. But I got the result like this:
0
0 False
1 True
Why? Is this about short circuiting in pandas? I searched the doc of pandas but did not find answer.
My pandas version is 1.1.3
This is behaviour that is tied to np.nan, not pandas. Take the following examples:
print(True or np.nan)
print(np.nan or True)
Output:
True
nan
When performing the operation, dtype ends up mattering and the way that np.nan functions within the numpy library is what leads to this strange behaviour.
To get around this quirk, you can fill NaN values with False for example or some other token value which evaluates to False (using pandas.DataFrame.fillna()).
I have a dataframe containing two columns: one filled with a string (irrelevant), and the other one is (a reference to) a dataframe.
Now I want to only keep the rows, where the dataframes in the second column have entries aka len(df.index) > 0 (there should be rows left, I don't care about columns).
I know that sorting out rows like this works perfectly fine for me if I use it in a list comprehension and can do it on every entry by its own, like in the following example:
[do_x for a, inner_df
in zip(outer_df.index, outer_df["inner"])
if len(inner_df.index) > 0]
But if I try using it for conditional indexing to create a shorter version of the dataframe, it will produce the error KeyError: True.
I thought, that putting len() around it could be a problem so I also tried different approaches to check for zero rows. In the following I show 4 examples of how I tried it:
# a) with the length of the index
outer_df = outer_df.loc[len(outer_df["inner"].index) > 0, :]
# b) same, but with lambda just like in the panda docs user guide
# I used it on the other versions too, with no change in result
outer_df = outer_df.loc[lambda df: len(df["inner"]) > 0, :]
# c) switching
outer_df = outer_df.loc[outer_df["inner"].index.size > 0, :]
# d) even "shorter" version
outer_df = outer_df.loc[not outer_df["inner"].empty, :]
So... where is my error and can I even do it with conditional indexing or do I need to find another way?
Edit: Changed and added some sentences above for more clarity plus added all below.
I know, that the filtering here kind of works through creating a Series the same length as the dataframe consisting of "True" and "False" after a comparison, resulting in keeping only the rows that contain a "True".
I do not however see a fundamental difference between my attempt to create such a list and the following examples (Source https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/):
# 1. difference: the resulting Series is *not* altered
# it just gets compared directly with here the value 80
# -> I thought this might be the problem, but then there is also #2
df = df[df['Percentage'] > 80]
# or
df = df.loc[df['Percentage'] > 80]
# 2. Here the entry is checked in a similar way to my c and d
options = ['x', 'y']
df = df[df['Stream'].isin(options)]
# or
df = df.loc[df['Stream'].isin(options)]
In both, number two here and my versions c & d, the entry in the cell (string // dataframe) is checked for something (is part of list // is empty).
Not sure if I understand your question or where you are stuck. however, I will just write my comment in this answer so that I can easily edit the post.
First, let's try typing in myvar = df['Percentage'] > 80 and see what myvar is. See if the content of myvar makes sense to you.
There is really only 1 true rule of .loc[], that is, the TRUTH TABLE.
Regarding the df[stuff] expression always appears within .loc[ df[stuff] expression ], you might get the impression that df[stuff] expression had some special meaning. For example: df[df['Percentage'] > 80] is asking for any Percentage that is greater than 80, looks quite intuitive! so...df['Percentage'] > 80 must be a "special syntax"? In reality, df['Percentage'] > 80 isn't anything special, it is just another truth table. df[stuff] expression will always be a truth table, that's it.
I have a function that generates different dataframes, the 3rd dataframe causes an error because it contains a final row of NaN values at the bottom.
I tried an if-else conditional statement to remove the row of NaN values, but everytime I do, it keeps outputting the NaN values.
ma = 1
year = 3
df
if ma > 0 and year == 3:
df[0:-1]
else:
df
I also tried a nested if statement, but that produced the same output of NaN values.
ma_path = "SMA"
year_path = "YEAR_3"
if ma_path == ["SMA"]:
if year_path == ["YEAR_3"]:
df[0:-1]
else:
df
I'm sure it's something simple that I've missed. Can anyone help? Thanks in advance.
df[0:-1] does not change the values that df currently contains. If you want to remove the last item of df, you need to assign the slice back to the name:
df = df[0:-1]
If df was an ordinary list, you could also remove items with pop.
df.pop()
I am trying to access a scalar value in a multi column dataframe via a lookup as follows:
targetDate = '2016-01-01'
df['revenue'][df['date']== targetDate].values[0]
Now, in my case there is nothing found in the dataframe for the targetDate. So I get the following index error:
IndexError: ('index 0 is out of bounds for axis 0 with size 0', 'occurred at index 69322')
Is there a pandas built-in way to gracefully result in np.nan in such cases? How would you handle such a situation?
I don't want my script to fail when nothing is found.
You can check if Series is empty and then add if-else:
targetDate = '2016-01-01'
a = df.loc[df['date']== targetDate, 'revenue']
print (a)
Series([], Name: revenue, dtype: int32)
if len(a) == 0:
print ('empty')
else:
first_val = a.values[0]
Similar solution with Series.empty:
if a.empty:
first_val = np.nan
else:
first_val = a.values[0]
print (first_val)
If you precede with head(1) and remove the subscript on values then that will avoid the error message although it won't fill with a nan (it will just be an empty numpy array).
df['revenue'][df['date']== targetDate].head(1).values
But you could do something like this to get a nan instead of empty.
df['revenue'][df['date']== targetDate].append(pd.Series(np.nan)).head(1).values
Or do it as a try/except or an if/else as #jezrael does. Lots of ways to do this, of course, just depends on what is convenient for you.
I've been trying to replace missing values in a Pandas dataframe, but without success. I tried the .fillna method and also tried to loop through the entire data set, checking each cell and replacing NaNs with a chosen value. However, in both cases, Python executes the script without throwing up any errors, but the NaN values remain.
When I dug a bit deeper, I discovered behaviour that seems erratic to me, best demonstrated with an example:
In[ ] X['Smokinginpregnancy'].head()
Out[ ]
Index
E09000002 NaN
E09000003 5.216126
E09000004 10.287496
E09000005 3.090379
E09000006 6.080041
Name: Smokinginpregnancy, dtype: float64
I know for a fact that the first item in this column is missing and pandas recognises it as NaN. In fact, if I call this item on its own, python tells me it's NaN:
In [ ] X['Smokinginpregnancy'][0]
Out [ ]
nan
However, when I test whether it's NaN, python returns False.
In [ ] X['Smokinginpregnancy'][0] == np.nan
Out [ ] False
I suspect that when .fillna is being executed, python checks whether the item is NaN but gets back a False, so it continues, leaving the cell alone.
Does anyone know what's going on? Any solutions? (apart from opening the csv file in excel and then manually replacing the values.)
I'm using Anaconda's Python 3 distribution.
You are doing:
X['Smokinginpregnancy'][0] == np.nan
This is guaranteed to return False because all NaNs compare unequal to everything by IEEE754 standard:
>>> x = float('nan')
>>> x == x
False
>>> x == 1
False
>>> x == float('nan')
False
See also here.
You have to use math.isnan to check for NaNs:
>>> math.isnan(x)
True
Or numpy.isnan
So use:
numpy.isnan(X['Smokinginpregnancy'][0])
Regarding pandas.fillna note that this function returns the filled array. Maybe you did something like:
X.fillna(...)
without reassigning X? Alternatively you must pass inplace=True to mutate the dataframe on which you are calling the method.
NaN in pandas can be check function pandas.isnull. I created boolean mask and return subset with NaN values.
Function filnna can be used for one column Smokinginpregnancy (more info in doc):
X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
or
X['Smokinginpregnancy'].fillna('100', inplace=True)
Warning:
Sometimes inplace=True can be ignored, better is not use. - link, github, github 3 comments.
All together:
print X['Smokinginpregnancy'].head()
#Index
#E09000002 NaN
#E09000003 5.216126
#E09000004 10.287496
#E09000005 3.090379
#E09000006 6.080041
#check NaN in column Smokinginpregnancy by boolean mask
mask = pd.isnull(X['Smokinginpregnancy'])
XNaN = X[mask]
print XNaN
# Smokinginpregnancy
#Index
#E09000002 NaN
#use function fillna for column Smokinginpregnancy
#X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
X['Smokinginpregnancy'].fillna('100', inplace=True)
print X
# Smokinginpregnancy
#Index
#E09000002 100
#E09000003 5.216126
#E09000004 10.2875
#E09000005 3.090379
#E09000006 6.080041
More information, why comparison doesn't work:
One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan. More info in Bakuriu's answer.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False