Is the condition None == None is true or false?
I have 2 pandas-dataframes:
import pandas as pd
df1 = pd.DataFrame({'id':[1,2,3,4,5], 'value':[None,20,None,40,50]})
df2 = pd.DataFrame({'index':[1,2,3], 'value':[None,20,None]})
In [42]: df1
Out[42]: id value
0 1 NaN
1 2 20.0
2 3 NaN
3 4 40.0
4 5 50.0
In [43]: df2
Out[43]: index value
0 1 NaN
1 2 20.0
2 3 NaN
When I'm executing merge action it's looks like None == None is True:
In [37]: df3 = df1.merge(df2, on='value', how='inner')
In [38]: df3
Out[38]: id value index
0 1 NaN 1
1 1 NaN 3
2 3 NaN 1
3 3 NaN 3
4 2 20.0 2
but when I do this:
In [39]: df4 = df3[df3['value']==df3['value']]
In [40]: df4
Out[40]: id value index
4 2 20.0 2
In [41]: df3['value']==df3['value']
Out[41]: 0 False
1 False
2 False
3 False
4 True
It shows that None == None is false.
Pandas uses the floating point Not a Number value, NaN, to indicate that something is missing in a series of numbers. That's because that's easier to handle in the internal representation of data. You don't have any None objects in your series. Even so, if you use dtype=object data, None is used to encode missing value. See Working with missing data.
Not that it matters here, but NaN is always, by definition, not equal to NaN:
>>> float('NaN') == float('NaN')
False
When merging or broadcasting, Pandas knows what 'missing' means, there is no equality test being done on the NaN or None values in a series. Nulls are skipped explicitly.
If you want to test if a value is a null or not, use the series.isnull()and series.notnull() methods instead.
Related
I am trying to generate a pandas Dataframe where a column will have numerical values based on the values of a column in another dataframe. Below is an example:
I want to generate another dataframe based on a column of dataframe df_
ipdb> df_ = pd.DataFrame({'c1':[False, True, False, True]})
ipdb> df_
c1
0 False
1 True
2 False
3 True
Using df_ another dataframe df1 is generated with columns as below.
ipdb> df1
col1 col2
0 0 NaN
1 1 0
2 2 NaN
3 3 1
Here, 'col1' has normal index values and 'c1' has NaN in the rows where there was False in df_ and sequentially incrementing values where 'c1' is True.
To generate this dataframe, below is what I have tried.
ipdb> df_[df_['c1']==True].reset_index().reset_index()
level_0 index c1
0 0 1 True
1 1 3 True
However, I feel there should be a better way to generate a dataframe with the two columns as in df1.
I think you need cumsum and subtract 1 for start counting from 0:
df_ = pd.DataFrame({'c1':[False, True, False, True]})
df_['col2'] = df_.loc[df_['c1'], 'c1'].cumsum().sub(1)
print (df_)
c1 col2
0 False NaN
1 True 0.0
2 False NaN
3 True 1.0
Another solution is count occurencies of True values by sum with numpy.arange and assign back to filtered DataFrame:
df_.loc[df_['c1'],'col2']= np.arange(df_['c1'].sum())
print (df_)
c1 col2
0 False NaN
1 True 0.0
2 False NaN
3 True 1.0
Details:
print (df_['c1'].sum())
2
print (np.arange(df_['c1'].sum()))
[0 1]
another way to solve this,
df.loc[df['c1'],'col2']=range(len(df[df['c1']]))
Output:
c1 col2
0 False NaN
1 True 0.0
2 False NaN
3 True 1.0
I would like to delete rows that contain only values that are less than 10 and greater than 25. My sample dataframe will look like this:
a b c
1 2 3
4 5 16
11 24 22
26 50 65
Expected Output:
a b c
1 2 3
4 5 16
26 50 65
So if the row contains any value less than 10 or greater than 25, then the row will stay in dataframe, otherwise, it needs to be dropped.
Is there any way I can achieve this with Pandas instead of iterating through all the rows?
You can call apply and return the results to a new column called 'Keep'. You can then use this column to drop rows that you don't need.
import pandas as pd
l = [[1,2,3],[4,5,6],[11,24,22],[26,50,65]]
df = pd.DataFrame(l, columns = ['a','b','c']) #Set up sample dataFrame
df['keep'] = df.apply(lambda row: sum(any([(x < 10) or (x > 25) for x in row])), axis = 1)
The any() function returns a generator. Calling sum(generator) simply returns the sum of all the results stored in the generator.
Check this on how any() works.
Apply function still iterates over all the rows like a for loop, but the code looks cleaner this way. I cannot think of a way to do this without iterating over all the rows.
Output:
a b c keep
0 1 2 3 1
1 4 5 6 1
2 11 24 22 0
3 26 50 65 1
df = df[df['keep'] == 1] #Drop unwanted rows
You can use pandas boolean indexing
dropped_df = df.loc[((df<10) | (df>25)).any(1)]
df<10 will return a boolean df
| is the OR operator
.any(1) returns any true element over the axis 1 (rows) see documentation
df.loc[] then filters the dataframe based on the boolean df
I really like using masking for stuff like this; it's clean so you can go back and read your code. It's faster than using .apply too which is effectively for looping. Also, it avoids setting by copy warnings.
This uses boolean indexing like Prageeth's answer. But the difference is I like how you can save the boolean index as a separate variable for re-use later. I often do that so I don't have to modify the original dataframe or create a new one and just use df[mask] wherever I want that cropped view of the dataframe.
df = pd.DataFrame(
[[1,2,3],
[4,5,16],
[11,24,22],
[26,50,65]],
columns=['a','b','c']
)
#use a mask to create a fully indexed boolean dataframe,
#which avoids the SettingWithCopyWarning:
#https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
mask = (df > 10) & (df < 25)
print(mask)
"""
a b c
0 False False False
1 False False True
2 True True True
3 False False False
"""
print(df[mask])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 24.0 22.0
3 NaN NaN NaN
"""
print(df[mask].dropna())
"""
a b c
2 11.0 24.0 22.0
"""
#one neat things about using masking is you can invert them too with a '~'
print(~mask)
"""
a b c
0 True True True
1 True True False
2 False False False
3 True True True
"""
print( df[~mask].dropna())
"""
a b c
0 1.0 2.0 3.0
3 26.0 50.0 65.0
"""
#you can also combine masks
mask2 = mask & (df < 24)
print(mask2)
"""
a b c
0 False False False
1 False False True
2 True False False
3 False False False
"""
#and the resulting dataframe (without dropping the rows that are nan or contain any false mask)
print(df[mask2])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 NaN 22.0
3 NaN NaN NaN
"""
I want to use .notnull() on several columns of a dataframe to eliminate the rows which contain "NaN" values.
Let say I have the following df:
A B C
0 1 1 1
1 1 NaN 1
2 1 NaN NaN
3 NaN 1 1
I tried to use this syntax but it does not work? do you know what I am doing wrong?
df[[df.A.notnull()],[df.B.notnull()],[df.C.notnull()]]
I get this Error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What should I do to get the following output?
A B C
0 1 1 1
Any idea?
You can first select subset of columns by df[['A','B','C']], then apply notnull and specify if all values in mask are True:
print (df[['A','B','C']].notnull())
A B C
0 True True True
1 True False True
2 True False False
3 False True True
print (df[['A','B','C']].notnull().all(1))
0 True
1 False
2 False
3 False
dtype: bool
print (df[df[['A','B','C']].notnull().all(1)])
A B C
0 1.0 1.0 1.0
Another solution is from Ayhan comment with dropna:
print (df.dropna(subset=['A', 'B', 'C']))
A B C
0 1.0 1.0 1.0
what is same as:
print (df.dropna(subset=['A', 'B', 'C'], how='any'))
and means drop all rows, where is at least one NaN value.
You can apply multiple conditions by combining them with the & operator (this works not only for the notnull() function).
df[(df.A.notnull() & df.B.notnull() & df.C.notnull())]
A B C
0 1.0 1.0 1.0
Alternatively, you can just drop all columns which contain NaN. The original DataFrame is not modified, instead a copy is returned.
df.dropna()
You can simply do:
df.dropna()
I'm trying to sum across columns of a Pandas dataframe, and when I have NaNs in every column I'm getting sum = zero; I'd expected sum = NaN based on the docs. Here's what I've got:
In [136]: df = pd.DataFrame()
In [137]: df['a'] = [1,2,np.nan,3]
In [138]: df['b'] = [4,5,np.nan,6]
In [139]: df
Out[139]:
a b
0 1 4
1 2 5
2 NaN NaN
3 3 6
In [140]: df['total'] = df.sum(axis=1)
In [141]: df
Out[141]:
a b total
0 1 4 5
1 2 5 7
2 NaN NaN 0
3 3 6 9
The pandas.DataFrame.sum docs say "If an entire row/column is NA, the result will be NA", so I don't understand why "total" = 0 and not NaN for index 2. What am I missing?
pandas documentation » API Reference » DataFrame » pandas.DataFrame »
DataFrame.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
min_count: int, default 0
The required number of valid values to
perform the operation. If fewer than min_count non-NA values are
present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the
sum of an all-NA or empty Series is 0, and the product of an all-NA or
empty Series is 1.
Quoting from pandas latest docs it says the min_count will be 0 for the all-NA series.
If you say min_count=1 then the result of the sum will be a NaN.
Great link provided by Jeff.
Here you can find a example:
df1 = pd.DataFrame();
df1['a'] = [1,2,np.nan,3];
df1['b'] = [np.nan,2,np.nan,3]
df1
Out[4]:
a b
0 1.0 NaN
1 2.0 2.0
2 NaN NaN
3 3.0 3.0
df1.sum(axis=1, skipna=False)
Out[6]:
0 NaN
1 4.0
2 NaN
3 6.0
dtype: float64
df1.sum(axis=1, skipna=True)
Out[7]:
0 1.0
1 4.0
2 0.0
3 6.0
dtype: float64
df1.sum(axis=1, min_count=1)
Out[7]:
0 1.0
1 4.0
2 NaN
3 6.0
dtype: float64
A solution would be to select all cases where rows are all-nan, then set the sum to nan:
df['total'] = df.sum(axis=1)
df.loc[df['a'].isnull() & df['b'].isnull(),'total']=np.nan
or
df['total'] = df.sum(axis=1)
df.loc[df[['a','b']].isnull().all(1),'total']=np.nan
The latter option is probably more practical, because you can create a list of columns ['a','b', ... , 'z'] which you may want to sum.
I got around this by casting the series to a numpy array, which computes the answer correctly.
print(np.array([np.nan,np.nan,np.nan]).sum()) # nan
print(pd.Series([np.nan,np.nan,np.nan]).sum()) # 0.0
print(pd.Series([np.nan,np.nan,np.nan]).to_numpy().sum()) # nan
How do I drop a row if any of the values in the row equal zero?
I would normally use df.dropna() for NaN values but not sure how to do it with "0" values.
i think the easiest way is looking at rows where all values are not equal to 0:
df[(df != 0).all(1)]
You could make a boolean frame and then use any:
>>> df = pd.DataFrame([[1,0,2],[1,2,3],[0,1,2],[4,5,6]])
>>> df
0 1 2
0 1 0 2
1 1 2 3
2 0 1 2
3 4 5 6
>>> df == 0
0 1 2
0 False True False
1 False False False
2 True False False
3 False False False
>>> df = df[~(df == 0).any(axis=1)]
>>> df
0 1 2
1 1 2 3
3 4 5 6
Although it is late, someone else might find it helpful.
I had similar issue. But the following worked best for me.
df =pd.read_csv(r'your file')
df =df[df['your column name'] !=0]
reference:
Drop rows with all zeros in pandas data frame
see #ikbel benabdessamad
Assume a simple DataFrame as below:
df=pd.DataFrame([1,2,0,3,4,0,9])
Pick non-zero values which turns all zero values into nan and remove nan-values
df=df[df!=0].dropna()
df
Output:
0
0 1.0
1 2.0
3 3.0
4 4.0
6 9.0