Is there a `in` like statement for a Pandas Series? - python

I can easily check whether each is equal to a number:
In [20]: s = pd.Series([1, 2, 3])
In [21]: s == 1
Out[21]:
0 True
1 False
2 False
My problem is, is there a function like s._in([1, 2]) and output something like
0 True
1 True
2 False

Yes, it is called isin. Do s.isin([1, 2]).

Related

Pandas: find matching rows in two dataframes (without using `merge`)

Let's suppose I have these two dataframes with the same number of columns, but possibly different number of rows:
tmp = np.arange(0,12).reshape((4,3))
df = pd.DataFrame(data=tmp)
tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)
print(df)
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
print(df2)
a b c
0 3 4 5
1 100 4 100
2 101 100 3
I want to verify if the rows of df2 are matching any rows of df, that is I want to obtain a series (or an array) of boolean values that gives this result:
0 True
1 False
2 False
dtype: bool
I think something like the isin method should work, but I got this result, which results in a dataframe and is wrong:
print(df2.isin(df))
a b c
0 False False False
1 False False False
2 False False False
As a constraint, I wish to not use the merge method, since what I am doing is in fact a check on the data before applying merge itself.
Thank you for your help!
You can use numpy.isin, which will compare all elements in your arrays and return True or False for each element for each array.
Then using all() on each array, will get your desired output as the function returns True if all elements are true:
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])
0 True
1 False
2 False
dtype: bool
Breakdown of what is happening:
# np.isin
>>> np.isin(df2.values,df.values)
Out[139]:
array([[ True, True, True],
[False, True, False],
[False, False, True]])
# all()
>>> [m.all() for m in np.isin(df2.values,df.values)]
Out[140]: [True, False, False]
# pd.Series()
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])
Out[141]:
0 True
1 False
2 False
dtype: bool
Use np.in1d:
>>> df2.apply(lambda x: all(np.in1d(x, df)), axis=1)
0 True
1 False
2 False
dtype: bool
Another way, use frozenset:
>>> df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
0 True
1 False
2 False
dtype: bool
You can use a MultiIndex (expensive IMO):
pd.MultiIndex.from_frame(df2).isin(pd.MultiIndex.from_frame(df))
Out[32]: array([ True, False, False])
Another option is to create a dictionary, and run isin:
df2.isin({key : array.array for key, (_, array) in zip(df2, df.items())}).all(1)
Out[45]:
0 True
1 False
2 False
dtype: bool
There may be more efficient solutions, but you could append the two dataframes can call duplicated, e.g.:
df.append(df2).duplicated().iloc[df.shape[0]:]
This assumes that all rows in each DataFrame are distinct. Here are some benchmarks:
tmp1 = np.arange(0,12).reshape((4,3))
df1 = pd.DataFrame(data=tmp1, columns=["a", "b", "c"])
tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)
df1 = pd.concat([df1] * 10_000).reset_index()
df2 = pd.concat([df2] * 10_000).reset_index()
%timeit df1.append(df2).duplicated().iloc[df1.shape[0]:]
# 100 loops, best of 5: 4.16 ms per loop
%timeit pd.Series([m.all() for m in np.isin(df2.values,df1.values)])
# 10 loops, best of 5: 74.9 ms per loop
%timeit df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
# 1 loop, best of 5: 443 ms per loop
Try:
df[~df.apply(tuple,1).isin(df2.apply(tuple,1))]
Here is my result:

Pandas IF statement referencing other column value

I can't for the life of me find an example of this operation in pandas.... I am trying to write an IF statement saying IF my Check column is true then pull the value from my Proficiency_Value column, and if it's False, then default to 1000.
report_skills['Check'] = report_skills.apply(lambda x: x.Skill_Specialization in x.Specialization, axis=1)
report_skills = report_skills.loc[report_skills['Check'] == True, 'Proficiency_Value'] = 1000
Any ideas why this is not working? I'm sure this is an easy fix
Let`s create a small example DataFrame like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Check':[True,False, False ,True],
'Proficiency_Value':range(4)
})
>>> df
Check Proficiency_Value
0 True 0
1 False 1
2 False 2
3 True 3
If you use now the np.where() functione, you can get the result you are asking for.
df['Proficiency_Value'] = np.where(df['Check']==True, df['Proficiency_Value'], 1000)
>>> df
Check Proficiency_Value
0 True 0
1 False 1000
2 False 1000
3 True 3

Count number of rows with NaN in a pandas DataFrame?

Having the following running code:
import datetime as dt
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
my_funds = [1, 2, 5, 7, 9, 11]
my_time = ['2020-01', '2019-12', '2019-11', '2019-10', '2019-09', '2019-08']
df = pd.DataFrame({'TIME': my_time, 'FUNDS':my_funds})
for x in range(2,3):
df.insert(len(df.columns), f'x**{x}', df["FUNDS"]**x)
df = df.replace([1, 7, 9, 25],float('nan'))
print(df.isnull().values.ravel().sum()) #5 (obviously counting NaNs in total)
print(sum(map(any, df.isnull()))) #3 (I guess counting the NaNs in the left column)
I am getting the dataframe below. I want to get the count of the total rows, with 1 or more NaN, which in my case is 4, on rows - [0, 2, 3, 4].
Use:
print (df.isna().any(axis=1).sum())
4
Explanation: First compare missing values by DataFrame.isna:
print (df.isna())
TIME FUNDS x**2
0 False True True
1 False False False
2 False False True
3 False True False
4 False True False
5 False False False
And test if at least per rows is True by DataFrame.any:
print (df.isna().any(axis=1))
0 True
1 False
2 True
3 True
4 True
5 False
dtype: bool
And last count Trues by sum.
Another option:
nan_rows = len(df[df["FUNDS"].isna() | df["x**2"].isna()])
New option Series.clip
to take one when there is more than one NaN per row
df.isna().sum(axis=1).clip(upper=1).sum()
#4

Comparing two columns in pandas dataframe

I have a pandas dataframe where I would like to verify that column A is greater than column B (row wise). I am doing something like
tmp=df['B']-df['A']
if(any( [ v for v in tmp if v > 0])):
....
I was wondering if there was better(concise) way of doing it or if pandas dataframe had any such built in routines to accomplish this
df = pd.DataFrame({'A': [1, 2, 3], 'B': [3, 1, 1]})
temp = df['B'] - df['A']
print(temp)
0 2
1 -1
2 -2
Now you can create a Boolean series using temp > 0:
print(temp > 0)
0 True
1 False
2 False
dtype: bool
This boolean series can be fed to any and therefore you can use:
if any(temp > 0):
print('juhu!')
Or simply (which avoids temp):
if any(df['B'] > df['A']):
print('juhu')
using the same logic of creating a Boolean series first:
print(df['B'] > df['A'])
0 True
1 False
2 False
dtype: bool
df['B']>df['A'] will be pandas series in boolean datatype.
>>> (df['B']>df['A']).dtype
dtype('bool')
For example
>>> df['B']>df['A']
0 True
1 False
2 False
3 True
4 True
dtype: bool
any() function returns True if any of the item in an iterable is true
>>> if any(df['B']>df['A']):
... print(True)
...
True
I guess you wanted to check if any df[‘B'] > df[‘A’] then do something.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [2, 0, 6, 3]})
if np.where(df['B'] > df['A'], 1, 0).sum():
print('do something')

Numpy/Pandas clean way to check if a specific value is NaN

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

Categories

Resources