Pandas: find matching rows in two dataframes (without using `merge`) - python

Let's suppose I have these two dataframes with the same number of columns, but possibly different number of rows:
tmp = np.arange(0,12).reshape((4,3))
df = pd.DataFrame(data=tmp)
tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)
print(df)
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
print(df2)
a b c
0 3 4 5
1 100 4 100
2 101 100 3
I want to verify if the rows of df2 are matching any rows of df, that is I want to obtain a series (or an array) of boolean values that gives this result:
0 True
1 False
2 False
dtype: bool
I think something like the isin method should work, but I got this result, which results in a dataframe and is wrong:
print(df2.isin(df))
a b c
0 False False False
1 False False False
2 False False False
As a constraint, I wish to not use the merge method, since what I am doing is in fact a check on the data before applying merge itself.
Thank you for your help!

You can use numpy.isin, which will compare all elements in your arrays and return True or False for each element for each array.
Then using all() on each array, will get your desired output as the function returns True if all elements are true:
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])
0 True
1 False
2 False
dtype: bool
Breakdown of what is happening:
# np.isin
>>> np.isin(df2.values,df.values)
Out[139]:
array([[ True, True, True],
[False, True, False],
[False, False, True]])
# all()
>>> [m.all() for m in np.isin(df2.values,df.values)]
Out[140]: [True, False, False]
# pd.Series()
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])
Out[141]:
0 True
1 False
2 False
dtype: bool

Use np.in1d:
>>> df2.apply(lambda x: all(np.in1d(x, df)), axis=1)
0 True
1 False
2 False
dtype: bool
Another way, use frozenset:
>>> df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
0 True
1 False
2 False
dtype: bool

You can use a MultiIndex (expensive IMO):
pd.MultiIndex.from_frame(df2).isin(pd.MultiIndex.from_frame(df))
Out[32]: array([ True, False, False])
Another option is to create a dictionary, and run isin:
df2.isin({key : array.array for key, (_, array) in zip(df2, df.items())}).all(1)
Out[45]:
0 True
1 False
2 False
dtype: bool

There may be more efficient solutions, but you could append the two dataframes can call duplicated, e.g.:
df.append(df2).duplicated().iloc[df.shape[0]:]
This assumes that all rows in each DataFrame are distinct. Here are some benchmarks:
tmp1 = np.arange(0,12).reshape((4,3))
df1 = pd.DataFrame(data=tmp1, columns=["a", "b", "c"])
tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)
df1 = pd.concat([df1] * 10_000).reset_index()
df2 = pd.concat([df2] * 10_000).reset_index()
%timeit df1.append(df2).duplicated().iloc[df1.shape[0]:]
# 100 loops, best of 5: 4.16 ms per loop
%timeit pd.Series([m.all() for m in np.isin(df2.values,df1.values)])
# 10 loops, best of 5: 74.9 ms per loop
%timeit df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
# 1 loop, best of 5: 443 ms per loop

Try:
df[~df.apply(tuple,1).isin(df2.apply(tuple,1))]
Here is my result:

Related

Reverse boolean column in python pandas [duplicate]

I have a pandas Series object containing boolean values. How can I get a series containing the logical NOT of each value?
For example, consider a series containing:
True
True
True
False
The series I'd like to get would contain:
False
False
False
True
This seems like it should be reasonably simple, but apparently I've misplaced my mojo =(
To invert a boolean Series, use ~s:
In [7]: s = pd.Series([True, True, False, True])
In [8]: ~s
Out[8]:
0 False
1 False
2 True
3 False
dtype: bool
Using Python2.7, NumPy 1.8.0, Pandas 0.13.1:
In [119]: s = pd.Series([True, True, False, True]*10000)
In [10]: %timeit np.invert(s)
10000 loops, best of 3: 91.8 µs per loop
In [11]: %timeit ~s
10000 loops, best of 3: 73.5 µs per loop
In [12]: %timeit (-s)
10000 loops, best of 3: 73.5 µs per loop
As of Pandas 0.13.0, Series are no longer subclasses of numpy.ndarray; they are now subclasses of pd.NDFrame. This might have something to do with why np.invert(s) is no longer as fast as ~s or -s.
Caveat: timeit results may vary depending on many factors including hardware, compiler, OS, Python, NumPy and Pandas versions.
#unutbu's answer is spot on, just wanted to add a warning that your mask needs to be dtype bool, not 'object'. Ie your mask can't have ever had any nan's. See here - even if your mask is nan-free now, it will remain 'object' type.
The inverse of an 'object' series won't throw an error, instead you'll get a garbage mask of ints that won't work as you expect.
In[1]: df = pd.DataFrame({'A':[True, False, np.nan], 'B':[True, False, True]})
In[2]: df.dropna(inplace=True)
In[3]: df['A']
Out[3]:
0 True
1 False
Name: A, dtype object
In[4]: ~df['A']
Out[4]:
0 -2
0 -1
Name: A, dtype object
After speaking with colleagues about this one I have an explanation: It looks like pandas is reverting to the bitwise operator:
In [1]: ~True
Out[1]: -2
As #geher says, you can convert it to bool with astype before you inverse with ~
~df['A'].astype(bool)
0 False
1 True
Name: A, dtype: bool
(~df['A']).astype(bool)
0 True
1 True
Name: A, dtype: bool
I just give it a shot:
In [9]: s = Series([True, True, True, False])
In [10]: s
Out[10]:
0 True
1 True
2 True
3 False
In [11]: -s
Out[11]:
0 False
1 False
2 False
3 True
You can also use numpy.invert:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: s = pd.Series([True, True, False, True])
In [4]: np.invert(s)
Out[4]:
0 False
1 False
2 True
3 False
EDIT: The difference in performance appears on Ubuntu 12.04, Python 2.7, NumPy 1.7.0 - doesn't seem to exist using NumPy 1.6.2 though:
In [5]: %timeit (-s)
10000 loops, best of 3: 26.8 us per loop
In [6]: %timeit np.invert(s)
100000 loops, best of 3: 7.85 us per loop
In [7]: %timeit ~s
10000 loops, best of 3: 27.3 us per loop
In support to the excellent answers here, and for future convenience, there may be a case where you want to flip the truth values in the columns and have other values remain the same (nan values for instance)
In[1]: series = pd.Series([True, np.nan, False, np.nan])
In[2]: series = series[series.notna()] #remove nan values
In[3]: series # without nan
Out[3]:
0 True
2 False
dtype: object
# Out[4] expected to be inverse of Out[3], pandas applies bitwise complement
# operator instead as in `lambda x : (-1*x)-1`
In[4]: ~series
Out[4]:
0 -2
2 -1
dtype: object
as a simple non-vectorized solution you can just, 1. check types2. inverse bools
In[1]: series = pd.Series([True, np.nan, False, np.nan])
In[2]: series = series.apply(lambda x : not x if x is bool else x)
Out[2]:
Out[2]:
0 True
1 NaN
2 False
3 NaN
dtype: object
NumPy is slower because it casts the input to boolean values (so None and 0 becomes False and everything else becomes True).
import pandas as pd
import numpy as np
s = pd.Series([True, None, False, True])
np.logical_not(s)
gives you
0 False
1 True
2 True
3 False
dtype: object
whereas ~s would crash. In most cases tilde would be a safer choice than NumPy.
Pandas 0.25, NumPy 1.17

Numpy/Pandas clean way to check if a specific value is NaN

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

Pandas not dropping columns

Hi I've tried to drop columns based on a boolean array but for some odd reason pandas does not seem to be dropping the columns at all.
The boolean array is and (376,). It only contains True and False values.
for x in range(0,len(analysis)-1):
if analysis[x] == False:
col = dtest.columns[x]
dtest.drop(dtest.columns[x],1)
This is my code for dropping the columns, essentially the length of the analysis array is the number of columns there is in dtest.
dtest is (4209, 376) & pandas.core.frame.DataFrame
I have tried debugging, it does detect the Falses in the analysis and also is able to print out the col variable accurately but it just wont drop the columns for some reason.
Would greatly appreciate any help! Thanks :)
IIUC you don't need loop:
dtest = dtest.loc[:, analysis]
Demo:
In [320]: df = pd.DataFrame(np.random.rand(5, 10), columns=list(range(1, 11)))
In [321]: df
Out[321]:
1 2 3 4 5 6 7 8 9 10
0 0.332792 0.927047 0.899874 0.294391 0.762800 0.861521 0.988783 0.475127 0.033096 0.980141
1 0.447273 0.268828 0.951633 0.947425 0.020006 0.808608 0.607091 0.712309 0.383256 0.248582
2 0.169946 0.951702 0.671014 0.514326 0.607129 0.227021 0.831474 0.696117 0.799418 0.224851
3 0.724165 0.748455 0.452430 0.941572 0.873344 0.877872 0.925788 0.183115 0.113217 0.072717
4 0.303488 0.426459 0.750076 0.225662 0.298983 0.729585 0.692489 0.934778 0.124634 0.274208
In [322]: analysis = np.random.choice([True, False], 10)
In [323]: analysis
Out[323]: array([ True, True, True, False, True, True, True, False, False, True], dtype=bool)
In [324]: df = df.loc[:, analysis]
In [325]: df
Out[325]:
1 2 3 5 6 7 10
0 0.332792 0.927047 0.899874 0.762800 0.861521 0.988783 0.980141
1 0.447273 0.268828 0.951633 0.020006 0.808608 0.607091 0.248582
2 0.169946 0.951702 0.671014 0.607129 0.227021 0.831474 0.224851
3 0.724165 0.748455 0.452430 0.873344 0.877872 0.925788 0.072717
4 0.303488 0.426459 0.750076 0.298983 0.729585 0.692489 0.274208
You need assign output back:
cols = []
for x in range(0,len(analysis)):
if analysis[x] == False:
col = dtest.columns[x]
cols.append(col)
dtest = dtest.drop(cols,1)
print (dtest)
0 2
0 1 3
but better is select only columns by True mask like in another answer.

How to convert keyword in cell of dataframe to own column each

I have a dataframe like the following:
In[8]: df = pd.DataFrame({'transport': ['Car;Bike;Horse','Car','Car;Bike', 'Horse;Car']})
df
Out[8]:
transport
0 Car;Bike;Horse
1 Car
2 Car;Bike
3 Horse;Car
And I want to convert it to something like this:
In[9]: df2 = pd.DataFrame({'transport_car': [True,True,True,True],'transport_bike': [True,False,True,False], 'transport_horse': [True,False,False,True]} )
df2
Out[10]:
transport_bike transport_car transport_horse
0 True True True
1 False True False
2 True True False
3 False True True
I got a solution, but it feels very 'hacked' and 'unpythonic'. (It works for my considerably small data set)
In[11]:
# get set of all possible values
new_columns = set()
for element in set(df.transport.unique()):
for transkey in str(element).split(';'):
new_columns.add(transkey)
print(new_columns)
# Use broadcast to initialize all columns with default value.
for col in new_columns:
df['trans_'+str(col).lower()] = False
# Change cells appropiate to keywords
for index, row in df.iterrows():
for key in new_columns:
if key in row.transport:
df.set_value(index, 'trans_'+str(key).lower(), True)
df
Out[11]:
transport trans_bike trans_car trans_horse
0 Car;Bike;Horse True True True
1 Car False True False
2 Car;Bike True True False
3 Horse;Car False True True
My goal is to use the second representation to perform some evaluation to answer questions like: "How often is car used?", "How often is car used together with horse", etc.
This and this answers suggest using pivot and eval might be the way to go, but I'm not sure.
So what would be the best way, to convert a DataFrame from first representation to the second?
You can use apply and construct a Series for each entry with the splited fields as index. This will result in a data frame with the index as the columns:
df.transport.apply(lambda x: pd.Series(True, x.split(";"))).fillna(False)
I decided to extend the great #Metropolis's answer with a working example:
In [249]: %paste
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(df.transport.str.replace(';',' '))
r = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
## -- End pasted text --
In [250]: r
Out[250]:
bike car horse
0 1 1 1
1 0 1 0
2 1 1 0
3 0 1 1
now you can join it back to the source DF:
In [251]: df.join(r)
Out[251]:
transport bike car horse
0 Car;Bike;Horse 1 1 1
1 Car 0 1 0
2 Car;Bike 1 1 0
3 Horse;Car 0 1 1
Timing: for 40K rows DF:
In [254]: df = pd.concat([df] * 10**4, ignore_index=True)
In [255]: df.shape
Out[255]: (40000, 1)
In [256]: %timeit df.transport.apply(lambda x: pd.Series(True, x.split(";"))).fillna(False)
1 loop, best of 3: 33.8 s per loop
In [257]: %%timeit
...: vectorizer = CountVectorizer(min_df=1)
...: X = vectorizer.fit_transform(df.transport.str.replace(';',' '))
...: r = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
...:
1 loop, best of 3: 732 ms per loop
I would consider using the Count Vectorizer provided by Scikit-learn. The vectorizer will construct a vector where each index refers to a term and the value refers to the number of appearances of that term in the record.
Advantages over the home-rolled approaches suggested in other answer are efficiency for large datasets and generalizability. Disadvantage is, obviously, bringing in an extra dependency.

Pandas boolean algebra: True if True in both columns

I would like to make a boolean vector that is created by the comparison of two input boolean vectors. I can use a for loop, but is there a better way to do this?
My ideal solution would look like this:
df['A'] = [True, False, False, True]
df['B'] = [True, False, False, False]
C = ((df['A']==True) or (df['B']==True)).as_matrix()
print C
>>> True, False, False, True
I think this is what you are looking for:
C = (df['A']) | (df['B'])
C
0 True
1 False
2 False
3 True
dtype: bool
You could then leave this as a series or convert it to a list or array
Alternatively you could use any method with axis=1 to search in index. It also will work for any number of columns where you have True values:
In [1105]: df
Out[1105]:
B A
0 True True
1 False False
2 False False
3 False True
In [1106]: df.any(axis=1)
Out[1106]:
0 True
1 False
2 False
3 True
dtype: bool

Categories

Resources