Pandas dataframes equality test - python

How do I write a function that checks two input dataframes are of equal as long as rows in both dataframes are equal? So it disregards index positions and column orders. I can't use df.equals() since it will enforce data types to be equal, which is not what I need.
from io import StringIO
canonical_in_csv = """,c,a,b
2,hat,x,1
0,rat,y,4
3,cat,x,2
1,bat,x,2"""
with StringIO(canonical_in_csv) as fp:
df1 = pd.read_csv(fp, index_col=0)
canonical_soln_csv = """,a,b,c
0,x,1,hat
1,x,2,bat
2,x,2,cat
3,y,4,rat"""
with StringIO(canonical_soln_csv) as fp:
df2 = pd.read_csv(fp, index_col=0)
df1:
c a b
2 hat x 1
0 rat y 4
3 cat x 2
1 bat x 2
df2:
a b c
0 x 1 hat
1 x 2 bat
2 x 2 cat
3 y 4 rat
My attempt:
temp1 = (df == df2).all()
temp2 = temp1.all()
temp2
ValueError: Can only compare identically-labeled DataFrame objects

You can use sort_index by index and columns values first, then merge with eq (==) or equals:
df11 = df1.sort_index().sort_index(axis=1)
df22 = df2.sort_index().sort_index(axis=1)
print (df11.merge(df22))
a b c
0 y 4 rat
1 x 2 bat
2 x 1 hat
3 x 2 cat
print (df11.merge(df22).eq(df11))
a b c
0 True True True
1 True True True
2 True True True
3 True True True
a = df11.merge(df22).eq(df11).values.all()
#alternative
#a = df11.merge(df22).equals(df11)
print (a)
True
Your function should be rewritten:
def checkequality(A, B):
df11 = A.sort_index(axis=1)
df11 = df11.sort_values(df11.columns.tolist()).reset_index(drop=True)
df22 = B.sort_index(axis=1)
df22 = df22.sort_values(df22.columns.tolist()).reset_index(drop=True)
return (df11 == df22).values.all()
a = checkequality(df1, df2)
print (a)
True

You request on row index dis-regard is pretty difficult to undertake as this datatype is not optimized for such operation whereas regarding columns issue, fortunately this will help you
df1.values == df2[df1.columns].values
where df1.columns syncs the columns order and values convert to numpy for comparison. I still recommend not doing row re-ordering and match as that can be very taxing for bigger dataset.
Based on index match this can be what you are looking for
df1.values==df2.reindex(df1.index.values.tolist())[df1.columns].values
Update
As pointed by #Dark a cleaner and in-place comparison can be done like this
df1.loc[df2.index,df2.columns] == df2

I figured it out,
def checkequality(A, B):
var_names = sorted(A.columns)
var_names
Y = A[var_names].copy()
Y.sort_values(by = var_names,inplace=True)
Y.set_index([list(range(0,len(Y)))],inplace=True)
var_names2 = sorted(B.columns)
var_names2
Y2 = B[var_names2].copy()
Y2.sort_values(by = var_names2,inplace=True)
Y2.set_index([list(range(0,len(Y2)))],inplace=True)
if (Y==Y2).all().all() == True:
return True
else:
return False

Related

Pandas: Determine if a string in one column is a substring of a string in another column

Consider these series:
>>> a = pd.Series('abc a abc c'.split())
>>> b = pd.Series('a abc abc a'.split())
>>> pd.concat((a, b), axis=1)
0 1
0 abc a
1 a abc
2 abc abc
3 c a
>>> unknown_operation(a, b)
0 False
1 True
2 True
3 False
The desired logic is to determine if the string in the left column is a substring of the string in the right column. pd.Series.str.contains does not accept another Series, and pd.Series.isin checks if the value exists in the other series (not in the same row specifically). I'm interested to know if there's a vectorized solution (not using .apply or a loop), but it may be that there isn't one.
Let us try with numpy defchararray which is vectorized
from numpy.core.defchararray import find
find(df['1'].values.astype(str),df['0'].values.astype(str))!=-1
Out[740]: array([False, True, True, False])
IIUC,
df[1].str.split('', expand=True).eq(df[0], axis=0).any(axis=1) | df[1].eq(df[0])
Output:
0 False
1 True
2 True
3 False
dtype: bool
I tested various functions with a randomly generated Dataframe of 1,000,000 5 letter entries.
Running on my machine, the averages of 3 tests showed:
zip > v_find > to_list > any > apply
0.21s > 0.79s > 1s > 3.55s > 8.6s
Hence, i would recommend using zip:
[x[0] in x[1] for x in zip(df['A'], df['B'])]
or vectorized find (as proposed by BENY)
np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
My test-setup:
def generate_string(length):
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
A = [generate_string(5) for x in range(n)]
B = [generate_string(5) for y in range(n)]
df = pd.DataFrame({"A": A, "B": B})
to_list = pd.Series([a in b for a, b in df[['A', 'B']].values.tolist()])
apply = df.apply(lambda s: s["A"] in s["B"], axis=1)
v_find = np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
any = df["B"].str.split('', expand=True).eq(df["A"], axis=0).any(axis=1) | df["B"].eq(df["A"])
zip = [x[0] in x[1] for x in zip(df['A'], df['B'])]

Multiplying values from a string column in Pandas

I have a column with land dimensions in Pandas. It looks like this:
df.LotSizeDimensions.value_counts(dropna=False)
40.00X150.00 2
57.00X130.00 2
27.00X117.00 2
63.00X135.00 2
37.00X108.00 2
65.00X134.00 2
57.00X116.00 2
33x124x67x31x20x118 1
55.00X160.00 1
63.00X126.00 1
36.00X105.50 1
In rows where there is only one X, I would like to create a separate column that would multiply the values. In columns where there is more than one X, I would like to return a zero. This is the code I came up with
def dimensions_split(df: pd.DataFrame):
df.LotSizeDimensions = df.LotSizeDimensions.str.strip()
df.LotSizeDimensions = df.LotSizeDimensions.str.upper()
df.LotSizeDimensions = df.LotSizeDimensions.str.strip('`"M')
if df.LotSizeDimensions.count('X') > 1
return 0
df['LotSize'] = map(int(df.LotSizeDimensions.str.split("X", 1).str[0])*int(df.LotSizeDimensions.str.split("X", 1).str[1]))
This is coming back with the following error:
TypeError: cannot convert the series to <class 'int'>
I would also like to add a line where if there are any non-numeric characters other than X, return a zero.
Idea is first stripping and convert to upper column LotSizeDimensions to Series and then use Series.str.split for DataFrame and then multiple columns if there is only one X else is returned 0:
s = df.LotSizeDimensions.str.strip('`"M ').str.upper()
df1 = s.str.split('X', expand=True).astype(float)
#general data
#df1 = s.str.split('X', expand=True).apply(lambda x: pd.to_numeric(x, errors='coerce'))
df['LotSize'] = np.where(s.str.count('X').eq(1), df1[0] * df1[1], 0)
print (df)
LotSizeDimensions LotSize
0 40.00X150.00 6000.0
1 57.00X130.00 7410.0
2 27.00X117.00 3159.0
3 37.00X108.00 3996.0
4 63.00X135.00 8505.0
5 65.00X134.00 8710.0
6 57.00X116.00 6612.0
7 33x124x67x31x20x118 0.0
8 55.00X160.00 8800.0
9 63.00X126.00 7938.0
10 36.00X105.50 3798.0
I get this using list comprehension:
import pandas as pd
df = pd.DataFrame(['40.00X150.00','57.00X130.00',
'27.00X117.00',
'37.00X108.00',
'63.00X135.00' ,
'65.00X134.00' ,
'57.00X116.00' ,
'33x124x67x31x20x118',
'55.00X160.00',
'63.00X126.00',
'36.00X105.50'])
df[1] = [float(str_data.strip().split("X")[0])*float(str_data.strip().split("X")[1]) if len(str_data.strip().split("X"))==2 else None for str_data in df[0]]

Compare if two python tables are tibble equivalent

I want to write a function to compare if 2 tables are tibble equivalent(identical variables and observations) For example, the 2 tables below are equivalent. For the third one isn't.
a b c
x 1 hat
y 2 cat
z 3 bat
w 4 rat
b c a
2 cat y
3 bat z
1 hat x
4 rat w
a b c
2 y cat
3 z bat
1 x hat
4 w rat
I decided to solve this by comparing the max values. How do I properly call out first, second, etc column and compare the max values for each one?
def equal(A, B):
A_names = sorted(A.columns)
X = A[var_names].copy()
B_names=sorted(B.columns)
Y=B[var_names].copy()
if A[0].max()==B[0].max() and A[1].max()==B[1].max():
return True
else:
return False
This has a Error KeyError: 0
This task can be solved by using equals method of DataFrame object and some DataFrames preprocessing:
def compare_dataframes(df1, df2):
df1_cols = df1.columns.tolist()
df2_cols = df2.columns.tolist()
# column names and shapes should be equal for both dataframes
if set(df1_cols).symmetric_difference(set(df2_cols)) or (df1.shape != df2.shape):
return False
df1_sorted = df1.sort_values(by=cols).reset_index(drop=True)
df2_sorted = df2.sort_values(by=cols).reset_index(drop=True)
df2_sorted = df2_sorted[df1_sorted.columns]
return df1_sorted.equals(df2_sorted)
A_var_names = sorted(A.columns)
AA = A[A_var_names].copy() #COLUMN ORDER
AA.sort_values(by=A_var_names,inplace=True) #VALUE ORDER
B_var_names = sorted(B.columns)
BB = B[B_var_names].copy()
BB.sort_values(by=B_var_names,inplace=True)
return AA.equals(BB)

How to convert keyword in cell of dataframe to own column each

I have a dataframe like the following:
In[8]: df = pd.DataFrame({'transport': ['Car;Bike;Horse','Car','Car;Bike', 'Horse;Car']})
df
Out[8]:
transport
0 Car;Bike;Horse
1 Car
2 Car;Bike
3 Horse;Car
And I want to convert it to something like this:
In[9]: df2 = pd.DataFrame({'transport_car': [True,True,True,True],'transport_bike': [True,False,True,False], 'transport_horse': [True,False,False,True]} )
df2
Out[10]:
transport_bike transport_car transport_horse
0 True True True
1 False True False
2 True True False
3 False True True
I got a solution, but it feels very 'hacked' and 'unpythonic'. (It works for my considerably small data set)
In[11]:
# get set of all possible values
new_columns = set()
for element in set(df.transport.unique()):
for transkey in str(element).split(';'):
new_columns.add(transkey)
print(new_columns)
# Use broadcast to initialize all columns with default value.
for col in new_columns:
df['trans_'+str(col).lower()] = False
# Change cells appropiate to keywords
for index, row in df.iterrows():
for key in new_columns:
if key in row.transport:
df.set_value(index, 'trans_'+str(key).lower(), True)
df
Out[11]:
transport trans_bike trans_car trans_horse
0 Car;Bike;Horse True True True
1 Car False True False
2 Car;Bike True True False
3 Horse;Car False True True
My goal is to use the second representation to perform some evaluation to answer questions like: "How often is car used?", "How often is car used together with horse", etc.
This and this answers suggest using pivot and eval might be the way to go, but I'm not sure.
So what would be the best way, to convert a DataFrame from first representation to the second?
You can use apply and construct a Series for each entry with the splited fields as index. This will result in a data frame with the index as the columns:
df.transport.apply(lambda x: pd.Series(True, x.split(";"))).fillna(False)
I decided to extend the great #Metropolis's answer with a working example:
In [249]: %paste
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(df.transport.str.replace(';',' '))
r = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
## -- End pasted text --
In [250]: r
Out[250]:
bike car horse
0 1 1 1
1 0 1 0
2 1 1 0
3 0 1 1
now you can join it back to the source DF:
In [251]: df.join(r)
Out[251]:
transport bike car horse
0 Car;Bike;Horse 1 1 1
1 Car 0 1 0
2 Car;Bike 1 1 0
3 Horse;Car 0 1 1
Timing: for 40K rows DF:
In [254]: df = pd.concat([df] * 10**4, ignore_index=True)
In [255]: df.shape
Out[255]: (40000, 1)
In [256]: %timeit df.transport.apply(lambda x: pd.Series(True, x.split(";"))).fillna(False)
1 loop, best of 3: 33.8 s per loop
In [257]: %%timeit
...: vectorizer = CountVectorizer(min_df=1)
...: X = vectorizer.fit_transform(df.transport.str.replace(';',' '))
...: r = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
...:
1 loop, best of 3: 732 ms per loop
I would consider using the Count Vectorizer provided by Scikit-learn. The vectorizer will construct a vector where each index refers to a term and the value refers to the number of appearances of that term in the record.
Advantages over the home-rolled approaches suggested in other answer are efficiency for large datasets and generalizability. Disadvantage is, obviously, bringing in an extra dependency.

Filter a pandas dataframe using values from a dict

I need to filter a data frame with a dict, constructed with the key being the column name and the value being the value that I want to filter:
filter_v = {'A':1, 'B':0, 'C':'This is right'}
# this would be the normal approach
df[(df['A'] == 1) & (df['B'] ==0)& (df['C'] == 'This is right')]
But I want to do something on the lines
for column, value in filter_v.items():
df[df[column] == value]
but this will filter the data frame several times, one value at a time, and not apply all filters at the same time. Is there a way to do it programmatically?
EDIT: an example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':1, 'B':0, 'C':'right'}
df1.loc[df1[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
gives
A B C D
0 1 1 right 1
1 0 1 right 2
3 1 0 right 3
but the expected result was
A B C D
3 1 0 right 3
only the last one should be selected.
IIUC, you should be able to do something like this:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
This works by making a Series to compare against:
>>> pd.Series(filter_v)
A 1
B 0
C right
dtype: object
Selecting the corresponding part of df1:
>>> df1[list(filter_v)]
A C B
0 1 right 1
1 0 right 1
2 1 wrong 1
3 1 right 0
4 NaN right 1
Finding where they match:
>>> df1[list(filter_v)] == pd.Series(filter_v)
A B C
0 True False True
1 False False True
2 True False False
3 True True True
4 False False True
Finding where they all match:
>>> (df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)
0 False
1 False
2 False
3 True
4 False
dtype: bool
And finally using this to index into df1:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
Abstraction of the above for case of passing array of filter values rather than single value (analogous to pandas.core.series.Series.isin()). Using the same example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':[1], 'B':[1,0], 'C':['right']}
##Start with array of all True
ind = [True] * len(df1)
##Loop through filters, updating index
for col, vals in filter_v.items():
ind = ind & (df1[col].isin(vals))
##Return filtered dataframe
df1[ind]
##Returns
A B C D
0 1.0 1 right 1
3 1.0 0 right 3
Here is a way to do it:
df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
UPDATE:
With values being the same across columns you could then do something like this:
# Create your filtering function:
def filter_dict(df, dic):
return df[df[dic.keys()].apply(
lambda x: x.equals(pd.Series(dic.values(), index=x.index, name=x.name)), asix=1)]
# Use it on your DataFrame:
filter_dict(df1, filter_v)
Which yields:
A B C D
3 1 0 right 3
If it something that you do frequently you could go as far as to patch DataFrame for an easy access to this filter:
pd.DataFrame.filter_dict_ = filter_dict
And then use this filter like this:
df1.filter_dict_(filter_v)
Which would yield the same result.
BUT, it is not the right way to do it, clearly.
I would use DSM's approach.
For python2, that's OK in #primer's answer. But, you should be careful in Python3 because of dict_keys. For instance,
>> df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
>> TypeError: unhashable type: 'dict_keys'
The correct way to Python3:
df.loc[df[list(filter_v.keys())].isin(list(filter_v.values())).all(axis=1), :]
Here's another way:
filterSeries = pd.Series(np.ones(df.shape[0],dtype=bool))
for column, value in filter_v.items():
filterSeries = ((df[column] == value) & filterSeries)
This gives:
>>> df[filterSeries]
A B C D
3 1 0 right 3
To follow up on DSM's answer, you can also use any() to turn your query into an OR operation (instead of AND):
df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).any(axis=1)]
You can also create a query
query_string = ' and '.join(
[f'({key} == "{val}")' if type(val) == str else f'({key} == {val})' for key, val in filter_v.items()]
)
df1.query(query_string)
Combining previous answers, here's a function you can feed to df1.loc. Allows for AND/OR (using how='all'/'any'), plus it allows comparisons other than == using the op keyword, if desired.
import operator
def quick_mask(df, filters, how='all', op=operator.eq) -> pd.Series:
if how == 'all':
comb = pd.Series.all
elif how == 'any':
comb = pd.Series.any
return comb(op(df[[*filters]], pd.Series(filters)), axis=1)
# Usage
df1.loc[quick_mask(df1, filter_v)]
I had an issue due to my dictionary having multiple values for the same key.
I was able to change DSM's query to:
df1.loc[df1[list(filter_v)].isin(filter_v).all(axis=1), :]

Categories

Resources