Compare one row of a dataframe with rows of other dataframe? - python

I have two dataframes say df and thresh_df. The shape of df is say 1000*200 and thresh_df is 1*200.
I need to compare the thresh_df row with each row of df element wise respectively and I have to fetch the corresponding column number whose values are less than the values of thresh_df.
I tried the following
compared_df = df.apply(lambda x : np.where(x < thresh_df.values))
But I get an empty dataframe! If question is unclear and need any explanations,please let me know in the comments.

I think apply is not necessary only compare one row DataFrame converted to Series by selecting first row:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
thresh_df = pd.DataFrame({
'B':[4],
'C':[7],
'D':[4],
'E':[5],
})
compared_df = df < thresh_df.iloc[0]
print (compared_df)
B C D E
0 False False True False
1 False False True True
2 False False False False
3 False True False False
4 False True True True
5 False True True True
Then use DataFrame.any for filter at least one True per row and filter index values:
idx = df.index[compared_df.any(axis=1)]
print (idx)
Int64Index([0, 1, 3, 4, 5], dtype='int64')
Detail:
print (compared_df.any(axis=1))
0 True
1 True
2 False
3 True
4 True
5 True
dtype: bool

Related

how to count and change value in dataframe by row

I want to count the number of times a value False appears in my dataframe and get the number of how many times False appears in a row.
So here is how my table should look initially:
A
B
C
D
count
First
row
True
False
0
Second
row
False
True
0
Third
row
True
True
0
Fourth
row
False
False
0
This is how it should look:
A
B
C
D
count
First
row
True
False
1
Second
row
False
True
1
Third
row
True
True
0
Fourth
row
False
False
2
This is my code, I have tried to count for at least one column to begin with something, but it does not change the value in count column.
import pandas as pd
data = {'A': ['One', None, 'One', None], 'B': [None, 'Two', None, 'Two'], 'C': [True, False, True, False],
'D': [False, True, True, False], 'count': [0, 0, 0, 0]}
df = pd.DataFrame(data)
for index, row in df.iterrows():
if row['C'] is False:
row['count'] += 1
print(df.head(4))
IIUC, you want to count the False (boolean) values per row?
You can subset the boolean columns with select_dtypes, then invert the boolean value with ~ (so that False becomes True and is equivalent to 1), then sum per row:
df['count'] = (~df.select_dtypes('boolean')).sum(axis=1)
output:
A B C D count
0 One None True False 1
1 None Two False True 1
2 One None True True 0
3 None Two False False 2
Select columns 'C' and 'D', flip/invert the booleans (~) and then sum across both columns:
df['count'] = (~df[['C', 'D']]).sum(axis='columns')

How can I insert a column of a dataframe in pandas as a list of a cell into another dataframe?

I have several of dataframes (df, tmp_df and sub_df) and I want to enter a column of tmp_df into a cell of sub_df as a list. My code and dataframes are shown as below. But the loop part is not working correctly:
import pandas as pd
df = pd.read_csv('myfile.csv')
tmp_df = pd.DataFrame()
sub_df = pd.DataFrame()
tmp_df = df[df['Type'] == True]
for c in tmp_df['Category']:
sub_df['Data'] , sub_df ['Category'], sub_df['Type'] = [list(set(tmp_df['Data']))],
tmp_df['Category'], tmp_df['Type']
df:
Data
Category
Type
30275
A
True
35881
C
False
28129
C
True
30274
D
False
30351
D
True
35886
A
True
39900
C
True
35887
A
False
35883
A
True
35856
D
True
35986
C
False
30350
D
False
28129
C
True
31571
C
True
tmp_df:
Data
Category
Type
30275
A
True
28129
C
True
30351
D
True
35886
A
True
39900
C
True
35883
A
True
35856
D
True
28129
C
True
31571
C
True
What should I do if I want the following result?
sub_df:
Data
Category
Type
[30275,35886,35883]
A
True
[28129,39900,28129,31571]
C
True
[30351,35856]
D
True
you can select the rows withquery, then groupby+agg:
(df.query('Type') # or 'Type == "True"' if strings
.groupby('Category', as_index=False)
.agg({'Data': list, 'Type': 'first'})
)
output:
Category Data Type
0 A [30275, 35886, 35883] True
1 C [28129, 39900, 28129, 31571] True
2 D [30351, 35856] True

check columns in DataFrame for constant values explanation

I want to check a big DataFrame for constant columns and make a 2 list. The first for the columnnames with only zeros the second with the columnnames of constant values (excluding 0)
I found a solution (A in code) at Link but I dont understand it. A is making what i want but i dont know how and how i can get the list.
import numpy as np
import pandas as pd
data = [[0,1,1],[0,1,2],[0,1,3]]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
A =df.loc[:, (df != df.iloc[0]).any()]
Use:
m1 = (df == 0).all()
m2 = (df == df.iloc[0]).all()
a = df.columns[m1].tolist()
b = df.columns[~m1 & m2].tolist()
print (a)
['A']
print (b)
['B']
Explanation:
First compare all values by 0:
print (df == 0)
A B C
0 True False False
1 True False False
2 True False False
Then test if all values are Trues by DataFrame.all:
print ((df == 0).all())
A True
B False
C False
dtype: bool
Then compare first values of row by DataFrame.iloc:
print (df == df.iloc[0])
A B C
0 True True True
1 True True False
2 True True False
And test again by all:
print ((df == df.iloc[0]).all())
A True
B True
C False
dtype: bool
because exclude 0 chain inverted first mask by ~ with & for bitwise AND:
print (~m1 & m2)
A False
B True
C False
dtype: bool
This seems like a clean way to do what you want:
m1 = df.eq(0).all()
m2 = df.nunique().eq(1) & ~m1
m1[m1].index, m2[m2].index
# (Index(['A'], dtype='object'), Index(['B'], dtype='object'))
m1 gives you a boolean of columns that all have zeros:
m1
A True
B False
C False
dtype: bool
m2 gives you all columns with unique values, but not zeros (second condition re-uses the first)
m2
A False
B True
C False
dtype: bool
Deriving your lists is trivial from these masks.

Check if two rows in pandas DataFrame has same set of values regard & regardless of column order

I have two dataframe with same index but different column names. Number of columns are the same. I want to check, index by index, 1) whether they have same set of values regardless of column order, and 2) whether they have same set of values regarding column order.
ind = ['aaa', 'bbb', 'ccc']
df1 = pd.DataFrame({'old1': ['A','A','A'], 'old2': ['B','B','B'], 'old3': ['C','C','C']}, index=ind)
df2 = pd.DataFrame({'new1': ['A','A','A'], 'new2': ['B','C','B'], 'new3': ['C','B','D']}, index=ind)
This is the output I need.
OpX OpY
-------------
aaa True True
bbb False True
ccc False False
Could anyone help me with OpX and OpY?
Using tuple and set: keep the order or tuple , and reorder with set
s1=df1.apply(tuple,1)==df2.apply(tuple,1)
s2=df1.apply(set,1)==df2.apply(set,1)
pd.concat([s1,s2],1)
Out[746]:
0 1
aaa True True
bbb False True
ccc False False
Since cs95 mentioned apply have problem here
s=np.equal(df1.values,df2.values).all(1)
t=np.equal(np.sort(df1.values,1),np.sort(df2.values,1)).all(1)
pd.DataFrame(np.column_stack([s,t]),index=df1.index)
Out[754]:
0 1
aaa True True
bbb False True
ccc False False
Here's a solution that is performant and should scale. First, align the DataFrames on the index so you can compare them easily.
df3 = df2.set_axis(df1.columns, axis=1, inplace=False)
df4, df5 = df1.align(df3)
For req 1, simply call DataFrame.equals (or just use the == op):
u = (df4 == df5).all(axis=1)
u
aaa True
bbb False
ccc False
dtype: bool
Req 2 is slightly more complex, sort them along the first axis, then compare.
v = pd.Series((np.sort(df4) == np.sort(df5)).all(axis=1), index=u.index)
v
aaa True
bbb True
ccc False
dtype: bool
Concatenate the results,
pd.concat([u, v], axis=1, keys=['X', 'Y'])
X Y
aaa True True
bbb False True
ccc False False
For item 2):
(df1.values == df2.values).all(axis=1)
This checks element-wise equality of the dataframes, and gives True when all entries in a row are equal.
For item 1), sort the values along each row first:
import numpy as np
(np.sort(df1.values, axis=1) == np.sort(df2.values, axis=1)).all(axis=1)
Construct a new DataFrame and check the equality:
df3 = pd.DataFrame(index=ind)
df3['OpX'] = (df1.values == df2.values).all(1)
df3['OpY'] = (df1.apply(np.sort, axis=1).values == df2.apply(np.sort, axis=1).values).all(1)
print(df3)
Output:
OpX OpY
aaa True True
bbb False True
ccc False False

matching of columns between two pandas dataframe

import pandas as pd
temp1 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp1['a'] = [1,2,2,3,3,4,4,4,9,11]
temp1['b'] = 'B'
temp2 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp2['a'] = [1,2,3,4,5,6,7,8,9,10]
temp2['b'] = 'B'
As the script above, I want to pickup rows from temp1 that column a was not seen at temp2. I can use %in% in R to do it easily, how can I do it in pandas?
update 01
the output should be one row which column a is 11 and column b is B
You can use isin to perform boolean indexing:
isin will produce a boolean index:
In [95]:
temp1.a.isin(temp2.a)
Out[95]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: a, dtype: bool
This can then be used as a mask in the final output:
In [94]:
# note the ~ this negates the result so equivalent of NOT
temp1[~temp1.a.isin(temp2.a)]
Out[94]:
a b
9 11 B
You can use isin to get the indices that are seen, and then negate the boolean indices:
temp1[~temp1.a.isin(temp2.a)]

Categories

Resources