check columns in DataFrame for constant values explanation - python

I want to check a big DataFrame for constant columns and make a 2 list. The first for the columnnames with only zeros the second with the columnnames of constant values (excluding 0)
I found a solution (A in code) at Link but I dont understand it. A is making what i want but i dont know how and how i can get the list.
import numpy as np
import pandas as pd
data = [[0,1,1],[0,1,2],[0,1,3]]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
A =df.loc[:, (df != df.iloc[0]).any()]

Use:
m1 = (df == 0).all()
m2 = (df == df.iloc[0]).all()
a = df.columns[m1].tolist()
b = df.columns[~m1 & m2].tolist()
print (a)
['A']
print (b)
['B']
Explanation:
First compare all values by 0:
print (df == 0)
A B C
0 True False False
1 True False False
2 True False False
Then test if all values are Trues by DataFrame.all:
print ((df == 0).all())
A True
B False
C False
dtype: bool
Then compare first values of row by DataFrame.iloc:
print (df == df.iloc[0])
A B C
0 True True True
1 True True False
2 True True False
And test again by all:
print ((df == df.iloc[0]).all())
A True
B True
C False
dtype: bool
because exclude 0 chain inverted first mask by ~ with & for bitwise AND:
print (~m1 & m2)
A False
B True
C False
dtype: bool

This seems like a clean way to do what you want:
m1 = df.eq(0).all()
m2 = df.nunique().eq(1) & ~m1
m1[m1].index, m2[m2].index
# (Index(['A'], dtype='object'), Index(['B'], dtype='object'))
m1 gives you a boolean of columns that all have zeros:
m1
A True
B False
C False
dtype: bool
m2 gives you all columns with unique values, but not zeros (second condition re-uses the first)
m2
A False
B True
C False
dtype: bool
Deriving your lists is trivial from these masks.

Related

How can I insert a column of a dataframe in pandas as a list of a cell into another dataframe?

I have several of dataframes (df, tmp_df and sub_df) and I want to enter a column of tmp_df into a cell of sub_df as a list. My code and dataframes are shown as below. But the loop part is not working correctly:
import pandas as pd
df = pd.read_csv('myfile.csv')
tmp_df = pd.DataFrame()
sub_df = pd.DataFrame()
tmp_df = df[df['Type'] == True]
for c in tmp_df['Category']:
sub_df['Data'] , sub_df ['Category'], sub_df['Type'] = [list(set(tmp_df['Data']))],
tmp_df['Category'], tmp_df['Type']
df:
Data
Category
Type
30275
A
True
35881
C
False
28129
C
True
30274
D
False
30351
D
True
35886
A
True
39900
C
True
35887
A
False
35883
A
True
35856
D
True
35986
C
False
30350
D
False
28129
C
True
31571
C
True
tmp_df:
Data
Category
Type
30275
A
True
28129
C
True
30351
D
True
35886
A
True
39900
C
True
35883
A
True
35856
D
True
28129
C
True
31571
C
True
What should I do if I want the following result?
sub_df:
Data
Category
Type
[30275,35886,35883]
A
True
[28129,39900,28129,31571]
C
True
[30351,35856]
D
True
you can select the rows withquery, then groupby+agg:
(df.query('Type') # or 'Type == "True"' if strings
.groupby('Category', as_index=False)
.agg({'Data': list, 'Type': 'first'})
)
output:
Category Data Type
0 A [30275, 35886, 35883] True
1 C [28129, 39900, 28129, 31571] True
2 D [30351, 35856] True

Compare one row of a dataframe with rows of other dataframe?

I have two dataframes say df and thresh_df. The shape of df is say 1000*200 and thresh_df is 1*200.
I need to compare the thresh_df row with each row of df element wise respectively and I have to fetch the corresponding column number whose values are less than the values of thresh_df.
I tried the following
compared_df = df.apply(lambda x : np.where(x < thresh_df.values))
But I get an empty dataframe! If question is unclear and need any explanations,please let me know in the comments.
I think apply is not necessary only compare one row DataFrame converted to Series by selecting first row:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
thresh_df = pd.DataFrame({
'B':[4],
'C':[7],
'D':[4],
'E':[5],
})
compared_df = df < thresh_df.iloc[0]
print (compared_df)
B C D E
0 False False True False
1 False False True True
2 False False False False
3 False True False False
4 False True True True
5 False True True True
Then use DataFrame.any for filter at least one True per row and filter index values:
idx = df.index[compared_df.any(axis=1)]
print (idx)
Int64Index([0, 1, 3, 4, 5], dtype='int64')
Detail:
print (compared_df.any(axis=1))
0 True
1 True
2 False
3 True
4 True
5 True
dtype: bool

Check if two rows in pandas DataFrame has same set of values regard & regardless of column order

I have two dataframe with same index but different column names. Number of columns are the same. I want to check, index by index, 1) whether they have same set of values regardless of column order, and 2) whether they have same set of values regarding column order.
ind = ['aaa', 'bbb', 'ccc']
df1 = pd.DataFrame({'old1': ['A','A','A'], 'old2': ['B','B','B'], 'old3': ['C','C','C']}, index=ind)
df2 = pd.DataFrame({'new1': ['A','A','A'], 'new2': ['B','C','B'], 'new3': ['C','B','D']}, index=ind)
This is the output I need.
OpX OpY
-------------
aaa True True
bbb False True
ccc False False
Could anyone help me with OpX and OpY?
Using tuple and set: keep the order or tuple , and reorder with set
s1=df1.apply(tuple,1)==df2.apply(tuple,1)
s2=df1.apply(set,1)==df2.apply(set,1)
pd.concat([s1,s2],1)
Out[746]:
0 1
aaa True True
bbb False True
ccc False False
Since cs95 mentioned apply have problem here
s=np.equal(df1.values,df2.values).all(1)
t=np.equal(np.sort(df1.values,1),np.sort(df2.values,1)).all(1)
pd.DataFrame(np.column_stack([s,t]),index=df1.index)
Out[754]:
0 1
aaa True True
bbb False True
ccc False False
Here's a solution that is performant and should scale. First, align the DataFrames on the index so you can compare them easily.
df3 = df2.set_axis(df1.columns, axis=1, inplace=False)
df4, df5 = df1.align(df3)
For req 1, simply call DataFrame.equals (or just use the == op):
u = (df4 == df5).all(axis=1)
u
aaa True
bbb False
ccc False
dtype: bool
Req 2 is slightly more complex, sort them along the first axis, then compare.
v = pd.Series((np.sort(df4) == np.sort(df5)).all(axis=1), index=u.index)
v
aaa True
bbb True
ccc False
dtype: bool
Concatenate the results,
pd.concat([u, v], axis=1, keys=['X', 'Y'])
X Y
aaa True True
bbb False True
ccc False False
For item 2):
(df1.values == df2.values).all(axis=1)
This checks element-wise equality of the dataframes, and gives True when all entries in a row are equal.
For item 1), sort the values along each row first:
import numpy as np
(np.sort(df1.values, axis=1) == np.sort(df2.values, axis=1)).all(axis=1)
Construct a new DataFrame and check the equality:
df3 = pd.DataFrame(index=ind)
df3['OpX'] = (df1.values == df2.values).all(1)
df3['OpY'] = (df1.apply(np.sort, axis=1).values == df2.apply(np.sort, axis=1).values).all(1)
print(df3)
Output:
OpX OpY
aaa True True
bbb False True
ccc False False

Create a new column by comparing rows pandas

My dataframe looks like this
df = pd.Dataframe({ 'a': ["10001", "10001", "10002", "10002" , "10002"], 'b': ['hello', 'hello', 'hola', 'hello', 'hola']})
I want to create a new column 'c' of boolean values with the following condition:
If values of 'a' is the same (i.e. 1st and 2nd row, 3rd and 4th and 5th row), check if values of 'b' of those rows are the same. (2nd row returns True. 4th row returns False).
If values of 'a' is not the same, skip.
My current code is the following:
def check_consistency(col1,col2):
df['match'] = df[col1].eq(df[col1].shift())
t = []
for i in df['match']:
if i == True:
t.append(df[col2].eq(df[col2].shift()))
check_consistency('a','b')
And it returns error.
I think this is groupby
df.groupby('a').b.apply(lambda x : x==x.shift())
Out[431]:
0 False
1 True
2 False
3 False
4 False
Name: b, dtype: bool
A bitwise & should do: Checking if both the conditions are satisfied:
df['c'] = (df.a == df.a.shift()) & (df.b == df.b.shift())
df.c
#0 False
#1 True
#2 False
#3 False
#4 False
#Name: c, dtype: bool
Alternatively, if you want to make your current code work, you can do something like (essentially doing the same check as above):
def check_consistency(col1,col2):
df['match'] = df[col1].eq(df[col1].shift())
for i in range(len(df['match'])):
if (df['match'][i] == True):
df.loc[i,'match'] = (df.loc[i, col2] == df.loc[i-1, col2])
check_consistency('a','b')

matching of columns between two pandas dataframe

import pandas as pd
temp1 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp1['a'] = [1,2,2,3,3,4,4,4,9,11]
temp1['b'] = 'B'
temp2 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp2['a'] = [1,2,3,4,5,6,7,8,9,10]
temp2['b'] = 'B'
As the script above, I want to pickup rows from temp1 that column a was not seen at temp2. I can use %in% in R to do it easily, how can I do it in pandas?
update 01
the output should be one row which column a is 11 and column b is B
You can use isin to perform boolean indexing:
isin will produce a boolean index:
In [95]:
temp1.a.isin(temp2.a)
Out[95]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: a, dtype: bool
This can then be used as a mask in the final output:
In [94]:
# note the ~ this negates the result so equivalent of NOT
temp1[~temp1.a.isin(temp2.a)]
Out[94]:
a b
9 11 B
You can use isin to get the indices that are seen, and then negate the boolean indices:
temp1[~temp1.a.isin(temp2.a)]

Categories

Resources