Select columns with specific values in pandas DataFrame - python

I have some colums in my DataFrame with values 0 and 1
name a b c d e
0 one 1 0 1 0 0
1 two 0 0 1 0 0
2 three 0 0 1 0 1
How can I select columns where at least one value is 1? But another columns (that are strings or take not only 0 and 1 values) must be selected too.
I tried this expression
df.iloc[:, [(clm == 'name') | (1 in df[clm].unique()) for clm in df.columns]]
Out:
name a c e
0 one 1 1 0
1 two 0 1 0
2 three 0 1 1
But is seems not good because I explicitly choose column 'name'

If is possible remove all columns with only 0 values compare values by DataFrame.ne for not equal and return at least one True per columns in DataFrame.loc:
df = df.loc[:, df.ne(0).any()]
print (df)
name a c e
0 one 1 1 0
1 two 0 1 0
2 three 0 1 1
Details:
print (df.ne(0))
name a b c d e
0 True True False True False False
1 True False False True False False
2 True False False True False True
print (df.ne(0).any())
name True
a True
b False
c True
d False
e True
dtype: bool

Related

How to count the number of rows that have the value 1 for all the columns in a dataframe?

The following is an example dataframe for this issue:
name gender address phone no.
---------------------------------------
1 1 1 1
1 0 0 0
1 1 1 1
1 1 0 1
The desired output here is 2 because the number of rows containing all 1s is 2.
Can anyone please help me with this issue?
Thanks.
Use eq(1) to identify the values with 1, then aggregate per row with any to have True when all values are True and sum the True taking advantage of the True/1 equivalence:
df.eq(1).all(axis=1).sum()
output: 2
Intermediates:
df.eq(1)
name gender address phone no.
0 True True True True
1 True False False False
2 True True True True
3 True True False True
df.eq(1).all(axis=1)
0 True
1 False
2 True
3 False
dtype: bool
Let's do
l = sum(df.eq(1).all(axis=1))
print(l)
2
Assuming above dataframe is a binary table i.e. all values are either 1 or 0, then df.sum(axis=1) equal to 4 should give you all rows where all values are 1.
df[df.sum(axis=1) == len(df.columns)]
name gender address phone no.
0 1 1 1 1
2 1 1 1 1

drop the row only if all columns contains 0

I am trying to drop rows that have 0 for all 3 columns, i tried using these codes, but it dropped all the rows that have 0 in either one of the 3 columns instead.
indexNames = news[ news['contain1']&news['contain2'] &news['contain3']== 0 ].index
news.drop(indexNames , inplace=True)
My CSV file
contain1 contain2 contain3
1 0 0
0 0 0
0 1 1
1 0 1
0 0 0
1 1 1
Using the codes i used, all of my rows would be deleted. Below are the result i wanted instead
contain1 contain2 contain3
1 0 0
0 1 1
1 0 1
1 1 1
First filter by DataFrame.ne for not equal 0 and then get rows with at least one match - so removed only 0 rows by DataFrame.any:
df = news[news.ne(0).any(axis=1)]
#cols = ['contain1','contain2','contain3']
#if necessary filter only columns by list
#df = news[news[cols].ne(0).any(axis=1)]
print (df)
contain1 contain2 contain3
0 1 0 0
2 0 1 1
3 1 0 1
5 1 1 1
Details:
print (news.ne(0))
contain1 contain2 contain3
0 True False False
1 False False False
2 False True True
3 True False True
4 False False False
5 True True True
print (news.ne(0).any(axis=1))
0 True
1 False
2 True
3 True
4 False
5 True
dtype: bool
If this is a pandas dataframe you can sum the indexes with .sum().
news_sums = news.sum(axis=0)
indexNames = news.loc[news_sums == 0].index
news.drop(indexNames, inplace=True)
(note: Not tested, just from memory)
A simple solution would be to filter on the sum of your columns. You can do this by running this code news[news.sum(axis=1)!=0].
Hope this will help you :)
You might want to try this.
news[(news.T != 0).any()]

Compare columns of 2 dataframes with a combination of index and row value

There are quite a few similar questions out there, but I am not sure if there is one that tackles both index and row values. (relevant to binary classification df)
So what I am trying to do is compare the columns with the same name to have the same values and index. If not, simply return an error.
Let's say DataFrame df has columns a, b and c and df_orginal has columns from a to z.
How can we first find the columns that have the same name between those 2 DataFrames, and then check the contents of those columns such that they match row by row in value and index between a, b and c from df and df_orginal
The contents of all the columns are numerical, that's why I want to compare the combination of index and values
Demo:
In [1]: df
Out[1]:
a b c
0 0 1 2
1 1 2 0
2 0 1 0
3 1 1 0
4 3 1 0
In [3]: df_orginal
Out[3]:
a b c d e f g ......
0 4 3 1 1 0 0 0
1 3 1 2 1 1 2 1
2 1 2 1 1 1 2 1
3 3 4 1 1 1 2 1
4 0 3 0 0 1 1 1
In the above example, for those columns that have the same column name, compare the combination of index and value and flag an error if the combination of index and value is not correct
common_cols = df.columns.intersection(df_original.columns)
for col in common_cols:
df1_ind_val_pair = df[col].index.astype(str) + ' ' + df[col].astype(str)
df2_ind_val_pair = df_original[col].index.astype(str) + ' ' + df_original[col].astype(str)
if any(df1_ind_val_pair != df2_ind_val_pair):
print('Found one or more unequal (index, value) pairs in col {}'.format(col))
IIUC:
Use pd.DataFrame.align with a join method of inner. Then pass the resulting tuple unpacked to pd.DataFrame.eq
pd.DataFrame.eq(*df.align(dfo, 'inner'))
a b c
0 False False False
1 False False False
2 False False False
3 False False False
4 False False True
To see rows that have all columns True, filter with this mask:
pd.DataFrame.eq(*df.align(dfo, 'inner')).all(1)
0 False
1 False
2 False
3 False
4 False
dtype: bool
with the sample data however, the result will be empty
df[pd.DataFrame.eq(*df.align(dfo, 'inner')).all(1)]
Empty DataFrame
Columns: [a, b, c]
Index: []
Same answer but with clearer code
def eq(d1, d2):
d1, d2 = d1.align(d2, 'inner')
return d1 == d2
eq(df, dfo)
a b c
0 False False False
1 False False False
2 False False False
3 False False False
4 False False True

Python count all boolean values in pivot table

Does anybody have an idea how to count all boolean values(including the false one) in pivot table?
passed_exam is a column of boolean values.
This code performs the task only for true values which is great:
table = pd.pivot_table(df,index=["student","semester"], values=["passed_exam"],aggfunc=np.sum)
But I also want a column that counts all boolean values.
Thank you in advance!
I think you need groupby with size, last reshape by unstack:
df = pd.DataFrame({'student':['a'] * 4 + ['b'] * 6,
'semester':[1,1,2,2,1,1,2,2,2,2],
'passed_exam':[True, False] * 5})
print (df)
passed_exam semester student
0 True 1 a
1 False 1 a
2 True 2 a
3 False 2 a
4 True 1 b
5 False 1 b
6 True 2 b
7 False 2 b
8 True 2 b
9 False 2 b
table = df.groupby(["student","semester","passed_exam"])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
.reset_index()
print (table)
student semester False True
0 a 1 1 1
1 a 2 1 1
2 b 1 1 1
3 b 2 2 2

Add number to column each time a different column has group of True bools

I have two columns I am working with. The first column is populated with zeros and the second column is populated with booleans.
column 1 column 2
0 True
0 True
0 False
0 True
0 True
0 False
0 False
0 True
There are millions of rows so I am trying to figure an efficient process that looks at column 2 and for each grouping of True bools adds 1 to column 1.
column 1 column 2
1 True
1 True
0 False
2 True
2 True
0 False
0 False
3 True
Any help is much appreciated!
One trick which often comes in handy when vectorizing operations on contiguous groups is the shift-cumsum pattern:
>>> c = df["column 2"]
>>> c * (c & (c != c.shift())).cumsum()
0 1
1 1
2 0
3 2
4 2
5 0
6 0
7 3
Name: column 2, dtype: int32
df['column 3'] = (df['column 2'] & (df['column 2'].shift() != True))
df['column 4'] = df['column 3'].cumsum()
df['column 1'] = df['column 2'] * df['column 4']
print df
column 1 column 2 column 3 column 4
0 1 True True 1
1 1 True False 1
2 0 False False 1
3 2 True True 2
4 2 True False 2
5 0 False False 2
6 0 False False 2
7 3 True True 3

Categories

Resources