inconsistent any vs all pd dataframe - python

This was asked in other forums but with focus on nan.
I have a simple dataframe:
y=[[1,2,3,4,1],[1,2,0,4,5]]
df = pd.DataFrame(y)
I am having difficulties understanding how any and all work. According to the pandas documentation 'any' returns "...whether any element is True over requested axis".
If I use:
~(df == 0)
Out[77]:
0 1 2 3 4
0 True True True True True
1 True True False True True
~(df == 0).any(1)
Out[78]:
0 True
1 False
dtype: bool
From my understanding the second command means: Return 'True' if any element is True over requested axis, and it should return True, True for both rows (since both contain at least one true value) but instead I get True, False. Why is that?

You need one () because priority of operators:
print (df == 0)
0 1 2 3 4
0 False False False False False
1 False False True False False
print (~(df == 0))
0 1 2 3 4
0 True True True True True
1 True True False True True
print ((~(df == 0)).any(1))
0 True
1 True
dtype: bool
Because:
print ((df == 0).any(1))
0 False
1 True
dtype: bool
print (~(df == 0).any(1))
0 True
1 False
dtype: bool

Python interprets your call as:
~ ( (df == 0).any(1) )
So it **evaluates any first. Now if we take a look at df == 0, we see:
>>> df == 0
0 1 2 3 4
0 False False False False False
1 False False True False False
So this means that in the first row, there is no such True, in the second one there is, so:
>>> (df == 0).any(1)
0 False
1 True
dtype: bool
Now we negate this with ~, so False becomes True and vice versa:
>>> ~ (df == 0).any(1)
0 True
1 False
dtype: bool
In case we first negate, we see:
>>> (~ (df == 0)).any(1)
0 True
1 True
dtype: bool
Both are True, since in both rows there is at least one column that is True.

Related

How to satisfy the condition of 2 columns different rows at the same time

My logic is like this:
cond2 column is true before expected column, and cond1 column is true before cond2 column, then expected column can be true
input
import pandas as pd
import numpy as np
d={'cond1':[False,False,True,False,False,False,False,True,False,False],'cond2':[False,True,False,True,True,False,False,False,True,False]}
df = pd.DataFrame(d)
expected result table
cond1 cond2 expected
0 FALSE FALSE
1 FALSE TRUE
2 TRUE FALSE
3 FALSE TRUE
4 FALSE TRUE
5 FALSE FALSE TRUE
6 FALSE FALSE TRUE
7 TRUE FALSE
8 FALSE TRUE
9 FALSE FALSE TRUE
I have such an idea
get the number of lines from cond1 is true to the present, and then use the cumsum function to calculate the number of lines where cond2 is true is greater than 0
But how to get the number of lines from cond1 is true to the present
The description is not fully clear. It looks like you need a cummax per group starting with True in cond1:
m = df.groupby(df['cond1'].cumsum())['cond2'].cummax()
df['expected'] = df['cond2'].ne(m)
Output:
cond1 cond2 expected
0 False False False
1 False True False
2 True False False
3 False True False
4 False True False
5 False False True
6 False False True
7 True False False
8 False True False
9 False False True
It's not very clear what you're looking for~
df['expected'] = ((df.index > df.idxmax().max())
& ~df.any(axis=1))
# Output:
cond1 cond2 expected
0 False False False
1 False True False
2 True False False
3 False True False
4 False True False
5 False False True
6 False False True
7 True False False
8 False True False
9 False False True

Selecting Dataframe row from a condition to another condition

I have a dataframe with two columns:
A B
0 False False
1 False False
2 False False
3 True False
4 False False
5 False False
6 False True
7 False False
8 False False
9 False False
10 True False
11 False False
12 False False
I would like to create a new column "C" with Boolean values, that turns on (=True) each time B turns on and turns of each time A turns on (ex: here between index 6 to index 10)
Ex: for this df, the output will be:
A B C
0 False False False
1 False False False
2 False False False
3 True False False
4 False False False
5 False False False
6 False True True
7 False False True
8 False False True
9 False False True
10 True False True
11 False False False
12 False False False
I wrote this code with a for loop and a "switch", but I'm pretty sure there will be faster and easier solution to do the same thing for large dataframes. I appreciate your help.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [False,False,False,True,False,False,False,False,False,False,True,False,False],
'B': [False,False,False,False,False,False,True,False,False,False,False,False,False]
})
df["C"]=0
switch=False
for i in df.index :
if df.B.iloc[i]:
switch=True
if switch:
df.C.iloc[i]=True
else:
df.C.iloc[i]=False
if df.A.iloc[i]:
switch=False
print(df)
Alternative approach using ffill
df.loc[df['A'],'C'] = False
df.loc[df['B'],'C'] = True
df['C'] = df['C'].ffill().fillna(False) #start "off"
Combine the two columns, subtract 1, filter out negative and even numbers:
x = (df['A'] | df['B']).cumsum().sub(1)
df['C'] = (x >= 0) & (x % 2 == 1)
Output:
>>> df
A B C
0 False False False
1 False False False
2 False False False
3 True False False
4 False False False
5 False False False
6 False True True <
7 False False True <
8 False False True <
9 False False True <
10 True False False
11 False False False
12 False False False

Filter pandas dataframe columns based on value for columns in a list

I have a dataframe df defined as below :
df = pd.DataFrame()
df["A"] = ['True','True','True','True','True']
df["B"] = ['True','False','True','False','True']
df["C"] = ['False','True','False','True','False']
df["D"] = ['True','True','False','False','False']
df["E"] = ['False','True','True','False','True']
df["F"] = ['HI','HI','HI','HI','HI']
>> df
A B C D E F
0 True True False True False HI
1 True False True True True HI
2 True True False False True HI
3 True False True False False HI
4 True True False False True HI
and a list
lst = ["A","C"]
I would like to filter the rows in df based on the values being 'True' for the columns in lst.
That is, I would like to get my resultant dataframe as :
A B C D E F
1 True False True True True HI
3 True False True False False HI
Instead of looping through the column names from the list and filtering it, is there any better solution for this?
Use DatFrame.all over the column axis (axis=1):
df[df[lst].all(axis=1)]
A B C D E F
1 True False True True True HI
3 True False True False False HI
Details:
We get the columns in scope with df[lst], then we use all to check which rows have "all" as True:
df[lst].all(axis=1)
0 False
1 True
2 False
3 True
4 False
dtype: bool

Pandas get one hot encodings from a column as booleans

I'm considering a Pandas Dataframe. I would like to find an efficient way in which the second Dataframe is created.
import pandas as pd
data = {"column":[0,1,2,0,1,2,0]}
df = pd.DataFrame(data)
column
0
1
2
0
1
2
0
column0 column1 column2
true false false
false true false
false false true
true false false
false true false
false false true
true false false
This is a get_dummies problem, but you will additionally need to specify dtype=bool to get columns of bools:
pd.get_dummies(df['column'], dtype=bool)
0 1 2
0 True False False
1 False True False
2 False False True
3 True False False
4 False True False
5 False False True
6 True False False
pd.get_dummies(df['column'], dtype=bool).dtypes
0 bool
1 bool
2 bool
dtype: object
# carbon copy of expected output
dummies = pd.get_dummies(df['column'], dtype=bool)
dummies[:] = np.where(pd.get_dummies(df['column'], dtype=bool), 'true', 'false')
dummies.add_prefix('column')
column0 column1 column2
0 true false false
1 false true false
2 false false true
3 true false false
4 false true false
5 false false true
6 true false false
I also use get_dummies as cs95. However, I use str.get_dummies and concat the word column before get_dummies. Finally, replace
('column'+df.column.astype(str)).str.get_dummies().replace({1:'true', 0:'false'})
Out[2164]:
column0 column1 column2
0 true false false
1 false true false
2 false false true
3 true false false
4 false true false
5 false false true
6 true false false
factorize and slice assignment
i, u = pd.factorize(df.column)
a = np.empty((len(i), len(u)), '<U5')
a.fill('false')
a[np.arange(len(i)), i] = 'true'
pd.DataFrame(a).add_prefix('column')
column0 column1 column2
0 true false false
1 false true false
2 false false true
3 true false false
4 false true false
5 false false true
6 true false false

Pandas How to delete rows containing required string

I want to delete all rows containing required string ,
Suppose I have following dataframe:
A B C
1 a x
w g n
3 l p
j p v
I want to delete all rows containing string p. I have search for it but most of the answer is on the basis of column name , in my case I will not be aware of column it can be present in any of the column.
Output dataframe should be
A B C
1 a x
w g n
For filtering strings:
df = df[(df != 'p').all(axis=1)]
Compare for not equal:
print ((df != 'p'))
A B C
0 True True True
1 True True True
2 True True False
3 True False True
And test for all Trues per row:
print ((df != 'p').all(axis=1))
0 True
1 True
2 False
3 False
dtype: bool
Or:
df = df[~(df == 'p').any(axis=1)]
Test for equal:
print ((df == 'p'))
A B C
0 False False False
1 False False False
2 False False True
3 False True False
Test at least one True per row:
print ((df == 'p').any(axis=1))
0 False
1 False
2 True
3 True
dtype: bool
Invert boolean mask:
print (~(df == 'p').any(axis=1))
0 True
1 True
2 False
3 False
dtype: bool
For filtering substrings use contains with apply:
df = df[~df.apply(lambda x: x.astype(str).str.contains('p')).any(axis=1)]
Or:
df = df[~df.stack().astype(str).str.contains('p').unstack().any(axis=1)]
print (df)
A B C
0 1 a x
1 w g n

Categories

Resources