Pandas How to delete rows containing required string - python

I want to delete all rows containing required string ,
Suppose I have following dataframe:
A B C
1 a x
w g n
3 l p
j p v
I want to delete all rows containing string p. I have search for it but most of the answer is on the basis of column name , in my case I will not be aware of column it can be present in any of the column.
Output dataframe should be
A B C
1 a x
w g n

For filtering strings:
df = df[(df != 'p').all(axis=1)]
Compare for not equal:
print ((df != 'p'))
A B C
0 True True True
1 True True True
2 True True False
3 True False True
And test for all Trues per row:
print ((df != 'p').all(axis=1))
0 True
1 True
2 False
3 False
dtype: bool
Or:
df = df[~(df == 'p').any(axis=1)]
Test for equal:
print ((df == 'p'))
A B C
0 False False False
1 False False False
2 False False True
3 False True False
Test at least one True per row:
print ((df == 'p').any(axis=1))
0 False
1 False
2 True
3 True
dtype: bool
Invert boolean mask:
print (~(df == 'p').any(axis=1))
0 True
1 True
2 False
3 False
dtype: bool
For filtering substrings use contains with apply:
df = df[~df.apply(lambda x: x.astype(str).str.contains('p')).any(axis=1)]
Or:
df = df[~df.stack().astype(str).str.contains('p').unstack().any(axis=1)]
print (df)
A B C
0 1 a x
1 w g n

Related

Count number of consecutive True in column, restart when False

I work with the following column in a pandas df:
A
True
True
True
False
True
True
I want to add column B that counts the number of consecutive "True" in A. I want to restart everytime a "False" comes up. Desired output:
A B
True 1
True 2
True 3
False 0
True 1
True 2
Using cumsum identify the blocks of rows where the values in column A stays True, then group the column A on these blocks and calculate cumulative sum to assign ordinal numbers
df['B'] = df['A'].groupby((~df['A']).cumsum()).cumsum()
A B
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
5 True 2
Using a simple & native approach
(For a small code sample it worked fine)
import pandas as pd
df = pd.DataFrame({'A': [True, False, True, True, True, False, True, True]})
class ToNums:
counter = 0
#staticmethod
def convert(bool_val):
if bool_val:
ToNums.counter += 1
else:
ToNums.counter = 0
return ToNums.counter
df['B'] = df.A.map(ToNums.convert)
df
A B
0 True 1
1 False 0
2 True 1
3 True 2
4 True 3
5 False 0
6 True 1
7 True 2
Here's an example
v=0
for i,val in enumerate(df['A']):
if val =="True":
df.loc[i,"C"]= v =v+1
else:
df.loc[i,"C"]=v=0
df.head()
This will give the desired output
A C
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
You can use a combination of groupby, cumsum, and cumcount
df['B'] = (df.groupby((df['A']&
~df['A'].shift(1).fillna(False) # row is True and next is False
)
.cumsum() # make group id
)
.cumcount().add(1) # make cumulated count
*df['A'] # multiply by 0 where initially False, 1 otherwise
)
output:
A B
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
5 True 2

Filter pandas dataframe columns based on value for columns in a list

I have a dataframe df defined as below :
df = pd.DataFrame()
df["A"] = ['True','True','True','True','True']
df["B"] = ['True','False','True','False','True']
df["C"] = ['False','True','False','True','False']
df["D"] = ['True','True','False','False','False']
df["E"] = ['False','True','True','False','True']
df["F"] = ['HI','HI','HI','HI','HI']
>> df
A B C D E F
0 True True False True False HI
1 True False True True True HI
2 True True False False True HI
3 True False True False False HI
4 True True False False True HI
and a list
lst = ["A","C"]
I would like to filter the rows in df based on the values being 'True' for the columns in lst.
That is, I would like to get my resultant dataframe as :
A B C D E F
1 True False True True True HI
3 True False True False False HI
Instead of looping through the column names from the list and filtering it, is there any better solution for this?
Use DatFrame.all over the column axis (axis=1):
df[df[lst].all(axis=1)]
A B C D E F
1 True False True True True HI
3 True False True False False HI
Details:
We get the columns in scope with df[lst], then we use all to check which rows have "all" as True:
df[lst].all(axis=1)
0 False
1 True
2 False
3 True
4 False
dtype: bool

inconsistent any vs all pd dataframe

This was asked in other forums but with focus on nan.
I have a simple dataframe:
y=[[1,2,3,4,1],[1,2,0,4,5]]
df = pd.DataFrame(y)
I am having difficulties understanding how any and all work. According to the pandas documentation 'any' returns "...whether any element is True over requested axis".
If I use:
~(df == 0)
Out[77]:
0 1 2 3 4
0 True True True True True
1 True True False True True
~(df == 0).any(1)
Out[78]:
0 True
1 False
dtype: bool
From my understanding the second command means: Return 'True' if any element is True over requested axis, and it should return True, True for both rows (since both contain at least one true value) but instead I get True, False. Why is that?
You need one () because priority of operators:
print (df == 0)
0 1 2 3 4
0 False False False False False
1 False False True False False
print (~(df == 0))
0 1 2 3 4
0 True True True True True
1 True True False True True
print ((~(df == 0)).any(1))
0 True
1 True
dtype: bool
Because:
print ((df == 0).any(1))
0 False
1 True
dtype: bool
print (~(df == 0).any(1))
0 True
1 False
dtype: bool
Python interprets your call as:
~ ( (df == 0).any(1) )
So it **evaluates any first. Now if we take a look at df == 0, we see:
>>> df == 0
0 1 2 3 4
0 False False False False False
1 False False True False False
So this means that in the first row, there is no such True, in the second one there is, so:
>>> (df == 0).any(1)
0 False
1 True
dtype: bool
Now we negate this with ~, so False becomes True and vice versa:
>>> ~ (df == 0).any(1)
0 True
1 False
dtype: bool
In case we first negate, we see:
>>> (~ (df == 0)).any(1)
0 True
1 True
dtype: bool
Both are True, since in both rows there is at least one column that is True.

Whats the fastest way to loop through a DataFrame and count occurrences within the DataFrame whilst some condition is fulfilled (in Python)?

I have a dataframe with two Boolean fields (as below).
import pandas as pd
d = [{'a1':False, 'a2':False}, {'a1':True, 'a2':False}, {'a1':True, 'a2':False}, {'a1':False, 'a2':False}, {'a1':False, 'a2':True},
{'a1': False, 'a2': False}, {'a1':False, 'a2':False}, {'a1':True, 'a2':False}, {'a1':False, 'a2':True}, {'a1':False, 'a2':False},]
df = pd.DataFrame(d)
df
Out[1]:
a1 a2
0 False False
1 True False
2 True False
3 False False
4 False True
5 False False
6 False False
7 True False
8 False True
9 False False
I am trying to find the fastest and most "Pythonic" way of achieving the following:
If a1==True, count instances from current row where a2==False (e.g. row 1: a1=True, a2 is False for three rows from row 1)
At first instance of a2==True, stop counting (e.g. row 4, count = 3)
Set value of 'count' to new df column 'a3' on row where counting began (e.g. 'a3' = 3 on row 1)
Target result set as follows.
a1 a2 a3
0 False False 0
1 True False 3
2 True False 2
3 False False 0
4 False True 0
5 False False 0
6 False False 0
7 True False 1
8 False True 0
9 False False 0
I have been trying to accomplish this using for loops, iterrows and while loops and so far haven't been able to produce a good nested combination which provides the results I want. Any help appreciated. I apologize if the problem is not totally clear.
How about this:
df['a3'] = df.apply(lambda x: 0 if not x.a1 else len(df.a2[x.name:df.a2.tolist()[x.name:].index(True)+x.name]), axis=1)
So, if a1 is False write 0 else write the length of list that goes from that row until next True.
This will do the trick:
df['a3'] = 0
# loop throught every value of 'a1'
for i in xrange(df['a1'].__len__()):
# if 'a1' at position i is 'True'...
if df['a1'][i] == True:
count = 0
# loop over the remaining items in 'a2'
# remaining: __len__() - i
# i: position of 'True' value in 'a1'
for j in xrange(df['a2'].__len__() - i):
# if the value of 'a2' is 'False'...
if df['a2'][j + i] == False:
# count the occurances of 'False' values in a row...
count += 1
else:
# ... if it's not 'False' break the loop
break
# write the number of occurances on the right position (i) in 'a3'
df['a3'][i] = count
and produce the following output:
a1 a2 a3
0 False False 0
1 True False 3
2 True False 2
3 False False 0
4 False True 0
5 False False 0
6 False False 0
7 True False 1
8 False True 0
9 False False 0
Edit: added comments in the code

Making new column in pandas DataFrame based on filter

Given this DataFrame:
df = pandas.DataFrame({"a": [1,10,20,3,10], "b": [50,60,55,0,0], "c": [1,30,1,0,0]})
What is the best way to make a new column, "filter" that has value "pass" if the values at columns a and b are both greater than x and value "fail" otherwise?
It can be done by iterating through rows but it's inefficient and inelegant:
c = []
for x, v in df.iterrows():
if v["a"] >= 20 and v["b"] >= 20:
c.append("pass")
else:
c.append("fail")
df["filter"] = c
One way would be to create a column of boolean values like this:
>>> df['filter'] = (df['a'] >= 20) & (df['b'] >= 20)
a b c filter
0 1 50 1 False
1 10 60 30 False
2 20 55 1 True
3 3 0 0 False
4 10 0 0 False
You can then change the boolean values to 'pass' or 'fail' using replace:
>>> df['filter'].astype(object).replace({False: 'fail', True: 'pass'})
0 fail
1 fail
2 pass
3 fail
4 fail
You can extend this to more columns using all. For example, to find rows across the columns with entries greater than 0:
>>> cols = ['a', 'b', 'c'] # a list of columns to test
>>> df[cols] > 0
a b c
0 True True True
1 True True True
2 True True True
3 True False False
4 True False False
Using all across axis 1 of this DataFrame creates the new column:
>>> (df[cols] > 0).all(axis=1)
0 True
1 True
2 True
3 False
4 False
dtype: bool

Categories

Resources