Count occurrences of stings in a row Pandas - python

I'm trying to count the number of instances of a certain sting in a row in a pandas dataframe.
In the example here I utilized a lambda function and pandas .count() to try and count the number of times 'True' exists in each row.
Though instead of a count of 'True' it is just returning a boolean whether or not it exists in the row...
#create dataframe
d = {'Period': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'Result': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result1': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result2': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False']}
df = pd.DataFrame(data=d)
#count instances of Trus or False in each row
df['Count'] = df.apply(lambda row: row.astype(str).str.count('True').any(), axis=1)
print(df)
The desired outcome is:
Period Result Result1 Result2 Count
1 True True True 3
2 None None None 0
3 False False False 0
4 True True True 3
1 False False False 0
2 True True True 3
3 False False False 0
... ... ... ... ......

You can use np.where:
df['count'] = np.where(df == 'True', 1, 0).sum(axis=1)
Regarding why your apply returns a boolean: both any and all
returns boolean, not numbers
Edit: You can include df.isin for multiple conditions:
df['count'] = np.where(df.isin(['True', 'False']), 1, 0).sum(axis=1)

Use eq with sum:
df.eq("True").sum(axis=1)
Use apply with lambda function.
df.apply(lambda x: x.eq("True").sum(), axis=1)
For more than 1 text matching try
df.iloc[:,1:].apply(lambda x: x.str.contains("True|False")).sum(axis=1)

Avoiding using the apply function, as it can be slow:
df[["Result", "Result1", "Result2"]].sum(axis=1).str.count("True")
This also will work for when you have strings that are like:
"this sentence contains True"

Your lambda is not working correctly, try this:
import pandas as pd
#create dataframe
d = {'Period': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'Result': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result1': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result2': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False']}
df = pd.DataFrame(data=d)
#count instances of Trues or False in each row
df['Count'] = df.apply(lambda row: sum(row[1:4] == 'True') ,axis=1)
print(df)
# Output:
# >> Period Result Result1 Result2 Count
# >> 0 1 True True True 3
# >> 1 2 None None None 0
# >> 2 3 False False False 0
# >> 3 4 True True True 3
# >> 4 1 False False False 0
# >> 5 2 True True True 3
# >> 6 3 False False False 0
# >> 7 4 True True True 3
# >> 8 1 True True True 3
# >> 9 2 False False False 0
# >> 10 3 False False False 0
# >> 11 4 True True True 3
# >> 12 1 False False False 0
# >> 13 2 True True True 3
# >> 14 3 False False False 0
# >> 15 4 False False False 0

Related

Pandas dataframe conditional statement didn't give me what i expected

I have a dataframe like this
import numpy as np
import pandas as pd
lbl = [0, 1, 2, 3]
lbl2 = [0, 1, 2, 3, 4, 5]
label = lbl + lbl2
df = pd.DataFrame({"label":label})
#matching lbl and lbl2
pairs =[]
for i in range(3):
pair = (i,i+1)
pairs.append(pair)
when I hit this on terminal
(df.loc[:,'label'][df.index > (num_old_lbl - 1)]) & (df['label'] == pairs[0][1])
I got what I expected like this
(df.loc[:,'label'][df.index > (num_old_lbl - 1)]) & (df['label'] == pairs[1][1])
but when I changed the line like this, every values show False. I expected row 6 will be True.
We get the expected result using this writing (setting up num_old_lbl to 3) :
>>> (df.index > (num_old_lbl - 1)) & (df['label'] == pairs[1][1])
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
9 False
Name: label, dtype: bool
It also works as expected with the given example :
>>> (df.index > (num_old_lbl - 1)) & (df['label'] == pairs[0][1])
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
Name: label, dtype: bool

find first non-null & non-empty string value

I was using this to find the first non null value of a string:
def get_first_non_null_values(df):
first_non_null_values = []
try:
kst = df['kst'].loc[df['kst'].first_valid_index()]
first_non_null_values.append(kst)
except:
kst = df['kst22'].loc[df['kst22'].first_valid_index()]
first_non_null_values.append(kst)
return first_non_null_values
first_non_null_values = get_first_non_null_values(df_merged)
This worked but now in my new dataset, I have some null values and some "" empty strings. How can I modify this such that I can extract the first value which is neither null not an empty string
I think u need:
df = pd.DataFrame({'col': ['', np.nan, '', 1, 2, 3]})
print(df['col'].loc[df['col'].replace('', np.nan).first_valid_index()])
You can use a combination of notnull/astype(bool) and idxmax:
(df['col'].notnull()&df['col'].astype(bool)).idxmax()
Example input:
>>> df = pd.DataFrame({'col': ['', float('nan'), False, None, 0, 'A', 3]})
>>> df
col
0
1 NaN
2 False
3 None
4 0
5 A
6 3
output: 5
null and truthy states:
col notnull astype(bool) both
0 True False False
1 NaN False True False
2 False True False False
3 None False False False
4 0 True False False
5 A True True True
6 3 True True True
first non empty string value:
If you're only interesting in strings that are not empty:
df['col'].str.len().gt(0).idxmax()

add a sequence number to every sub-block of True value in column

I have a boolean column, every time a block of True value occurs in the column, I want to count that as one event. I tried, cumcount and groupby.ngroup to see if I can find a solution from there. cumcount doesn't count False values which is useful, but unable to figure out where to go from there. So I want to create a new colum event_number in a dataframe.
Assumption: The data is correctly sorted.
import pandas as pd
mydata = {"event_bool": [False,False,False,True,True,True,True,False,False,True,True,True,
False,False,False,True,True,True,True,False,False,True,True,True]}
# expected result - '0' can be something else like NaN
myresult = {"event_bool": [False,False,False,True,True,True,True,False,False,True,True,True,
False,False,False,True,True,True,True,False,False,True,True,True],
"event_number": [0,0,0,1,1,1,1,0,0,2,2,2,0,0,0,3,3,3,3,0,0,4,4,4]}
df = pd.DataFrame(myresult)
using pandas and numpy.where
import pandas as pd
import numpy as np
df = pd.DataFrame(mydata)
df['event_number'] = np.where(
df['event_bool'].eq(True),
(df['event_bool'].ne(df['event_bool'].shift()) & df['event_bool'].eq(True)).cumsum(),
0)
print(df)
event_bool event_number
0 False 0
1 False 0
2 False 0
3 True 1
4 True 1
5 True 1
6 True 1
7 False 0
8 False 0
9 True 2
10 True 2
11 True 2
12 False 0
13 False 0
14 False 0
15 True 3
16 True 3
17 True 3
18 True 3
19 False 0
20 False 0
21 True 4
22 True 4
23 True 4
df.to_dict(orient='list')
{'event_bool': [False, False, False, True, True, True, True, False, False, True, True, True, False, False, False, True, True, True, True, False, False, True, True, True], 'event_number': [0, 0, 0, 1, 1, 1, 1, 0, 0, 2, 2, 2, 0, 0, 0, 3, 3, 3, 3, 0, 0, 4, 4, 4]}
try:
df = pd.DataFrame(mydata)
df = df.reset_index()
df['rank'] = df[df.event_bool]['index'].diff().ne(1).cumsum()
df.drop('index', axis=1, inplace=True)
df['rank'] = df['rank'].fillna(0).astype(int)
event_bool
rank
0
False
0
1
False
0
2
False
0
3
True
1
4
True
1
5
True
1
6
True
1
7
False
0
8
False
0
9
True
2
10
True
2
11
True
2
12
False
0
13
False
0
14
False
0
15
True
3
16
True
3
17
True
3
18
True
3
19
False
0
20
False
0
21
True
4
22
True
4
23
True
4
Is this what you want?
mydata = {"event_bool": [False,False,False,True,True,True,True,False,False,True,True,True,False,False,False,True,True,True,True,False,False,True,True,True]}
event_number = []
true = 1
for i in range(len(mydata['event_bool'])):
if mydata['event_bool'][i] == True:
if mydata['event_bool'][i-1] == False:
true = true+1
event_number.append(true-1)
else:
event_number.append(true-1)
else:
event_number.append(0)
print(event_number)
Output:
[0, 0, 0, 1, 1, 1, 1, 0, 0, 2, 2, 2, 0, 0, 0, 3, 3, 3, 3, 0, 0, 4, 4, 4]

Pandas: How to create a column that indicates when a value is present in another column a set number of rows in advance?

I'm trying to ascertain how I can create a column that indicates in advance (X rows) when the next occurrence of a value in another column will occur with pandas that in essence performs the following functionality (In this instance X = 3):
df
rowid event indicator
1 True 1 # Event occurs
2 False 0
3 False 0
4 False 1 # Starts indicator
5 False 1
6 True 1 # Event occurs
7 False 0
Apart from doing a iterative/recursive loop through every row:
i = df.index[df['event']==True]
dfx = [df.index[z-X:z] for z in i]
df['indicator'][dfx]=1
df['indicator'].fillna(0)
However this seems inefficient, is there a more succinct method of achieving the aforementioned example? Thanks
Here's a NumPy based approach using flatnonzero:
X = 3
# ndarray of indices where indicator should be set to one
nd_ixs = np.flatnonzero(df.event)[:,None] - np.arange(X-1, -1, -1)
# flatten the indices
ixs = nd_ixs.ravel()
# filter out negative indices an set to 1
df['indicator'] = 0
df.loc[ixs[ixs>=0], 'indicator'] = 1
print(df)
rowid event indicator
0 1 True 1
1 2 False 0
2 3 False 0
3 4 False 1
4 5 False 1
5 6 True 1
6 7 False 0
Where nd_ixs is obtained through the broadcasted subtraction of the indices where event is True and an arange up to X:
print(nd_ixs)
array([[-2, -1, 0],
[ 3, 4, 5]], dtype=int64)
A pandas and numpy solution:
# Make a variable shift:
def var_shift(series, X):
return [series] + [series.shift(i) for i in range(-X + 1, 0, 1)]
X = 3
# Set indicator to default to 1
df["indicator"] = 1
# Use pd.Series.where and np.logical_or with the
# var_shift function to get a bool array, setting
# 0 when False
df["indicator"] = df["indicator"].where(
np.logical_or.reduce(var_shift(df["event"], X)),
0,
)
# rowid event indicator
# 0 1 True 1
# 1 2 False 0
# 2 3 False 0
# 3 4 False 1
# 4 5 False 1
# 5 6 True 1
# 6 7 False 0
In [77]: np.logical_or.reduce(var_shift(df["event"], 3))
Out[77]: array([True, False, False, True, True, True, nan], dtype=object)

Find first 'True' value in blocks in pandas data frame

I have a dataframe, where one column contains only True or False values in blocks. For example:
df =
b
0 False
1 True
2 True
3 False
4 True
5 True
6 True
7 True
8 False
9 False
10 False
11 False
12 False
13 True
14 True
15 True
I need to find the beginning of block with True:
>> find_first_true(df)
>> array([1, 4, 13])
Any elegant solutions?
EDIT
Thanks for the proposed solution. I am wondering, what's the easiest way to extract blocks of a certain length, starting from the indices I found?
For example, I need to take blocks (number of rows) of length 4 before the indices. So, if my indices (found previously)
index = array([1, 4, 13])
then I need blocks:
[df.loc[0:4], df.loc[9:13]]
or
b
0 False
1 True
2 True
3 False
4 True
9 False
10 False
11 False
12 False
13 True
I am looping over indices, but wonder about more pandasian solution
In [2]: df = pd.read_clipboard()
In [3]: df
Out[3]:
b
0 False
1 True
2 True
3 False
4 True
5 True
6 True
7 True
8 False
9 False
10 False
11 False
12 False
13 True
14 True
15 True
In [11]: np.where(((df.b != df.b.shift(1)) & df.b).values)[0]
Out[11]: array([ 1, 4, 13], dtype=int64)
def find_first_true(df):
#finds indexes of true elements
a = list(map(lambda e: e[0] + 1 if e[1] else 0, enumerate(df)))
a = list(filter(bool, a))
a = list(map(lambda x: x - 1, a))
#removes consecutive elements
ta = [a[0]] + list(filter(lambda x: a[x] - a[x-1] != 1, range(1, len(a))))
a = list(map(lambda x: a[x], ta))
return a
find_first = []
for i in range(len(df)):
if (df.loc[i, 'b'] == False and df.loc[i+1, 'b'] == True):
find_first.append(i+1)

Categories

Resources