find first non-null & non-empty string value

find first non-null & non-empty string value - python

I was using this to find the first non null value of a string:
def get_first_non_null_values(df):
first_non_null_values = []
try:
kst = df['kst'].loc[df['kst'].first_valid_index()]
first_non_null_values.append(kst)
except:
kst = df['kst22'].loc[df['kst22'].first_valid_index()]
first_non_null_values.append(kst)
return first_non_null_values
first_non_null_values = get_first_non_null_values(df_merged)
This worked but now in my new dataset, I have some null values and some "" empty strings. How can I modify this such that I can extract the first value which is neither null not an empty string

I think u need:
df = pd.DataFrame({'col': ['', np.nan, '', 1, 2, 3]})
print(df['col'].loc[df['col'].replace('', np.nan).first_valid_index()])

You can use a combination of notnull/astype(bool) and idxmax:
(df['col'].notnull()&df['col'].astype(bool)).idxmax()
Example input:
>>> df = pd.DataFrame({'col': ['', float('nan'), False, None, 0, 'A', 3]})
>>> df
col
0
1 NaN
2 False
3 None
4 0
5 A
6 3
output: 5
null and truthy states:
col notnull astype(bool) both
0 True False False
1 NaN False True False
2 False True False False
3 None False False False
4 0 True False False
5 A True True True
6 3 True True True
first non empty string value:
If you're only interesting in strings that are not empty:
df['col'].str.len().gt(0).idxmax()

Related

Count occurrences of stings in a row Pandas

I'm trying to count the number of instances of a certain sting in a row in a pandas dataframe.
In the example here I utilized a lambda function and pandas .count() to try and count the number of times 'True' exists in each row.
Though instead of a count of 'True' it is just returning a boolean whether or not it exists in the row...
#create dataframe
d = {'Period': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'Result': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result1': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result2': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False']}
df = pd.DataFrame(data=d)
#count instances of Trus or False in each row
df['Count'] = df.apply(lambda row: row.astype(str).str.count('True').any(), axis=1)
print(df)
The desired outcome is:
Period Result Result1 Result2 Count
1 True True True 3
2 None None None 0
3 False False False 0
4 True True True 3
1 False False False 0
2 True True True 3
3 False False False 0
... ... ... ... ......

You can use np.where:
df['count'] = np.where(df == 'True', 1, 0).sum(axis=1)
Regarding why your apply returns a boolean: both any and all
returns boolean, not numbers
Edit: You can include df.isin for multiple conditions:
df['count'] = np.where(df.isin(['True', 'False']), 1, 0).sum(axis=1)

Use eq with sum:
df.eq("True").sum(axis=1)
Use apply with lambda function.
df.apply(lambda x: x.eq("True").sum(), axis=1)
For more than 1 text matching try
df.iloc[:,1:].apply(lambda x: x.str.contains("True|False")).sum(axis=1)

Avoiding using the apply function, as it can be slow:
df[["Result", "Result1", "Result2"]].sum(axis=1).str.count("True")
This also will work for when you have strings that are like:
"this sentence contains True"

Your lambda is not working correctly, try this:
import pandas as pd
#create dataframe
d = {'Period': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'Result': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result1': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result2': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False']}
df = pd.DataFrame(data=d)
#count instances of Trues or False in each row
df['Count'] = df.apply(lambda row: sum(row[1:4] == 'True') ,axis=1)
print(df)
# Output:
# >> Period Result Result1 Result2 Count
# >> 0 1 True True True 3
# >> 1 2 None None None 0
# >> 2 3 False False False 0
# >> 3 4 True True True 3
# >> 4 1 False False False 0
# >> 5 2 True True True 3
# >> 6 3 False False False 0
# >> 7 4 True True True 3
# >> 8 1 True True True 3
# >> 9 2 False False False 0
# >> 10 3 False False False 0
# >> 11 4 True True True 3
# >> 12 1 False False False 0
# >> 13 2 True True True 3
# >> 14 3 False False False 0
# >> 15 4 False False False 0

Filter pandas dataframe columns based on value for columns in a list

I have a dataframe df defined as below :
df = pd.DataFrame()
df["A"] = ['True','True','True','True','True']
df["B"] = ['True','False','True','False','True']
df["C"] = ['False','True','False','True','False']
df["D"] = ['True','True','False','False','False']
df["E"] = ['False','True','True','False','True']
df["F"] = ['HI','HI','HI','HI','HI']
>> df
A B C D E F
0 True True False True False HI
1 True False True True True HI
2 True True False False True HI
3 True False True False False HI
4 True True False False True HI
and a list
lst = ["A","C"]
I would like to filter the rows in df based on the values being 'True' for the columns in lst.
That is, I would like to get my resultant dataframe as :
A B C D E F
1 True False True True True HI
3 True False True False False HI
Instead of looping through the column names from the list and filtering it, is there any better solution for this?

Use DatFrame.all over the column axis (axis=1):
df[df[lst].all(axis=1)]
A B C D E F
1 True False True True True HI
3 True False True False False HI
Details:
We get the columns in scope with df[lst], then we use all to check which rows have "all" as True:
df[lst].all(axis=1)
0 False
1 True
2 False
3 True
4 False
dtype: bool

inconsistent any vs all pd dataframe

This was asked in other forums but with focus on nan.
I have a simple dataframe:
y=[[1,2,3,4,1],[1,2,0,4,5]]
df = pd.DataFrame(y)
I am having difficulties understanding how any and all work. According to the pandas documentation 'any' returns "...whether any element is True over requested axis".
If I use:
~(df == 0)
Out[77]:
0 1 2 3 4
0 True True True True True
1 True True False True True
~(df == 0).any(1)
Out[78]:
0 True
1 False
dtype: bool
From my understanding the second command means: Return 'True' if any element is True over requested axis, and it should return True, True for both rows (since both contain at least one true value) but instead I get True, False. Why is that?

You need one () because priority of operators:
print (df == 0)
0 1 2 3 4
0 False False False False False
1 False False True False False
print (~(df == 0))
0 1 2 3 4
0 True True True True True
1 True True False True True
print ((~(df == 0)).any(1))
0 True
1 True
dtype: bool
Because:
print ((df == 0).any(1))
0 False
1 True
dtype: bool
print (~(df == 0).any(1))
0 True
1 False
dtype: bool

Python interprets your call as:
~ ( (df == 0).any(1) )
So it **evaluates any first. Now if we take a look at df == 0, we see:
>>> df == 0
0 1 2 3 4
0 False False False False False
1 False False True False False
So this means that in the first row, there is no such True, in the second one there is, so:
>>> (df == 0).any(1)
0 False
1 True
dtype: bool
Now we negate this with ~, so False becomes True and vice versa:
>>> ~ (df == 0).any(1)
0 True
1 False
dtype: bool
In case we first negate, we see:
>>> (~ (df == 0)).any(1)
0 True
1 True
dtype: bool
Both are True, since in both rows there is at least one column that is True.

Whats the fastest way to loop through a DataFrame and count occurrences within the DataFrame whilst some condition is fulfilled (in Python)?

I have a dataframe with two Boolean fields (as below).
import pandas as pd
d = [{'a1':False, 'a2':False}, {'a1':True, 'a2':False}, {'a1':True, 'a2':False}, {'a1':False, 'a2':False}, {'a1':False, 'a2':True},
{'a1': False, 'a2': False}, {'a1':False, 'a2':False}, {'a1':True, 'a2':False}, {'a1':False, 'a2':True}, {'a1':False, 'a2':False},]
df = pd.DataFrame(d)
df
Out[1]:
a1 a2
0 False False
1 True False
2 True False
3 False False
4 False True
5 False False
6 False False
7 True False
8 False True
9 False False
I am trying to find the fastest and most "Pythonic" way of achieving the following:
If a1==True, count instances from current row where a2==False (e.g. row 1: a1=True, a2 is False for three rows from row 1)
At first instance of a2==True, stop counting (e.g. row 4, count = 3)
Set value of 'count' to new df column 'a3' on row where counting began (e.g. 'a3' = 3 on row 1)
Target result set as follows.
a1 a2 a3
0 False False 0
1 True False 3
2 True False 2
3 False False 0
4 False True 0
5 False False 0
6 False False 0
7 True False 1
8 False True 0
9 False False 0
I have been trying to accomplish this using for loops, iterrows and while loops and so far haven't been able to produce a good nested combination which provides the results I want. Any help appreciated. I apologize if the problem is not totally clear.

How about this:
df['a3'] = df.apply(lambda x: 0 if not x.a1 else len(df.a2[x.name:df.a2.tolist()[x.name:].index(True)+x.name]), axis=1)
So, if a1 is False write 0 else write the length of list that goes from that row until next True.

This will do the trick:
df['a3'] = 0
# loop throught every value of 'a1'
for i in xrange(df['a1'].__len__()):
# if 'a1' at position i is 'True'...
if df['a1'][i] == True:
count = 0
# loop over the remaining items in 'a2'
# remaining: __len__() - i
# i: position of 'True' value in 'a1'
for j in xrange(df['a2'].__len__() - i):
# if the value of 'a2' is 'False'...
if df['a2'][j + i] == False:
# count the occurances of 'False' values in a row...
count += 1
else:
# ... if it's not 'False' break the loop
break
# write the number of occurances on the right position (i) in 'a3'
df['a3'][i] = count
and produce the following output:
a1 a2 a3
0 False False 0
1 True False 3
2 True False 2
3 False False 0
4 False True 0
5 False False 0
6 False False 0
7 True False 1
8 False True 0
9 False False 0
Edit: added comments in the code

Find first 'True' value in blocks in pandas data frame

I have a dataframe, where one column contains only True or False values in blocks. For example:
df =
b
0 False
1 True
2 True
3 False
4 True
5 True
6 True
7 True
8 False
9 False
10 False
11 False
12 False
13 True
14 True
15 True
I need to find the beginning of block with True:
>> find_first_true(df)
>> array([1, 4, 13])
Any elegant solutions?
EDIT
Thanks for the proposed solution. I am wondering, what's the easiest way to extract blocks of a certain length, starting from the indices I found?
For example, I need to take blocks (number of rows) of length 4 before the indices. So, if my indices (found previously)
index = array([1, 4, 13])
then I need blocks:
[df.loc[0:4], df.loc[9:13]]
or
b
0 False
1 True
2 True
3 False
4 True
9 False
10 False
11 False
12 False
13 True
I am looping over indices, but wonder about more pandasian solution

In [2]: df = pd.read_clipboard()
In [3]: df
Out[3]:
b
0 False
1 True
2 True
3 False
4 True
5 True
6 True
7 True
8 False
9 False
10 False
11 False
12 False
13 True
14 True
15 True
In [11]: np.where(((df.b != df.b.shift(1)) & df.b).values)[0]
Out[11]: array([ 1, 4, 13], dtype=int64)

def find_first_true(df):
#finds indexes of true elements
a = list(map(lambda e: e[0] + 1 if e[1] else 0, enumerate(df)))
a = list(filter(bool, a))
a = list(map(lambda x: x - 1, a))
#removes consecutive elements
ta = [a[0]] + list(filter(lambda x: a[x] - a[x-1] != 1, range(1, len(a))))
a = list(map(lambda x: a[x], ta))
return a

find_first = []
for i in range(len(df)):
if (df.loc[i, 'b'] == False and df.loc[i+1, 'b'] == True):
find_first.append(i+1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

find first non-null & non-empty string value - python

I think u need: df = pd.DataFrame({'col': ['', np.nan, '', 1, 2, 3]}) print(df['col'].loc[df['col'].replace('', np.nan).first_valid_index()])

Related

Count occurrences of stings in a row Pandas

Filter pandas dataframe columns based on value for columns in a list

inconsistent any vs all pd dataframe

Whats the fastest way to loop through a DataFrame and count occurrences within the DataFrame whilst some condition is fulfilled (in Python)?

Find first 'True' value in blocks in pandas data frame

Categories

Resources