identify blocks of consecutive True values with tolerance - python

I have a boolean pandas DataFrame:
w=pd.DataFrame(data=[True,False,True,True,True,False,False,True,False,True,True,False,True])
I am trying to identify the blocks of True values, which are long at least N:
I can do that (as suggested elsewere on SO) by
N=3.0
b = w.ne(w.shift()).cumsum() *w
m = b[0].map(b[0].mask(b[0] == 0).value_counts()) >= N
which works fine and returns
m
0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
Now, I need to do the same buyt allow for some tolerance in determining the blocks. so I would like to identify all the blocks long at least N, but allowing for M values (arbitrarly placed within the block) to be False.
For the example w, N=3, and M=1 it should be,
w
0 True
1 False
2 True
3 True
4 True
5 False
6 False
7 True
8 False
9 True
10 True
11 False
12 True
differently from previous results at:
desidered=
0 **True**
1 **True**
2 True
3 True
4 True
5 False
6 False
7 True
8 ** True **
9 True
10 True
11 **True**
12 True

I believe you can re-use solution with inverting m by ~ and last chain both conditions by or :
N = 3.0
M = 1
b = w.ne(w.shift()).cumsum() *w
m = b[0].map(b[0].mask(b[0] == 0).value_counts()) <= N
w1 = ~m
b1 = w1.ne(w1.shift()).cumsum() * w1
m1 = b1.map(b1.mask(b1 == 0).value_counts()) == M
m = m | m1
print (m)
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 True
8 True
9 True
10 True
11 True
12 True
Name: 0, dtype: bool

Related

In python, how to shift and fill with a specific values for all the shifted rows in DataFrame?

I have a following dataframe.
y = pd.DataFrame(np.zeros((10,1), dtype = 'bool'), columns = ['A'])
y.iloc[[3,5], 0] = True
A
0 False
1 False
2 False
3 True
4 False
5 True
6 False
7 False
8 False
9 False
And I want to make 'True' for the next three rows from where 'True' is found in the above dataframe. The expected results is shown in the below.
A
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 False
I can do that in the following way, but I wonder if there is a smarter way to do so.
y['B'] = y['A'].shift()
y['C'] = y['B'].shift()
y['D'] = y.any(axis = 1)
y['A'] = y['D']
y = y['A']
Thank you for the help in advance.
I use parameter limit in forward filling missing values with replace False to missing values and last replace NaNs to False:
y.A = y.A.replace(False, np.nan).ffill(limit=2).fillna(False)
print (y)
A
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 False
Another idea with Rolling.apply and any for test at least one True per window:
y.A = y.A.rolling(3, min_periods=1).apply(lambda x: x.any()).astype(bool)

obtain values in dataframe instead of boolean

I have a dataframe of calculated distances as following,
x_y_data = pd.read_csv("x_y_points400_labeled_20pnts_csv.csv")
x = x_y_data.loc[:,'x']
y = x_y_data.loc[:,'y']
xs=x.to_numpy()
ys=y.to_numpy()
result = pd.DataFrame(np.sqrt((xs[:, None] - xs)**2 + (ys[:, None] - ys)**2))
i get the results for all distances,
0 1 2 ... 10 11 12
0 0.000000 16.132750 33.039985 ... 17.628989 27.273213 20.898938
1 16.132750 0.000000 16.912458 ... 16.658800 17.480346 25.375308
2 33.039985 16.912458 0.000000 ... 27.985766 19.625398 37.343842
3 10.140420 25.301309 41.896450 ... 20.173079 32.241763 18.523634
4 9.368331 9.228014 25.210365 ... 10.518585 18.039020 17.464249
now when I want to obtain only the values of the dataframe that are less than 12
(by simply adding result2=result<12) I obtain the table of boolean,
result2:
0 1 2 3 4 ... 8 9 10 11 12
0 True False False True True ... False False False False False
1 False True False False True ... False False False False False
2 False False True False False ... True False False False False
3 True False False True False ... False True False False False
4 True True False False True ... False False True False False
where I want just the values that are less than 12 and not equal to zero. can you please help?
Please Try
result[result < 12].fillna('Morethan12')
or
result[result < 12].unstack().fillna('Morethan12')
Just give condition to show values based on 0<result2<12

How to change the bool values in pd.Series if the count is more than some value in pandas

I have the following bool series, I am trying to find the count, I have to check the False values, if the count of the False values in consecutive series is greater than 2 then let it be false but if the count is less than equal to 2 then i have to inverse them, from False to True
Expected output: Like wise first time the False is repeating for two times which means it will change into True, but if we see after true values the false is repeating again more than two times so those values will stay as false,
How can I perform this using Pandas functions?
True
True
True
True
True
False
False
True
True
True
True
True
True
True
False
False
False
False
False
False
True
True
True
True
Let us try something different
s=df.cumsum().mask(df)
df=df.mask(s.isin(s.value_counts()[s.value_counts()<=2].index),True)
df
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 False
15 False
16 False
17 False
18 False
19 False
20 True
21 True
22 True
23 True
Name: a, dtype: bool
Try using groupby and cumsum to generate unique groups of False, then get the count of each group, if that count is less than three invert that group of series using ~ and apply back to series with mask:
s.mask(s.groupby((s).cumsum().where(~s)).transform('count') < 3, ~s)
Output:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 False
15 False
16 False
17 False
18 False
19 False
20 True
21 True
22 True
23 True
Name: 0, dtype: bool

Difference between `Series.str.contains("|")` and `Series.apply(lambda x:"|" in x)` in pandas?

This is the code for testing:
import numpy as np # maybe you should download the package
import pandas as pd # maybe you should download the package
data = ['Romance|Fantasy|Family|Drama', 'War|Adventure|Science Fiction',
'Action|Family|Science Fiction|Adventure|Mystery', 'Action|Drama',
'Action|Drama|Thriller', 'Drama|Romance', 'Comedy|Drama', 'Action',
'Comedy', 'Crime|Comedy|Action|Adventure',
'Drama|Thriller|History', 'Action|Science Fiction|Thriller']
a = pd.Series(data)
print(a.str.contains("|"))
print(a.apply(lambda x:"|" in x))
print(a)
After executing the code above, you will get this three output:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
dtype: bool
print(a.apply(lambda x:"|" in x)) output is:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
10 True
11 True
dtype: bool
print(a) output is:
You will see in 7 and 8 in Series a do not have |. However the return of print(a.str.contains("|")) is all True. What is wrong here?
| has a special meaning in RegEx, so you need to escape it:
In [2]: a.str.contains(r"\|")
Out[2]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
10 True
11 True
dtype: bool

Groupby with boolean condition True in one of the columns in Pandas

This is my dataframe which I want to use groupby
Value Boolean1 Boolean2
5.175603 False False
5.415855 False False
5.046997 False False
4.607749 True False
5.140482 False False
1.796552 False False
0.139924 False True
4.157981 False True
4.893860 False False
5.091573 False False
6 True False
6.05 False False
I want to use groupby with Boolean1 and Boolean2 column. The groupby continues from False to unless it finds True and it checks in both column and then next False to True again. If there is nomore True, then it can ignore rest of the False (values corresponding to it) or it can be there
I want to achieve similar to this.
Value Boolean1 Boolean2
This is one group
5.175603 False False
5.415855 False False
5.046997 False False
4.607749 True False
This is another one
5.140482 False False
1.796552 False False
0.139924 False True
4.157981 False True
And this is another one
4.893860 False False
5.091573 False False
6 True False
My idea is check Falses in both columns before at least one True column:
#chain condition together by OR and invert
m = ~(df['Boolean1'] | df['Boolean2'])
#get consecutive groups with AND for filter only Trues
#(because inverting, it return False in both cols)
s = (m.ne(m.shift()) & m).cumsum()
for i, x in df.groupby(s):
print (x)
dtype: int32
Value Boolean1 Boolean2
0 5.175603 False False
1 5.415855 False False
2 5.046997 False False
3 4.607749 True False
Value Boolean1 Boolean2
4 5.140482 False False
5 1.796552 False False
6 0.139924 False True
7 4.157981 False True
Value Boolean1 Boolean2
8 4.893860 False False
9 5.091573 False False
10 6.000000 True False
Value Boolean1 Boolean2
11 6.05 False False
Detail:
print (m)
0 True
1 True
2 True
3 False
4 True
5 True
6 False
7 False
8 True
9 True
10 False
11 True
dtype: bool
print (s)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 4
dtype: int32

Categories

Resources