Pandas and Numpy consecutive non Nan values - python

I'm trying to use np.where to count consecutive non-NaN values longer than a certain length as shown below:
e.g. If there are more than 3 consecutive non-NaN values then return True.
Would appreciate any help!
value
consecutive
nan
False
nan
False
1
False
1
False
nan
False
4
True
2
True
3
True
nan
False
nan
False
1
True
3
True
3
True
5
True

The idea is to create groups by testing missing values and mapping using Series.map with Series.value_counts to have only rows with non NaNs filtered by inverted mask ~m:
#convert values to numeric
df['value'] = df['value'].astype(float)
m = df['value'].isna()
s = m.cumsum()
N = 3
df['new'] = s.map(s[~m].value_counts()).ge(N) & ~m
print (df)
value consecutive new
0 NaN False False
1 NaN False False
2 1.0 False False
3 1.0 False False
4 NaN False False
5 4.0 True True
6 2.0 True True
7 3.0 True True
8 NaN False False
9 NaN False False
10 1.0 True True
11 3.0 True True
12 3.0 True True
13 5.0 True True

Related

How to set values to np.nan with multiple conditions for series?

Let's say I have a dataframe:
A B C D E F
0 x R i R nan h
1 z g j x a nan
2 z h nan y nan nan
3 x g nan nan nan nan
4 x x h x s f
I want to replace all the cells where:
the value in row 0 is R (df.loc[0] == 'R')
the cell is not 'x' (!= 'x')
only rows 2 and below (2:)
with np.nan.
Essentially I want to do:
df.loc[2:,df.loc[0]=='R']!='x' = np.nan
I get the error:
SyntaxError: can't assign to comparison
I just don't know how the syntax is supposed to be.
I've tried
df[df.loc[2:,df.loc[0]=='R']!='x']
but this doesn't list the values I want.
Solution
mask = df.ne('x') & df.iloc[0].eq('R')
mask.iloc[:2] = False
df.mask(mask)
A B C D E F
0 x R i R NaN h
1 z g j x a NaN
2 z NaN NaN NaN NaN NaN
3 x NaN NaN NaN NaN NaN
4 x x h x s f
Explanation
Build the mask up
df.ne('x') gives
A B C D E F
0 False True True True True True
1 True True True False True True
2 True True True True True True
3 False True True True True True
4 False False True False True True
But we want that in conjunction with df.iloc[0].eq('R') which is a Series. Turns out that if we just & those two together, it will align the Series index with the columns of the mask in step 1.
A False
B True
C False
D True
E False
F False
Name: 0, dtype: bool
# &
A B C D E F
0 False True True True True True
1 True True True False True True
2 True True True True True True
3 False True True True True True
4 False False True False True True
# GIVES YOU
A B C D E F
0 False True False True False False
1 False True False False False False
2 False True False True False False
3 False True False True False False
4 False False False False False False
Finally, we want to exclude the first two rows from these shenanigans so...
mask.iloc[:2] = False
Try with:
mask = df.iloc[0] !='R'
df.loc[2:, mask] = df.loc[2:,mask].where(df.loc[2:,mask]=='x')
Output:
A B C D E F
0 x R i R NaN h
1 z g j x a NaN
2 NaN h NaN y NaN NaN
3 x g NaN NaN NaN NaN
4 x x NaN x NaN NaN
By your approach:
df[df.loc[2:,df.loc[0]=='R']!='x']=np.nan
Output:
>>> df
A B C D E F
0 x R i R NaN h
1 z g j x a NaN
2 z NaN NaN NaN NaN NaN
3 x NaN NaN NaN NaN NaN
4 x x h x s f

python split dataframe subset with from to condition

I have dataframe and I have to split in subset with this conditions:
start split: c = 1
end split: c = -1
example:
a b c
False False
False False -1
False False
True False 1 start first subset
False False
False False 1
False False 1
False False
False False 1
False False
False False 1
False False
False True -1 end of first subset
False False
False False
False True -1
False False
False False
True False 1 start second subset
False False -1 end of second subset
This could be a solution, although I'm not sure if it's the most efficient approach. This basically uses cumsum and some and/or logic.
import pandas as pd
import numpy as np
df = pd.DataFrame({'c': [np.nan, 1, np.nan, 1, np.nan, np.nan,
-1, np.nan, np.nan, 1, np.nan, np.nan,
1, 1, np.nan, -1, -1, 1, -1]})
c
0 NaN
1 1.0
2 NaN
3 1.0
4 NaN
5 NaN
6 -1.0
7 NaN
8 NaN
9 1.0
10 NaN
11 NaN
12 1.0
13 1.0
14 NaN
15 -1.0
16 -1.0
17 1.0
18 -1.0
(
df
.assign(
start_end=lambda df: df.index.isin(
df
.loc[lambda df: df.c.isin([1,-1])]
.loc[lambda df: df.c.shift(1,fill_value=0)!=df.c]
.index),
start=lambda df: np.where(np.logical_and(df.start_end==True,df.c==1),1,0),
end=lambda df: np.where(np.logical_and(df.start_end==True,df.c==-1),1,0),
subset=lambda df: np.where(df.start.cumsum() != df.end.shift(1, fill_value=0).cumsum(),
df.start.cumsum(),
0)
)
.drop(columns=['start_end','start','end'])
)
c subset
0 NaN 0
1 1.0 1
2 NaN 1
3 1.0 1
4 NaN 1
5 NaN 1
6 -1.0 1
7 NaN 0
8 NaN 0
9 1.0 2
10 NaN 2
11 NaN 2
12 1.0 2
13 1.0 2
14 NaN 2
15 -1.0 2
16 -1.0 0
17 1.0 3
18 -1.0 3
```

pandas groupby column and check if group meets multiple conditions

I have a DataFrame that looks like the following:
X Y Date are_equal
0 50.0 10.0 2018-08-19 False
1 NaN 10.0 2018-08-19 False
2 NaN 50.0 2018-08-19 True
3 10.0 NaN 2018-08-21 False
4 1.0 NaN 2018-08-19 False
5 NaN 10.0 2018-08-22 False
6 10.0 NaN 2018-08-21 False
The are_equal column indicates that a value in Y is in X for the same date (in this case 50.0).
I am trying to group by date and find whether X contains a specific value (say 1.0) for a date that contains are_equal True.
My approach was to use df.iterrows() and get the row at next index after meeting the condition df['are_equal'] == True. However, the rows aren't necessarily ordered.
How can I group by Date and check if a date contains True in are_equal and 1.0 in column X for that same date?
The output I'm trying to achieve is a new Boolean column that looks like this:
contains_specific_value
0 False
1 False
2 False
3 False
4 True
5 False
6 False
Let us do apply, this can add more condition but slow. You can check the other solution from transform
df['New']=df.groupby('Date').apply(lambda x : (x['X']==1)&x['are_equal'].any()).reset_index(level=0,drop=True)
df
Out[101]:
X Y Date are_equal New
0 50.0 10.0 2018-08-19 False False
1 NaN 10.0 2018-08-19 False False
2 NaN 50.0 2018-08-19 True False
3 10.0 NaN 2018-08-21 False False
4 1.0 NaN 2018-08-19 False True
5 NaN 10.0 2018-08-22 False False
6 10.0 NaN 2018-08-21 False False
Or transform
df['X'].eq(1)&(df.groupby('Date').X.transform('any'))
Out[102]:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
Name: X, dtype: bool

Applying values to a DataFrame without using a for-loop

I'm looking for a faster method of applying values to a column in a DataFrame. The value is based on two True and False values in the first and second column. This is my current solution:
df['result'] = df.check1.astype(int)
for i in range(len(df)):
if df.result[i] != 1:
df.result[i] = df.result.shift(1)[i] + df.check2[i].astype(int)
Which yields this result:
check1 check2 result
0 True False 1
1 False False 1
2 False False 1
3 False False 1
4 False False 1
5 False False 1
6 False True 2
7 False False 2
8 False True 3
9 False False 3
10 False True 4
11 False False 4
12 False True 5
13 False False 5
14 False True 6
15 False False 6
16 False True 7
17 False False 7
18 False False 7
19 False False 7
20 False True 8
21 False False 8
22 False True 9
23 True False 1
24 False False 1
So the third column needs to be a number based on the value in the row above it.
If check1 is True the number needs to go back to 1. If check2 is true, 1 needs to be added to the number. Otherwise the number stays the same.
The current code is fine but it's taking too long as I need to apply this to a DataFrame with approx. 70.000 rows. I'm pretty sure it can be improved (I'm guessing using the apply function but I'm not sure). Any ideas?
Use pandas.DataFrame.groupby.cumsum:
import pandas as pd
df['result'] = df.groupby(df['check1'].cumsum())[['check1', 'check2']].cumsum().sum(1)
Or #Dan's suggestion:
df['result'] = df.groupby(df['check1'].cumsum())['check2'].cumsum().add(1)
Output:
check1 check2 result
0 True False 1.0
1 False False 1.0
2 False False 1.0
3 False False 1.0
4 False False 1.0
5 False False 1.0
6 False True 2.0
7 False False 2.0
8 False True 3.0
9 False False 3.0
10 False True 4.0
11 False False 4.0
12 False True 5.0
13 False False 5.0
14 False True 6.0
15 False False 6.0
16 False True 7.0
17 False False 7.0
18 False False 7.0
19 False False 7.0
20 False True 8.0
21 False False 8.0
22 False True 9.0
23 True False 1.0
24 False False 1.0
You want to iterate a dataframe using the value of the preceding row. In that case, the most efficient way is to directly iterate the underlying numpy arrays:
df = pd.read_fwf(io.StringIO(t))
df['result'] = df.check1.astype(int)
res = df['result'].values
c1 = df['check1'].values
c2 = df['check2'].values
old = -1
for i in range(len(df)):
if res[i] != 1:
res[i] = old + int(c2[i])
old = res[i]
This works fine because numpy arrays are mutable types, so the changes are reflected in the dataframe.
Timeit says that this is twice as fast as the original solution from #Chris's, and still 1.5 times faster after #Dan's improvement.

Python Pandas Boolean Dataframe Where Dataframe Equals False - Returns 0 instead of False?

If I have a Dataframe with True/False values only like this:
df_mask = pd.DataFrame({'AAA': [True] * 4,
'BBB': [False]*4,
'CCC': [True, False, True, False]}); print(df_mask)
AAA BBB CCC
0 True False True
1 True False False
2 True False True
3 True False False
Then try to print where the values in the dataframe is equivalent to False like so:
print(df_mask[df_mask == False])
print(df_mask.where(df_mask == False))
My question is about column CCC. Column BBB shows False (as I expect) but why is index 1 and 3 in column CCC equal to 0 instead of False?
AAA BBB CCC
0 NaN False NaN
1 NaN False 0
2 NaN False NaN
3 NaN False 0
AAA BBB CCC
0 NaN False NaN
1 NaN False 0
2 NaN False NaN
3 NaN False 0
Why doesn't it return a dataframe that looks like this?
AAA BBB CCC
0 NaN False NaN
1 NaN False False
2 NaN False NaN
3 NaN False False
Not entirely sure why, but if you're looking for a quick fix to convert it back to bools you can do the following:
>>> df_bool = df_mask.where(df_mask == False).astype(bool)
>>> df_bool
AAA BBB CCC
0 True False True
1 True False False
2 True False True
3 True False False
This is because the returned dataframe has a different dtype: it's not longer a dataframe of bools.
>>> df2 = df_mask.where(df_mask == False)
>>> df2.dtypes
AAA float64
BBB bool
CCC float64
dtype: object
This even occurs if you force it to a bool dtype from the getgo:
>>> df_mask = pd.DataFrame({'AAA': [True] * 4,
... 'BBB': [False]*4,
... 'CCC': [True, False, True, False]}, dtype=bool); print(df_mask)
AAA BBB CCC
0 True False True
1 True False False
2 True False True
3 True False False
>>> df2 = df_mask.where(df_mask == False)
>>> df2
AAA BBB CCC
0 NaN False NaN
1 NaN False 0
2 NaN False NaN
3 NaN False 0
If you're explicitly worried about memory, you can also just return a reference, but unless you're explicitly ignoring the old reference (in which case it shouldn't matter), be careful:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

Categories

Resources