How to retrieve pandas dataframe rows surrounding rows with a True boolean? - python

Suppose I have a df of the following format:
Assumed line 198 would be True for rot_mismatch, what would be the best way to retrieve the True line (easy) and the line above and below (unsolved)?
I have multiple lines with a True boolean and would like to automatically create a dataframe for closer investigation, always including the True line and its surrounding lines.
Thanks!
Edit for clarification:
exemplary input:
id
name
Bool
1
Sta
False
2
Danny
True
3
Elle
False
4
Rob
False
5
Dan
False
6
Holger
True
7
Mat
True
8
Derrick
False
9
Lisa
False
desired output:
id
name
Bool
1
Sta
False
2
Danny
True
3
Elle
False
5
Dan
False
6
Holger
True
7
Mat
True
8
Derrick
False

Assuming this input:
col1 rot_mismatch
0 A False
1 B True
2 C False
3 D False
4 E False
5 F False
6 G True
7 H True
to get the N rows before/after any True, you can use a rolling operation to compute a mask for boolean indexing:
N = 1
mask = (df['rot_mismatch']
.rolling(2*N+1, center=True, min_periods=1)
.max().astype(bool)
)
df2 = df.loc[mask]
output:
# N = 1
col1 rot_mismatch
0 A False
1 B True
2 C False
5 F False
6 G True
7 H True
# N = 0
col1 rot_mismatch
1 B True
6 G True
7 H True
# N = 2
col1 rot_mismatch
0 A False
1 B True
2 C False
3 D False
4 E False
5 F False
6 G True
7 H True

Try with shift:
>>> df[df["rot_mismatch"]|df["rot_mismatch"].shift()|df["rot_mismatch"].shift(-1)]
dep_ap_sched arr_ap_sched rot_mismatch
120 East Carmen South Nathaniel False
198 South Nathaniel East Carmen True
289 East Carmen Joneshaven False
Output for amended example:
>>> df[df["Bool"]|df["Bool"].shift()|df["Bool"].shift(-1)]
id name Bool
0 1 Sta False
1 2 Danny True
2 3 Elle False
4 5 Dan False
5 6 Holger True
6 7 Mat True
7 8 Derrick False

Is it what you want ?
df_true=df.loc[df['rot_mismatch']=='True',:]
df_false=df.loc[df['rot_mismatch']=='False',:]

Related

Pandas group by recurring state

I have a dataset with hosts sorted by time and a state if isCorrect or not. I would like to get only the hosts that have been rated "False" for at least 3 consecutive times. That is if there is a True in between the counter should reset.
data = {'time': ['10:01', '10:02', '10:03', '10:15', '10:16', '10:18','10:20','10:21','10:22', '10:23','10:24','10:25','10:26','10:27'],
'host': ['A','B','A','A','A','B','A','A','B','B','B','B','B','B'],
'isCorrect': [True, True, False, True, False, False, True, True, False, False, True, False, False, False]}
time host isCorrect
0 10:01 A True
1 10:02 B True
2 10:03 A False
3 10:15 A True
4 10:16 A False
5 10:18 B False
6 10:20 A True
7 10:21 A True
8 10:22 B False
9 10:23 B False
10 10:24 B True
11 10:25 B False
12 10:26 B False
13 10:27 B False
With this example dataset there should be 2 clusters:
Host B due to row 5,8,9 since they were False for 3 times in a row.
Host B due to row 11,12,13
Note that it should be 2 clusters rather than 1 made of 6 items. Unfortunately my implementation does exactly that.
df = pd.DataFrame(data)
df = df[~df['isCorrect']].sort_values(['host','time'])
mask = df['host'].map(df['host'].value_counts()) >= 3
df = df[mask].copy()
df['Group'] = pd.factorize(df['host'])[0]
Which returns
time host isCorrect Group
5 10:18 B False 0
8 10:22 B False 0
9 10:23 B False 0
11 10:25 B False 0
12 10:26 B False 0
13 10:27 B False 0
Expected is an output like so:
time host isCorrect Group
5 10:18 B False 0
8 10:22 B False 0
9 10:23 B False 0
11 10:25 B False 1
12 10:26 B False 1
13 10:27 B False 1
Solution was changed for generate new column Group after sorting with cumulative sum of Trues (because tested Falses), so is generated unique groups which are factorized in last step:
df = df.sort_values(['host','time'])
df['Group'] = df['isCorrect'].cumsum()
df = df[~df['isCorrect']]
mask = df['Group'].map(df['Group'].value_counts()) >= 3
df = df[mask].copy()
df['Group'] = pd.factorize(df['Group'])[0]
print (df)
time host isCorrect Group
5 10:18 B False 0
8 10:22 B False 0
9 10:23 B False 0
11 10:25 B False 1
12 10:26 B False 1
13 10:27 B False 1
df = df.sort_values(['host','time'])
df['Group'] = df['isCorrect'].cumsum()
df = df[~df['isCorrect']]
mask = df['Group'].map(df['Group'].value_counts()) >= 5
df = df[mask].copy()
df['Group'] = pd.factorize(df['Group'])[0]
print (df)
Empty DataFrame
Columns: [time, host, isCorrect, Group]
Index: []

How to set values to np.nan with multiple conditions for series?

Let's say I have a dataframe:
A B C D E F
0 x R i R nan h
1 z g j x a nan
2 z h nan y nan nan
3 x g nan nan nan nan
4 x x h x s f
I want to replace all the cells where:
the value in row 0 is R (df.loc[0] == 'R')
the cell is not 'x' (!= 'x')
only rows 2 and below (2:)
with np.nan.
Essentially I want to do:
df.loc[2:,df.loc[0]=='R']!='x' = np.nan
I get the error:
SyntaxError: can't assign to comparison
I just don't know how the syntax is supposed to be.
I've tried
df[df.loc[2:,df.loc[0]=='R']!='x']
but this doesn't list the values I want.
Solution
mask = df.ne('x') & df.iloc[0].eq('R')
mask.iloc[:2] = False
df.mask(mask)
A B C D E F
0 x R i R NaN h
1 z g j x a NaN
2 z NaN NaN NaN NaN NaN
3 x NaN NaN NaN NaN NaN
4 x x h x s f
Explanation
Build the mask up
df.ne('x') gives
A B C D E F
0 False True True True True True
1 True True True False True True
2 True True True True True True
3 False True True True True True
4 False False True False True True
But we want that in conjunction with df.iloc[0].eq('R') which is a Series. Turns out that if we just & those two together, it will align the Series index with the columns of the mask in step 1.
A False
B True
C False
D True
E False
F False
Name: 0, dtype: bool
# &
A B C D E F
0 False True True True True True
1 True True True False True True
2 True True True True True True
3 False True True True True True
4 False False True False True True
# GIVES YOU
A B C D E F
0 False True False True False False
1 False True False False False False
2 False True False True False False
3 False True False True False False
4 False False False False False False
Finally, we want to exclude the first two rows from these shenanigans so...
mask.iloc[:2] = False
Try with:
mask = df.iloc[0] !='R'
df.loc[2:, mask] = df.loc[2:,mask].where(df.loc[2:,mask]=='x')
Output:
A B C D E F
0 x R i R NaN h
1 z g j x a NaN
2 NaN h NaN y NaN NaN
3 x g NaN NaN NaN NaN
4 x x NaN x NaN NaN
By your approach:
df[df.loc[2:,df.loc[0]=='R']!='x']=np.nan
Output:
>>> df
A B C D E F
0 x R i R NaN h
1 z g j x a NaN
2 z NaN NaN NaN NaN NaN
3 x NaN NaN NaN NaN NaN
4 x x h x s f

Applying values to a DataFrame without using a for-loop

I'm looking for a faster method of applying values to a column in a DataFrame. The value is based on two True and False values in the first and second column. This is my current solution:
df['result'] = df.check1.astype(int)
for i in range(len(df)):
if df.result[i] != 1:
df.result[i] = df.result.shift(1)[i] + df.check2[i].astype(int)
Which yields this result:
check1 check2 result
0 True False 1
1 False False 1
2 False False 1
3 False False 1
4 False False 1
5 False False 1
6 False True 2
7 False False 2
8 False True 3
9 False False 3
10 False True 4
11 False False 4
12 False True 5
13 False False 5
14 False True 6
15 False False 6
16 False True 7
17 False False 7
18 False False 7
19 False False 7
20 False True 8
21 False False 8
22 False True 9
23 True False 1
24 False False 1
So the third column needs to be a number based on the value in the row above it.
If check1 is True the number needs to go back to 1. If check2 is true, 1 needs to be added to the number. Otherwise the number stays the same.
The current code is fine but it's taking too long as I need to apply this to a DataFrame with approx. 70.000 rows. I'm pretty sure it can be improved (I'm guessing using the apply function but I'm not sure). Any ideas?
Use pandas.DataFrame.groupby.cumsum:
import pandas as pd
df['result'] = df.groupby(df['check1'].cumsum())[['check1', 'check2']].cumsum().sum(1)
Or #Dan's suggestion:
df['result'] = df.groupby(df['check1'].cumsum())['check2'].cumsum().add(1)
Output:
check1 check2 result
0 True False 1.0
1 False False 1.0
2 False False 1.0
3 False False 1.0
4 False False 1.0
5 False False 1.0
6 False True 2.0
7 False False 2.0
8 False True 3.0
9 False False 3.0
10 False True 4.0
11 False False 4.0
12 False True 5.0
13 False False 5.0
14 False True 6.0
15 False False 6.0
16 False True 7.0
17 False False 7.0
18 False False 7.0
19 False False 7.0
20 False True 8.0
21 False False 8.0
22 False True 9.0
23 True False 1.0
24 False False 1.0
You want to iterate a dataframe using the value of the preceding row. In that case, the most efficient way is to directly iterate the underlying numpy arrays:
df = pd.read_fwf(io.StringIO(t))
df['result'] = df.check1.astype(int)
res = df['result'].values
c1 = df['check1'].values
c2 = df['check2'].values
old = -1
for i in range(len(df)):
if res[i] != 1:
res[i] = old + int(c2[i])
old = res[i]
This works fine because numpy arrays are mutable types, so the changes are reflected in the dataframe.
Timeit says that this is twice as fast as the original solution from #Chris's, and still 1.5 times faster after #Dan's improvement.

How to sum all values in a column only if they are numerical?

I have the following code:
def check(df, columns):
for col in columns:
if df[col].sum(axis=0) == 0:
return True
return false
This code goes through the columns of df and checks is the sum of all values in a column is equal to 0 (i.e. all values are 0, while ignoring empty fields).
However it fails if one of the columns in columns in non-numeric. How can I add a condition that df[col].sum(axis=0) == 0 should only work on numeric columns and it should ignore empty rows if any?
Use:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,0,np.nan,0,-0,0],
'C':[7,8,9,4,2,3],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C E F
0 a 0.0 7 5 a
1 b 0.0 8 3 a
2 c NaN 9 6 a
3 d 0.0 4 9 b
4 e 0.0 2 2 b
5 f 0.0 3 4 b
def check(df, columns):
return df[columns].select_dtypes(np.number).fillna(0).eq(0).all().any()
print (check(df, df.columns))
True
Another alternative with test missing values and chained boolean DataFrame by | for bitwise OR:
def check(df, columns):
df1 = df[columns].select_dtypes(np.number)
return (df1.eq(0) | df1.isna()).all().any()
Explanation:
First select columns specified in list, in sample all column and get only numeric columns by DataFrame.select_dtypes:
print (df[columns].select_dtypes(np.number))
B C E
0 0.0 7 5
1 0.0 8 3
2 NaN 9 6
3 0.0 4 9
4 0.0 2 2
5 0.0 3 4
Then replace missing values by 0 with DataFrame.fillna:
print (df[columns].select_dtypes(np.number).fillna(0))
B C E
0 0.0 7 5
1 0.0 8 3
2 0.0 9 6
3 0.0 4 9
4 0.0 2 2
5 0.0 3 4
Compare by DataFrame.eq for ==:
print (df[columns].select_dtypes(np.number).fillna(0).eq(0))
B C E
0 True False False
1 True False False
2 True False False
3 True False False
4 True False False
5 True False False
Test if all columns are only Trues by DataFrame.all:
print (df[columns].select_dtypes(np.number).fillna(0).eq(0).all())
B True
C False
E False
dtype: bool
And last test if at least one in Series in True by Series.any:
print (df[columns].select_dtypes(np.number).fillna(0).eq(0).all().any())
True
You can try this condition as well :
if df[col].dtype == int or df[col].dtype == float:
#your code

How can I perform a value dependent pivot table/Groupby in Pandas?

I have the following dataframe:
Tran ID Category Quantity
0 001 A 5
1 001 B 2
2 001 C 3
3 002 A 4
4 002 C 2
5 003 D 6
I want to transform it into:
Tran ID A B C D Quantity
0 001 True True True False 10
1 002 True False True False 6
2 003 False False False True 6
I know I can use groupby to get the sum of quantity, but I can't figure out how to perform the pivot that I described.
Use get_dummies for indicators with max and add new column with aggregating sum:
#pandas 0.23+
df1 = pd.get_dummies(df.set_index('Tran ID')['Category'], dtype=bool).max(level=0)
#oldier pandas versions
#df1 = pd.get_dummies(df.set_index('Tran ID')['Category']).astype(bool).max(level=0)
s = df.groupby('Tran ID')['Quantity'].sum()
df2 = df1.assign(Quantity = s).reset_index()
print (df2)
Tran ID A B C D Quantity
0 001 True True True False 10
1 002 True False True False 6
2 003 False False False True 6
Or you can use:
print(df.drop('Category',1).join(df['Category'].str.get_dummies().astype(bool)).groupby('Tran ID',as_index=False).sum())
Or little easier to read:
df1 = df.drop('Category',1).join(df['Category'].str.get_dummies().astype(bool))
print(df1.groupby('Tran ID',as_index=False).sum())
Both output:
Tran ID Quantity A B C D
0 1 10 True True True False
1 2 6 True False True False
2 3 6 False False False True
pandas.DataFrame.groupby with pandas.Series.str.get_dummies is the way to do it.

Categories

Resources