Loop through dataframe in python to select specific row - python

I have a timeseries data of 5864 ICU Patients and my dataframe is like this. Each row is the ICU stay of respective patient at a particular hour.
HR
SBP
DBP
ICULOS
Sepsis
P_ID
92
120
80
1
0
0
98
115
85
2
0
0
93
125
75
3
1
0
95
130
90
4
1
0
102
120
80
1
0
1
109
115
75
2
0
1
94
135
100
3
0
1
97
100
70
4
1
1
85
120
80
5
1
1
88
115
75
6
1
1
93
125
85
1
0
2
78
130
90
2
0
2
115
140
110
3
0
2
102
120
80
4
0
2
98
140
110
5
1
2
I want to select the ICULOS where Sepsis = 1 (first hour only) based on patient ID. Like in P_ID = 0, Sepsis = 1 at ICULOS = 3. I did this on a single patient (the dataframe having data of only a single patient) using the code:
x = df[df['Sepsis'] == 1]["ICULOS"].values[0]
print("ICULOS at which Sepsis Label = 1 is:", x)
# Output
ICULOS at which Sepsis Label = 1 is: 46
If I want to check it for each P_ID, I have to do this 5864 times. Can someone help me with the code using a loop? The loop will go to each P_ID and then give the result of ICULOS where Sepsis = 1. Looking forward for help.

for x in df['P_ID'].unique():
print(df.query('P_ID == #x and Sepsis == 1')['ICULOS'][0])

First, filter the rows which have Sepsis=1. It will automatically filter the P_IDs which don't have Sepsis as 1. Thus, you will have fewer patients to iterate.
df1 = df[df.Sepsis==1]
for pid in df.P_ID.unique():
if pid not in df.P_ID:
print("P_ID: {pid} - it has no iclus at Sepsis Lable = 1")
else:
iclus = df1[df1.P_ID==pid].ICULOS.values[0]
print(f"P_ID: {pid} - ICULOS at which Sepsis Label = 1 is: {iclus}")

Related

Randomly drop % of rows by condition in polars

Imagine we have the following polars dataframe:
Feature 1
Feature 2
Labels
100
25
1
150
18
0
200
15
0
230
28
0
120
12
1
130
34
1
150
23
1
180
25
0
Now using polars we want to drop every row with Labels == 0 with 50% probability. An example output would be the following:
Feature 1
Feature 2
Labels
100
25
1
200
15
0
230
28
0
120
12
1
130
34
1
150
23
1
I think filter and sample might be handy... I have something but it is not working:
df = df.drop(df.filter(pl.col("Labels") == 0).sample(frac=0.5))
How can I make it work?
You can use polars.DataFrame.vstack:
df = (df.filter(pl.col("Labels") == 0).sample(frac=0.5)
.vstack(df.filter(pl.col("Labels") != 0))
.sample(frac=1, shuffle=True))

Average of every x rows with a step size of y per each subset using pandas

I have a pandas data frame like this:
Subset Position Value
1 1 2
1 10 3
1 15 0.285714
1 43 1
1 48 0
1 89 2
1 132 2
1 152 0.285714
1 189 0.133333
1 200 0
2 1 0.133333
2 10 0
2 15 2
2 33 2
2 36 0.285714
2 72 2
2 132 0.133333
2 152 0.133333
2 220 3
2 250 8
2 350 6
2 750 0
I want to know how can I get the mean of values for every "x" row with "y" step size per subset in pandas?
For example, mean of every 5 rows (step size =2) for value column in each subset like this:
Subset Start_position End_position Mean
1 1 48 1.2571428
1 15 132 1.0571428
1 48 189 0.8838094
2 1 36 0.8838094
2 15 132 1.2838094
2 36 220 1.110476
2 132 350 3.4533332
Is this what you were looking for:
df = pd.DataFrame({'Subset': [1]*10+[2]*12,
'Position': [1,10,15,43,48,89,132,152,189,200,1,10,15,33,36,72,132,152,220,250,350,750],
'Value': [2,3,.285714,1,0,2,2,.285714,.1333333,0,0.133333,0,2,2,.285714,2,.133333,.133333,3,8,6,0]})
averaged_df = pd.DataFrame(columns=['Subset', 'Start_position', 'End_position', 'Mean'])
window = 5
step_size = 2
for subset in df.Subset.unique():
subset_df = df[df.Subset==subset].reset_index(drop=True)
for i in range(0,len(df),step_size):
window_rows = subset_df.iloc[i:i+window]
if len(window_rows) < window:
continue
window_average = {'Subset': window_rows.Subset.loc[0+i],
'Start_position': window_rows.Position[0+i],
'End_position': window_rows.Position.iloc[-1],
'Mean': window_rows.Value.mean()}
averaged_df = averaged_df.append(window_average,ignore_index=True)
Some notes about the code:
It assumes all subsets are in order in the original df (1,1,2,1,2,2 will behave as if it was 1,1,1,2,2,2)
If there is a group left that's smaller than a window, it will skip it (e.g. 1, 132, 200, 0,60476 is not included`)
One version specific answer would be, using pandas.api.indexers.FixedForwardWindowIndexer introduced in pandas 1.1.0:
>>> window=5
>>> step=2
>>> indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window)
>>> df2 = df.join(df.Position.shift(-(window-1)), lsuffix='_start', rsuffix='_end')
>>> df2 = df2.assign(Mean=df2.pop('Value').rolling(window=indexer).mean()).iloc[::step]
>>> df2 = df2[df2.Position_start.lt(df2.Position_end)].dropna()
>>> df2['Position_end'] = df2['Position_end'].astype(int)
>>> df2
Subset Position_start Position_end Mean
0 1 1 48 1.257143
2 1 15 132 1.057143
4 1 48 189 0.883809
10 2 1 36 0.883809
12 2 15 132 1.283809
14 2 36 220 1.110476
16 2 132 350 3.453333

Python - Grouping and Assigning Exception Rules

I would like to group by list first by assigning group 1, if the closest negative diff to 0 is Location 86 as Group 1, and I would like to assign Group 2 if the closest negative diff to 0 is Location 90. And then group 3 would be if Location 86 and 90 are the closest. After this set is run, I would rerun the code and anywhere a Group has not been assigned, it begins assigning starting from Group 4 and on, so as to not override the previous group assignments.
The groupby is occurring based on ID, Location, and closest to the Anchor column.
Note in the below example, we skip over Location 66 as an exception, where I would use df['diff'].where(df['diff'].le(0)&df['Anchor Date'].ne('Y')&df['Location'].ne(66))
Input:
ID Location Anchor Date Diff
111 86 N 5/2/2020 -1
111 87 Y 5/3/2020 0
111 90 N 5/4/2020 -2
111 90 Y 5/6/2020 0
123 86 N 1/4/2020 -1
123 90 N 1/4/2020 -1
123 91 Y 1/5/2020 0
456 64 N 2/3/2020 -2
456 66 N 2/4/2020 -1
456 91 Y 2/5/2020 0
Output:
ID Location Anchor Date Diff Group
111 86 N 5/2/2020 -1 1
111 87 Y 5/3/2020 0
111 90 N 5/4/2020 -2 2
111 90 Y 5/6/2020 0
123 86 N 1/4/2020 -1 3
123 90 N 1/4/2020 -1 3
123 91 Y 1/5/2020 0
456 64 N 2/3/2020 -2 4
456 66 N 2/4/2020 -1
456 91 Y 2/5/2020 0
Among your exception rules, the one with both 86 and 90 adds some complexity to the code as one need to get a value for this group composed of two locations. In general the fact that you want to catch several location if same diff is harder. Here is one way. Create series with different groups values and masks
#catch each group per ID and up until a 0
gr = (df['ID'].ne(df['ID']).shift()|df['Anchor'].shift().eq('Y')).cumsum()
# where the diff per group is equal to the last value possible before anchor
mask_last = (df['Diff'].where(df['Diff'].le(0)&df['Anchor'].ne('Y')&df['Location'].ne(66))
.groupby(gr).transform('last')
.eq(df['Diff']))
# need this info to create unique fake Location value, especially if several
loc_max = df['Location'].max()+1
#create groups based on Location value
gr2 = (df['Location'].where(mask_last).groupby(gr)
.transform(lambda x:(x.dropna().sort_values()
*loc_max**np.arange(len(x.dropna()))).sum()))
Now you can create the groups:
#now create the column group
d_exception = {86:1, 90:2, 86 + 90*loc_max:3} #you can add more
df['group'] = ''
#exception
for key, val in d_exception.items():
df.loc[mask_last&gr2.eq(key), 'group'] = val
#the rest of the groups
idx = df.index[mask_last&~gr2.isin(d_exception.keys())]
df.loc[idx, 'group'] = pd.factorize(df.loc[idx, 'Location'])[0]+len(d_exception)+1
print (df)
ID Location Anchor Date Diff group
0 111 86 N 5/2/2020 -1 1
1 111 87 Y 5/3/2020 0
2 111 90 N 5/4/2020 -2 2
3 111 90 Y 5/6/2020 0
4 123 86 N 1/4/2020 -1 3
5 123 90 N 1/4/2020 -1 3
6 123 91 Y 1/5/2020 0
7 456 64 N 2/3/2020 -2 4
8 456 66 N 2/4/2020 -1
9 456 91 Y 2/5/2020 0

Get row numbers based on column values from numpy array

I am new to numpy and need some help in solving my problem.
I read records from a binary file using dtypes, then I am selecting 3 columns
df = pd.DataFrame(np.array([(124,90,5),(125,90,5),(126,90,5),(127,90,0),(128,91,5),(129,91,5),(130,91,5),(131,91,0)]), columns = ['atype','btype','ctype'] )
which gives
atype btype ctype
0 124 90 5
1 125 90 5
2 126 90 5
3 127 90 0
4 128 91 5
5 129 91 5
6 130 91 5
7 131 91 0
'atype' is of no interest to me for now.
But what I want is the row numbers when
(x,90,5) appears in 2nd and 3rd columns
(x,90,0) appears in 2nd and 3rd columns
when (x,91,5) appears in 2nd and 3rd columns
and (x,91,0) appears in 2nd and 3rd columns
etc
There are 7 variables like 90,91,92,93,94,95,96 and correspondingly there will be values of either 5 or 0 in the 3rd column.
The entries are 1 million. So is there anyway to find out these without a for loop.
Using pandas you could try the following.
df[(df['btype'].between(90, 96)) & (df['ctype'].isin([0, 5]))]
Using your example. if some of the values are changed, such that df is
atype btype ctype
0 124 90 5
1 125 90 5
2 126 0 5
3 127 90 100
4 128 91 5
5 129 0 5
6 130 91 5
7 131 91 0
then using the solution above, the following is returned.
atype btype ctype
0 124 90 5
1 125 90 5
4 128 91 5
6 130 91 5
7 131 91 0

How to find out if there was weekend between days?

I have two data frames. One representing when an order was placed and arrived, while the other one represents the working days of the shop.
Days are taken as days of the year. i.e. 32 = 1th February.
orders = DataFrame({'placed':[100,103,104,105,108,109], 'arrived':[103,104,105,106,111,111]})
Out[25]:
arrived placed
0 103 100
1 104 103
2 105 104
3 106 105
4 111 108
5 111 109
calendar = DataFrame({'day':['100','101','102','103','104','105','106','107','108','109','110','111','112','113','114','115','116','117','118','119','120'], 'closed':[0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0]})
Out[21]:
closed day
0 0 100
1 1 101
2 1 102
3 0 103
4 0 104
5 0 105
6 0 106
7 0 107
8 1 108
9 1 109
10 0 110
11 0 111
12 0 112
13 0 113
14 0 114
15 1 115
16 1 116
17 0 117
18 0 118
19 0 119
20 0 120
What i want to do is to compute the difference between placed and arrived
x = orders['arrived'] - orders['placed']
Out[24]:
0 3
1 1
2 1
3 1
4 3
5 2
dtype: int64
and subtract one if any day between arrived and placed (included) was a day in which the shop was closed.
i.e. in the first row the order is placed on day 100 and arrived on day 103. the day used are 100, 101, 102, 103. the difference between 103 and 100 is 3. However, since 101 and 102 are days in which the shop is closed I want to subtract 1 for each. That is 3 -1 -1 = 1. And finally append this result on the orders df.

Categories

Resources