comparing row wise each values in pandas data frame - python

My data frame looks like (almost 10M) -
date value1 value2
01/02/2019 10 120
02/02/2019 21 130
03/02/2019 0 140
04/02/2019 24 150
05/02/2019 29 160
06/02/2019 32 160
07/02/2019 54 160
08/02/2019 32 180
01/02/2019 -3 188
My final output looks like -
date value1 value2 result
01/02/2019 10 120 1
02/02/2019 21 130 1
03/02/2019 0 140 0
04/02/2019 24 150 1
05/02/2019 29 160 1
06/02/2019 32 160 0
07/02/2019 54 160 0
08/02/2019 32 180 1
01/02/2019 -3 188 0
My logic should if value1 <=0 or 3 consecutive rows(value2) is same then result is 0 otherwise 1
How to do it in pandas

You can try, defining your own function that handles consecutive values, and where value1 is above 0, then groupby using a custom series of consecutives and finally apply the custom function:
import pandas as pd
from io import StringIO
s = '''date,value1,value2
01/02/2019,10,120
02/02/2019,21,130
03/02/2019,0,140
04/02/2019,24,150
05/02/2019,29,160
06/02/2019,32,160
07/02/2019,54,160
08/02/2019,32,180
01/02/2019,-3,188'''
df = pd.read_csv(StringIO(s), header=0, index_col=0)
def fun(group_df):
if group_df.shape[0] >= 3:
return pd.Series([0]*group_df.shape[0], index=group_df.index)
else:
return group_df.value1 > 0
consecutives = (df.value2 != df.value2.shift()).cumsum()
df['results'] = df.groupby(consecutives).apply(
fun).reset_index(level=0, drop=True)
In this case fun is a vectorized function to check if consectives are 3 or more, or if values are greater than 0, the results are:
print(df)
# value1 value2 results
# date
# 01/02/2019 10 120 1
# 02/02/2019 21 130 1
# 03/02/2019 0 140 0
# 04/02/2019 24 150 1
# 05/02/2019 29 160 0
# 06/02/2019 32 160 0
# 07/02/2019 54 160 0
# 08/02/2019 32 180 1
# 01/02/2019 -3 188 0

Something like this
np.where((df.value1.le(0)) | (df.value2.diff().eq(0)), 0, 1)

Related

Randomly drop % of rows by condition in polars

Imagine we have the following polars dataframe:
Feature 1
Feature 2
Labels
100
25
1
150
18
0
200
15
0
230
28
0
120
12
1
130
34
1
150
23
1
180
25
0
Now using polars we want to drop every row with Labels == 0 with 50% probability. An example output would be the following:
Feature 1
Feature 2
Labels
100
25
1
200
15
0
230
28
0
120
12
1
130
34
1
150
23
1
I think filter and sample might be handy... I have something but it is not working:
df = df.drop(df.filter(pl.col("Labels") == 0).sample(frac=0.5))
How can I make it work?
You can use polars.DataFrame.vstack:
df = (df.filter(pl.col("Labels") == 0).sample(frac=0.5)
.vstack(df.filter(pl.col("Labels") != 0))
.sample(frac=1, shuffle=True))

Loop through dataframe in python to select specific row

I have a timeseries data of 5864 ICU Patients and my dataframe is like this. Each row is the ICU stay of respective patient at a particular hour.
HR
SBP
DBP
ICULOS
Sepsis
P_ID
92
120
80
1
0
0
98
115
85
2
0
0
93
125
75
3
1
0
95
130
90
4
1
0
102
120
80
1
0
1
109
115
75
2
0
1
94
135
100
3
0
1
97
100
70
4
1
1
85
120
80
5
1
1
88
115
75
6
1
1
93
125
85
1
0
2
78
130
90
2
0
2
115
140
110
3
0
2
102
120
80
4
0
2
98
140
110
5
1
2
I want to select the ICULOS where Sepsis = 1 (first hour only) based on patient ID. Like in P_ID = 0, Sepsis = 1 at ICULOS = 3. I did this on a single patient (the dataframe having data of only a single patient) using the code:
x = df[df['Sepsis'] == 1]["ICULOS"].values[0]
print("ICULOS at which Sepsis Label = 1 is:", x)
# Output
ICULOS at which Sepsis Label = 1 is: 46
If I want to check it for each P_ID, I have to do this 5864 times. Can someone help me with the code using a loop? The loop will go to each P_ID and then give the result of ICULOS where Sepsis = 1. Looking forward for help.
for x in df['P_ID'].unique():
print(df.query('P_ID == #x and Sepsis == 1')['ICULOS'][0])
First, filter the rows which have Sepsis=1. It will automatically filter the P_IDs which don't have Sepsis as 1. Thus, you will have fewer patients to iterate.
df1 = df[df.Sepsis==1]
for pid in df.P_ID.unique():
if pid not in df.P_ID:
print("P_ID: {pid} - it has no iclus at Sepsis Lable = 1")
else:
iclus = df1[df1.P_ID==pid].ICULOS.values[0]
print(f"P_ID: {pid} - ICULOS at which Sepsis Label = 1 is: {iclus}")

Python Pandas calculate total volume with last article volume

I have the following problem and do not know how to solve it in a perfomant way:
Input Pandas DataFrame:
timestep
article
volume
35
1
20
37
2
5
123
2
12
155
3
10
178
2
23
234
1
17
478
1
28
Output Pandas DataFrame:
timestep
volume
35
20
37
25
123
32
178
53
234
50
478
61
Calculation Example for timestep 478:
28 (last article 1 volume) + 23 (last article 2 volume) + 10 (last article 3 volume) = 61
What ist the best way to do this in pandas?
Try with ffill:
#sort if needed
df = df.sort_values("timestep")
df["volume"] = (df["volume"].where(df["article"].eq(1)).ffill().fillna(0) +
df["volume"].where(df["article"].eq(2)).ffill().fillna(0))
output = df.drop("article", axis=1)
>>> output
timestep volume
0 35 20.0
1 37 25.0
2 123 32.0
3 178 43.0
4 234 40.0
5 478 51.0
Group By article & Take last element & Sum
df.groupby(['article']).tail(1)["volume"].sum()
You can set group number of consecutive article by .cumsum(). Then get the value of previous group last item by .map() with GroupBy.last(). Finally, add volume with this previous last, as follows:
# Get group number of consecutive `article`
g = df['article'].ne(df['article'].shift()).cumsum()
# Add `volume` to previous group last
df['volume'] += g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
Result:
print(df)
timestep article volume
0 35 1 20
1 37 2 25
2 123 2 32
3 178 2 43
4 234 1 40
5 478 1 51
Breakdown of steps
Previous group last values:
g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
0 0
1 20
2 20
3 20
4 43
5 43
Name: article, dtype: int64
Try:
df["new_volume"] = (
df.loc[df["article"] != df["article"].shift(-1), "volume"]
.reindex(df.index, method='ffill')
.shift()
+ df["volume"]
).fillna(df["volume"])
df
Output:
timestep article volume new_volume
0 35 1 20 20.0
1 37 2 5 25.0
2 123 2 12 32.0
3 178 2 23 43.0
4 234 1 17 40.0
5 478 1 28 51.0
Explained:
Find the last record of each group by checking the 'article' from the previous row, then reindex that series aligning to the original dataframe and fill forward and shift to the next group with that 'volume'. And this to the current row's 'volume' and fill that first value with the original 'volume' value.

Average of every x rows with a step size of y per each subset using pandas

I have a pandas data frame like this:
Subset Position Value
1 1 2
1 10 3
1 15 0.285714
1 43 1
1 48 0
1 89 2
1 132 2
1 152 0.285714
1 189 0.133333
1 200 0
2 1 0.133333
2 10 0
2 15 2
2 33 2
2 36 0.285714
2 72 2
2 132 0.133333
2 152 0.133333
2 220 3
2 250 8
2 350 6
2 750 0
I want to know how can I get the mean of values for every "x" row with "y" step size per subset in pandas?
For example, mean of every 5 rows (step size =2) for value column in each subset like this:
Subset Start_position End_position Mean
1 1 48 1.2571428
1 15 132 1.0571428
1 48 189 0.8838094
2 1 36 0.8838094
2 15 132 1.2838094
2 36 220 1.110476
2 132 350 3.4533332
Is this what you were looking for:
df = pd.DataFrame({'Subset': [1]*10+[2]*12,
'Position': [1,10,15,43,48,89,132,152,189,200,1,10,15,33,36,72,132,152,220,250,350,750],
'Value': [2,3,.285714,1,0,2,2,.285714,.1333333,0,0.133333,0,2,2,.285714,2,.133333,.133333,3,8,6,0]})
averaged_df = pd.DataFrame(columns=['Subset', 'Start_position', 'End_position', 'Mean'])
window = 5
step_size = 2
for subset in df.Subset.unique():
subset_df = df[df.Subset==subset].reset_index(drop=True)
for i in range(0,len(df),step_size):
window_rows = subset_df.iloc[i:i+window]
if len(window_rows) < window:
continue
window_average = {'Subset': window_rows.Subset.loc[0+i],
'Start_position': window_rows.Position[0+i],
'End_position': window_rows.Position.iloc[-1],
'Mean': window_rows.Value.mean()}
averaged_df = averaged_df.append(window_average,ignore_index=True)
Some notes about the code:
It assumes all subsets are in order in the original df (1,1,2,1,2,2 will behave as if it was 1,1,1,2,2,2)
If there is a group left that's smaller than a window, it will skip it (e.g. 1, 132, 200, 0,60476 is not included`)
One version specific answer would be, using pandas.api.indexers.FixedForwardWindowIndexer introduced in pandas 1.1.0:
>>> window=5
>>> step=2
>>> indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window)
>>> df2 = df.join(df.Position.shift(-(window-1)), lsuffix='_start', rsuffix='_end')
>>> df2 = df2.assign(Mean=df2.pop('Value').rolling(window=indexer).mean()).iloc[::step]
>>> df2 = df2[df2.Position_start.lt(df2.Position_end)].dropna()
>>> df2['Position_end'] = df2['Position_end'].astype(int)
>>> df2
Subset Position_start Position_end Mean
0 1 1 48 1.257143
2 1 15 132 1.057143
4 1 48 189 0.883809
10 2 1 36 0.883809
12 2 15 132 1.283809
14 2 36 220 1.110476
16 2 132 350 3.453333

Selecting rows which match condition of group

I have a Pandas DataFrame df which looks as follows:
ID Timestamp x y
1 10 322 222
1 12 234 542
1 14 22 523
2 55 222 76
2 56 23 87
2 58 322 5436
3 100 322 345
3 150 22 243
3 160 12 765
3 170 78 65
Now, I would like to keep all rows where the timestamp is between 12 and 155. This I could do by df[df["timestamp"] >= 12 & df["timestamp"] <= 155]. But I would like to have only rows included where all timestamps in the corresponding ID group are within the range. So in the example above it should result in the following dataframe:
ID Timestamp x y
2 55 222 76
2 56 23 87
2 58 322 5436
For ID == 1 and ID == 3 not all timestamps of the rows are in the range that's why they are not included.
How can this be done?
You can combine groupby("ID") and filter:
df.groupby("ID").filter(lambda x: x.Timestamp.between(12, 155).all())
ID Timestamp x y
3 2 55 222 76
4 2 56 23 87
5 2 58 322 5436
Use transform with groupby and using all() to check if all items in the group matches the condition:
df[df.groupby('ID').Timestamp.transform(lambda x: x.between(12,155).all())]
ID Timestamp x y
3 2 55 222 76
4 2 56 23 87
5 2 58 322 5436

Categories

Resources