Randomly drop % of rows by condition in polars

Randomly drop % of rows by condition in polars - python

Imagine we have the following polars dataframe:
Feature 1
Feature 2
Labels
100
25
1
150
18
0
200
15
0
230
28
0
120
12
1
130
34
1
150
23
1
180
25
0
Now using polars we want to drop every row with Labels == 0 with 50% probability. An example output would be the following:
Feature 1
Feature 2
Labels
100
25
1
200
15
0
230
28
0
120
12
1
130
34
1
150
23
1
I think filter and sample might be handy... I have something but it is not working:
df = df.drop(df.filter(pl.col("Labels") == 0).sample(frac=0.5))
How can I make it work?

You can use polars.DataFrame.vstack:
df = (df.filter(pl.col("Labels") == 0).sample(frac=0.5)
.vstack(df.filter(pl.col("Labels") != 0))
.sample(frac=1, shuffle=True))

Related

Identify a code by quantity intervals in a pandas DataFrame

Given the following DataFrame in pandas:
avg_time_1
avg_time_2
avg_time_3
1200
34
1
90
45
3600
0
4
1
0
4
50
80
4
60
82
40
65
I want to get a new DataFrame from the previous one, such that it assigns the following code to each row if any of the three columns visit_time, exceeds the following values:
CODE-1: All values are less than 5.
CODE-2: Some value is between 5 and 100.
CODE-3: All values are between 5 and 100.
CODE-4: Some value is higher than 1000.
Applying the function, we will obtain the following DataFrame.
avg_time_1
avg_time_2
avg_time_3
codes
1200
34
1
4
90
45
3600
4
0
4
1
1
0
4
50
2
80
4
60
2
82
40
65
3
Thank you for your response in advance.

You can try np.select, note that you should put the higher priority condition ahead.
df['codes'] = np.select(
[df.lt(5).all(1), df.gt(1000).any(1),
df.apply(lambda col: col.between(5, 100)).all(1),
df.apply(lambda col: col.between(5, 100)).any(1)],
[1, 4, 3, 2],
default=0
)
print(df)
avg_time_1 avg_time_2 avg_time_3 codes
0 1200 34 1 4
1 90 45 3600 4
2 0 4 1 1
3 0 4 50 2
4 80 4 60 2
5 82 40 65 3

Loop through dataframe in python to select specific row

I have a timeseries data of 5864 ICU Patients and my dataframe is like this. Each row is the ICU stay of respective patient at a particular hour.
HR
SBP
DBP
ICULOS
Sepsis
P_ID
92
120
80
1
0
0
98
115
85
2
0
0
93
125
75
3
1
0
95
130
90
4
1
0
102
120
80
1
0
1
109
115
75
2
0
1
94
135
100
3
0
1
97
100
70
4
1
1
85
120
80
5
1
1
88
115
75
6
1
1
93
125
85
1
0
2
78
130
90
2
0
2
115
140
110
3
0
2
102
120
80
4
0
2
98
140
110
5
1
2
I want to select the ICULOS where Sepsis = 1 (first hour only) based on patient ID. Like in P_ID = 0, Sepsis = 1 at ICULOS = 3. I did this on a single patient (the dataframe having data of only a single patient) using the code:
x = df[df['Sepsis'] == 1]["ICULOS"].values[0]
print("ICULOS at which Sepsis Label = 1 is:", x)
# Output
ICULOS at which Sepsis Label = 1 is: 46
If I want to check it for each P_ID, I have to do this 5864 times. Can someone help me with the code using a loop? The loop will go to each P_ID and then give the result of ICULOS where Sepsis = 1. Looking forward for help.

for x in df['P_ID'].unique():
print(df.query('P_ID == #x and Sepsis == 1')['ICULOS'][0])

First, filter the rows which have Sepsis=1. It will automatically filter the P_IDs which don't have Sepsis as 1. Thus, you will have fewer patients to iterate.
df1 = df[df.Sepsis==1]
for pid in df.P_ID.unique():
if pid not in df.P_ID:
print("P_ID: {pid} - it has no iclus at Sepsis Lable = 1")
else:
iclus = df1[df1.P_ID==pid].ICULOS.values[0]
print(f"P_ID: {pid} - ICULOS at which Sepsis Label = 1 is: {iclus}")

Average of every x rows with a step size of y per each subset using pandas

I have a pandas data frame like this:
Subset Position Value
1 1 2
1 10 3
1 15 0.285714
1 43 1
1 48 0
1 89 2
1 132 2
1 152 0.285714
1 189 0.133333
1 200 0
2 1 0.133333
2 10 0
2 15 2
2 33 2
2 36 0.285714
2 72 2
2 132 0.133333
2 152 0.133333
2 220 3
2 250 8
2 350 6
2 750 0
I want to know how can I get the mean of values for every "x" row with "y" step size per subset in pandas?
For example, mean of every 5 rows (step size =2) for value column in each subset like this:
Subset Start_position End_position Mean
1 1 48 1.2571428
1 15 132 1.0571428
1 48 189 0.8838094
2 1 36 0.8838094
2 15 132 1.2838094
2 36 220 1.110476
2 132 350 3.4533332

Is this what you were looking for:
df = pd.DataFrame({'Subset': [1]*10+[2]*12,
'Position': [1,10,15,43,48,89,132,152,189,200,1,10,15,33,36,72,132,152,220,250,350,750],
'Value': [2,3,.285714,1,0,2,2,.285714,.1333333,0,0.133333,0,2,2,.285714,2,.133333,.133333,3,8,6,0]})
averaged_df = pd.DataFrame(columns=['Subset', 'Start_position', 'End_position', 'Mean'])
window = 5
step_size = 2
for subset in df.Subset.unique():
subset_df = df[df.Subset==subset].reset_index(drop=True)
for i in range(0,len(df),step_size):
window_rows = subset_df.iloc[i:i+window]
if len(window_rows) < window:
continue
window_average = {'Subset': window_rows.Subset.loc[0+i],
'Start_position': window_rows.Position[0+i],
'End_position': window_rows.Position.iloc[-1],
'Mean': window_rows.Value.mean()}
averaged_df = averaged_df.append(window_average,ignore_index=True)
Some notes about the code:
It assumes all subsets are in order in the original df (1,1,2,1,2,2 will behave as if it was 1,1,1,2,2,2)
If there is a group left that's smaller than a window, it will skip it (e.g. 1, 132, 200, 0,60476 is not included`)

One version specific answer would be, using pandas.api.indexers.FixedForwardWindowIndexer introduced in pandas 1.1.0:
>>> window=5
>>> step=2
>>> indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window)
>>> df2 = df.join(df.Position.shift(-(window-1)), lsuffix='_start', rsuffix='_end')
>>> df2 = df2.assign(Mean=df2.pop('Value').rolling(window=indexer).mean()).iloc[::step]
>>> df2 = df2[df2.Position_start.lt(df2.Position_end)].dropna()
>>> df2['Position_end'] = df2['Position_end'].astype(int)
>>> df2
Subset Position_start Position_end Mean
0 1 1 48 1.257143
2 1 15 132 1.057143
4 1 48 189 0.883809
10 2 1 36 0.883809
12 2 15 132 1.283809
14 2 36 220 1.110476
16 2 132 350 3.453333

comparing row wise each values in pandas data frame

My data frame looks like (almost 10M) -
date value1 value2
01/02/2019 10 120
02/02/2019 21 130
03/02/2019 0 140
04/02/2019 24 150
05/02/2019 29 160
06/02/2019 32 160
07/02/2019 54 160
08/02/2019 32 180
01/02/2019 -3 188
My final output looks like -
date value1 value2 result
01/02/2019 10 120 1
02/02/2019 21 130 1
03/02/2019 0 140 0
04/02/2019 24 150 1
05/02/2019 29 160 1
06/02/2019 32 160 0
07/02/2019 54 160 0
08/02/2019 32 180 1
01/02/2019 -3 188 0
My logic should if value1 <=0 or 3 consecutive rows(value2) is same then result is 0 otherwise 1
How to do it in pandas

You can try, defining your own function that handles consecutive values, and where value1 is above 0, then groupby using a custom series of consecutives and finally apply the custom function:
import pandas as pd
from io import StringIO
s = '''date,value1,value2
01/02/2019,10,120
02/02/2019,21,130
03/02/2019,0,140
04/02/2019,24,150
05/02/2019,29,160
06/02/2019,32,160
07/02/2019,54,160
08/02/2019,32,180
01/02/2019,-3,188'''
df = pd.read_csv(StringIO(s), header=0, index_col=0)
def fun(group_df):
if group_df.shape[0] >= 3:
return pd.Series([0]*group_df.shape[0], index=group_df.index)
else:
return group_df.value1 > 0
consecutives = (df.value2 != df.value2.shift()).cumsum()
df['results'] = df.groupby(consecutives).apply(
fun).reset_index(level=0, drop=True)
In this case fun is a vectorized function to check if consectives are 3 or more, or if values are greater than 0, the results are:
print(df)
# value1 value2 results
# date
# 01/02/2019 10 120 1
# 02/02/2019 21 130 1
# 03/02/2019 0 140 0
# 04/02/2019 24 150 1
# 05/02/2019 29 160 0
# 06/02/2019 32 160 0
# 07/02/2019 54 160 0
# 08/02/2019 32 180 1
# 01/02/2019 -3 188 0

Something like this
np.where((df.value1.le(0)) | (df.value2.diff().eq(0)), 0, 1)

Create a text file using the pandas dataframes

I am new to the python . I have the following dataframe
Document_ID OFFSET PredictedFeature word
0 0 2000 abcd
0 8 2000 is
0 16 2200 a
0 23 2200 good
0 25 315 XXYYZZ
1 0 2100 but
1 5 2100 it
1 7 2100 can
1 10 315 XXYYZZ
Now, In this dataframe what I trying to do is make a file which can be in a readable formt like ,
abcd is 2000, a good 2200
but it can 2100,
PredictedData feature offset endoffset
abcd is 2000 0 8
a good 2200 16 23
NewLine 315 25 25
but it can 2100 0 7
this type of data. where if you see I trying same sequence of predictedFeatures are coming then I am concatening same words with it's value. If there is feature 315 then I am giving a new line to it.
SO, Is there any way though which I can do this ? Any help will be appreciated.
Thnaks

IIUC, you can do groupby():
(df.groupby(['Document_ID', 'PredictedFeature'],as_index=False)
.agg({'word':(' '.join),
'OFFSET':('min','max')
})
)
Output:
Document_ID PredictedFeature word OFFSET
join min max
0 0 315 XXYYZZ 25 25
1 0 2000 abcd is 0 8
2 0 2200 a good 16 23
3 1 315 XXYYZZ 10 10
4 1 2100 but it can 0 7

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Randomly drop % of rows by condition in polars - python

You can use polars.DataFrame.vstack: df = (df.filter(pl.col("Labels") == 0).sample(frac=0.5) .vstack(df.filter(pl.col("Labels") != 0)) .sample(frac=1, shuffle=True))

Related

Identify a code by quantity intervals in a pandas DataFrame

Loop through dataframe in python to select specific row

Average of every x rows with a step size of y per each subset using pandas

comparing row wise each values in pandas data frame

Create a text file using the pandas dataframes

Categories

Resources