Pandas: duplicating dataframe entries while column higher or equal to 0 - python

I have a dataframe containing clinical readings of hospital patients, for example a similar dataframe could look like this
heartrate pid time
0 67 151 0.0
1 75 151 1.2
2 78 151 2.5
3 99 186 0.0
In reality there are many more columns, but I will just keep those 3 to make the example more concise.
I would like to "expand" the dataset. In short, I would like to be able to give an argument n_times_back and another argument interval.
For each iteration, which corresponds to for i in range (n_times_back + 1), we do the following:
Create a new, unique pid [OLD ID | i] (Although as long as the new
pid is unique for each duplicated entry, the exact name isn't
really important to me so feel free to change this if it makes it
easier)
For every patient (pid), remove the rows with time column which is
more than the final time of that patient - i * interval. For
example if i * interval = 2.0 and the times associated to one pid
are [0, 0.5, 1.5, 2.8], the new times will be [0, 0.5], as final
time - 2.0 = 0.8
iterate
Since I realize that explaining this textually is a bit messy, here is an example.
With the dataset above, if we let n_times_back = 1 and interval=1 then we get
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 99 18600 0.0
For n_times_back = 2, the result would be
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 67 15102 0.0
6 99 18600 0.0
n_times_back = 3 and above would lead to the same result as n_times_back = 2, as no patient data goes below that point in time
I have written code for this.
def expand_df(df, n_times_back, interval):
for curr_patient in df['pid'].unique():
patient_data = df[df['pid'] == curr_patient]
final_time = patient_data['time'].max()
for i in range(n_times_back + 1):
new_data = patient_data[patient_data['time'] <= final_time - i * interval]
new_data['pid'] = patient_data['pid'].astype(str) + str(i).zfill(2)
new_data['pid'] = new_data['pid'].astype(int)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
df = df[df['pid'] != curr_patient] # remove original patient data, now duplicate
df.reset_index(inplace = True, drop = True)
return df
As far as functionality goes, this code works as intended. However, it is very slow. I am working with a dataframe of 30'000 patients and the code has been running for over 2 hours now.
Is there a way to use pandas operations to speed this up? I have looked around but so far I haven't managed to reproduce this functionality with high level pandas functions

ended up using a groupby function and breaking when no more times were available, as well as creating an "index" column that I merge with the "pid" column at the end.
def expand_df(group, n_times, interval):
df = pd.DataFrame()
final_time = group['time'].max()
for i in range(n_times + 1):
new_data = group[group['time'] <= final_time - i * interval]
new_data['iteration'] = str(i).zfill(2)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
else:
break
return df
new_df = df.groupby('pid').apply(lambda x : expand_df(x, n_times_back, interval))
new_df = new_df.reset_index(drop=True)
new_df['pid'] = new_df['pid'].map(str) + new_df['iteration']

Related

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Average between points based on time

I'm trying to use Python to get time taken, as well as average speed between an object traveling between points.
The data looks somewhat like this,
location initialtime id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9km
2 2020-09-18T12:10:14.485952Z car_uno 83 8km
3 2020-09-18T11:59:14.484781Z car_duo 70 9km
7 2020-09-18T12:00:14.484653Z car_trio 85 8km
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5km
The function I'm using currently is essentially like this,
Speeds.index = pd.to_datetime(Speeds.index)
..etc
Now if I were doing this usually, I would just take the unique values of the id's,
for x in speeds.id.unique():
Speeds[speeds.id=="x"]...
But this method really isn't working.
What is the best approach for simply seeing if there are multiple id points over time, then taking the average of the speeds by that time given? Otherwise just returning the speed itself if there are not multiple values.
Is there a simpler pandas filter I could use?
Expected output is simply,
area - id - initial time - journey time - average speed.
the point is to get the average time and journey time for a vehicle going past two points
To get the average speed and journey times you can use groupby() and pass in the columns that determine one complete journey, like id or area.
import pandas as pd
from io import StringIO
data = StringIO("""
area initialtime id speed
1 2020-09-18T12:03:14.485952Z car_uno 72
2 2020-09-18T12:10:14.485952Z car_uno 83
3 2020-09-18T11:59:14.484781Z car_duo 70
7 2020-09-18T12:00:14.484653Z car_trio 85
8 2020-09-18T12:12:14.484653Z car_trio 70
""")
df = pd.read_csv(data, delim_whitespace=True)
df["initialtime"] = pd.to_datetime(df["initialtime"])
# change to ["id", "area"] if need more granular aggregation
group_cols = ["id"]
time = df.groupby(group_cols)["initialtime"].agg([max, min]).eval('max-min').reset_index(name="journey_time")
speed = df.groupby(group_cols)["speed"].mean().reset_index(name="average_speed")
pd.merge(time, speed, on=group_cols)
id journey_time average_speed
0 car_duo 00:00:00 70.0
1 car_trio 00:12:00 77.5
2 car_uno 00:07:00 77.5
I tryed to use a very intuitive solution. I'm assuming the data has already been loaded to df.
df['initialtime'] = pd.to_datetime(df['initialtime'])
result = []
for car in df['id'].unique():
_df = df[df['id'] == car].sort_values('initialtime', ascending=True)
# Where the car is leaving "from" and where it's heading "to"
_df['From'] = _df['location']
_df['To'] = _df['location'].shift(-1, fill_value=_df['location'].iloc[0])
# Auxiliary columns
_df['end_time'] = _df['initialtime'].shift(-1, fill_value=_df['initialtime'].iloc[0])
_df['end_speed'] = _df['speed'].shift(-1, fill_value=_df['speed'].iloc[0])
# Desired columns
_df['journey_time'] = _df['end_time'] - _df['initialtime']
_df['avg_speed'] = (_df['speed'] + _df['end_speed']) / 2
_df = _df[_df['journey_time'] >= pd.Timedelta(0)]
_df.drop(['location', 'distance', 'speed', 'end_time', 'end_speed'],
axis=1, inplace=True)
result.append(_df)
final_df = pd.concat(result).reset_index(drop=True)
The final DataFrame is as follows:
initialtime id From To journey_time avg_speed
0 2020-09-18 12:03:14.485952+00:00 car_uno 1 2 0 days 00:07:00 77.5
1 2020-09-18 11:59:14.484781+00:00 car_duo 3 3 0 days 00:00:00 70.0
2 2020-09-18 12:00:14.484653+00:00 car_trio 7 8 0 days 00:12:00 77.5
Here is another approach. My results are different that other posts, so I may have misunderstood the requirements. In brief, I calculated each average speed as total distance divided by total time (for each car).
from io import StringIO
import pandas as pd
# speed in km / hour; distance in km
data = '''location initial-time id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9
2 2020-09-18T12:10:14.485952Z car_uno 83 8
3 2020-09-18T11:59:14.484781Z car_duo 70 9
7 2020-09-18T12:00:14.484653Z car_trio 85 8
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5
'''
Now create data frame and perform calculations
# create data frame
df = pd.read_csv(StringIO(data), delim_whitespace=True)
df['elapsed-time'] = df['distance'] / df['speed'] # in hours
# utility function
def hours_to_hms(elapsed):
''' Convert `elapsed` (in hours) to hh:mm:ss (round to nearest sec)'''
h, m = divmod(elapsed, 1)
m *= 60
_, s = divmod(m, 1)
s *= 60
hms = '{:02d}:{:02d}:{:02d}'.format(int(h), int(m), int(round(s, 0)))
return hms
# perform calculations
start_time = df.groupby('id')['initial-time'].min()
journey_hrs = df.groupby('id')['elapsed-time'].sum().rename('elapsed-hrs')
hms = journey_hrs.apply(lambda x: hours_to_hms(x)).rename('hh:mm:ss')
ave_speed = ((df.groupby('id')['distance'].sum()
/ df.groupby('id')['elapsed-time'].sum())
.rename('ave speed (km/hr)')
.round(2))
# assemble results
result = pd.concat([start_time, journey_hrs, hms, ave_speed], axis=1)
print(result)
initial-time elapsed-hrs hh:mm:ss \
id
car_duo 2020-09-18T11:59:14.484781Z 0.128571 00:07:43
car_trio 2020-09-18T12:00:14.484653Z 0.201261 00:12:05
car_uno 2020-09-18T12:03:14.485952Z 0.221386 00:13:17
ave speed (km/hr)
id
car_duo 70.00
car_trio 77.01
car_uno 76.79
You should provide a better dataset (ie with identical time points) so that we understand better the inputs, and an exemple of expected output so that we understand the computation of the average speed.
Thus I'm just guessing that you may be looking for df.groupby('initialtime')['speed'].mean() if df is a dataframe containing your input data.

Filter rows lesser than the cumulative maximum

I'm trying to filter rows based on a relative simple criteria. If the value for Open is less than the max value for the column until that row, it gets dropped, otherwise the row stays and is the reference value for the new max.
This is the starting example dataframe:
import pandas as pd
import numpy as np
d = {'Date':['22-01-2019','23-01-2019','24-01-2019','25-01-2019','26-01-2019'],'Open': [40,54,54,79,67], 'Close': [43,53,65,65,61]}
df = pd.DataFrame(data=d)
print(df)
In this case I would like to do the filtering on the column Open:
Date Open Close
0 22-01-2019 40 43 #Max is 40
1 23-01-2019 54 53 #54 is higher than 40 so it stays
2 24-01-2019 54 65 #This is not higher than the previous max, should get dropped
3 25-01-2019 79 80 #This is higher than 54, so it stays
4 26-01-2019 67 61 #This is not higher than 79, should get dropped
The only way I could come up to solve the problem with a for loop iterating over each row in particular, defining an auxiliary variable that records does comparison, and returns a boolean series. However it's extremely inefficient when dealing with more than 100k rows. The final goal is to perform the same filter on the Close column and join them to know in which days (the original data is every 15 minutes) both Open and Close values have risen above the highest value ever (previously) recorded.
Finally the output should look like this:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 80
If doing the same operation for the Close column it should look like:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
2 24-01-2019 54 65
3 25-01-2019 79 80
The final goal (which I would know how to do once the I can get through the filtering part, but just sharing for the sake of the full case) is:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 80
My solution is:
max_v = 0
list_for_filtering = []
for i, value in df.iterrows():
if value['Open'] > max_v:
max_v = value['Open']
list_for_filtering.append(True)
else:
pass
list_for_filtering.append(False)
df['T/F'] = list_for_filtering
And filter keeping only the True values
One simple solution is to compare "Open" with the shifted cummax:
# thanks to Andy L. for the simplification!
df[df['Open'] > df['Open'].cummax().shift(fill_value=-np.inf)]
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 65
Where,
df['Open'].cummax().shift()
0 NaN
1 40.0
2 54.0
3 54.0
4 79.0
Name: Open, dtype: float64

Computing using two dataframes in Pandas

I'm trying to compute the following:
When there are
df1 (dataframe that has speed of characters(char_speed) of subtitle that starts at start_time and ends at end_time):
char_speed start_time end_time
0 34 3 15
1 19 15 21
2 9 21 28
...
and
df2 (dataframe that has user's listening log that starts at start_time and ends at end_time with the speed that the user listened to at that interval):
start_time end_time speed
0 9.23 20.929 1.0
1 1.4 20.26 1.5
2 20.0 27.6 1.25
...
then compute the total character count during each interval:
start_time end_time speed total_char
0 9.23 20.929 1.0
1 1.4 20.26 1.5
2 20.0 27.6 1.25
...
For example, df2['total_char'].iloc[0] would be
((15-9.23)*34) + ((20.929-15)*19)
as among time period of 9.23 ~ 20.929,
during 9.23 ~ 15, the speed would be 34,
during 15 ~ 20.929, the speed would be 19
and df2['total_char'].iloc[1] would be
(3-1.4)*0 + ((15-3)*34) + ((20.26-15)*19)
as among time period of 1.4 ~ 20.26,
during 1.4 ~ 3, the speed is not found in df1, so 0
during 3 ~ 15, the speed would be 34
during 15 ~ 20.26, the speed would be 19
I'm a newbie in Pandas and I've been recently mesmerized by how Pandas can be efficient in short and simple codings, but I'm not sure if there's a way to compute this in a short and simple coding. Right now, I can only think of an way to do it without utilizing Pandas functions: calling each row of df2 and then searching through each row in df1 and then compute it.
It would be helpful if you could tell me a way to efficiently code this using Pandas. Or any recommendation of functions would be helpful too!
Thanks in advance! :)
If you aren't opposed to merging the dataframes then apply makes it easy.
df2 = pd.concat([df1, df2], axis=1, sort=False)
def speed_calc(row):
return ((row['end_time1']-row['start_time1'])*row['char_speed']) + \
((row['end_time2']-row['end_time1'])*row['char_speed'])
df2['total_char'] = df2.apply(speed_calc, axis=1)
This would require you to adjust the header names.

Pandas timeseries bins and indexing

I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!
Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0

Categories

Resources