Input
Tech ID | 4th April'22 | 3rd April'22 | 2nd April'22
123 | 2022-04-04 05:03:00 | 2022-04-03 05:08:00 | 2022-04-02 05:10:00
345 | 2022-04-04 05:37:00 | 2022-04-03 05:18:00 | 2022-04-02 05:12:00
678 | 2022-04-04 05:42:00 | 2022-04-03 05:25:00 | 2022-04-02 05:30:00
901 | 2022-04-04 05:48:00 | 2022-04-03 05:45:00 | 2022-04-02 04:08:00
367 | 2022-04-04 05:32:00 | 2022-04-03 06:08:00 | 2022-04-02 06:08:00
Output
Tech ID | 4th April'22 | 3rd April'22 | 2nd April'22 | Mean
123 | 2022-04-04 05:03:00 | 2022-04-03 05:08:00 | 2022-04-02 05:10:00 | result
345 | 2022-04-04 05:37:00 | 2022-04-03 05:18:00 | 2022-04-02 05:12:00 | result
678 | 2022-04-04 05:42:00 | 2022-04-03 05:25:00 | 2022-04-02 05:30:00 | result
901 | 2022-04-04 05:48:00 | 2022-04-03 05:45:00 | 2022-04-02 04:08:00 | result
367 | 2022-04-04 05:32:00 | 2022-04-03 06:08:00 | 2022-04-02 06:08:00 | result
Depending on the version of Pandas that you are using, you may be able to use the Dataframe.mean method to determine the mean across the rows.
You will need to set numeric_only to False per the pandas documentation:
numeric_only: bool, default None
Include only float, int, boolean columns. If None, will attempt to use everything, >then use only numeric data. Not implemented for Series.
df['mean'] = df.mean(axis=1,numeric_only=False)
This works for me using pandas 1.3.4 assuming the dtype of the 3 source columns are all datetime64[ns]
Related
I have a dataframe that stores the dates of creation and upgrades of files. I would like to write a function in order to create a column named 'version' keeping track of the files versions.
original dataset:
| fileID | creationDate | upgradeDate |
|--------|----------------------|----------------------|
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:00:00 |
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:30:00 |
| file_1 | 02/02/2022 15:00:00 | 10/02/2022 10:00:00 |
| file_2 | 03/02/2022 15:00:00 | 03/02/2022 15:00:00 |
| file_3 | 04/02/2022 15:00:00 | 04/02/2022 15:00:00 |
| file_3 | 04/02/2022 15:00:00 | 05/02/2022 15:00:00 |
what I would like:
| fileID | creationDate | upgradeDate | version |
|--------|----------------------|----------------------|---------|
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:00:00 | 1 |
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:30:00 | 2 |
| file_1 | 02/02/2022 15:00:00 | 10/02/2022 10:00:00 | 3 |
| file_2 | 03/02/2022 15:00:00 | 03/02/2022 15:00:00 | 1 |
| file_3 | 04/02/2022 15:00:00 | 04/02/2022 15:00:00 | 1 |
| file_3 | 04/02/2022 15:00:00 | 05/02/2022 15:00:00 | 2 |
I was able to do this through a for loop :
ids = df['fileID'].unique().tolist()
for id in ids:
df.loc[df['fileID'] == id, 'version'] = range(1, len(df.loc[df['fileID'] == id]) + 1)
But I would prefer not to use a for loop and do it with a function or list comprehension. Any idea ?
Given the following DataFrame of pandas in Python:
| ID | date |
|--------------|---------------------------------------|
| ESP | 2022-03-02 07:24:19+01:00 |
| ESP | 2022-03-02 07:24:19+01:00 |
| ESP | 2022-03-02 08:00:00+01:00 |
| UK | 2022-03-02 08:08:30+01:00 |
| ESP | 2022-03-02 09:11:50+01:00 |
| USA | 2022-03-02 10:19:11+01:00 |
| UK | 2022-03-02 10:12:11+01:00 |
| USA | 2022-03-03 08:33:22+01:00 |
| USA | 2022-03-03 09:23:22+01:00 |
| UK | 2022-03-03 12:13:22+01:00 |
| UK | 2022-03-03 12:35:22+01:00 |
With the following code implemented in Python, I get the following DataFrame:
def create_dataframe(df):
df['date'] = pd.to_datetime(df['date'].astype(str).str.split('+').str[0])
string = df['date'].groupby(df['date'].dt.floor('H')).count()
df = pd.DataFrame({'date': string.index.date, 'start_interval': string.index.time,
'end_interval': (string.index + pd.DateOffset(hours=1)).time,
'total_rows': string.to_numpy()})
return df
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
I would like to add to the table the information provided by the 'ID' column, i.e. get this DataFrame:
| ID | date | start_interval | end_interval | total_rows |
|--------|-----------------------|-------------------|-------------------|------------|
| ESP | 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| ESP | 2022-03-02 | 08:00:00 | 09:00:00 | 1 |
| UK | 2022-03-02 | 08:00:00 | 09:00:00 | 1 |
| ESP | 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| USA | 2022-03-02 | 10:00:00 | 11:00:00 | 1 |
| UK | 2022-03-02 | 10:00:00 | 11:00:00 | 1 |
| USA | 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| USA | 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| UK | 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
How could I modify the supplied code to obtain the resulting table? Thank you in advance for your help.
Does this produce what you are looking for:
result = (
df
.groupby(['ID', df['date'].dt.floor('H')]).agg(total_rows=('date', 'count'))
.reset_index()
.assign(
start_interval=lambda df: df['date'].dt.time,
end_interval=lambda df: (df['date'] + pd.Timedelta(hours=1)).dt.time,
date=lambda df: df['date'].dt.date
)
)
Result:
ID date total_rows start_interval end_interval
0 ESP 2022-03-02 2 07:00:00 08:00:00
1 ESP 2022-03-02 1 08:00:00 09:00:00
2 ESP 2022-03-02 1 09:00:00 10:00:00
3 UK 2022-03-02 1 08:00:00 09:00:00
4 UK 2022-03-02 1 10:00:00 11:00:00
5 UK 2022-03-03 2 12:00:00 13:00:00
6 USA 2022-03-02 1 10:00:00 11:00:00
7 USA 2022-03-03 1 08:00:00 09:00:00
8 USA 2022-03-03 1 09:00:00 10:00:00
I have a dataframe that has this overall structure:
(I know. It could be better, but this is what i have to work with :)
| patient_id | inclusion_timestamp | pre_event_1 | post_event_1 | post_event_2 |
|------------|---------------------|------------------|------------------|------------------|
| 1 | NaN | 27-06-2020 12:26 | NaN | NaN |
| 1 | 28-06-2020 13:05 | NaN | NaN | NaN |
| 1 | NaN | NaN | 29-06-2020 14:00 | NaN |
| 1 | NaN | NaN | NaN | 29-06-2020 23:57 |
| 2 | NaN | 29-06-2020 10:11 | NaN | NaN |
| 2 | 29-06-2020 18:26 | NaN | NaN | NaN |
| 2 | NaN | NaN | 30-06-2020 19:36 | NaN |
| 2 | NaN | NaN | NaN | 31-06-2020 21:20 |
| 3 | NaN | 29-06-2020 06:35 | NaN | NaN |
| 3 | NaN | 29-06-2020 07:28 | NaN | NaN |
| 3 | 30-06-2020 09:06 | NaN | NaN | NaN |
| 3 | NaN | NaN | NaN | 01-07-2020 12:10 |
and so forth.
The only way i know to do calculations from the inclusion_timestamp, is to fill forward from the inclusion_timestamp. However, this will yield wrong calculations for the pre_event_1 field, as it's column typically precedes the value for calculation.
Is there any way to do forward and backwards fill but only on the same index_col(patient_id)?
This way, the resulting dataframe will look like so:
| patient_id | inclusion_timestamp | pre_event_1 | post_event_1 | post_event_2 |
|------------|---------------------|------------------|------------------|------------------|
| 1 | 28-06-2020 13:05 | 27-06-2020 12:26 | NaN | NaN |
| 1 | 28-06-2020 13:05 | NaN | NaN | NaN |
| 1 | 28-06-2020 13:05 | NaN | 29-06-2020 14:00 | NaN |
| 1 | 28-06-2020 13:05 | NaN | NaN | 29-06-2020 23:57 |
| 2 | 29-06-2020 18:26 | 29-06-2020 10:11 | NaN | NaN |
| 2 | 29-06-2020 18:26 | NaN | NaN | NaN |
| 2 | 29-06-2020 18:26 | NaN | 30-06-2020 19:36 | NaN |
| 2 | 29-06-2020 18:26 | NaN | NaN | 31-06-2020 21:20 |
| 3 | 30-06-2020 09:06 | 29-06-2020 06:35 | NaN | NaN |
| 3 | 30-06-2020 09:06 | 29-06-2020 07:28 | NaN | NaN |
| 3 | 30-06-2020 09:06 | NaN | NaN | NaN |
| 3 | 30-06-2020 09:06 | NaN | NaN | 01-07-2020 12:10 |
I think that the answer is to iterate over the index column, and then apply forward and backwards fill within each patient_id, but i can't get my code to work...
Use DataFrame.groupby on column patient_id and using apply to ffill and bfill:
df['inclusion_timestamp'] = df.groupby('patient_id')['inclusion_timestamp']\
.apply(lambda x: x.ffill().bfill())
Or another idea using DataFrame.groupby with Series.combine_first:
g = df.groupby('patient_id')['inclusion_timestamp']
df['inclusion_timestamp'] = g.ffill().combine_first(g.bfill())
Another idea using two successive Series.groupby:
df['inclusion_timestamp'] = df['inclusion_timestamp'].groupby(df['patient_id'])\
.ffill().groupby(df['patient_id']).bfill()
Result:
patient_id inclusion_timestamp pre_event_1 post_event_1 post_event_2
0 1 28-06-2020 13:05 27-06-2020 12:26 NaN NaN
1 1 28-06-2020 13:05 NaN NaN NaN
2 1 28-06-2020 13:05 NaN 29-06-2020 14:00 NaN
3 1 28-06-2020 13:05 NaN NaN 29-06-2020 23:57
4 2 29-06-2020 18:26 29-06-2020 10:11 NaN NaN
5 2 29-06-2020 18:26 NaN NaN NaN
6 2 29-06-2020 18:26 NaN 30-06-2020 19:36 NaN
7 2 29-06-2020 18:26 NaN NaN 31-06-2020 21:20
8 3 30-06-2020 09:06 29-06-2020 06:35 NaN NaN
9 3 30-06-2020 09:06 29-06-2020 07:28 NaN NaN
10 3 30-06-2020 09:06 NaN NaN NaN
11 3 30-06-2020 09:06 NaN NaN 01-07-2020 12:10
Performance (measured using timeit):
df.shape
(1200000, 5)
%%timeit -n10 #Method 1 (Best Method)
263 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n10 #Method 2
342 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n10 #Method3
297 ms ± 4.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have the following dataframe :
| start_time | end_time | id |
|---------------------|---------------------|-----|
| 2017-03-30 01:00:00 | 2017-03-30 01:15:30 |1 |
| 2017-03-30 02:02:00 | 2017-03-30 03:30:00 |4 |
| 2017-03-30 03:37:00 | 2017-03-30 03:39:00 |7 |
| 2017-03-30 03:41:30 | 2017-03-30 04:50:00 |8 |
| 2017-03-30 07:10:00 | 2017-03-30 07:10:30 |10 |
| 2017-03-30 07:11:00 | 2017-03-30 07:20:00 |13 |
| 2017-03-30 07:22:00 | 2017-03-30 08:00:00 |15 |
| 2017-03-30 10:00:00 | 2017-03-30 10:03:00 |20 |
I would like to group rows under the same id when time_finish of row "i-1" is at most 900 seconds before time_start of row "i".
Basically, the output for the example above would be :
The result would be :
| start_time | end_time | id |
|---------------------|---------------------|-----|
| 2017-03-30 01:00:00 | 2017-03-30 01:15:30 |1 |
| 2017-03-30 02:02:00 | 2017-03-30 03:30:00 |4 |
| 2017-03-30 03:37:00 | 2017-03-30 03:39:00 |4 |
| 2017-03-30 03:41:30 | 2017-03-30 04:50:00 |4 |
| 2017-03-30 07:10:00 | 2017-03-30 07:10:30 |10 |
| 2017-03-30 07:11:00 | 2017-03-30 07:20:00 |10 |
| 2017-03-30 07:22:00 | 2017-03-30 08:00:00 |10 |
| 2017-03-30 10:00:00 | 2017-03-30 10:03:00 |20 |
I achieved it through the following code but I'm sure there's a more elegant (and efficient) way to do so :
df['endTime_delayed'] = df.end_time.shift(1)
df['id_delayed'] = df['id'].shift(1)
for (i,row) in df.iterrows():
if (row.start_time-row.endTime_delayed).seconds <= 900 :
df.id.iloc[i] = df.id_delayed.iloc[i]
try :
df.id_delayed.iloc[i+1] = df.id.iloc[i]
except :
break
mask and ffill
diff = df.start_time.sub(df.end_time.shift())
mask = diff < pd.Timedelta(900, unit='s')
df.id.mask(mask).ffill().astype(df.id.dtype)
0 1
1 4
2 4
3 4
4 10
5 10
6 10
7 20
Name: id, dtype: int64
I would like to add columns to a time-indexed pandas DataFrame which contain the rate of change over the last n hours for each of the existing columns. I have accomplished this with the following code, however, it is too slow for my needs (probably due to looping over every index of each column?).
Is there a (computationally) faster way to do this?
roc_hours = 12
tol = 1e-10
for c in ts.columns:
c_roc = c + ' +++ RoC ' + str(roc_hours) + 'h'
ts[c_roc] = np.nan
for i in ts.index[np.isfinite(ts[c])]:
df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]
X = (df.index.values - df.index.values.min()).astype('Int64')*2.77778e-13 #hours back
Y = df.values
if Y.std() > tol and X.shape[0] > 1:
fit = np.polyfit(X,Y,1)
ts[c_roc][i] = fit[0]
else:
ts[c_roc][i] = 0
Edit input dataframe ts is irregularly sampled and can contain NaNs. First few lines of input ts:
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
| WCT | a | b | c | d | e | f |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
| 2011-09-04 20:00:00 | | | | | | |
| 2011-09-04 21:00:00 | | | | | | |
| 2011-09-04 22:00:00 | | | | | | |
| 2011-09-04 23:00:00 | | | | | | |
| 2011-09-05 02:00:00 | 93.0 | 97.0 | 20.0 | 209.0 | 85.0 | 98.0 |
| 2011-09-05 03:00:00 | 74.14285714285714 | 97.0 | 20.0 | 194.14285714285717 | 74.42857142857143 | 98.0 |
| 2011-09-05 04:00:00 | 67.5 | 98.5 | 20.0 | 176.0 | 75.0 | 98.0 |
| 2011-09-05 05:00:00 | 72.0 | 98.5 | 20.0 | 176.0 | 75.0 | 98.0 |
| 2011-09-05 07:00:00 | 80.0 | 93.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 08:00:00 | 80.0 | 93.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 09:00:00 | 78.5 | 98.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 10:00:00 | 73.0 | 98.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 11:00:00 | 77.0 | 98.0 | 18.0 | 175.0 | 87.0 | 97.0999984741211 |
| 2011-09-05 12:00:00 | 78.0 | 98.0 | 19.0 | 163.0 | 57.0 | 98.4000015258789 |
| 2011-09-05 15:00:00 | 78.0 | 98.0 | 19.0 | 163.0 | 57.0 | 98.4000015258789 |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
Edit 2
After profiling, the bottleneck is in the slicing step: df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]. This line pulls out observations time-stamped between now-roc_hours and now. It's very handy syntax, but is taking up the bulk of the compute time.
Works on a dataset of mine, haven't checked on yours:
import pandas as pd
from numpy import polyfit
from matplotlib import style
style.use('ggplot')
# ... acquire a dataframe named *water* with a column *value*
WINDOW = 10
ax=water.value.plot()
roll = pd.rolling_mean(water.value, WINDOW)
roll.plot(ax=ax)
def lintrend(df):
df = df.tolist()
m, b = polyfit(range(len(df)), df,1)
return m
linny = pd.rolling_apply(water.value, WINDOW, lintrend)
linny.plot(ax=ax)
Casting the numpy.ndarray to list after rolling_apply cast it to numpy.ndarray seems inelegant. Suggestions?