How to group rows based on difference with previous row? - python

I have the following dataframe :
| start_time | end_time | id |
|---------------------|---------------------|-----|
| 2017-03-30 01:00:00 | 2017-03-30 01:15:30 |1 |
| 2017-03-30 02:02:00 | 2017-03-30 03:30:00 |4 |
| 2017-03-30 03:37:00 | 2017-03-30 03:39:00 |7 |
| 2017-03-30 03:41:30 | 2017-03-30 04:50:00 |8 |
| 2017-03-30 07:10:00 | 2017-03-30 07:10:30 |10 |
| 2017-03-30 07:11:00 | 2017-03-30 07:20:00 |13 |
| 2017-03-30 07:22:00 | 2017-03-30 08:00:00 |15 |
| 2017-03-30 10:00:00 | 2017-03-30 10:03:00 |20 |
I would like to group rows under the same id when time_finish of row "i-1" is at most 900 seconds before time_start of row "i".
Basically, the output for the example above would be :
The result would be :
| start_time | end_time | id |
|---------------------|---------------------|-----|
| 2017-03-30 01:00:00 | 2017-03-30 01:15:30 |1 |
| 2017-03-30 02:02:00 | 2017-03-30 03:30:00 |4 |
| 2017-03-30 03:37:00 | 2017-03-30 03:39:00 |4 |
| 2017-03-30 03:41:30 | 2017-03-30 04:50:00 |4 |
| 2017-03-30 07:10:00 | 2017-03-30 07:10:30 |10 |
| 2017-03-30 07:11:00 | 2017-03-30 07:20:00 |10 |
| 2017-03-30 07:22:00 | 2017-03-30 08:00:00 |10 |
| 2017-03-30 10:00:00 | 2017-03-30 10:03:00 |20 |
I achieved it through the following code but I'm sure there's a more elegant (and efficient) way to do so :
df['endTime_delayed'] = df.end_time.shift(1)
df['id_delayed'] = df['id'].shift(1)
for (i,row) in df.iterrows():
if (row.start_time-row.endTime_delayed).seconds <= 900 :
df.id.iloc[i] = df.id_delayed.iloc[i]
try :
df.id_delayed.iloc[i+1] = df.id.iloc[i]
except :
break

mask and ffill
diff = df.start_time.sub(df.end_time.shift())
mask = diff < pd.Timedelta(900, unit='s')
df.id.mask(mask).ffill().astype(df.id.dtype)
0 1
1 4
2 4
3 4
4 10
5 10
6 10
7 20
Name: id, dtype: int64

Related

How to generate a dataframe column to keep track of a file version

I have a dataframe that stores the dates of creation and upgrades of files. I would like to write a function in order to create a column named 'version' keeping track of the files versions.
original dataset:
| fileID | creationDate | upgradeDate |
|--------|----------------------|----------------------|
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:00:00 |
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:30:00 |
| file_1 | 02/02/2022 15:00:00 | 10/02/2022 10:00:00 |
| file_2 | 03/02/2022 15:00:00 | 03/02/2022 15:00:00 |
| file_3 | 04/02/2022 15:00:00 | 04/02/2022 15:00:00 |
| file_3 | 04/02/2022 15:00:00 | 05/02/2022 15:00:00 |
what I would like:
| fileID | creationDate | upgradeDate | version |
|--------|----------------------|----------------------|---------|
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:00:00 | 1 |
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:30:00 | 2 |
| file_1 | 02/02/2022 15:00:00 | 10/02/2022 10:00:00 | 3 |
| file_2 | 03/02/2022 15:00:00 | 03/02/2022 15:00:00 | 1 |
| file_3 | 04/02/2022 15:00:00 | 04/02/2022 15:00:00 | 1 |
| file_3 | 04/02/2022 15:00:00 | 05/02/2022 15:00:00 | 2 |
I was able to do this through a for loop :
ids = df['fileID'].unique().tolist()
for id in ids:
df.loc[df['fileID'] == id, 'version'] = range(1, len(df.loc[df['fileID'] == id]) + 1)
But I would prefer not to use a for loop and do it with a function or list comprehension. Any idea ?

Create a DataFrame with the total number of rows for each time interval grouped by ID

Given the following DataFrame of pandas in Python:
| ID | date |
|--------------|---------------------------------------|
| ESP | 2022-03-02 07:24:19+01:00 |
| ESP | 2022-03-02 07:24:19+01:00 |
| ESP | 2022-03-02 08:00:00+01:00 |
| UK | 2022-03-02 08:08:30+01:00 |
| ESP | 2022-03-02 09:11:50+01:00 |
| USA | 2022-03-02 10:19:11+01:00 |
| UK | 2022-03-02 10:12:11+01:00 |
| USA | 2022-03-03 08:33:22+01:00 |
| USA | 2022-03-03 09:23:22+01:00 |
| UK | 2022-03-03 12:13:22+01:00 |
| UK | 2022-03-03 12:35:22+01:00 |
With the following code implemented in Python, I get the following DataFrame:
def create_dataframe(df):
df['date'] = pd.to_datetime(df['date'].astype(str).str.split('+').str[0])
string = df['date'].groupby(df['date'].dt.floor('H')).count()
df = pd.DataFrame({'date': string.index.date, 'start_interval': string.index.time,
'end_interval': (string.index + pd.DateOffset(hours=1)).time,
'total_rows': string.to_numpy()})
return df
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
I would like to add to the table the information provided by the 'ID' column, i.e. get this DataFrame:
| ID | date | start_interval | end_interval | total_rows |
|--------|-----------------------|-------------------|-------------------|------------|
| ESP | 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| ESP | 2022-03-02 | 08:00:00 | 09:00:00 | 1 |
| UK | 2022-03-02 | 08:00:00 | 09:00:00 | 1 |
| ESP | 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| USA | 2022-03-02 | 10:00:00 | 11:00:00 | 1 |
| UK | 2022-03-02 | 10:00:00 | 11:00:00 | 1 |
| USA | 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| USA | 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| UK | 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
How could I modify the supplied code to obtain the resulting table? Thank you in advance for your help.
Does this produce what you are looking for:
result = (
df
.groupby(['ID', df['date'].dt.floor('H')]).agg(total_rows=('date', 'count'))
.reset_index()
.assign(
start_interval=lambda df: df['date'].dt.time,
end_interval=lambda df: (df['date'] + pd.Timedelta(hours=1)).dt.time,
date=lambda df: df['date'].dt.date
)
)
Result:
ID date total_rows start_interval end_interval
0 ESP 2022-03-02 2 07:00:00 08:00:00
1 ESP 2022-03-02 1 08:00:00 09:00:00
2 ESP 2022-03-02 1 09:00:00 10:00:00
3 UK 2022-03-02 1 08:00:00 09:00:00
4 UK 2022-03-02 1 10:00:00 11:00:00
5 UK 2022-03-03 2 12:00:00 13:00:00
6 USA 2022-03-02 1 10:00:00 11:00:00
7 USA 2022-03-03 1 08:00:00 09:00:00
8 USA 2022-03-03 1 09:00:00 10:00:00

Pandas timestamp time mean

Input
Tech ID | 4th April'22 | 3rd April'22 | 2nd April'22
123 | 2022-04-04 05:03:00 | 2022-04-03 05:08:00 | 2022-04-02 05:10:00
345 | 2022-04-04 05:37:00 | 2022-04-03 05:18:00 | 2022-04-02 05:12:00
678 | 2022-04-04 05:42:00 | 2022-04-03 05:25:00 | 2022-04-02 05:30:00
901 | 2022-04-04 05:48:00 | 2022-04-03 05:45:00 | 2022-04-02 04:08:00
367 | 2022-04-04 05:32:00 | 2022-04-03 06:08:00 | 2022-04-02 06:08:00
Output
Tech ID | 4th April'22 | 3rd April'22 | 2nd April'22 | Mean
123 | 2022-04-04 05:03:00 | 2022-04-03 05:08:00 | 2022-04-02 05:10:00 | result
345 | 2022-04-04 05:37:00 | 2022-04-03 05:18:00 | 2022-04-02 05:12:00 | result
678 | 2022-04-04 05:42:00 | 2022-04-03 05:25:00 | 2022-04-02 05:30:00 | result
901 | 2022-04-04 05:48:00 | 2022-04-03 05:45:00 | 2022-04-02 04:08:00 | result
367 | 2022-04-04 05:32:00 | 2022-04-03 06:08:00 | 2022-04-02 06:08:00 | result
Depending on the version of Pandas that you are using, you may be able to use the Dataframe.mean method to determine the mean across the rows.
You will need to set numeric_only to False per the pandas documentation:
numeric_only: bool, default None
Include only float, int, boolean columns. If None, will attempt to use everything, >then use only numeric data. Not implemented for Series.
df['mean'] = df.mean(axis=1,numeric_only=False)
This works for me using pandas 1.3.4 assuming the dtype of the 3 source columns are all datetime64[ns]

Get DataFrame with the number of rows for each time interval

Given the following DataFrame of pandas in Python:
| ID | date |
|--------------|---------------------------------------|
| 2 | 2022-03-02 07:24:19+01:00 |
| 2 | 2022-03-02 07:24:19+01:00 |
| 0 | 2022-03-02 08:00:00+01:00 |
| 0 | 2022-03-02 08:08:30+01:00 |
| 1 | 2022-03-02 09:11:50+01:00 |
| 1 | 2022-03-02 10:19:11+01:00 |
| 1 | 2022-03-02 10:12:11+01:00 |
| 3 | 2022-03-03 08:33:22+01:00 |
| 3 | 2022-03-03 09:23:22+01:00 |
| 3 | 2022-03-03 12:13:22+01:00 |
| 3 | 2022-03-03 12:35:22+01:00 |
I need to create a new DataFrame containing the total number of rows for each day in a given time interval, specified by parameter. Let's assume 1 hour for this example. Example of the DataFrame I want to obtain:
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 00:00:00 | 01:00:00 | 0 |
| 2022-03-02 | 01:00:00 | 02:00:00 | 0 |
| 2022-03-02 | 02:00:00 | 03:00:00 | 0 |
| 2022-03-02 | 03:00:00 | 04:00:00 | 0 |
| 2022-03-02 | 04:00:00 | 05:00:00 | 0 |
| 2022-03-02 | 05:00:00 | 06:00:00 | 0 |
| 2022-03-02 | 06:00:00 | 07:00:00 | 0 |
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-02 | 11:00:00 | 12:00:00 | 0 |
| 2022-03-02 | 12:00:00 | 13:00:00 | 0 |
| 2022-03-02 | 13:00:00 | 14:00:00 | 0 |
| 2022-03-02 | 14:00:00 | 15:00:00 | 0 |
| 2022-03-02 | 15:00:00 | 16:00:00 | 0 |
| 2022-03-02 | 16:00:00 | 17:00:00 | 0 |
| 2022-03-02 | 17:00:00 | 18:00:00 | 0 |
| 2022-03-02 | 18:00:00 | 19:00:00 | 0 |
| 2022-03-02 | 19:00:00 | 20:00:00 | 0 |
| 2022-03-02 | 20:00:00 | 21:00:00 | 0 |
| 2022-03-02 | 21:00:00 | 22:00:00 | 0 |
| 2022-03-02 | 22:00:00 | 23:00:00 | 0 |
| 2022-03-02 | 23:00:00 | 00:00:00 | 0 |
| 2022-03-03 | 00:00:00 | 01:00:00 | 0 |
| 2022-03-03 | 01:00:00 | 02:00:00 | 0 |
| 2022-03-03 | 02:00:00 | 03:00:00 | 0 |
| 2022-03-03 | 03:00:00 | 04:00:00 | 0 |
| 2022-03-03 | 04:00:00 | 05:00:00 | 0 |
| 2022-03-03 | 05:00:00 | 06:00:00 | 0 |
| 2022-03-03 | 06:00:00 | 07:00:00 | 0 |
| 2022-03-03 | 07:00:00 | 08:00:00 | 0 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 10:00:00 | 11:00:00 | 0 |
| 2022-03-03 | 11:00:00 | 12:00:00 | 0 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
| 2022-03-03 | 13:00:00 | 14:00:00 | 0 |
| 2022-03-03 | 14:00:00 | 15:00:00 | 0 |
| 2022-03-03 | 15:00:00 | 16:00:00 | 0 |
| 2022-03-03 | 16:00:00 | 17:00:00 | 0 |
| 2022-03-03 | 17:00:00 | 18:00:00 | 0 |
| 2022-03-03 | 18:00:00 | 19:00:00 | 0 |
| 2022-03-03 | 19:00:00 | 20:00:00 | 0 |
| 2022-03-03 | 20:00:00 | 21:00:00 | 0 |
| 2022-03-03 | 21:00:00 | 22:00:00 | 0 |
| 2022-03-03 | 22:00:00 | 23:00:00 | 0 |
| 2022-03-03 | 23:00:00 | 00:00:00 | 0 |
My idea is to finally delete all rows containing a 0 in the total_rows column.
df= df[df['total_rows'] != 0]
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
How could I get this result?
Floor your date column then count number of occurrences:
s = df['date'].groupby(df['date'].dt.floor('H')).count()
out = pd.DataFrame({'date': s.index.date, 'start_interval': s.index.time,
'end_interval': (s.index + pd.DateOffset(hours=1)).time,
'total_rows': s.to_numpy()})
print(out)
# Output
date start_interval end_interval total_rows
0 2022-03-02 07:00:00 08:00:00 2
1 2022-03-02 08:00:00 09:00:00 2
2 2022-03-02 09:00:00 10:00:00 1
3 2022-03-02 10:00:00 11:00:00 2
4 2022-03-03 08:00:00 09:00:00 1
5 2022-03-03 09:00:00 10:00:00 1
6 2022-03-03 12:00:00 13:00:00 2
That's a nice job for pd.Grouper:
z = df.groupby(
pd.Grouper(freq='1h', key='date')
).size().to_frame('total_rows').reset_index()
out = z.assign(
start_interval=z['date'].dt.time,
end_interval=(z['date'] + pd.Timedelta(1, 'hour')).dt.time,
date=z['date'].dt.normalize(),
)

Rate of change over last n hours using pandas timeseries

I would like to add columns to a time-indexed pandas DataFrame which contain the rate of change over the last n hours for each of the existing columns. I have accomplished this with the following code, however, it is too slow for my needs (probably due to looping over every index of each column?).
Is there a (computationally) faster way to do this?
roc_hours = 12
tol = 1e-10
for c in ts.columns:
c_roc = c + ' +++ RoC ' + str(roc_hours) + 'h'
ts[c_roc] = np.nan
for i in ts.index[np.isfinite(ts[c])]:
df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]
X = (df.index.values - df.index.values.min()).astype('Int64')*2.77778e-13 #hours back
Y = df.values
if Y.std() > tol and X.shape[0] > 1:
fit = np.polyfit(X,Y,1)
ts[c_roc][i] = fit[0]
else:
ts[c_roc][i] = 0
Edit input dataframe ts is irregularly sampled and can contain NaNs. First few lines of input ts:
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
| WCT | a | b | c | d | e | f |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
| 2011-09-04 20:00:00 | | | | | | |
| 2011-09-04 21:00:00 | | | | | | |
| 2011-09-04 22:00:00 | | | | | | |
| 2011-09-04 23:00:00 | | | | | | |
| 2011-09-05 02:00:00 | 93.0 | 97.0 | 20.0 | 209.0 | 85.0 | 98.0 |
| 2011-09-05 03:00:00 | 74.14285714285714 | 97.0 | 20.0 | 194.14285714285717 | 74.42857142857143 | 98.0 |
| 2011-09-05 04:00:00 | 67.5 | 98.5 | 20.0 | 176.0 | 75.0 | 98.0 |
| 2011-09-05 05:00:00 | 72.0 | 98.5 | 20.0 | 176.0 | 75.0 | 98.0 |
| 2011-09-05 07:00:00 | 80.0 | 93.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 08:00:00 | 80.0 | 93.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 09:00:00 | 78.5 | 98.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 10:00:00 | 73.0 | 98.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 11:00:00 | 77.0 | 98.0 | 18.0 | 175.0 | 87.0 | 97.0999984741211 |
| 2011-09-05 12:00:00 | 78.0 | 98.0 | 19.0 | 163.0 | 57.0 | 98.4000015258789 |
| 2011-09-05 15:00:00 | 78.0 | 98.0 | 19.0 | 163.0 | 57.0 | 98.4000015258789 |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
Edit 2
After profiling, the bottleneck is in the slicing step: df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]. This line pulls out observations time-stamped between now-roc_hours and now. It's very handy syntax, but is taking up the bulk of the compute time.
Works on a dataset of mine, haven't checked on yours:
import pandas as pd
from numpy import polyfit
from matplotlib import style
style.use('ggplot')
# ... acquire a dataframe named *water* with a column *value*
WINDOW = 10
ax=water.value.plot()
roll = pd.rolling_mean(water.value, WINDOW)
roll.plot(ax=ax)
def lintrend(df):
df = df.tolist()
m, b = polyfit(range(len(df)), df,1)
return m
linny = pd.rolling_apply(water.value, WINDOW, lintrend)
linny.plot(ax=ax)
Casting the numpy.ndarray to list after rolling_apply cast it to numpy.ndarray seems inelegant. Suggestions?

Categories

Resources