Get DataFrame with the number of rows for each time interval - python

Given the following DataFrame of pandas in Python:
| ID | date |
|--------------|---------------------------------------|
| 2 | 2022-03-02 07:24:19+01:00 |
| 2 | 2022-03-02 07:24:19+01:00 |
| 0 | 2022-03-02 08:00:00+01:00 |
| 0 | 2022-03-02 08:08:30+01:00 |
| 1 | 2022-03-02 09:11:50+01:00 |
| 1 | 2022-03-02 10:19:11+01:00 |
| 1 | 2022-03-02 10:12:11+01:00 |
| 3 | 2022-03-03 08:33:22+01:00 |
| 3 | 2022-03-03 09:23:22+01:00 |
| 3 | 2022-03-03 12:13:22+01:00 |
| 3 | 2022-03-03 12:35:22+01:00 |
I need to create a new DataFrame containing the total number of rows for each day in a given time interval, specified by parameter. Let's assume 1 hour for this example. Example of the DataFrame I want to obtain:
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 00:00:00 | 01:00:00 | 0 |
| 2022-03-02 | 01:00:00 | 02:00:00 | 0 |
| 2022-03-02 | 02:00:00 | 03:00:00 | 0 |
| 2022-03-02 | 03:00:00 | 04:00:00 | 0 |
| 2022-03-02 | 04:00:00 | 05:00:00 | 0 |
| 2022-03-02 | 05:00:00 | 06:00:00 | 0 |
| 2022-03-02 | 06:00:00 | 07:00:00 | 0 |
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-02 | 11:00:00 | 12:00:00 | 0 |
| 2022-03-02 | 12:00:00 | 13:00:00 | 0 |
| 2022-03-02 | 13:00:00 | 14:00:00 | 0 |
| 2022-03-02 | 14:00:00 | 15:00:00 | 0 |
| 2022-03-02 | 15:00:00 | 16:00:00 | 0 |
| 2022-03-02 | 16:00:00 | 17:00:00 | 0 |
| 2022-03-02 | 17:00:00 | 18:00:00 | 0 |
| 2022-03-02 | 18:00:00 | 19:00:00 | 0 |
| 2022-03-02 | 19:00:00 | 20:00:00 | 0 |
| 2022-03-02 | 20:00:00 | 21:00:00 | 0 |
| 2022-03-02 | 21:00:00 | 22:00:00 | 0 |
| 2022-03-02 | 22:00:00 | 23:00:00 | 0 |
| 2022-03-02 | 23:00:00 | 00:00:00 | 0 |
| 2022-03-03 | 00:00:00 | 01:00:00 | 0 |
| 2022-03-03 | 01:00:00 | 02:00:00 | 0 |
| 2022-03-03 | 02:00:00 | 03:00:00 | 0 |
| 2022-03-03 | 03:00:00 | 04:00:00 | 0 |
| 2022-03-03 | 04:00:00 | 05:00:00 | 0 |
| 2022-03-03 | 05:00:00 | 06:00:00 | 0 |
| 2022-03-03 | 06:00:00 | 07:00:00 | 0 |
| 2022-03-03 | 07:00:00 | 08:00:00 | 0 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 10:00:00 | 11:00:00 | 0 |
| 2022-03-03 | 11:00:00 | 12:00:00 | 0 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
| 2022-03-03 | 13:00:00 | 14:00:00 | 0 |
| 2022-03-03 | 14:00:00 | 15:00:00 | 0 |
| 2022-03-03 | 15:00:00 | 16:00:00 | 0 |
| 2022-03-03 | 16:00:00 | 17:00:00 | 0 |
| 2022-03-03 | 17:00:00 | 18:00:00 | 0 |
| 2022-03-03 | 18:00:00 | 19:00:00 | 0 |
| 2022-03-03 | 19:00:00 | 20:00:00 | 0 |
| 2022-03-03 | 20:00:00 | 21:00:00 | 0 |
| 2022-03-03 | 21:00:00 | 22:00:00 | 0 |
| 2022-03-03 | 22:00:00 | 23:00:00 | 0 |
| 2022-03-03 | 23:00:00 | 00:00:00 | 0 |
My idea is to finally delete all rows containing a 0 in the total_rows column.
df= df[df['total_rows'] != 0]
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
How could I get this result?

Floor your date column then count number of occurrences:
s = df['date'].groupby(df['date'].dt.floor('H')).count()
out = pd.DataFrame({'date': s.index.date, 'start_interval': s.index.time,
'end_interval': (s.index + pd.DateOffset(hours=1)).time,
'total_rows': s.to_numpy()})
print(out)
# Output
date start_interval end_interval total_rows
0 2022-03-02 07:00:00 08:00:00 2
1 2022-03-02 08:00:00 09:00:00 2
2 2022-03-02 09:00:00 10:00:00 1
3 2022-03-02 10:00:00 11:00:00 2
4 2022-03-03 08:00:00 09:00:00 1
5 2022-03-03 09:00:00 10:00:00 1
6 2022-03-03 12:00:00 13:00:00 2

That's a nice job for pd.Grouper:
z = df.groupby(
pd.Grouper(freq='1h', key='date')
).size().to_frame('total_rows').reset_index()
out = z.assign(
start_interval=z['date'].dt.time,
end_interval=(z['date'] + pd.Timedelta(1, 'hour')).dt.time,
date=z['date'].dt.normalize(),
)

Related

How to generate a dataframe column to keep track of a file version

I have a dataframe that stores the dates of creation and upgrades of files. I would like to write a function in order to create a column named 'version' keeping track of the files versions.
original dataset:
| fileID | creationDate | upgradeDate |
|--------|----------------------|----------------------|
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:00:00 |
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:30:00 |
| file_1 | 02/02/2022 15:00:00 | 10/02/2022 10:00:00 |
| file_2 | 03/02/2022 15:00:00 | 03/02/2022 15:00:00 |
| file_3 | 04/02/2022 15:00:00 | 04/02/2022 15:00:00 |
| file_3 | 04/02/2022 15:00:00 | 05/02/2022 15:00:00 |
what I would like:
| fileID | creationDate | upgradeDate | version |
|--------|----------------------|----------------------|---------|
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:00:00 | 1 |
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:30:00 | 2 |
| file_1 | 02/02/2022 15:00:00 | 10/02/2022 10:00:00 | 3 |
| file_2 | 03/02/2022 15:00:00 | 03/02/2022 15:00:00 | 1 |
| file_3 | 04/02/2022 15:00:00 | 04/02/2022 15:00:00 | 1 |
| file_3 | 04/02/2022 15:00:00 | 05/02/2022 15:00:00 | 2 |
I was able to do this through a for loop :
ids = df['fileID'].unique().tolist()
for id in ids:
df.loc[df['fileID'] == id, 'version'] = range(1, len(df.loc[df['fileID'] == id]) + 1)
But I would prefer not to use a for loop and do it with a function or list comprehension. Any idea ?

Create a DataFrame with the total number of rows for each time interval grouped by ID

Given the following DataFrame of pandas in Python:
| ID | date |
|--------------|---------------------------------------|
| ESP | 2022-03-02 07:24:19+01:00 |
| ESP | 2022-03-02 07:24:19+01:00 |
| ESP | 2022-03-02 08:00:00+01:00 |
| UK | 2022-03-02 08:08:30+01:00 |
| ESP | 2022-03-02 09:11:50+01:00 |
| USA | 2022-03-02 10:19:11+01:00 |
| UK | 2022-03-02 10:12:11+01:00 |
| USA | 2022-03-03 08:33:22+01:00 |
| USA | 2022-03-03 09:23:22+01:00 |
| UK | 2022-03-03 12:13:22+01:00 |
| UK | 2022-03-03 12:35:22+01:00 |
With the following code implemented in Python, I get the following DataFrame:
def create_dataframe(df):
df['date'] = pd.to_datetime(df['date'].astype(str).str.split('+').str[0])
string = df['date'].groupby(df['date'].dt.floor('H')).count()
df = pd.DataFrame({'date': string.index.date, 'start_interval': string.index.time,
'end_interval': (string.index + pd.DateOffset(hours=1)).time,
'total_rows': string.to_numpy()})
return df
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
I would like to add to the table the information provided by the 'ID' column, i.e. get this DataFrame:
| ID | date | start_interval | end_interval | total_rows |
|--------|-----------------------|-------------------|-------------------|------------|
| ESP | 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| ESP | 2022-03-02 | 08:00:00 | 09:00:00 | 1 |
| UK | 2022-03-02 | 08:00:00 | 09:00:00 | 1 |
| ESP | 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| USA | 2022-03-02 | 10:00:00 | 11:00:00 | 1 |
| UK | 2022-03-02 | 10:00:00 | 11:00:00 | 1 |
| USA | 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| USA | 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| UK | 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
How could I modify the supplied code to obtain the resulting table? Thank you in advance for your help.
Does this produce what you are looking for:
result = (
df
.groupby(['ID', df['date'].dt.floor('H')]).agg(total_rows=('date', 'count'))
.reset_index()
.assign(
start_interval=lambda df: df['date'].dt.time,
end_interval=lambda df: (df['date'] + pd.Timedelta(hours=1)).dt.time,
date=lambda df: df['date'].dt.date
)
)
Result:
ID date total_rows start_interval end_interval
0 ESP 2022-03-02 2 07:00:00 08:00:00
1 ESP 2022-03-02 1 08:00:00 09:00:00
2 ESP 2022-03-02 1 09:00:00 10:00:00
3 UK 2022-03-02 1 08:00:00 09:00:00
4 UK 2022-03-02 1 10:00:00 11:00:00
5 UK 2022-03-03 2 12:00:00 13:00:00
6 USA 2022-03-02 1 10:00:00 11:00:00
7 USA 2022-03-03 1 08:00:00 09:00:00
8 USA 2022-03-03 1 09:00:00 10:00:00

How to group rows based on difference with previous row?

I have the following dataframe :
| start_time | end_time | id |
|---------------------|---------------------|-----|
| 2017-03-30 01:00:00 | 2017-03-30 01:15:30 |1 |
| 2017-03-30 02:02:00 | 2017-03-30 03:30:00 |4 |
| 2017-03-30 03:37:00 | 2017-03-30 03:39:00 |7 |
| 2017-03-30 03:41:30 | 2017-03-30 04:50:00 |8 |
| 2017-03-30 07:10:00 | 2017-03-30 07:10:30 |10 |
| 2017-03-30 07:11:00 | 2017-03-30 07:20:00 |13 |
| 2017-03-30 07:22:00 | 2017-03-30 08:00:00 |15 |
| 2017-03-30 10:00:00 | 2017-03-30 10:03:00 |20 |
I would like to group rows under the same id when time_finish of row "i-1" is at most 900 seconds before time_start of row "i".
Basically, the output for the example above would be :
The result would be :
| start_time | end_time | id |
|---------------------|---------------------|-----|
| 2017-03-30 01:00:00 | 2017-03-30 01:15:30 |1 |
| 2017-03-30 02:02:00 | 2017-03-30 03:30:00 |4 |
| 2017-03-30 03:37:00 | 2017-03-30 03:39:00 |4 |
| 2017-03-30 03:41:30 | 2017-03-30 04:50:00 |4 |
| 2017-03-30 07:10:00 | 2017-03-30 07:10:30 |10 |
| 2017-03-30 07:11:00 | 2017-03-30 07:20:00 |10 |
| 2017-03-30 07:22:00 | 2017-03-30 08:00:00 |10 |
| 2017-03-30 10:00:00 | 2017-03-30 10:03:00 |20 |
I achieved it through the following code but I'm sure there's a more elegant (and efficient) way to do so :
df['endTime_delayed'] = df.end_time.shift(1)
df['id_delayed'] = df['id'].shift(1)
for (i,row) in df.iterrows():
if (row.start_time-row.endTime_delayed).seconds <= 900 :
df.id.iloc[i] = df.id_delayed.iloc[i]
try :
df.id_delayed.iloc[i+1] = df.id.iloc[i]
except :
break
mask and ffill
diff = df.start_time.sub(df.end_time.shift())
mask = diff < pd.Timedelta(900, unit='s')
df.id.mask(mask).ffill().astype(df.id.dtype)
0 1
1 4
2 4
3 4
4 10
5 10
6 10
7 20
Name: id, dtype: int64

Removing duplicates every 5 minutes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am trying to remove duplicate ID's that appear in every 5 minutes time frame from the dataset. The data frame looks something like this;
|---------------------|------------------|------------------|
| ID | Date | Time |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:01:00 |
|---------------------|------------------|------------------|
| 13 | 2012-1-1 | 00:01:30 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:04:30 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:05:10 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:10:00 |
|---------------------|------------------|------------------|
Which should become;
|---------------------|------------------|------------------|
| ID | Date | Time |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:01:00 |
|---------------------|------------------|------------------|
| 13 | 2012-1-1 | 00:01:30 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:05:10 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:10:00 |
|---------------------|------------------|------------------|
The second time "12" occurs it should be flagged as duplicate as it appears a second time in the time frame 00:00:00 - 00:05:00.
I am using pandas to clean the current dataset.
Any help is appreciated!
Start from adding DatTim column (of type DateTime), taking source
data from Date and Time:
df['DatTim'] = pd.to_datetime(df.Date + ' ' + df.Time)
Then, assuming that ID is an "ordinary" column (not the index),
you should call:
groupby on DatTim column with 5 min frequency.
To each group apply drop_duplicates, with subset including only ID column.
And finally drop DatTim from the index.
Expressing the above instruction in Python:
df2 = df.groupby(pd.Grouper(key='DatTim', freq='5min'))\
.apply(lambda grp: grp.drop_duplicates(subset='ID'))\
.reset_index(level=0, drop=True)
If you print(df2), you will get:
ID Date Time DatTim
0 12 2012-1-1 00:01:00 2012-01-01 00:01:00
1 13 2012-1-1 00:01:30 2012-01-01 00:01:30
3 12 2012-1-1 00:05:10 2012-01-01 00:05:10
4 12 2012-1-1 00:10:00 2012-01-01 00:10:00
To "clean up", you can drop DatTim column:
df2.drop('DatTim', axis=1)
Edit
If ID is the index, a slight change is required:
df2 = df.groupby(pd.Grouper(key='DatTim', freq='5min'))\
.apply(lambda grp: grp[~grp.index.duplicated(keep='first')])\
.reset_index(level=0, drop=True)
And then the printed df2 is:
Date Time DatTim
ID
12 2012-1-1 00:01:00 2012-01-01 00:01:00
13 2012-1-1 00:01:30 2012-01-01 00:01:30
12 2012-1-1 00:05:10 2012-01-01 00:05:10
12 2012-1-1 00:10:00 2012-01-01 00:10:00
Of course, also in this case you can drop DatTim column.

Rate of change over last n hours using pandas timeseries

I would like to add columns to a time-indexed pandas DataFrame which contain the rate of change over the last n hours for each of the existing columns. I have accomplished this with the following code, however, it is too slow for my needs (probably due to looping over every index of each column?).
Is there a (computationally) faster way to do this?
roc_hours = 12
tol = 1e-10
for c in ts.columns:
c_roc = c + ' +++ RoC ' + str(roc_hours) + 'h'
ts[c_roc] = np.nan
for i in ts.index[np.isfinite(ts[c])]:
df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]
X = (df.index.values - df.index.values.min()).astype('Int64')*2.77778e-13 #hours back
Y = df.values
if Y.std() > tol and X.shape[0] > 1:
fit = np.polyfit(X,Y,1)
ts[c_roc][i] = fit[0]
else:
ts[c_roc][i] = 0
Edit input dataframe ts is irregularly sampled and can contain NaNs. First few lines of input ts:
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
| WCT | a | b | c | d | e | f |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
| 2011-09-04 20:00:00 | | | | | | |
| 2011-09-04 21:00:00 | | | | | | |
| 2011-09-04 22:00:00 | | | | | | |
| 2011-09-04 23:00:00 | | | | | | |
| 2011-09-05 02:00:00 | 93.0 | 97.0 | 20.0 | 209.0 | 85.0 | 98.0 |
| 2011-09-05 03:00:00 | 74.14285714285714 | 97.0 | 20.0 | 194.14285714285717 | 74.42857142857143 | 98.0 |
| 2011-09-05 04:00:00 | 67.5 | 98.5 | 20.0 | 176.0 | 75.0 | 98.0 |
| 2011-09-05 05:00:00 | 72.0 | 98.5 | 20.0 | 176.0 | 75.0 | 98.0 |
| 2011-09-05 07:00:00 | 80.0 | 93.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 08:00:00 | 80.0 | 93.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 09:00:00 | 78.5 | 98.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 10:00:00 | 73.0 | 98.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 11:00:00 | 77.0 | 98.0 | 18.0 | 175.0 | 87.0 | 97.0999984741211 |
| 2011-09-05 12:00:00 | 78.0 | 98.0 | 19.0 | 163.0 | 57.0 | 98.4000015258789 |
| 2011-09-05 15:00:00 | 78.0 | 98.0 | 19.0 | 163.0 | 57.0 | 98.4000015258789 |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
Edit 2
After profiling, the bottleneck is in the slicing step: df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]. This line pulls out observations time-stamped between now-roc_hours and now. It's very handy syntax, but is taking up the bulk of the compute time.
Works on a dataset of mine, haven't checked on yours:
import pandas as pd
from numpy import polyfit
from matplotlib import style
style.use('ggplot')
# ... acquire a dataframe named *water* with a column *value*
WINDOW = 10
ax=water.value.plot()
roll = pd.rolling_mean(water.value, WINDOW)
roll.plot(ax=ax)
def lintrend(df):
df = df.tolist()
m, b = polyfit(range(len(df)), df,1)
return m
linny = pd.rolling_apply(water.value, WINDOW, lintrend)
linny.plot(ax=ax)
Casting the numpy.ndarray to list after rolling_apply cast it to numpy.ndarray seems inelegant. Suggestions?

Categories

Resources