Removing duplicates every 5 minutes [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am trying to remove duplicate ID's that appear in every 5 minutes time frame from the dataset. The data frame looks something like this;
|---------------------|------------------|------------------|
| ID | Date | Time |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:01:00 |
|---------------------|------------------|------------------|
| 13 | 2012-1-1 | 00:01:30 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:04:30 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:05:10 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:10:00 |
|---------------------|------------------|------------------|
Which should become;
|---------------------|------------------|------------------|
| ID | Date | Time |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:01:00 |
|---------------------|------------------|------------------|
| 13 | 2012-1-1 | 00:01:30 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:05:10 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:10:00 |
|---------------------|------------------|------------------|
The second time "12" occurs it should be flagged as duplicate as it appears a second time in the time frame 00:00:00 - 00:05:00.
I am using pandas to clean the current dataset.
Any help is appreciated!

Start from adding DatTim column (of type DateTime), taking source
data from Date and Time:
df['DatTim'] = pd.to_datetime(df.Date + ' ' + df.Time)
Then, assuming that ID is an "ordinary" column (not the index),
you should call:
groupby on DatTim column with 5 min frequency.
To each group apply drop_duplicates, with subset including only ID column.
And finally drop DatTim from the index.
Expressing the above instruction in Python:
df2 = df.groupby(pd.Grouper(key='DatTim', freq='5min'))\
.apply(lambda grp: grp.drop_duplicates(subset='ID'))\
.reset_index(level=0, drop=True)
If you print(df2), you will get:
ID Date Time DatTim
0 12 2012-1-1 00:01:00 2012-01-01 00:01:00
1 13 2012-1-1 00:01:30 2012-01-01 00:01:30
3 12 2012-1-1 00:05:10 2012-01-01 00:05:10
4 12 2012-1-1 00:10:00 2012-01-01 00:10:00
To "clean up", you can drop DatTim column:
df2.drop('DatTim', axis=1)
Edit
If ID is the index, a slight change is required:
df2 = df.groupby(pd.Grouper(key='DatTim', freq='5min'))\
.apply(lambda grp: grp[~grp.index.duplicated(keep='first')])\
.reset_index(level=0, drop=True)
And then the printed df2 is:
Date Time DatTim
ID
12 2012-1-1 00:01:00 2012-01-01 00:01:00
13 2012-1-1 00:01:30 2012-01-01 00:01:30
12 2012-1-1 00:05:10 2012-01-01 00:05:10
12 2012-1-1 00:10:00 2012-01-01 00:10:00
Of course, also in this case you can drop DatTim column.

Related

Rolling window on timestamped DataFrame with a custom step?

I have been fiddling about with pandas.DataFrame.rolling for some time now and I haven't been able to achieve the result that I am looking for, so before I write a custom windowing function I figured I would ask if I'm missing something.
I have postgresql data with a composite index of (time, node) that has been read into a pandas.DataFrame, where time is a certain hour on a certain date. I need to create windows that contain all entries within the last two calendar dates (or any arbitrary number of days), for example, beginning at 2022-12-26 00:00:00 and ending on 2022-12-27 23:00:00, and then perform operations on that window to return a new, resultant DataFrame. The window should then move forward an entire calendar date, which is where I am failing.
| time | node | value |
| --------------------- | ----- | ------ |
| 2022-12-26 00:00:00 | 123 | low |
| 2022-12-26 01:00:00 | 123 | med |
| 2022-12-26 02:00:00 | 123 | low |
| 2022-12-26 03:00:00 | 123 | high |
| ... | ... | ... |
| 2022-12-26 00:00:00 | 999 | low |
| 2022-12-26 01:00:00 | 999 | low |
| 2022-12-26 02:00:00 | 999 | low |
| 2022-12-26 03:00:00 | 999 | med |
| ... | ... | ... |
| 2022-12-27 00:00:00 | 123 | low |
| 2022-12-27 01:00:00 | 123 | med |
| 2022-12-27 02:00:00 | 123 | low |
| 2022-12-27 03:00:00 | 123 | high |
When I use something akin to df.rolling(window=pd.Timedelta('2days'), the windows move forward hour-by-hour, as opposed to beginning on the next calendar date.
I've played around with using min_periods, but it doesn't seem to work with my data, nor would it be acceptable in the long run because the number of expected observations per window is not fixed regardless. The step parameter also appears to be useless in this case because I am using an offset versus an integer for the window anyways.
Is the behaviour I am looking for doable with pandas.DataFrame.rolling or must I look elsewhere/write my own windowing function?
Any guidance would be appreciated. Thanks!
So from what I understand, you want to create windows of length ndays and the next window should start with the next day.
Given some dataframe with 5 days in total in the frequency of 1H between indices:
import pandas as pd
import numpy as np
periods = 23 * 5
df = pd.DataFrame(
{'value': list(range(periods))},
index=pd.date_range('2022-12-16', periods=periods, freq='H')
)
d = np.random.choice(
pd.date_range('2022-12-16', periods=periods, freq='H'),
int(periods * 0.25)
)
df = df.drop(index=d)
df.head(5)
>>> value
2022-12-16 00:00:00 0
2022-12-16 01:00:00 1
2022-12-16 02:00:00 2
2022-12-16 04:00:00 4
2022-12-16 05:00:00 5
I randomly dropped some indices to simulate missing data.
We can use df.resample (docs) to group the data by days (regardless of missing data):
days = df.resample('1d')
print(days.get_group('2022-12-16'))
>>> value
2022-12-16 00:00:00 0
2022-12-16 01:00:00 1
2022-12-16 02:00:00 2
2022-12-16 04:00:00 4
2022-12-16 05:00:00 5
2022-12-16 06:00:00 6
2022-12-16 07:00:00 7
2022-12-16 08:00:00 8
2022-12-16 09:00:00 9
2022-12-16 11:00:00 11
2022-12-16 12:00:00 12
2022-12-16 13:00:00 13
2022-12-16 14:00:00 14
2022-12-16 15:00:00 15
2022-12-16 17:00:00 17
2022-12-16 18:00:00 18
2022-12-16 19:00:00 19
2022-12-16 21:00:00 21
2022-12-16 22:00:00 22
2022-12-16 23:00:00 23
Now, we only need to iterate over the days in a "sliding" manner. The package more-itertools has some helpful functions, such as windowed and we can easily control the size of the window (here with ndays):
from more_itertools import windowed
ndays = 2
windows = [
pd.concat([w[1] for w in window])
for window in windowed(days, ndays)
]
Printing the first and last index of each window returns:
for window in windows:
print(window.iloc[[0, -1]])
>>> value
2022-12-16 00:00:00 0
2022-12-17 23:00:00 47
value
2022-12-17 00:00:00 24
2022-12-18 23:00:00 71
value
2022-12-18 00:00:00 48
2022-12-19 23:00:00 95
value
2022-12-19 01:00:00 73
2022-12-20 18:00:00 114
Furthermore, you can set step in windowed to control the step size between windows.

How to generate a dataframe column to keep track of a file version

I have a dataframe that stores the dates of creation and upgrades of files. I would like to write a function in order to create a column named 'version' keeping track of the files versions.
original dataset:
| fileID | creationDate | upgradeDate |
|--------|----------------------|----------------------|
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:00:00 |
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:30:00 |
| file_1 | 02/02/2022 15:00:00 | 10/02/2022 10:00:00 |
| file_2 | 03/02/2022 15:00:00 | 03/02/2022 15:00:00 |
| file_3 | 04/02/2022 15:00:00 | 04/02/2022 15:00:00 |
| file_3 | 04/02/2022 15:00:00 | 05/02/2022 15:00:00 |
what I would like:
| fileID | creationDate | upgradeDate | version |
|--------|----------------------|----------------------|---------|
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:00:00 | 1 |
| file_1 | 02/02/2022 15:00:00 | 02/02/2022 15:30:00 | 2 |
| file_1 | 02/02/2022 15:00:00 | 10/02/2022 10:00:00 | 3 |
| file_2 | 03/02/2022 15:00:00 | 03/02/2022 15:00:00 | 1 |
| file_3 | 04/02/2022 15:00:00 | 04/02/2022 15:00:00 | 1 |
| file_3 | 04/02/2022 15:00:00 | 05/02/2022 15:00:00 | 2 |
I was able to do this through a for loop :
ids = df['fileID'].unique().tolist()
for id in ids:
df.loc[df['fileID'] == id, 'version'] = range(1, len(df.loc[df['fileID'] == id]) + 1)
But I would prefer not to use a for loop and do it with a function or list comprehension. Any idea ?

Create a DataFrame with the total number of rows for each time interval grouped by ID

Given the following DataFrame of pandas in Python:
| ID | date |
|--------------|---------------------------------------|
| ESP | 2022-03-02 07:24:19+01:00 |
| ESP | 2022-03-02 07:24:19+01:00 |
| ESP | 2022-03-02 08:00:00+01:00 |
| UK | 2022-03-02 08:08:30+01:00 |
| ESP | 2022-03-02 09:11:50+01:00 |
| USA | 2022-03-02 10:19:11+01:00 |
| UK | 2022-03-02 10:12:11+01:00 |
| USA | 2022-03-03 08:33:22+01:00 |
| USA | 2022-03-03 09:23:22+01:00 |
| UK | 2022-03-03 12:13:22+01:00 |
| UK | 2022-03-03 12:35:22+01:00 |
With the following code implemented in Python, I get the following DataFrame:
def create_dataframe(df):
df['date'] = pd.to_datetime(df['date'].astype(str).str.split('+').str[0])
string = df['date'].groupby(df['date'].dt.floor('H')).count()
df = pd.DataFrame({'date': string.index.date, 'start_interval': string.index.time,
'end_interval': (string.index + pd.DateOffset(hours=1)).time,
'total_rows': string.to_numpy()})
return df
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
I would like to add to the table the information provided by the 'ID' column, i.e. get this DataFrame:
| ID | date | start_interval | end_interval | total_rows |
|--------|-----------------------|-------------------|-------------------|------------|
| ESP | 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| ESP | 2022-03-02 | 08:00:00 | 09:00:00 | 1 |
| UK | 2022-03-02 | 08:00:00 | 09:00:00 | 1 |
| ESP | 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| USA | 2022-03-02 | 10:00:00 | 11:00:00 | 1 |
| UK | 2022-03-02 | 10:00:00 | 11:00:00 | 1 |
| USA | 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| USA | 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| UK | 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
How could I modify the supplied code to obtain the resulting table? Thank you in advance for your help.
Does this produce what you are looking for:
result = (
df
.groupby(['ID', df['date'].dt.floor('H')]).agg(total_rows=('date', 'count'))
.reset_index()
.assign(
start_interval=lambda df: df['date'].dt.time,
end_interval=lambda df: (df['date'] + pd.Timedelta(hours=1)).dt.time,
date=lambda df: df['date'].dt.date
)
)
Result:
ID date total_rows start_interval end_interval
0 ESP 2022-03-02 2 07:00:00 08:00:00
1 ESP 2022-03-02 1 08:00:00 09:00:00
2 ESP 2022-03-02 1 09:00:00 10:00:00
3 UK 2022-03-02 1 08:00:00 09:00:00
4 UK 2022-03-02 1 10:00:00 11:00:00
5 UK 2022-03-03 2 12:00:00 13:00:00
6 USA 2022-03-02 1 10:00:00 11:00:00
7 USA 2022-03-03 1 08:00:00 09:00:00
8 USA 2022-03-03 1 09:00:00 10:00:00

Get DataFrame with the number of rows for each time interval

Given the following DataFrame of pandas in Python:
| ID | date |
|--------------|---------------------------------------|
| 2 | 2022-03-02 07:24:19+01:00 |
| 2 | 2022-03-02 07:24:19+01:00 |
| 0 | 2022-03-02 08:00:00+01:00 |
| 0 | 2022-03-02 08:08:30+01:00 |
| 1 | 2022-03-02 09:11:50+01:00 |
| 1 | 2022-03-02 10:19:11+01:00 |
| 1 | 2022-03-02 10:12:11+01:00 |
| 3 | 2022-03-03 08:33:22+01:00 |
| 3 | 2022-03-03 09:23:22+01:00 |
| 3 | 2022-03-03 12:13:22+01:00 |
| 3 | 2022-03-03 12:35:22+01:00 |
I need to create a new DataFrame containing the total number of rows for each day in a given time interval, specified by parameter. Let's assume 1 hour for this example. Example of the DataFrame I want to obtain:
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 00:00:00 | 01:00:00 | 0 |
| 2022-03-02 | 01:00:00 | 02:00:00 | 0 |
| 2022-03-02 | 02:00:00 | 03:00:00 | 0 |
| 2022-03-02 | 03:00:00 | 04:00:00 | 0 |
| 2022-03-02 | 04:00:00 | 05:00:00 | 0 |
| 2022-03-02 | 05:00:00 | 06:00:00 | 0 |
| 2022-03-02 | 06:00:00 | 07:00:00 | 0 |
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-02 | 11:00:00 | 12:00:00 | 0 |
| 2022-03-02 | 12:00:00 | 13:00:00 | 0 |
| 2022-03-02 | 13:00:00 | 14:00:00 | 0 |
| 2022-03-02 | 14:00:00 | 15:00:00 | 0 |
| 2022-03-02 | 15:00:00 | 16:00:00 | 0 |
| 2022-03-02 | 16:00:00 | 17:00:00 | 0 |
| 2022-03-02 | 17:00:00 | 18:00:00 | 0 |
| 2022-03-02 | 18:00:00 | 19:00:00 | 0 |
| 2022-03-02 | 19:00:00 | 20:00:00 | 0 |
| 2022-03-02 | 20:00:00 | 21:00:00 | 0 |
| 2022-03-02 | 21:00:00 | 22:00:00 | 0 |
| 2022-03-02 | 22:00:00 | 23:00:00 | 0 |
| 2022-03-02 | 23:00:00 | 00:00:00 | 0 |
| 2022-03-03 | 00:00:00 | 01:00:00 | 0 |
| 2022-03-03 | 01:00:00 | 02:00:00 | 0 |
| 2022-03-03 | 02:00:00 | 03:00:00 | 0 |
| 2022-03-03 | 03:00:00 | 04:00:00 | 0 |
| 2022-03-03 | 04:00:00 | 05:00:00 | 0 |
| 2022-03-03 | 05:00:00 | 06:00:00 | 0 |
| 2022-03-03 | 06:00:00 | 07:00:00 | 0 |
| 2022-03-03 | 07:00:00 | 08:00:00 | 0 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 10:00:00 | 11:00:00 | 0 |
| 2022-03-03 | 11:00:00 | 12:00:00 | 0 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
| 2022-03-03 | 13:00:00 | 14:00:00 | 0 |
| 2022-03-03 | 14:00:00 | 15:00:00 | 0 |
| 2022-03-03 | 15:00:00 | 16:00:00 | 0 |
| 2022-03-03 | 16:00:00 | 17:00:00 | 0 |
| 2022-03-03 | 17:00:00 | 18:00:00 | 0 |
| 2022-03-03 | 18:00:00 | 19:00:00 | 0 |
| 2022-03-03 | 19:00:00 | 20:00:00 | 0 |
| 2022-03-03 | 20:00:00 | 21:00:00 | 0 |
| 2022-03-03 | 21:00:00 | 22:00:00 | 0 |
| 2022-03-03 | 22:00:00 | 23:00:00 | 0 |
| 2022-03-03 | 23:00:00 | 00:00:00 | 0 |
My idea is to finally delete all rows containing a 0 in the total_rows column.
df= df[df['total_rows'] != 0]
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
How could I get this result?
Floor your date column then count number of occurrences:
s = df['date'].groupby(df['date'].dt.floor('H')).count()
out = pd.DataFrame({'date': s.index.date, 'start_interval': s.index.time,
'end_interval': (s.index + pd.DateOffset(hours=1)).time,
'total_rows': s.to_numpy()})
print(out)
# Output
date start_interval end_interval total_rows
0 2022-03-02 07:00:00 08:00:00 2
1 2022-03-02 08:00:00 09:00:00 2
2 2022-03-02 09:00:00 10:00:00 1
3 2022-03-02 10:00:00 11:00:00 2
4 2022-03-03 08:00:00 09:00:00 1
5 2022-03-03 09:00:00 10:00:00 1
6 2022-03-03 12:00:00 13:00:00 2
That's a nice job for pd.Grouper:
z = df.groupby(
pd.Grouper(freq='1h', key='date')
).size().to_frame('total_rows').reset_index()
out = z.assign(
start_interval=z['date'].dt.time,
end_interval=(z['date'] + pd.Timedelta(1, 'hour')).dt.time,
date=z['date'].dt.normalize(),
)

Retrieve time difference since last action -- python/pandas

Let's say I have purchase records with two fields Buy and Time.
What I want to get is the third column of time elapsed since first not-buy so it looks like:
buy| time | time difference
1 | 8:00 | NULL
0 | 9:01 | NULL
0 | 9:10 | NULL
0 | 9:21 | NULL
1 | 9:31 | 0:30
0 | 9:41 | NULL
0 | 9:42 | NULL
1 | 9:53 | 0:12
How can I achieve this? It seems to me that it's a mix of pd.groupby() and pd.shift(), but I can't seem to work that out in my head.
IIUC
df.time=pd.to_datetime(df.time)
df.loc[df.buy==1,'DIFF']=df.groupby(df.buy.cumsum().shift().fillna(0)).time.transform(lambda x : x.iloc[-1]-x.iloc[0])
df
Out[19]:
buy time timedifference DIFF
0 1 2018-02-26 08:00:00 NaN 00:00:00
1 0 2018-02-26 09:01:00 NaN NaT
2 0 2018-02-26 09:10:00 NaN NaT
3 0 2018-02-26 09:21:00 NaN NaT
4 1 2018-02-26 09:31:00 0:30 00:30:00
5 0 2018-02-26 09:41:00 NaN NaT
6 0 2018-02-26 09:42:00 NaN NaT
7 1 2018-02-26 09:53:00 0:12 00:12:00
#df.buy.cumsum().shift().fillna(0) Create the key for groupby
#time.transform(lambda x : x.iloc[-1]-x.iloc[0]) create the different for each group
#df.loc[df.buy==1,'DIFF'] fill the value from groupby by the right position which buy equal to 1

Categories

Resources