Resample by some timeframe - python

I have dataframe like:
Timestamp Sold
10.01.2017 10:00:20 10
10.01.2017 10:01:55 20
10.01.2017 11:02:11 15
11.01.2017 11:04:30 10
11.01.2017 11:15:35 35
12.01.2017 10:02:01 22
How to resample it by hour. Ordinary resample resamples by all hours from first row to last. But what I need is to make timeframe (10-11) and resample it within this timeframe.
Last df should be like this:
Timestamp Sold
10.01.2017 10:00:00 30
10.01.2017 11:00:00 15
11.01.2017 10:00:00 NAN
11.01.2017 11:00:00 45
12.01.2017 10:00:00 22
12.01.2017 11:00:00 NAN

You could do something like this:
df_out = df.groupby(df.Timestamp.dt.floor('H')).sum()
df_out.reset_index()
Output:
Timestamp Sold
0 2017-10-01 10:00:00 30
1 2017-10-01 11:00:00 15
2 2017-11-01 11:00:00 45
3 2017-12-01 10:00:00 22

Related

Get rolling average without every timestamp

I have data about how many messages each account sends aggregated to an hourly level. For each row, I would like to add a column with the sum of the previous 7 days messages. I know I can groupby account and date and aggregate the number of messages to the daily level, but I'm having a hard time calculating the rolling average because there isn't a row in the data if the account didn't send any messages that day (and I'd like to not balloon my data by adding these in if at all possible). If I could figure out a way to calculate the rolling 7-day average for each day that each account sent messages, I could then re-join that number back to the hourly data (is my hope). Any suggestions?
Note: For any day not in the data, assume 0 messages sent.
Raw Data:
Account | Messages | Date | Hour
12 5 2022-07-11 09:00:00
12 6 2022-07-13 10:00:00
12 10 2022-07-13 11:00:00
12 9 2022-07-15 16:00:00
12 1 2022-07-19 13:00:00
15 2 2022-07-12 10:00:00
15 13 2022-07-13 11:00:00
15 3 2022-07-17 16:00:00
15 4 2022-07-22 13:00:00
Desired Output:
Account | Messages | Date | Hour | Rolling Previous 7 Day Average
12 5 2022-07-11 09:00:00 0
12 6 2022-07-13 10:00:00 0.714
12 10 2022-07-13 11:00:00 0.714
12 9 2022-07-15 16:00:00 3
12 1 2022-07-19 13:00:00 3.571
15 2 2022-07-12 10:00:00 0
15 13 2022-07-13 11:00:00 0.286
15 3 2022-07-17 16:00:00 2.143
15 4 2022-07-22 13:00:00 0.429
I hope I've understood your question right:
df["Date"] = pd.to_datetime(df["Date"])
df["Messages_tmp"] = df.groupby(["Account", "Date"])["Messages"].transform(
"sum"
)
df["Rolling Previous 7 Day Average"] = (
df.set_index("Date")
.groupby("Account")["Messages_tmp"]
.rolling("7D")
.apply(lambda x: x.loc[~x.index.duplicated()].shift().sum() / 7)
).values
df = df.drop(columns="Messages_tmp")
print(df)
Prints:
Account Messages Date Hour Rolling Previous 7 Day Average
0 12 5 2022-07-11 09:00:00 0.000000
1 12 6 2022-07-13 10:00:00 0.714286
2 12 10 2022-07-13 11:00:00 0.714286
3 12 9 2022-07-15 16:00:00 3.000000
4 12 1 2022-07-19 13:00:00 3.571429
5 15 2 2022-07-12 10:00:00 0.000000
6 15 13 2022-07-13 11:00:00 0.285714
7 15 3 2022-07-17 16:00:00 2.142857
8 15 4 2022-07-22 13:00:00 0.428571

Filtering dataframe given a list of dates

I have the following dataframe:
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
3 1999-10-05 12:00:00 53
4 1999-10-10 16:00:00 43
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21
I have a datetime list that I get from tolist() in another dataframe.
[Timestamp('1999-10-01 00:00:00'),
Timestamp('1999-10-02 00:00:00'),
Timestamp('1999-10-24 00:00:00')]
The tolist() purpose is to filter the dataframe based on the dates inside the list. The end result is:
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21
Where only 1st, 2nd and 24th Oct rows will appear in the dataframe.
What is the approach to do this? I have looked up and only see solution to filter between dates or a singular date.
Thank you.
If want compare Timestamp without times use Series.dt.normalize:
df1 = df[df['Date'].dt.normalize().isin(L)]
Or Series.dt.floor :
df1 = df[df['Date'].dt.floor('d').isin(L)]
For compare by dates is necessary convert also list to dates:
df1 = df[df['Date'].dt.date.isin([x.date for x in L])]
print (df1)
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21

How to use pd.interpolate fill the gap with only one missing data

I have a time series data for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
The gaps were composed of several consecutive or inconsecutive NAs, and there are some helpful statistics done by R like:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
Generally if I use the df.interpolate(limit=1),
gaps with more than one missing will be interpolated as well.
So I guess a better way to interpolate the gap with only one missing is to get the gap id.
To do so, I grouped the different size of gap and used the following function:
cum = df.notna().cumsum()
cum[cum.duplicated()]
and got the result:
PM2.5
2019-01-09 13:00:00 205
2019-01-10 15:00:00 230
2019-01-10 16:00:00 230
2019-01-16 11:00:00 368
2019-01-23 14:00:00 538
...
2019-12-02 10:00:00 7971
2019-12-10 09:00:00 8161
2019-12-16 15:00:00 8310
2019-12-24 12:00:00 8498
2019-12-31 10:00:00 8663
How to get the index of each first missing value in each gap like this?
PM2.5 gap size
2019-01-09 13:00:00 1
2019-01-10 15:00:00 2
2019-01-16 11:00:00 1
2019-01-23 14:00:00 1
...
2019-12-02 10:00:00 1
2019-12-10 09:00:00 1
2019-12-16 15:00:00 1
2019-12-24 12:00:00 1
2019-12-31 10:00:00 1
but when I used cum[cum.duplicated()].groupby(cum[cum.duplicated()]).count() the index would miss.
Are there better solutions to do these?
OR How to interpolate case by case?
Anyone can help me?

Python: how to group by for each user?

I have a dataframe that looks like the following
uid timestamp count val
0 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 23 3
1 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 20 2
2 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 15:00:00 10 5
3 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 9 6
4 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 9 3
I would like to groupby for each uid in order to have the sum of count every hour and the average of val
I would like something like the following
uid timestamp count val
0 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 43 2.5
2 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 15:00:00 10 5
3 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 18 4.5
You can try groupby in combination with agg using a dictionary style definition of your custom functions:
import pandas pd
import numpy as np
df.groupby(['uid', 'timestamp']).agg({"val": np.mean, "count" :np.sum})

Convert Time Column to UTC Time, no date

I have a row of data (in pandas), that has a time of day:
0 8:00 AM
1 11:00 AM
2 8:00 AM
3 4:00 PM
4 9:00 AM
5
6 9:00 AM
7
8 9:00 AM
9
10 9:00 AM
11
12 9:00 AM
13
14 8:00 AM
15 11:00 AM
16 8:00 AM
17 11:00 AM
18 9:00 AM
19
20 9:00 AM
21
22 9:00 AM
23
24 9:00 AM
25
26 9:00 AM
27
28 9:00 AM
I would like to convert this to something similar to this:
0 2015-11-11 08:00:00
1 2015-11-11 11:00:00
2 2015-11-11 08:00:00
3 2015-11-11 16:00:00
4 2015-11-11 09:00:00
5 NaT
6 2015-11-11 09:00:00
7 NaT
8 2015-11-11 09:00:00
9 NaT
10 2015-11-11 09:00:00
11 NaT
12 2015-11-11 09:00:00
13 NaT
14 2015-11-11 08:00:00
15 2015-11-11 11:00:00
16 2015-11-11 08:00:00
17 2015-11-11 11:00:00
18 2015-11-11 09:00:00
19 NaT
20 2015-11-11 09:00:00
21 NaT
22 2015-11-11 09:00:00
23 NaT
24 2015-11-11 09:00:00
25 NaT
26 2015-11-11 09:00:00
27 NaT
28 2015-11-11 09:00:00
29 NaT
But without the date added to it. I am then trying to merge my pandas columns into a single column to be able to iterate through. I have tried adding them astype(str) with no success in a pd.merge.
Any ideas on how to use the to_datetime function in pandas while just keeping it as UTC time?
Considering the following input Data:
data = ['8:00 AM',
'11:00 AM',
'8:00 AM',
'4:00 PM',
'9:00 AM',
'',
'9:00 AM',
'',
'9:00 AM']
Code:
import pandas as pd
x = pd.to_datetime(data).time
pd.Series(x)
Output:
0 08:00:00
1 11:00:00
2 08:00:00
3 16:00:00
4 09:00:00
5 NaN
6 09:00:00
7 NaN
8 09:00:00
dtype: object
If you have other data in another series you would like to join into the same dataframe:
x = pd.Series(x)
y = pd.Series(range(9))
pd.concat([x, y], axis=1)
0 1
0 08:00:00 0
1 11:00:00 1
2 08:00:00 2
Finally, if you prefer the columns merged as strings, try this:
z = pd.concat([x, y], axis=1)
z[0].astype(str) + ' foo ' + z[1].astype(str)
0 08:00:00 foo 0
1 11:00:00 foo 1
2 08:00:00 foo 2
3 16:00:00 foo 3
4 09:00:00 foo 4
5 nan foo 5
6 09:00:00 foo 6
7 nan foo 7
8 09:00:00 foo 8
dtype: object

Categories

Resources