Python: how to group by for each user? - python

I have a dataframe that looks like the following
uid timestamp count val
0 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 23 3
1 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 20 2
2 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 15:00:00 10 5
3 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 9 6
4 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 9 3
I would like to groupby for each uid in order to have the sum of count every hour and the average of val
I would like something like the following
uid timestamp count val
0 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 43 2.5
2 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 15:00:00 10 5
3 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 18 4.5

You can try groupby in combination with agg using a dictionary style definition of your custom functions:
import pandas pd
import numpy as np
df.groupby(['uid', 'timestamp']).agg({"val": np.mean, "count" :np.sum})

Related

Get rolling average without every timestamp

I have data about how many messages each account sends aggregated to an hourly level. For each row, I would like to add a column with the sum of the previous 7 days messages. I know I can groupby account and date and aggregate the number of messages to the daily level, but I'm having a hard time calculating the rolling average because there isn't a row in the data if the account didn't send any messages that day (and I'd like to not balloon my data by adding these in if at all possible). If I could figure out a way to calculate the rolling 7-day average for each day that each account sent messages, I could then re-join that number back to the hourly data (is my hope). Any suggestions?
Note: For any day not in the data, assume 0 messages sent.
Raw Data:
Account | Messages | Date | Hour
12 5 2022-07-11 09:00:00
12 6 2022-07-13 10:00:00
12 10 2022-07-13 11:00:00
12 9 2022-07-15 16:00:00
12 1 2022-07-19 13:00:00
15 2 2022-07-12 10:00:00
15 13 2022-07-13 11:00:00
15 3 2022-07-17 16:00:00
15 4 2022-07-22 13:00:00
Desired Output:
Account | Messages | Date | Hour | Rolling Previous 7 Day Average
12 5 2022-07-11 09:00:00 0
12 6 2022-07-13 10:00:00 0.714
12 10 2022-07-13 11:00:00 0.714
12 9 2022-07-15 16:00:00 3
12 1 2022-07-19 13:00:00 3.571
15 2 2022-07-12 10:00:00 0
15 13 2022-07-13 11:00:00 0.286
15 3 2022-07-17 16:00:00 2.143
15 4 2022-07-22 13:00:00 0.429
I hope I've understood your question right:
df["Date"] = pd.to_datetime(df["Date"])
df["Messages_tmp"] = df.groupby(["Account", "Date"])["Messages"].transform(
"sum"
)
df["Rolling Previous 7 Day Average"] = (
df.set_index("Date")
.groupby("Account")["Messages_tmp"]
.rolling("7D")
.apply(lambda x: x.loc[~x.index.duplicated()].shift().sum() / 7)
).values
df = df.drop(columns="Messages_tmp")
print(df)
Prints:
Account Messages Date Hour Rolling Previous 7 Day Average
0 12 5 2022-07-11 09:00:00 0.000000
1 12 6 2022-07-13 10:00:00 0.714286
2 12 10 2022-07-13 11:00:00 0.714286
3 12 9 2022-07-15 16:00:00 3.000000
4 12 1 2022-07-19 13:00:00 3.571429
5 15 2 2022-07-12 10:00:00 0.000000
6 15 13 2022-07-13 11:00:00 0.285714
7 15 3 2022-07-17 16:00:00 2.142857
8 15 4 2022-07-22 13:00:00 0.428571

pandas datetime giving wrong output

I am working with a pandas dataframe with date column. I have converted the dtype of this column from object to datetime using pandas pd.to_datetime:
Input:
0 30-11-2019
1 31-12-2019
2 31-12-2019
3 31-12-2019
4 31-12-2019
5 21-01-2020
6 27-01-2020
7 01-02-2020
8 01-02-2020
9 03-02-2020
10 15-02-2020
11 12-03-2020
12 13-03-2020
13 31-03-2020
14 31-03-2020
15 04-04-2020
16 04-04-2020
17 04-04-2020
ta['transaction_date'] = pd.to_datetime(ta['transaction_date'])
Output:
0 2019-11-30
1 2019-12-31
2 2019-12-31
3 2019-12-31
4 2019-12-31
5 2020-01-21
6 2020-01-27
7 2020-01-02
8 2020-01-02
9 2020-03-02
10 2020-02-15
11 2020-12-03
12 2020-03-13
13 2020-03-31
14 2020-03-31
15 2020-04-04
16 2020-04-04
17 2020-04-04
As you can see that the 11th output after converting it into datetime is wrong month is swapped with day.This is affecting my further analysis. How can I sort this out.
Use dayfirst=True parameter or specify format, because pandas by default matching months first, if possible:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], dayfirst=True)
Or:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], format='%d-%m-%Y')
Method 1
Look into this dateframe
there is a parameter named dayfirst set it to true
Method 2
Use the parameter format in the to_datetime function

Resample by some timeframe

I have dataframe like:
Timestamp Sold
10.01.2017 10:00:20 10
10.01.2017 10:01:55 20
10.01.2017 11:02:11 15
11.01.2017 11:04:30 10
11.01.2017 11:15:35 35
12.01.2017 10:02:01 22
How to resample it by hour. Ordinary resample resamples by all hours from first row to last. But what I need is to make timeframe (10-11) and resample it within this timeframe.
Last df should be like this:
Timestamp Sold
10.01.2017 10:00:00 30
10.01.2017 11:00:00 15
11.01.2017 10:00:00 NAN
11.01.2017 11:00:00 45
12.01.2017 10:00:00 22
12.01.2017 11:00:00 NAN
You could do something like this:
df_out = df.groupby(df.Timestamp.dt.floor('H')).sum()
df_out.reset_index()
Output:
Timestamp Sold
0 2017-10-01 10:00:00 30
1 2017-10-01 11:00:00 15
2 2017-11-01 11:00:00 45
3 2017-12-01 10:00:00 22

How to add missing rows in dataframe by comparing values in python

Hi I am using python pandas for dataframes,I have data something like as followed:
Employee-ID Time-slot Calls-received Prod-sold
1 14:30:00 10 1
1 15:00:00 15 3
1 15:30:00 10 2
1
16:00:00 8 2
1 16:30:00 10 0
2 14:30:00 10 2
2
15:00:00 15 3
2 16:30:00 10 2
2 17:00:00 10 0
I have 10,000 employee and ideally there should be 16 time slots for each employee but time slots are missing for some employees, like employee 2 time slot 15:30:00 and 16:00:00 is missing I wish to add new rows and with missing time slots and zero values for 'calls-received' and prod-sold. Something like that:
2 14:30:00 10 2
2 15:00:00 15 3
2 15:30:00 0 0
2 16:00:00 0 0
2 16:30:00 10 2
2 17:00:00 10 0

Aggregate to 15min based timestamp to hour and find sum, avg and max for multiple columns in pandas

I have a dataframe with period_start_time by every 15 minutes and now I need to aggregate to 1 hour and calculate sum and avg for almost every column in dataframe (it has about 20 columns) and
PERIOD_START_TIME ID val1 val2
06.21.2017 22:15:00 12 3 0
06.21.2017 22:30:00 12 5 6
06.21.2017 22:45:00 12 0 3
06.21.2017 23:00:00 12 5 2
...
06.21.2017 22:15:00 15 9 2
06.21.2017 22:30:00 15 0 2
06.21.2017 22:45:00 15 1 5
06.21.2017 23:00:00 15 0 1
...
Desired output:
PERIOD_START_TIME ID val1(avg) val1(sum) val1(max) ...
06.21.2017 22:00:00 12 3.25 13 5
...
06.21.2017 23:00:00 15 2.25 10 9 ...
And for columns val2 too, and for every other column in dataframe.
I have no idea how to group by period start time for every hour, not for the whole day, no idea how to start.
I believe you need Series.dt.floor for Hours and then aggregate by agg:
df = df.groupby([df['PERIOD_START_TIME'].dt.floor('H'),'ID']).agg(['mean','sum', 'max'])
#for columns from MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
val1_mean val1_sum val1_max val2_mean val2_sum \
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 2.666667 8 5 3 9
15 3.333333 10 9 3 9
2017-06-21 23:00:00 12 5.000000 5 5 2 2
15 0.000000 0 0 1 1
val2_max
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 6
15 5
2017-06-21 23:00:00 12 2
15 1
df = df.reset_index()
print (df)
PERIOD_START_TIME ID val1_mean val1_sum val1_max val2_mean val2_sum \
0 2017-06-21 22:00 12 2.666667 8 5 3 9
1 2017-06-21 22:00 15 3.333333 10 9 3 9
2 2017-06-21 23:00 12 5.000000 5 5 2 2
3 2017-06-21 23:00 15 0.000000 0 0 1 1
val2_max
0 6
1 5
2 2
3 1
Very similarly you can convert PERIOD_START_TIME to a pandas Period.
df['PERIOD_START_TIME'] = df['PERIOD_START_TIME'].dt.to_period('H')
df.groupby(['PERIOD_START_TIME', 'ID']).agg(['max', 'min', 'mean']).reset_index()

Categories

Resources