Python: how to group by for each user?

Python: how to group by for each user? - python

I have a dataframe that looks like the following
uid timestamp count val
0 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 23 3
1 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 20 2
2 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 15:00:00 10 5
3 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 9 6
4 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 9 3
I would like to groupby for each uid in order to have the sum of count every hour and the average of val
I would like something like the following
uid timestamp count val
0 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 43 2.5
2 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 15:00:00 10 5
3 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 18 4.5

You can try groupby in combination with agg using a dictionary style definition of your custom functions:
import pandas pd
import numpy as np
df.groupby(['uid', 'timestamp']).agg({"val": np.mean, "count" :np.sum})

Related

Get rolling average without every timestamp

I have data about how many messages each account sends aggregated to an hourly level. For each row, I would like to add a column with the sum of the previous 7 days messages. I know I can groupby account and date and aggregate the number of messages to the daily level, but I'm having a hard time calculating the rolling average because there isn't a row in the data if the account didn't send any messages that day (and I'd like to not balloon my data by adding these in if at all possible). If I could figure out a way to calculate the rolling 7-day average for each day that each account sent messages, I could then re-join that number back to the hourly data (is my hope). Any suggestions?
Note: For any day not in the data, assume 0 messages sent.
Raw Data:
Account | Messages | Date | Hour
12 5 2022-07-11 09:00:00
12 6 2022-07-13 10:00:00
12 10 2022-07-13 11:00:00
12 9 2022-07-15 16:00:00
12 1 2022-07-19 13:00:00
15 2 2022-07-12 10:00:00
15 13 2022-07-13 11:00:00
15 3 2022-07-17 16:00:00
15 4 2022-07-22 13:00:00
Desired Output:
Account | Messages | Date | Hour | Rolling Previous 7 Day Average
12 5 2022-07-11 09:00:00 0
12 6 2022-07-13 10:00:00 0.714
12 10 2022-07-13 11:00:00 0.714
12 9 2022-07-15 16:00:00 3
12 1 2022-07-19 13:00:00 3.571
15 2 2022-07-12 10:00:00 0
15 13 2022-07-13 11:00:00 0.286
15 3 2022-07-17 16:00:00 2.143
15 4 2022-07-22 13:00:00 0.429

I hope I've understood your question right:
df["Date"] = pd.to_datetime(df["Date"])
df["Messages_tmp"] = df.groupby(["Account", "Date"])["Messages"].transform(
"sum"
)
df["Rolling Previous 7 Day Average"] = (
df.set_index("Date")
.groupby("Account")["Messages_tmp"]
.rolling("7D")
.apply(lambda x: x.loc[~x.index.duplicated()].shift().sum() / 7)
).values
df = df.drop(columns="Messages_tmp")
print(df)
Prints:
Account Messages Date Hour Rolling Previous 7 Day Average
0 12 5 2022-07-11 09:00:00 0.000000
1 12 6 2022-07-13 10:00:00 0.714286
2 12 10 2022-07-13 11:00:00 0.714286
3 12 9 2022-07-15 16:00:00 3.000000
4 12 1 2022-07-19 13:00:00 3.571429
5 15 2 2022-07-12 10:00:00 0.000000
6 15 13 2022-07-13 11:00:00 0.285714
7 15 3 2022-07-17 16:00:00 2.142857
8 15 4 2022-07-22 13:00:00 0.428571

pandas datetime giving wrong output

I am working with a pandas dataframe with date column. I have converted the dtype of this column from object to datetime using pandas pd.to_datetime:
Input:
0 30-11-2019
1 31-12-2019
2 31-12-2019
3 31-12-2019
4 31-12-2019
5 21-01-2020
6 27-01-2020
7 01-02-2020
8 01-02-2020
9 03-02-2020
10 15-02-2020
11 12-03-2020
12 13-03-2020
13 31-03-2020
14 31-03-2020
15 04-04-2020
16 04-04-2020
17 04-04-2020
ta['transaction_date'] = pd.to_datetime(ta['transaction_date'])
Output:
0 2019-11-30
1 2019-12-31
2 2019-12-31
3 2019-12-31
4 2019-12-31
5 2020-01-21
6 2020-01-27
7 2020-01-02
8 2020-01-02
9 2020-03-02
10 2020-02-15
11 2020-12-03
12 2020-03-13
13 2020-03-31
14 2020-03-31
15 2020-04-04
16 2020-04-04
17 2020-04-04
As you can see that the 11th output after converting it into datetime is wrong month is swapped with day.This is affecting my further analysis. How can I sort this out.

Use dayfirst=True parameter or specify format, because pandas by default matching months first, if possible:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], dayfirst=True)
Or:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], format='%d-%m-%Y')

Method 1
Look into this dateframe
there is a parameter named dayfirst set it to true
Method 2
Use the parameter format in the to_datetime function

Resample by some timeframe

I have dataframe like:
Timestamp Sold
10.01.2017 10:00:20 10
10.01.2017 10:01:55 20
10.01.2017 11:02:11 15
11.01.2017 11:04:30 10
11.01.2017 11:15:35 35
12.01.2017 10:02:01 22
How to resample it by hour. Ordinary resample resamples by all hours from first row to last. But what I need is to make timeframe (10-11) and resample it within this timeframe.
Last df should be like this:
Timestamp Sold
10.01.2017 10:00:00 30
10.01.2017 11:00:00 15
11.01.2017 10:00:00 NAN
11.01.2017 11:00:00 45
12.01.2017 10:00:00 22
12.01.2017 11:00:00 NAN

You could do something like this:
df_out = df.groupby(df.Timestamp.dt.floor('H')).sum()
df_out.reset_index()
Output:
Timestamp Sold
0 2017-10-01 10:00:00 30
1 2017-10-01 11:00:00 15
2 2017-11-01 11:00:00 45
3 2017-12-01 10:00:00 22

How to add missing rows in dataframe by comparing values in python

Hi I am using python pandas for dataframes,I have data something like as followed:
Employee-ID Time-slot Calls-received Prod-sold
1 14:30:00 10 1
1 15:00:00 15 3
1 15:30:00 10 2
1
16:00:00 8 2
1 16:30:00 10 0
2 14:30:00 10 2
2
15:00:00 15 3
2 16:30:00 10 2
2 17:00:00 10 0
I have 10,000 employee and ideally there should be 16 time slots for each employee but time slots are missing for some employees, like employee 2 time slot 15:30:00 and 16:00:00 is missing I wish to add new rows and with missing time slots and zero values for 'calls-received' and prod-sold. Something like that:
2 14:30:00 10 2
2 15:00:00 15 3
2 15:30:00 0 0
2 16:00:00 0 0
2 16:30:00 10 2
2 17:00:00 10 0

Aggregate to 15min based timestamp to hour and find sum, avg and max for multiple columns in pandas

I have a dataframe with period_start_time by every 15 minutes and now I need to aggregate to 1 hour and calculate sum and avg for almost every column in dataframe (it has about 20 columns) and
PERIOD_START_TIME ID val1 val2
06.21.2017 22:15:00 12 3 0
06.21.2017 22:30:00 12 5 6
06.21.2017 22:45:00 12 0 3
06.21.2017 23:00:00 12 5 2
...
06.21.2017 22:15:00 15 9 2
06.21.2017 22:30:00 15 0 2
06.21.2017 22:45:00 15 1 5
06.21.2017 23:00:00 15 0 1
...
Desired output:
PERIOD_START_TIME ID val1(avg) val1(sum) val1(max) ...
06.21.2017 22:00:00 12 3.25 13 5
...
06.21.2017 23:00:00 15 2.25 10 9 ...
And for columns val2 too, and for every other column in dataframe.
I have no idea how to group by period start time for every hour, not for the whole day, no idea how to start.

I believe you need Series.dt.floor for Hours and then aggregate by agg:
df = df.groupby([df['PERIOD_START_TIME'].dt.floor('H'),'ID']).agg(['mean','sum', 'max'])
#for columns from MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
val1_mean val1_sum val1_max val2_mean val2_sum \
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 2.666667 8 5 3 9
15 3.333333 10 9 3 9
2017-06-21 23:00:00 12 5.000000 5 5 2 2
15 0.000000 0 0 1 1
val2_max
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 6
15 5
2017-06-21 23:00:00 12 2
15 1
df = df.reset_index()
print (df)
PERIOD_START_TIME ID val1_mean val1_sum val1_max val2_mean val2_sum \
0 2017-06-21 22:00 12 2.666667 8 5 3 9
1 2017-06-21 22:00 15 3.333333 10 9 3 9
2 2017-06-21 23:00 12 5.000000 5 5 2 2
3 2017-06-21 23:00 15 0.000000 0 0 1 1
val2_max
0 6
1 5
2 2
3 1

Very similarly you can convert PERIOD_START_TIME to a pandas Period.
df['PERIOD_START_TIME'] = df['PERIOD_START_TIME'].dt.to_period('H')
df.groupby(['PERIOD_START_TIME', 'ID']).agg(['max', 'min', 'mean']).reset_index()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: how to group by for each user? - python

You can try groupby in combination with agg using a dictionary style definition of your custom functions: import pandas pd import numpy as np df.groupby(['uid', 'timestamp']).agg({"val": np.mean, "count" :np.sum})

Related

Get rolling average without every timestamp

pandas datetime giving wrong output

Resample by some timeframe

How to add missing rows in dataframe by comparing values in python

Aggregate to 15min based timestamp to hour and find sum, avg and max for multiple columns in pandas

Categories

Resources