Pandas GroupingBy and finding repetition by unique IDs

Pandas GroupingBy and finding repetition by unique IDs - python

I have a dataframe like this:
userId date new doa
67 23 2018-07-02 1 2
68 23 2018-07-03 1 3
69 23 2018-07-04 1 4
70 23 2018-07-06 1 6
71 23 2018-07-07 1 7
72 23 2018-07-10 1 10
73 23 2018-07-11 1 11
74 23 2018-07-13 1 13
75 23 2018-07-15 1 15
76 23 2018-07-16 1 16
77 23 2018-07-17 1 17
......
194605 448053 2018-08-11 1 11
194606 448054 2018-08-11 1 11
194607 448065 2018-08-11 1 11
df['doa'] stands for day of appearance.
Now I want to find out like which unique userIds have appeared on a daily basis. Like which userIds are appearing on day1, day2, day3, and so on. So how do I exactly groupby them? And also I want to find out like the avg. no of days unique users are opening the app in a month?
And finally I want to also find out like which users have appeared at least once every day throughout the month.
I want some thing like this:
userId week_no ndays
23 1 2
23 2 5
23 3 6
.....
1533 1 0
1534 2 1
1534 3 4
1534 4 1
1553 1 1
1553 2 0
1553 3 0
1553 4 0
And so on. ndays means no. of days in a week.

You're asking several different questions, and none of them are particularly difficult, they just require a couple groupbys and aggregation operations.
Setup
df = pd.DataFrame({
'userId': [1,1,1,1,1,2,2,2,2,3,3,3,3,3],
'date': ['2018-07-02', '2018-07-03', '2018-08-04', '2018-08-05', '2018-08-06',
'2018-07-02', '2018-07-03', '2018-08-04', '2018-08-05', '2018-07-02', '2018-07-03',
'2018-07-04', '2018-07-05', '2018-08-06']
})
df.date = pd.to_datetime(df.date)
df['doa'] = df.date.dt.day
userId date doa
0 1 2018-07-02 2
1 1 2018-07-03 3
2 1 2018-08-04 4
3 1 2018-08-05 5
4 1 2018-08-06 6
5 2 2018-07-02 2
6 2 2018-07-03 3
7 2 2018-08-04 4
8 2 2018-08-05 5
9 3 2018-07-02 2
10 3 2018-07-03 3
11 3 2018-07-04 4
12 3 2018-07-05 5
13 3 2018-08-06 6
Questions
How do I find the unique visitors per day?
You may use groupby and unique:
df.groupby([df.date.dt.month, 'doa']).userId.unique()
date doa
7 2 [1, 2, 3]
3 [1, 2, 3]
4 [3]
5 [3]
8 4 [1, 2]
5 [1, 2]
6 [1, 3]
Name: userId, dtype: object
How do I find the average number of days per month users open the app?
Using groupby and size:
df.groupby(['userId', df.date.dt.month]).size()
userId date
1 7 2
8 3
2 7 2
8 2
3 7 4
8 1
dtype: int64
This will give you the number of times per month each unique visitor has visited. If you want the average of this, simply apply mean:
df.groupby(['userId', df.date.dt.month]).size().groupby('date').mean()
date
7 2.666667
8 2.000000
dtype: float64
This one was a bit more unclear, but it seems that you want the number of days a user was seen per week:
You can groupby userId, as well as a variation on your date column to create continuous weeks, starting at the minimum date, then use size:
(df.groupby(
['userId', (df.date.dt.week.sub(df.date.dt.week.min())+1).rename('week_no')])
.size().reset_index(name='ndays')
)
userId week_no ndays
0 1 1 2
1 1 5 2
2 1 6 1
3 2 1 2
4 2 5 2
5 3 1 4
6 3 6 1

Related

Pandas rolling mean with offset by (not continuously available) date

given the following example table
Index
Date
Weekday
Value
1
05/12/2022
2
10
2
06/12/2022
3
20
3
07/12/2022
4
40
4
09/12/2022
6
10
5
10/12/2022
7
60
6
11/12/2022
1
30
7
12/12/2022
2
40
8
13/12/2022
3
50
9
14/12/2022
4
60
10
16/12/2022
6
20
11
17/12/2022
7
50
12
18/12/2022
1
10
13
20/12/2022
3
20
14
21/12/2022
4
10
15
22/12/2022
5
40
I want to calculate a rolling average of the last three observations (at least) a week ago. I cannot use .shift as some dates are randomly missing, and .shift would therefore not produce a reliable output.
Desired output example for last three rows in the example dataset:
Index 13: Avg of indices 8, 7, 6 = (30+40+50) / 3 = 40
Index 14: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
Index 15: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
What would be a working solution for this? Thanks!
Thanks!

MOSTLY inspired from #Aidis you could, make his solution an apply:
df['mean']=df.apply(lambda y: df["Value"][df['Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
or spliting the data at each call which may run faster if you have lots of data (to be tested):
df['mean']=df.apply(lambda y: df.loc[:y.name, "Value"][ df.loc[:y.name,'Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
which returns:
Index Date Weekday Value mean
0 1 2022-12-05 2 10 NaN
1 2 2022-12-06 3 20 NaN
2 3 2022-12-07 4 40 NaN
3 4 2022-12-09 6 10 NaN
4 5 2022-12-10 7 60 NaN
5 6 2022-12-11 1 30 NaN
6 7 2022-12-12 2 40 10.000000
7 8 2022-12-13 3 50 15.000000
8 9 2022-12-14 4 60 23.333333
9 10 2022-12-16 6 20 23.333333
10 11 2022-12-17 7 50 36.666667
11 12 2022-12-18 1 10 33.333333
12 13 2022-12-20 3 20 40.000000
13 14 2022-12-21 4 10 50.000000
14 15 2022-12-22 5 40 50.000000

I apologize for this ugly code. But it seems to work:
df = df.set_index("Index")
df['Date'] = df['Date'].astype("datetime64")
for id in df.index:
dfs = df.loc[:id]
mean = dfs["Value"][dfs['Date'] <= dfs.iloc[-1]['Date'] - pd.Timedelta(1, "W")].tail(3).mean()
print(id, mean)
Result:
1 nan
2 10.0
3 15.0
4 23.333333333333332
5 23.333333333333332
6 36.666666666666664
7 33.333333333333336
8 33.333333333333336
9 33.333333333333336
10 33.333333333333336
11 33.333333333333336
12 33.333333333333336
13 40.0
14 50.0
15 50.0

Python Monthly Change Calculation (Pandas)

Here is data
id
date
population
1
2021-5
21
2
2021-5
22
3
2021-5
23
4
2021-5
24
1
2021-4
17
2
2021-4
24
3
2021-4
18
4
2021-4
29
1
2021-3
20
2
2021-3
29
3
2021-3
17
4
2021-3
22
I want to calculate the monthly change regarding population in each id. so result will be:
id
date
delta
1
5
.2353
1
4
-.15
2
5
-.1519
2
4
-.2083
3
5
.2174
3
4
.0556
4
5
-.2083
4
4
.3182
delta := (this month - last month) / last month
How to approach this in pandas? I'm thinking of groupby but don't know what to do next
remember there might be more dates. but results is always

Use GroupBy.pct_change with sorting columns first before, last remove misisng rows by column delta:
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','date'], ascending=[True, False])
df['delta'] = df.groupby('id')['population'].pct_change(-1)
df = df.dropna(subset=['delta'])
print (df)
id date population delta
0 1 2021-05-01 21 0.235294
4 1 2021-04-01 17 -0.150000
1 2 2021-05-01 22 -0.083333
5 2 2021-04-01 24 -0.172414
2 3 2021-05-01 23 0.277778
6 3 2021-04-01 18 0.058824
3 4 2021-05-01 24 -0.172414
7 4 2021-04-01 29 0.318182

Try this:
df.groupby('id')['population'].rolling(2).apply(lambda x: (x.iloc[0] - x.iloc[1]) / x.iloc[0]).dropna()

maybe you could try something like:
data['delta'] = data['population'].diff()
data['delta'] /= data['population']
with this approach the first line would be NaNs, but for the rest, this should work.

Split up the total of a value when merging dataframes with rows that contain id multiple times

I have two dataframes that I would like to merge. The first dataframe contains a customer id and a column with a value. The second dataframe contains the customer id and a purchase id. When merging i would like to split up the total value in the first dataframe based on how many times the customer id is present in the second dataframe and attribute every row the correct split of the total value.
Example: Customer with id 1 has a total value of 3000 but has bought products two times in its lifetime the value 3000 should then be split when merging so that each row gets 1500.
First dataframe:
import pandas as pd
df_first = pd.DataFrame({'customer_id': [1,2,3,4,5], 'value': [3000,4000,5000,6000,7000]})
df_first.head()
Out[1]:
customer_id value
0 1 3000
1 2 4000
2 3 5000
3 4 6000
4 5 7000
Second dataframe:
df_second = pd.DataFrame({'customer_id': [1,2,3,4,5,1,2,3,4,5], 'purchase_id': [11,12,13,14,15,21,22,23,24,25]})
df_second.head(10)
Out[2]:
customer_id purchase_id
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 1 21
6 2 22
7 3 23
8 4 24
9 5 25
Expected output when merging:
Out[3]:
customer_id value purchase_id
0 1 1500 11
1 1 1500 21
2 2 2000 12
3 2 2000 22
4 3 2500 13
5 3 2500 23
6 4 3000 14
7 4 3000 24
8 5 3500 15
9 5 3500 25

Use DataFrame.merge with left join and sorted values by customer_id and then divide values by length of groups mapped by Series.map with Series.value_counts :
df = df_second.sort_values('customer_id').merge(df_first, on='customer_id', how='left')
df['value'] /= df['customer_id'].map(df['customer_id'].value_counts())
#alternative
#df['value'] /= df.groupby('customer_id')['customer_id'].transform('size')
print (df)
customer_id purchase_id value
0 1 11 1500.0
1 1 21 1500.0
2 2 12 2000.0
3 2 22 2000.0
4 3 13 2500.0
5 3 23 2500.0
6 4 14 3000.0
7 4 24 3000.0
8 5 15 3500.0
9 5 25 3500.0

Pandas - Times series multiple slices of a dataframe groupby Id

What I have:
A dataframe, df consists of 3 columns (Id, Item and Timestamp). Each subject has unique Id with recorded Item on a particular date and time (Timestamp). The second dataframe, df_ref consists of date time range reference for slicing the df, the Start and the End for each subject, Id.
df:
Id Item Timestamp
0 1 aaa 2011-03-15 14:21:00
1 1 raa 2012-05-03 04:34:01
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
5 1 aud 2017-05-10 11:58:02
6 2 boo 2004-06-22 22:20:58
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
9 2 baa 2008-05-22 21:28:00
10 2 boo 2017-06-08 23:31:06
11 3 ige 2011-06-30 13:14:21
12 3 afr 2013-06-11 01:38:48
13 3 gui 2013-06-21 23:14:26
14 3 loo 2014-06-10 15:15:42
15 3 boo 2015-01-23 02:08:35
16 3 afr 2015-04-15 00:15:23
17 3 aaa 2016-02-16 10:26:03
18 3 aaa 2016-06-10 01:11:15
19 3 ige 2016-07-18 11:41:18
20 3 boo 2016-12-06 19:14:00
21 4 gui 2016-01-05 09:19:50
22 4 aaa 2016-12-09 14:49:50
23 4 ige 2016-11-01 08:23:18
df_ref:
Id Start End
0 1 2013-03-12 00:00:00 2016-05-30 15:20:36
1 2 2005-06-05 08:51:22 2007-02-24 00:00:00
2 3 2011-05-14 10:11:28 2013-12-31 17:04:55
3 3 2015-03-29 12:18:31 2016-07-26 00:00:00
What I want:
Slice the df dataframe based on the data time range given for each Id (groupby Id) in df_ref and concatenate the sliced data into new dataframe. However, a subject could have more than one date time range (in this example Id=3 has 2 date time range).
df_expected:
Id Item Timestamp
0 1 baa 2013-05-08 22:21:29
1 1 boo 2015-12-24 21:53:41
2 1 afr 2016-04-14 12:28:26
3 2 aaa 2005-11-16 07:00:00
4 2 ige 2006-06-28 17:09:18
5 3 ige 2011-06-30 13:14:21
6 3 afr 2013-06-11 01:38:48
7 3 gui 2013-06-21 23:14:26
8 3 afr 2015-04-15 00:15:23
9 3 aaa 2016-02-16 10:26:03
10 3 aaa 2016-06-10 01:11:15
11 3 ige 2016-07-18 11:41:18
What I have done so far:
I referred to this post (Time series multiple slice) while doing my code. I modify the code since it does not have the groupby element which I need.
My code:
from datetime import datetime
df['Timestamp'] = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M')
x = pd.DataFrame()
for pid in def_ref.Id.unique():
selection = df[(df['Id']== pid) & (df['Timestamp']>= def_ref['Start']) & (df['Timestamp']<= def_ref['End'])]
x = x.append(selection)
Above code give error:
ValueError: Can only compare identically-labeled Series objects

First use merge with default inner join, also it create all combinations for duplicated Id. Then filter by between and DataFrame.loc for filtering by conditions and by df1.columns in one step:
df1 = df.merge(df_ref, on='Id')
df2 = df1.loc[df1['Timestamp'].between(df1['Start'], df1['End']), df.columns]
print (df2)
Id Item Timestamp
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
11 3 ige 2011-06-30 13:14:21
13 3 afr 2013-06-11 01:38:48
15 3 gui 2013-06-21 23:14:26
22 3 afr 2015-04-15 00:15:23
24 3 aaa 2016-02-16 10:26:03
26 3 aaa 2016-06-10 01:11:15
28 3 ige 2016-07-18 11:41:18

Calculate average of every 7 instances in a dataframe column

I have this pandas dataframe with daily asset prices:
Picture of head of Dataframe
I would like to create a pandas series (It could also be an additional column in the dataframe or some other datastructure) with the weakly average asset prices. This means I need to calculate the average on every 7 consecutive instances in the column and save it into a series.
Picture of how result should look like
As I am a complete newbie to python (and programming in general, for that matter), I really have no idea how to start.
I am very grateful for every tipp!

I believe need GroupBy.transform by modulo of numpy array create by numpy.arange for general solution also working with all indexes (e.g. with DatetimeIndex):
np.random.seed(2018)
rng = pd.date_range('2018-04-19', periods=20)
df = pd.DataFrame({'Date': rng[::-1],
'ClosingPrice': np.random.randint(4, size=20)})
#print (df)
df['weekly'] = df['ClosingPrice'].groupby(np.arange(len(df)) // 7).transform('mean')
print (df)
ClosingPrice Date weekly
0 2 2018-05-08 1.142857
1 2 2018-05-07 1.142857
2 2 2018-05-06 1.142857
3 1 2018-05-05 1.142857
4 1 2018-05-04 1.142857
5 0 2018-05-03 1.142857
6 0 2018-05-02 1.142857
7 2 2018-05-01 2.285714
8 1 2018-04-30 2.285714
9 1 2018-04-29 2.285714
10 3 2018-04-28 2.285714
11 3 2018-04-27 2.285714
12 3 2018-04-26 2.285714
13 3 2018-04-25 2.285714
14 1 2018-04-24 1.666667
15 0 2018-04-23 1.666667
16 3 2018-04-22 1.666667
17 2 2018-04-21 1.666667
18 2 2018-04-20 1.666667
19 2 2018-04-19 1.666667
Detail:
print (np.arange(len(df)) // 7)
[0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas GroupingBy and finding repetition by unique IDs - python

Related

Pandas rolling mean with offset by (not continuously available) date

Python Monthly Change Calculation (Pandas)

Split up the total of a value when merging dataframes with rows that contain id multiple times

Pandas - Times series multiple slices of a dataframe groupby Id

Calculate average of every 7 instances in a dataframe column

Categories

Resources