How to calculate cumulative groupby counts in Pandas with point in time? - python

I have a df that contains multiple weekly snapshots of JIRA tickets. I want to calculate the YTD counts of tickets.
the df looks like this:
pointInTime ticketId
2008-01-01 111
2008-01-01 222
2008-01-01 333
2008-01-07 444
2008-01-07 555
2008-01-07 666
2008-01-14 777
2008-01-14 888
2008-01-14 999
So if I df.groupby(['pointInTime'])['ticketId'].count() I can get the count of Ids in every snaphsots. But what I want to achieve is calculate the cumulative sum.
and have a df looks like this:
pointInTime ticketId cumCount
2008-01-01 111 3
2008-01-01 222 3
2008-01-01 333 3
2008-01-07 444 6
2008-01-07 555 6
2008-01-07 666 6
2008-01-14 777 9
2008-01-14 888 9
2008-01-14 999 9
so for 2008-01-07 number of ticket would be count of 2008-01-07 + count of 2008-01-01.

Use GroupBy.count and cumsum, then map the result back to "pointInTime":
df['cumCount'] = (
df['pointInTime'].map(df.groupby('pointInTime')['ticketId'].count().cumsum()))
df
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9

I am using value_counts
df.pointInTime.map(df.pointInTime.value_counts().sort_index().cumsum())
Out[207]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
Name: pointInTime, dtype: int64
Or
pd.Series(np.arange(len(df))+1,index=df.index).groupby(df['pointInTime']).transform('last')
Out[216]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
dtype: int32

Here's an approach transforming with the size and multiplying by the result of taking pd.factorize on pointInTime:
df['cumCount'] = (df.groupby('pointInTime').ticketId
.transform('size')
.mul(pd.factorize(df.pointInTime)[0]+1))
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9

Related

Create new column with largest number indexes based on values of another column

I have a DataFrame with two columns: 'goods name' and their 'overall sales'. I need to make another column which will contain the indexes with largest sales numerated from 1, 2, 3... Where 1 is the largest number, 2 second largest number and so on.
Hope you can help me.
My dataframe:
lst = [['Keyboard1', 1860], ['Keyboard2', 1650], ['Keyboard3', 900], ['Keyboard4', 1230], ['Keyboard5', 1150], ['Keyboard6', 1345],
['Mouse1', 3100], ['Mouse2', 2900], ['Mouse3', 3050], ['Mouse4', 2750], ['Mouse5', 4100], ['Mouse6', 3910]]
df = pd.DataFrame(lst, columns = ['Goods', 'Sales'])
Goods Sales
0 Keyboard1 1860
1 Keyboard2 1650
2 Keyboard3 900
3 Keyboard4 1230
4 Keyboard5 1150
5 Keyboard6 1345
6 Mouse1 3100
7 Mouse2 2900
8 Mouse3 3050
9 Mouse4 2750
10 Mouse5 4100
11 Mouse6 3910
I'm trying to use this code:
import pandas as pd
import numpy as np
df = df.sort_values('Sales', ascending = False)
df['Largest'] = np.arange(len(df))+1
But I get indexes of Largest values for all goods, I need to get Indexes of Largest values for each type of good separately. My result:
Goods Sales Largest
10 Mouse5 4100 1
11 Mouse6 3910 2
6 Mouse1 3100 3
8 Mouse3 3050 4
7 Mouse2 2900 5
9 Mouse4 2750 6
1 Keyboard2 1860 7
0 Keyboard1 1650 8
5 Keyboard6 1345 9
3 Keyboard4 1230 10
4 Keyboard5 1150 11
2 Keyboard3 900 12
Here is the output I need:
Goods Sales Largest
10 Mouse5 4100 1
11 Mouse6 3910 2
6 Mouse1 3100 3
8 Mouse3 3050 4
7 Mouse2 2900 5
9 Mouse4 2750 6
1 Keyboard2 1860 1
0 Keyboard1 1650 2
5 Keyboard6 1345 3
3 Keyboard4 1230 4
4 Keyboard5 1150 5
2 Keyboard3 900 6
Just do:
# remove any number of groups at the end
df['goods_group'] = df['Goods'].str.replace('\d+$', '')
# sort by the new column and sales
df = df.sort_values(['goods_group', 'Sales'], ascending=False)
# create largest column
df['largest'] = df.groupby('goods_group').cumcount() + 1
# drop the new column
res = df.drop('goods_group', 1)
print(res)
Output
Goods Sales largest
10 Mouse5 4100 1
11 Mouse6 3910 2
6 Mouse1 3100 3
8 Mouse3 3050 4
7 Mouse2 2900 5
9 Mouse4 2750 6
0 Keyboard1 1860 1
1 Keyboard2 1650 2
5 Keyboard6 1345 3
3 Keyboard4 1230 4
4 Keyboard5 1150 5
2 Keyboard3 900 6
Try adding these lines to the end of the code:
df['new'] = df['Goods'].str[:-1]
df['Largest'] = df.groupby('new').cumcount() + 1
df = df.drop('new', axis=1)
print(df)
Output:
Goods Sales new Largest
10 Mouse5 4100 Mouse 1
11 Mouse6 3910 Mouse 2
6 Mouse1 3100 Mouse 3
8 Mouse3 3050 Mouse 4
7 Mouse2 2900 Mouse 5
9 Mouse4 2750 Mouse 6
0 Keyboard1 1860 Keyboard 1
1 Keyboard2 1650 Keyboard 2
5 Keyboard6 1345 Keyboard 3
3 Keyboard4 1230 Keyboard 4
4 Keyboard5 1150 Keyboard 5
2 Keyboard3 900 Keyboard 6
You could groupby, Goods without the digits:
>>> df = df.sort_values('Sales', ascending=False)
>>> df
Goods Sales
10 Mouse5 4100
11 Mouse6 3910
6 Mouse1 3100
8 Mouse3 3050
7 Mouse2 2900
9 Mouse4 2750
0 Keyboard1 1860
1 Keyboard2 1650
5 Keyboard6 1345
3 Keyboard4 1230
4 Keyboard5 1150
2 Keyboard3 900
>>> df['Largest'] = df.groupby(df['Goods'].replace('\d+', '', regex=True)).cumcount() + 1
>>> df
Goods Sales Largest
10 Mouse5 4100 1
11 Mouse6 3910 2
6 Mouse1 3100 3
8 Mouse3 3050 4
7 Mouse2 2900 5
9 Mouse4 2750 6
0 Keyboard1 1860 1
1 Keyboard2 1650 2
5 Keyboard6 1345 3
3 Keyboard4 1230 4
4 Keyboard5 1150 5
2 Keyboard3 900 6

Mark duplicates based on time difference between successive rows

There are duplicated transactions in a bank dataframe(DF). ID is customer IDs. Duplicated transaction is a multi-swipe, where a vendor accidentally charges a customer's card multiple times within a short time span (2 minutes here).
DF = pd.DataFrame({'ID': ['111', '111', '111','111', '222', '222', '222', '333', '333', '333', '333','111'],'Dollar': [1,3,1,10, 25, 8, 25,9,20, 9, 9,10],'transactionDateTime': ['2016-01-08 19:04:50', '2016-01-29 19:03:55', '2016-01-08 19:05:50', '2016-01-08 20:08:50', '2016-01-08 19:04:50', '2016-02-08 19:04:50', '2016-03-08 19:04:50', '2016-01-08 19:04:50', '2016-03-08 19:05:53', '2016-01-08 19:03:20', '2016-01-08 19:02:15', '2016-02-08 20:08:50']})
DF['transactionDateTime'] = pd.to_datetime(DF['transactionDateTime'])
ID Dollar transactionDateTime
0 111 1 2016-01-08 19:04:50
1 111 3 2016-01-29 19:03:55
2 111 1 2016-01-08 19:05:50
3 111 10 2016-01-08 20:08:50
4 222 25 2016-01-08 19:04:50
5 222 8 2016-02-08 19:04:50
6 222 25 2016-03-08 19:04:50
7 333 9 2016-01-08 19:04:50
8 333 20 2016-03-08 19:05:53
9 333 9 2016-01-08 19:03:20
10 333 9 2016-01-08 19:02:15
11 111 10 2016-02-08 20:08:50
I want to add a column to my dataframe, which recognizes the duplicated transactions (dollar amount of same customer ID should be the same, and transaction date time should be less than 2 minutes). Please consider the first transaction to be "normal".
ID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 3 2016-01-29 19:03:55 No
2 111 1 2016-01-08 19:05:50 Yes
3 111 10 2016-01-08 20:08:50 No
4 222 25 2016-01-08 19:04:50 No
5 222 8 2016-02-08 19:04:50 No
6 222 25 2016-03-08 19:04:50 No
7 333 9 2016-01-08 19:04:50 Yes
8 333 20 2016-03-08 19:05:53 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:02:15 No
11 111 10 2016-02-08 20:08:50 No
IIUC, you can groupby and diff to check whether the difference between successive transactions is less than 120 seconds:
df['Duplicated?'] = (df.sort_values(['transactionDateTime'])
.groupby(['ID', 'Dollar'], sort=False)['transactionDateTime']
.diff()
.dt.total_seconds()
.lt(120))
df
ID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 False
1 111 3 2016-01-29 19:03:55 False
2 111 1 2016-01-08 19:05:50 True
3 111 100 2016-01-08 20:08:50 False
4 222 25 2016-01-08 19:04:50 False
5 222 8 2016-02-08 19:04:50 False
6 222 25 2016-03-08 19:04:50 False
7 333 9 2016-01-08 19:04:50 True
8 333 20 2016-03-08 19:05:53 False
9 333 9 2016-01-08 19:03:20 True
10 333 9 2016-01-08 19:02:15 False
11 111 100 2016-02-08 20:08:50 False
Note that your data isn't sorted, so you must sort it first to get a meaningful result.
You can use:
m=(DF.groupby('customerID')['transactionDateTime'].diff()/ np.timedelta64(1, 'm')).le(2)
DF['Duplicated?']=np.where((DF.Dollar.duplicated()&m),'Yes','No')
print(DF)
customerID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 3 2016-01-29 19:03:55 No
2 111 1 2016-01-08 19:05:50 Yes
3 111 100 2016-01-08 20:08:50 No
4 222 25 2016-01-08 19:04:50 No
5 222 8 2016-02-08 19:04:50 No
6 222 25 2016-03-08 19:04:50 No
7 333 9 2016-01-08 19:04:50 No
8 333 20 2016-03-08 19:05:53 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:02:15 Yes
11 111 100 2016-02-08 20:08:50 No
We can first mark the duplicate payments in your Dollar column. Then mark per customer if the difference is less then 2 minutes:
DF.sort_values(['customerID', 'transactionDateTime'], inplace=True)
m1 = DF.groupby('customerID', sort=False)['Dollar'].apply(lambda x: x.duplicated())
m2 = DF.groupby('customerID', sort=False)['transactionDateTime'].diff() <= pd.Timedelta(2, unit='minutes')
DF['Duplicated?'] = np.where(m1 & m2, 'Yes', 'No')
customerID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 1 2016-01-08 19:05:50 Yes
2 111 100 2016-01-08 20:08:50 No
3 111 3 2016-01-29 19:03:55 No
4 111 100 2016-02-08 20:08:50 No
5 222 25 2016-01-08 19:04:50 No
6 222 8 2016-02-08 19:04:50 No
7 222 25 2016-03-08 19:04:50 No
8 333 9 2016-01-08 19:02:15 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:04:50 Yes
11 333 20 2016-03-08 19:05:53 No
I made pd.Timedelta(minutes=2) to compare against the diff()
m2 = pd.Timedelta(minutes=2)
DF['dup'] = DF.sort_values('transactionDateTime').groupby(['Dollar','ID']).transactionDateTime.diff().abs().le(m2).astype(int)
Out[272]:
Dollar ID transactionDateTime dup
0 1 111 2016-01-08 19:04:50 0
1 3 111 2016-01-29 19:03:55 0
2 1 111 2016-01-08 19:05:50 1
3 100 111 2016-01-08 20:08:50 0
4 25 222 2016-01-08 19:04:50 0
5 8 222 2016-02-08 19:04:50 0
6 25 222 2016-03-08 19:04:50 0
7 9 333 2016-01-08 19:04:50 1
8 20 333 2016-03-08 19:05:53 0
9 9 333 2016-01-08 19:03:20 1
10 9 333 2016-01-08 19:02:15 0
11 100 111 2016-02-08 20:08:50 0

How to compress rows after groupby in pandas

I have performed a groupby on my dataframe.
grouped = data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
I am getting the below output :
data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
Out[81]:
Cluster Visit Number Final
0 1 21846
2 1485
3 299
4 95
5 24
6 8
7 3
1 1 33600
2 2283
3 404
4 117
5 34
6 7
2 1 5858
2 311
3 55
4 14
5 6
6 3
7 1
3 1 19699
2 1101
3 214
4 78
5 14
6 8
7 3
4 1 10086
2 344
3 59
4 14
5 3
6 1
Name: Visitor_ID, dtype: int64
Now i want to compress the rows whose Visit Number Final >3(Add a new row which has the summation for visit number final 4,5,6). I am trying groupby.filter but not getting the expected output.
My final output should look like
Cluster Visit Number Final
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18
The easiest way is to replace the 'Visit Number Final' values bigger than 3, before you group the dataframe:
df.loc[df['Visit Number Final'] > 3, 'Visit Number Final'] = '>=4'
df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
Try this:
visit_val = df.index.get_level_values(1)
grp = np.where((visit_val <= 3) == 0, '>=4', visit_val)
(df.groupby(['Cluster',grp])['Number Final'].sum()
.reset_index().rename(columns={'level_1':'Visit'}))
Output:
Cluster Visit Number Final
0 0 1 21846
1 0 2 1485
2 0 3 299
3 0 >=4 130
4 1 1 33600
5 1 2 2283
6 1 3 404
7 1 >=4 158
8 2 1 5858
9 2 2 311
10 2 3 55
11 2 >=4 24
12 3 1 19699
13 3 2 1101
14 3 3 214
15 3 >=4 103
16 4 1 10086
17 4 2 344
18 4 3 59
19 4 >=4 18
Or to get dataframe with indexes:
(df.groupby(['Cluster',grp])['Number Final'].sum()
.rename_axis(['Cluster','Visit']).to_frame())
Output:
Number Final
Cluster Visit
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18

Pandas: sum values in some column

I need to group elements and sum it with one column.
member_id event_path event_duration
0 111 vk.com 1
1 111 twitter.com 4
2 111 facebook.com 56
3 111 vk.com 23
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 56
7 333 avito.ru 8
8 333 avito.ru 4
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
12 111 vk.com 10
13 222 vk.com 20
And I want no unify member_id and event_path and sum event_duration.
Desire output
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 76
7 333 avito.ru 12
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
I use
df['event_duration'] = df.groupby(['member_id', 'event_path'])['event_duration'].transform('sum')
but I get
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 111 vk.com 34
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 76
7 333 avito.ru 12
8 333 avito.ru 12
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
12 111 vk.com 34
13 222 vk.com 76
What I do wrong?
You need groupby with parameters sort=False and as_index=False with aggregation sum:
df = df.groupby(['member_id','event_path'],sort=False,as_index=False)['event_duration'].sum()
print (df)
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 222 vesti.ru 6
4 222 facebook.com 23
5 222 vk.com 76
6 333 avito.ru 12
7 444 mail.ru 7
8 444 vk.com 20
9 444 yandex.ru 40
Another possible solution is add reset_index:
df = df.groupby(['member_id', 'event_path'],sort=False)['event_duration'].sum().reset_index()
print (df)
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 222 vesti.ru 6
4 222 facebook.com 23
5 222 vk.com 76
6 333 avito.ru 12
7 444 mail.ru 7
8 444 vk.com 20
9 444 yandex.ru 40
Function transform is used to add an aggregated calculation back to the original df as a new column.
What you are doing wrong is that you try to assign it to a column in the original dataframe. And since your new column has less rows than the original dataframe, it gets repeated at the end.

Pandas: union duplicate strings

I have dataframe
ID url date active_seconds
111 vk.com 12.01.2016 5
111 facebook.com 12.01.2016 4
111 facebook.com 12.01.2016 3
111 twitter.com 12.01.2016 12
222 vk.com 12.01.2016 8
222 twitter.com 12.01.2016 34
111 facebook.com 12.01.2016 5
and i need to get
ID url date active_seconds
111 vk.com 12.01.2016 5
111 facebook.com 12.01.2016 7
111 twitter.com 12.01.2016 12
222 vk.com 12.01.2016 8
222 twitter.com 12.01.2016 34
111 facebook.com 12.01.2016 5
If I try
df.groupby(['ID', 'url'])['active_seconds'].sum()
it unions all strings. How should I do to get desirable?
(s != s.shift()).cumsum() is a typical way to identify groups of contiguous identifiers
pd.DataFrame.assign is a convenient way to add a new column to a copy of a dataframe and chain more methods
pivot_table allows us to reconfigure our table and aggregate
args - this is a style preference of mine to keep code cleaner looking. I'll pass these arguments to pivot_table via *args
reset_index * 2 to clean up and get to final result
args = ('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum')
df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table(*args) \
.reset_index([1, 2, 3]).reset_index(drop=True)
ID url date active_seconds
0 111 facebook.com 12.01.2016 7
1 111 twitter.com 12.01.2016 12
2 111 vk.com 12.01.2016 5
3 222 twitter.com 12.01.2016 34
4 222 vk.com 12.01.2016 8
5 111 facebook.com 12.01.2016 5
Solutions 1 - cumsum by column url only:
You need groupby by custom Series created by cumsum of boolean mask, but then column url need aggregate by first. Then remove level url with reset_index and last reorder columns by reindex:
g = (df.url != df.url.shift()).cumsum()
print (g)
0 1
1 2
2 2
3 3
4 4
5 5
6 6
Name: url, dtype: int32
g = (df.url != df.url.shift()).cumsum()
#another solution with ne
#g = df.url.ne(df.url.shift()).cumsum()
print (df.groupby([df.ID,df.date,g], sort=False).agg({'active_seconds':'sum', 'url':'first'})
.reset_index(level='url', drop=True)
.reset_index()
.reindex(columns=df.columns))
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
g = (df.url != df.url.shift()).cumsum().rename('tmp')
print (g)
0 1
1 2
2 2
3 3
4 4
5 5
6 6
Name: tmp, dtype: int32
print (df.groupby([df.ID, df.url, df.date, g], sort=False)['active_seconds']
.sum()
.reset_index(level='tmp', drop=True)
.reset_index())
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
Solutions 2 - cumsum by columns ID and url:
g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum()
print (g)
ID url
0 1 1
1 1 2
2 1 2
3 1 3
4 2 4
5 2 5
6 3 6
print (df.groupby([g.ID, df.date, g.url], sort=False)
.agg({'active_seconds':'sum', 'url':'first'})
.reset_index(level='url', drop=True)
.reset_index()
.reindex(columns=df.columns))
ID url date active_seconds
0 1 vk.com 12.01.2016 5
1 1 facebook.com 12.01.2016 7
2 1 twitter.com 12.01.2016 12
3 2 vk.com 12.01.2016 8
4 2 twitter.com 12.01.2016 34
5 3 facebook.com 12.01.2016 5
And solution where add column df.url, but is necessary rename columns in helper df:
g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum()
g.columns = g.columns + '1'
print (g)
ID1 url1
0 1 1
1 1 2
2 1 2
3 1 3
4 2 4
5 2 5
6 3 6
print (df.groupby([df.ID, df.url, df.date, g.ID1, g.url1], sort=False)['active_seconds']
.sum()
.reset_index(level=['ID1','url1'], drop=True)
.reset_index())
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
Timings:
Similar solutions, but pivot_table is slowier as groupby:
In [180]: %timeit (df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum').reset_index([1, 2, 3]).reset_index(drop=True))
100 loops, best of 3: 5.02 ms per loop
In [181]: %timeit (df.groupby([df.ID, df.url, df.date, (df.url != df.url.shift()).cumsum().rename('tmp')], sort=False)['active_seconds'].sum().reset_index(level='tmp', drop=True).reset_index())
100 loops, best of 3: 3.62 ms per loop
it looks like you want a cumsum():
In [195]: df.groupby(['ID', 'url'])['active_seconds'].cumsum()
Out[195]:
0 5
1 4
2 7
3 12
4 8
5 34
6 12
Name: active_seconds, dtype: int64

Categories

Resources