Mark duplicates based on time difference between successive rows

Mark duplicates based on time difference between successive rows - python

There are duplicated transactions in a bank dataframe(DF). ID is customer IDs. Duplicated transaction is a multi-swipe, where a vendor accidentally charges a customer's card multiple times within a short time span (2 minutes here).
DF = pd.DataFrame({'ID': ['111', '111', '111','111', '222', '222', '222', '333', '333', '333', '333','111'],'Dollar': [1,3,1,10, 25, 8, 25,9,20, 9, 9,10],'transactionDateTime': ['2016-01-08 19:04:50', '2016-01-29 19:03:55', '2016-01-08 19:05:50', '2016-01-08 20:08:50', '2016-01-08 19:04:50', '2016-02-08 19:04:50', '2016-03-08 19:04:50', '2016-01-08 19:04:50', '2016-03-08 19:05:53', '2016-01-08 19:03:20', '2016-01-08 19:02:15', '2016-02-08 20:08:50']})
DF['transactionDateTime'] = pd.to_datetime(DF['transactionDateTime'])
ID Dollar transactionDateTime
0 111 1 2016-01-08 19:04:50
1 111 3 2016-01-29 19:03:55
2 111 1 2016-01-08 19:05:50
3 111 10 2016-01-08 20:08:50
4 222 25 2016-01-08 19:04:50
5 222 8 2016-02-08 19:04:50
6 222 25 2016-03-08 19:04:50
7 333 9 2016-01-08 19:04:50
8 333 20 2016-03-08 19:05:53
9 333 9 2016-01-08 19:03:20
10 333 9 2016-01-08 19:02:15
11 111 10 2016-02-08 20:08:50
I want to add a column to my dataframe, which recognizes the duplicated transactions (dollar amount of same customer ID should be the same, and transaction date time should be less than 2 minutes). Please consider the first transaction to be "normal".
ID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 3 2016-01-29 19:03:55 No
2 111 1 2016-01-08 19:05:50 Yes
3 111 10 2016-01-08 20:08:50 No
4 222 25 2016-01-08 19:04:50 No
5 222 8 2016-02-08 19:04:50 No
6 222 25 2016-03-08 19:04:50 No
7 333 9 2016-01-08 19:04:50 Yes
8 333 20 2016-03-08 19:05:53 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:02:15 No
11 111 10 2016-02-08 20:08:50 No

IIUC, you can groupby and diff to check whether the difference between successive transactions is less than 120 seconds:
df['Duplicated?'] = (df.sort_values(['transactionDateTime'])
.groupby(['ID', 'Dollar'], sort=False)['transactionDateTime']
.diff()
.dt.total_seconds()
.lt(120))
df
ID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 False
1 111 3 2016-01-29 19:03:55 False
2 111 1 2016-01-08 19:05:50 True
3 111 100 2016-01-08 20:08:50 False
4 222 25 2016-01-08 19:04:50 False
5 222 8 2016-02-08 19:04:50 False
6 222 25 2016-03-08 19:04:50 False
7 333 9 2016-01-08 19:04:50 True
8 333 20 2016-03-08 19:05:53 False
9 333 9 2016-01-08 19:03:20 True
10 333 9 2016-01-08 19:02:15 False
11 111 100 2016-02-08 20:08:50 False
Note that your data isn't sorted, so you must sort it first to get a meaningful result.

You can use:
m=(DF.groupby('customerID')['transactionDateTime'].diff()/ np.timedelta64(1, 'm')).le(2)
DF['Duplicated?']=np.where((DF.Dollar.duplicated()&m),'Yes','No')
print(DF)
customerID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 3 2016-01-29 19:03:55 No
2 111 1 2016-01-08 19:05:50 Yes
3 111 100 2016-01-08 20:08:50 No
4 222 25 2016-01-08 19:04:50 No
5 222 8 2016-02-08 19:04:50 No
6 222 25 2016-03-08 19:04:50 No
7 333 9 2016-01-08 19:04:50 No
8 333 20 2016-03-08 19:05:53 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:02:15 Yes
11 111 100 2016-02-08 20:08:50 No

We can first mark the duplicate payments in your Dollar column. Then mark per customer if the difference is less then 2 minutes:
DF.sort_values(['customerID', 'transactionDateTime'], inplace=True)
m1 = DF.groupby('customerID', sort=False)['Dollar'].apply(lambda x: x.duplicated())
m2 = DF.groupby('customerID', sort=False)['transactionDateTime'].diff() <= pd.Timedelta(2, unit='minutes')
DF['Duplicated?'] = np.where(m1 & m2, 'Yes', 'No')
customerID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 1 2016-01-08 19:05:50 Yes
2 111 100 2016-01-08 20:08:50 No
3 111 3 2016-01-29 19:03:55 No
4 111 100 2016-02-08 20:08:50 No
5 222 25 2016-01-08 19:04:50 No
6 222 8 2016-02-08 19:04:50 No
7 222 25 2016-03-08 19:04:50 No
8 333 9 2016-01-08 19:02:15 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:04:50 Yes
11 333 20 2016-03-08 19:05:53 No

I made pd.Timedelta(minutes=2) to compare against the diff()
m2 = pd.Timedelta(minutes=2)
DF['dup'] = DF.sort_values('transactionDateTime').groupby(['Dollar','ID']).transactionDateTime.diff().abs().le(m2).astype(int)
Out[272]:
Dollar ID transactionDateTime dup
0 1 111 2016-01-08 19:04:50 0
1 3 111 2016-01-29 19:03:55 0
2 1 111 2016-01-08 19:05:50 1
3 100 111 2016-01-08 20:08:50 0
4 25 222 2016-01-08 19:04:50 0
5 8 222 2016-02-08 19:04:50 0
6 25 222 2016-03-08 19:04:50 0
7 9 333 2016-01-08 19:04:50 1
8 20 333 2016-03-08 19:05:53 0
9 9 333 2016-01-08 19:03:20 1
10 9 333 2016-01-08 19:02:15 0
11 100 111 2016-02-08 20:08:50 0

Related

Pandas: Accumulated Shares Holdings per Ticker per Day from a List of Trades

I have a pd.DataFrame (pandas.core.frame.DataFrame) with some stock trades.
data = {'Date': ['2021-01-15', '2021-01-21', '2021-02-28', '2021-01-30', '2021-02-16', '2021-03-22', '2021-01-08', '2021-03-02', '2021-02-25', '2021-04-04', '2021-03-15', '2021-04-08'], 'Ticker': ['MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL'], 'Quantity': [2,3,7,2,6,4,-3,8,-2,9,11,1]}
df = pd.DataFrame(data)
Date Ticker Quantity
0 2021-01-15 MFST 2
1 2021-01-21 AMZN 3
2 2021-02-28 GOOG 7
3 2021-01-30 AAPL 2
4 2021-02-16 MFST 6
5 2021-03-22 AMZN 4
6 2021-01-08 GOOG -3
7 2021-03-02 AAPL 8
8 2021-02-25 MFST -2
9 2021-04-04 AMZN 9
10 2021-03-15 GOOG 11
11 2021-04-08 AAPL 1
Quantity refers to the number of shares bought.
I am looking for an efficient way to create a new df which contains the number of shares for each Ticker per day.
The first trade was on 2021-01-08 and the last on 2021-04-08. I want a new dataframe that contains all days between those to dates as rows and the tickers as columns. Values shall be the number of shares I hold at a specific day. Hence, if I buy 4 shares of a stock at 2021-03-15 (assuming no further buying or selling) I will have them from 2021-03-15 till 2021-04-08 which should be represented as a 4 in every row for this specific ticker. If I decide to buy more shares this number will change on that day and all following days.
Could be something like this:
Date MFST AMZN GOOG APPL
2021-01-08 2 3 1 0
2021-01-09 2 3 1 0
2021-01-10 2 3 1 0
...
2021-04-08 2 3 1 7
My first guess was to create an empty DataFrame and then iterate with two for loops over all its Dates and Tickers. However, I think that is not the most efficient way. I am thankful for any recommendation!

You can use df.pivot() to transform your data into a tabular form, as shown on the expected output layout, as follows:
df.pivot(index='Date', columns='Ticker', values='Quantity').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
If you need to aggregate Quantity for same date for each stock, you can use df.pivot_table() with parameter aggfunc='sum', as follows:
df.pivot_table(index='Date', columns='Ticker', values='Quantity', aggfunc='sum').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
Result:
Date AAPL AMZN GOOG MFST
0 2021-01-21 0 3 0 0
1 2021-02-28 0 0 1 0
2 2021-03-15 0 0 0 2
3 2021-04-30 7 0 0 0
Additional Test Case:
To showcase the aggregation function of df.pivot_table(), I have added some data as follows:
data = {'Date': ['2021-03-15',
'2021-01-21',
'2021-01-21',
'2021-02-28',
'2021-02-28',
'2021-04-30',
'2021-04-30'],
'Ticker': ['MFST', 'AMZN', 'AMZN', 'GOOG', 'GOOG', 'AAPL', 'AAPL'],
'Quantity': [2, 3, 4, 1, 2, 7, 2]}
df = pd.DataFrame(data)
Date Ticker Quantity
0 2021-03-15 MFST 2
1 2021-01-21 AMZN 3
2 2021-01-21 AMZN 4
3 2021-02-28 GOOG 1
4 2021-02-28 GOOG 2
5 2021-04-30 AAPL 7
6 2021-04-30 AAPL 2
df.pivot_table(index='Date', columns='Ticker', values='Quantity', aggfunc='sum').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
Date AAPL AMZN GOOG MFST
0 2021-01-21 0 7 0 0
1 2021-02-28 0 0 3 0
2 2021-03-15 0 0 0 2
3 2021-04-30 9 0 0 0
Edit
Based on latest requirement:
The first trade was on 2021-03-15 and the last on 2021-04-30. I want a
new dataframe that contains all days between those to dates as rows
and the tickers as columns. Values shall be the number of shares I
hold at a specific day. Hence, if I buy 4 shares of a stock at
2021-03-15 (assuming no further buying or selling) I will have them
from 2021-03-15 till 2021-04-30 which should be represented as a 4 in
every row for this specific ticker. If I decide to buy more shares
this number will change on that day and all following days.
Here is the enhanced solution:
data = {'Date': ['2021-01-15', '2021-01-21', '2021-02-28', '2021-01-30', '2021-02-16', '2021-03-22', '2021-01-08', '2021-03-02', '2021-02-25', '2021-04-04', '2021-03-15', '2021-04-08'], 'Ticker': ['MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL'], 'Quantity': [2,3,7,2,6,4,-3,8,-2,9,11,1]}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')
df1 = df.set_index('Date').asfreq('D')
df1['Ticker'] = df1['Ticker'].ffill().bfill()
df1['Quantity'] = df1['Quantity'].fillna(0)
df2 = df1.pivot_table(index='Date', columns='Ticker', values='Quantity', aggfunc='sum').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
df3 = df2[['Date']].join(df2.iloc[:,1:].cumsum())
Result:
print(df3)
Date AAPL AMZN GOOG MFST
0 2021-01-08 0 0 -3 0
1 2021-01-09 0 0 -3 0
2 2021-01-10 0 0 -3 0
3 2021-01-11 0 0 -3 0
4 2021-01-12 0 0 -3 0
5 2021-01-13 0 0 -3 0
6 2021-01-14 0 0 -3 0
7 2021-01-15 0 0 -3 2
8 2021-01-16 0 0 -3 2
9 2021-01-17 0 0 -3 2
10 2021-01-18 0 0 -3 2
11 2021-01-19 0 0 -3 2
12 2021-01-20 0 0 -3 2
13 2021-01-21 0 3 -3 2
14 2021-01-22 0 3 -3 2
15 2021-01-23 0 3 -3 2
16 2021-01-24 0 3 -3 2
17 2021-01-25 0 3 -3 2
18 2021-01-26 0 3 -3 2
19 2021-01-27 0 3 -3 2
20 2021-01-28 0 3 -3 2
21 2021-01-29 0 3 -3 2
22 2021-01-30 2 3 -3 2
23 2021-01-31 2 3 -3 2
24 2021-02-01 2 3 -3 2
25 2021-02-02 2 3 -3 2
26 2021-02-03 2 3 -3 2
27 2021-02-04 2 3 -3 2
28 2021-02-05 2 3 -3 2
29 2021-02-06 2 3 -3 2
30 2021-02-07 2 3 -3 2
31 2021-02-08 2 3 -3 2
32 2021-02-09 2 3 -3 2
33 2021-02-10 2 3 -3 2
34 2021-02-11 2 3 -3 2
35 2021-02-12 2 3 -3 2
36 2021-02-13 2 3 -3 2
37 2021-02-14 2 3 -3 2
38 2021-02-15 2 3 -3 2
39 2021-02-16 2 3 -3 8
40 2021-02-17 2 3 -3 8
41 2021-02-18 2 3 -3 8
42 2021-02-19 2 3 -3 8
43 2021-02-20 2 3 -3 8
44 2021-02-21 2 3 -3 8
45 2021-02-22 2 3 -3 8
46 2021-02-23 2 3 -3 8
47 2021-02-24 2 3 -3 8
48 2021-02-25 2 3 -3 6
49 2021-02-26 2 3 -3 6
50 2021-02-27 2 3 -3 6
51 2021-02-28 2 3 4 6
52 2021-03-01 2 3 4 6
53 2021-03-02 10 3 4 6
54 2021-03-03 10 3 4 6
55 2021-03-04 10 3 4 6
56 2021-03-05 10 3 4 6
57 2021-03-06 10 3 4 6
58 2021-03-07 10 3 4 6
59 2021-03-08 10 3 4 6
60 2021-03-09 10 3 4 6
61 2021-03-10 10 3 4 6
62 2021-03-11 10 3 4 6
63 2021-03-12 10 3 4 6
64 2021-03-13 10 3 4 6
65 2021-03-14 10 3 4 6
66 2021-03-15 10 3 15 6
67 2021-03-16 10 3 15 6
68 2021-03-17 10 3 15 6
69 2021-03-18 10 3 15 6
70 2021-03-19 10 3 15 6
71 2021-03-20 10 3 15 6
72 2021-03-21 10 3 15 6
73 2021-03-22 10 7 15 6
74 2021-03-23 10 7 15 6
75 2021-03-24 10 7 15 6
76 2021-03-25 10 7 15 6
77 2021-03-26 10 7 15 6
78 2021-03-27 10 7 15 6
79 2021-03-28 10 7 15 6
80 2021-03-29 10 7 15 6
81 2021-03-30 10 7 15 6
82 2021-03-31 10 7 15 6
83 2021-04-01 10 7 15 6
84 2021-04-02 10 7 15 6
85 2021-04-03 10 7 15 6
86 2021-04-04 10 16 15 6
87 2021-04-05 10 16 15 6
88 2021-04-06 10 16 15 6
89 2021-04-07 10 16 15 6
90 2021-04-08 11 16 15 6

Use df.groupby
df.groupby(['Date']).agg('sum')

How to calculate cumulative groupby counts in Pandas with point in time?

I have a df that contains multiple weekly snapshots of JIRA tickets. I want to calculate the YTD counts of tickets.
the df looks like this:
pointInTime ticketId
2008-01-01 111
2008-01-01 222
2008-01-01 333
2008-01-07 444
2008-01-07 555
2008-01-07 666
2008-01-14 777
2008-01-14 888
2008-01-14 999
So if I df.groupby(['pointInTime'])['ticketId'].count() I can get the count of Ids in every snaphsots. But what I want to achieve is calculate the cumulative sum.
and have a df looks like this:
pointInTime ticketId cumCount
2008-01-01 111 3
2008-01-01 222 3
2008-01-01 333 3
2008-01-07 444 6
2008-01-07 555 6
2008-01-07 666 6
2008-01-14 777 9
2008-01-14 888 9
2008-01-14 999 9
so for 2008-01-07 number of ticket would be count of 2008-01-07 + count of 2008-01-01.

Use GroupBy.count and cumsum, then map the result back to "pointInTime":
df['cumCount'] = (
df['pointInTime'].map(df.groupby('pointInTime')['ticketId'].count().cumsum()))
df
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9

I am using value_counts
df.pointInTime.map(df.pointInTime.value_counts().sort_index().cumsum())
Out[207]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
Name: pointInTime, dtype: int64
Or
pd.Series(np.arange(len(df))+1,index=df.index).groupby(df['pointInTime']).transform('last')
Out[216]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
dtype: int32

Here's an approach transforming with the size and multiplying by the result of taking pd.factorize on pointInTime:
df['cumCount'] = (df.groupby('pointInTime').ticketId
.transform('size')
.mul(pd.factorize(df.pointInTime)[0]+1))
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9

Multiindex after Groupby and apply if only one group

I am trying to set True or False if some rows (grouped by 'trn_crd_no' and 'loc_code') meet a condition (difference between operations is less than 5 minutes).
Everything goes fine if there are more than one group, but failes when there is only one group ['trn_crd_no', 'loc_code']
BBDD_Patron1:
trn_id trn_date loc_code trn_crd_no prd_acc_no
0 1 28/05/2019 10:29 20004 1111 32
1 2 28/05/2019 10:30 20004 1111 434
2 3 28/05/2019 10:35 20004 1111 24
3 4 28/05/2019 10:37 20004 1111 6453
4 5 28/05/2019 10:39 20004 1111 5454
5 6 28/05/2019 10:40 20004 1111 2132
6 7 28/05/2019 10:41 20004 1111 45
7 8 28/05/2019 13:42 20007 2222 867
8 9 28/05/2019 13:47 20007 2222 765
9 19 28/05/2019 13:54 20007 2222 2334
10 11 28/05/2019 13:56 20007 2222 3454
11 12 28/05/2019 14:03 20007 2222 23
12 13 28/05/2019 15:40 20007 2222 534
13 14 28/05/2019 15:45 20007 2222 13
14 15 28/05/2019 17:05 20007 2222 765
15 16 28/05/2019 17:08 20007 2222 87
16 17 28/05/2019 14:07 10003 2222 4526
#Set trn_date is datetime
BBDD_Patron1['trn_date'] = pd.to_datetime(BBDD_Patron1['trn_date'])
aux = BBDD_Patron1.groupby(['trn_crd_no', 'loc_code'], as_index=False).apply(lambda x: x.trn_date.diff().fillna(0).abs() < pd.Timedelta(5))
aux:
0 0 True
1 False
2 False
3 False
4 False
5 False
6 False
1 16 True
2 7 True
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
Create a new DF copy from the fisrt one, and include the new column with Boolean values
BBDD_Patron1_v = BBDD_Patron1.copy()
BBDD_Patron1_v['consec'] = aux.reset_index(level=0, drop=True)
Results as expected.
BBDD_Patron1_v:
trn_id trn_date loc_code trn_crd_no prd_acc_no consec
0 1 2019-05-28 10:29:00 20004 1111 32 True
1 2 2019-05-28 10:30:00 20004 1111 434 False
2 3 2019-05-28 10:35:00 20004 1111 24 False
3 4 2019-05-28 10:37:00 20004 1111 6453 False
4 5 2019-05-28 10:39:00 20004 1111 5454 False
5 6 2019-05-28 10:40:00 20004 1111 2132 False
6 7 2019-05-28 10:41:00 20004 1111 45 False
7 8 2019-05-28 13:42:00 20007 2222 867 True
8 9 2019-05-28 13:47:00 20007 2222 765 False
9 19 2019-05-28 13:54:00 20007 2222 2334 False
10 11 2019-05-28 13:56:00 20007 2222 3454 False
11 12 2019-05-28 14:03:00 20007 2222 23 False
12 13 2019-05-28 15:40:00 20007 2222 534 False
13 14 2019-05-28 15:45:00 20007 2222 13 False
14 15 2019-05-28 17:05:00 20007 2222 765 False
15 16 2019-05-28 17:08:00 20007 2222 87 False
16 17 2019-05-28 14:07:00 10003 2222 4526 True
PROBLEM: If I have only one group after the groupby:
BBDD_2:
trn_id trn_date loc_code trn_crd_no prd_acc_no
0 1 2019-05-28 10:29:00 20004 1111 32
1 2 2019-05-28 10:30:00 20004 1111 434
2 3 2019-05-28 10:35:00 20004 1111 24
3 4 2019-05-28 10:37:00 20004 1111 6453
4 5 2019-05-28 10:39:00 20004 1111 5454
5 6 2019-05-28 10:40:00 20004 1111 2132
6 7 2019-05-28 10:41:00 20004 1111 45
aux2:
trn_date 0 1 2 3 4 5 6
trn_crd_no loc_code
1111 20004 True False False False False False False
Since the strcutrue of aux is different, I get an error with the following line:
BBDD_Patron1_v['consec'] = aux.reset_index(level=0, drop=True)
ValueError: Wrong number of items passed 7, placement implies 1
I am trying also to set squeeze=True, but it also gives different structure, so I cannot copy into BBDD_Patron1 the Boolean values.
aux = BBDD_Patron1.groupby(['trn_crd_no', 'loc_code'], squeeze=True).apply(lambda x: x.trn_date.diff().fillna(0).abs() < pd.Timedelta(5))
Results when more than one group. Aux =
trn_crd_no loc_code
1111 20004 0 True
1 False
2 False
3 False
4 False
5 False
6 False
2222 10003 16 True
20007 7 True
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
Results when only one group. Aux2 =
0 True
1 False
2 False
3 False
4 False
5 False
6 False

Plot the result of a groupby operation in pandas

I have this sample table:
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
5 111 2016-06-01 30 20
6 111 2016-07-01 31 20
7 111 2016-08-01 31 15
8 111 2016-09-01 29 15
9 111 2016-10-01 31 10
10 111 2016-11-01 29 5
11 111 2016-12-01 27 0
0 112 2016-01-01 31 55
1 112 2016-02-01 26 45
2 112 2016-03-01 31 40
3 112 2016-04-01 30 35
4 112 2016-04-01 31 30
5 112 2016-05-01 30 25
6 112 2016-06-01 31 25
7 112 2016-07-01 31 20
8 112 2016-08-01 30 20
9 112 2016-09-01 31 15
10 112 2016-11-01 29 10
11 112 2016-12-01 31 0
I'm trying to make my table final table look like this below after grouping by ID and Date.
ID Date CumDays Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 45 40
2 111 2016-03-01 76 35
3 111 2016-04-01 106 30
4 111 2016-05-01 137 25
5 111 2016-06-01 167 20
6 111 2016-07-01 198 20
7 111 2016-08-01 229 15
8 111 2016-09-01 258 15
9 111 2016-10-01 289 10
10 111 2016-11-01 318 5
11 111 2016-12-01 345 0
0 112 2016-01-01 31 55
1 112 2016-02-01 57 45
2 112 2016-03-01 88 40
3 112 2016-04-01 118 35
4 112 2016-05-01 149 30
5 112 2016-06-01 179 25
6 112 2016-07-01 210 25
7 112 2016-08-01 241 20
8 112 2016-09-01 271 20
9 112 2016-10-01 302 15
10 112 2016-11-01 331 10
11 112 2016-12-01 362 0
Next, I want to be able to extract the first value of Volume/Day per ID, all the CumDays values and all the Volume/Day values per ID and Date. So I can use them for further computation and plotting Volume/Day vs CumDays. Example for ID:111, the first value of Volume/Day will be only 50 and ID:112, it will be only 55. All CumDays values for ID:111 will be 20,45... and ID:112, it will be 31,57...For all Volume/Day --- ID:111, will be 50, 40... and ID:112 will be 55,45...
My solution:
def get_time_rate(grp_df):
t = grp_df['Days'].cumsum()
r = grp_df['Volume/Day']
return t,r
vals = df.groupby(['ID','Date']).apply(get_time_rate)
vals
Doing this, the cumulative calculation doesn't take effect at all. It returns the original Days value. This didn't allow me move further in extracting the first value of Volume/Day, all the CumDays values and all the Volume/Day values I need. Any advice or help on how to go about it will be appreciated. Thanks

Get a groupby object.
g = df.groupby('ID')
Compute columns with transform:
df['CumDays'] = g.Days.transform('cumsum')
df['First Volume/Day'] = g['Volume/Day'].transform('first')
df
ID Date Days Volume/Day CumDays First Volume/Day
0 111 2016-01-01 20 50 20 50
1 111 2016-02-01 25 40 45 50
2 111 2016-03-01 31 35 76 50
3 111 2016-04-01 30 30 106 50
4 111 2016-05-01 31 25 137 50
5 111 2016-06-01 30 20 167 50
6 111 2016-07-01 31 20 198 50
7 111 2016-08-01 31 15 229 50
8 111 2016-09-01 29 15 258 50
9 111 2016-10-01 31 10 289 50
10 111 2016-11-01 29 5 318 50
11 111 2016-12-01 27 0 345 50
0 112 2016-01-01 31 55 31 55
1 112 2016-01-02 26 45 57 55
2 112 2016-01-03 31 40 88 55
3 112 2016-01-04 30 35 118 55
4 112 2016-01-05 31 30 149 55
5 112 2016-01-06 30 25 179 55
6 112 2016-01-07 31 25 210 55
7 112 2016-01-08 31 20 241 55
8 112 2016-01-09 30 20 271 55
9 112 2016-01-10 31 15 302 55
10 112 2016-01-11 29 10 331 55
11 112 2016-01-12 31 0 362 55
If you want grouped plots, you can iterate over each groups after grouping by ID. To plot, first set index and call plot.
fig, ax = plt.subplots(figsize=(8,6))
for i, g in df2.groupby('ID'):
g.plot(x='CumDays', y='Volume/Day', ax=ax, label=str(i))
plt.show()

Pandas - sort and head inside groupby

I have following dataframe:
uniq_id value
2016-12-26 11:03:10 001 342
2016-12-26 11:03:13 004 5
2016-12-26 12:03:13 005 14
2016-12-26 12:03:13 008 114
2016-12-27 11:03:10 009 343
2016-12-27 11:03:13 013 5
2016-12-27 12:03:13 016 124
2016-12-27 12:03:13 018 114
And i need get top N records for each day sorted by value.
Something like this (for N=2):
2016-12-26 001 342
008 114
2016-12-27 009 343
016 124
Please suggest right way to do that in pandas 0.19.x

Unfortunately there is no yet such method as DataFrameGroupBy.nlargest(), which would allow us to do the following:
df.groupby(...).nlargest(2, columns=['value'])
So here is a bit ugly, but working solution:
In [73]: df.set_index(df.index.normalize()).reset_index().sort_values(['index','value'], ascending=[1,0]).groupby('index').head(2)
Out[73]:
index uniq_id value
0 2016-12-26 1 342
3 2016-12-26 8 114
4 2016-12-27 9 343
6 2016-12-27 16 124
PS i think there must be a better one...
UPDATE: if your DF wouldn't have duplicated index values, the following solution should work as well:
In [117]: df
Out[117]:
uniq_id value
2016-12-26 11:03:10 1 342
2016-12-26 11:03:13 4 5
2016-12-26 12:03:13 5 14
2016-12-26 12:33:13 8 114 # <-- i've intentionally changed this index value
2016-12-27 11:03:10 9 343
2016-12-27 11:03:13 13 5
2016-12-27 12:03:13 16 124
2016-12-27 12:33:13 18 114 # <-- i've intentionally changed this index value
In [118]: df.groupby(pd.TimeGrouper('D')).apply(lambda x: x.nlargest(2, 'value')).reset_index(level=1, drop=1)
Out[118]:
uniq_id value
2016-12-26 1 342
2016-12-26 8 114
2016-12-27 9 343
2016-12-27 16 124

df.set_index('uniq_id', append=True) \
.groupby(df.index.date).value.nlargest(2) \
.rename_axis([None, None, 'uniq_id']).reset_index(-1)
uniq_id value
2016-12-26 2016-12-26 11:03:10 1 342
2016-12-26 12:03:13 8 114
2016-12-27 2016-12-27 11:03:10 9 343
2016-12-27 12:03:13 16 124

A solution that is easier to remember might be:
df.sort_values(by='value').groupby('date').head(2)
This will give for each date the two rows with the highest value in value column.
In the example from the OT, one would need to set df['date'] = df.index before, because the column used for grouping happens to be the index.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mark duplicates based on time difference between successive rows - python

Related

Pandas: Accumulated Shares Holdings per Ticker per Day from a List of Trades

How to calculate cumulative groupby counts in Pandas with point in time?

Multiindex after Groupby and apply if only one group

Plot the result of a groupby operation in pandas

Pandas - sort and head inside groupby

Categories

Resources