How to "groupby" year, column1 and calculate the average on column 2?

How to "groupby" year, column1 and calculate the average on column 2? - python

I have this DataFrame:
year vehicule number_of_passengers
2017-01-09 bus 100
2017-11-02 car 150
2018-08-01 car 180
2016-08-09 bus 100
...
I would like to have something like this (the average number of passengers per year and per vehicule) :
year vehicule avg_number_of_passengers
2018 car 123.5
2018 bus 213.7
2017 ... ...
...
I've tried with some groupby() but can't find the good command. Can you help me ?

Related

Merging two pandas dataframes based on a DateTime column [duplicate]

I have two dataframes, one with news and the other with stock price. Both the dataframes have a "Date" column. I want to merge them on a gap of 5 days.
Lets say my news dataframe is df1 and the other price dataframe as df2.
My df1 looks like this:
News_Dates News
2018-09-29 Huge blow to ABC Corp. as they lost the 2012 tax case
2018-09-30 ABC Corp. suffers a loss
2018-10-01 ABC Corp to Sell stakes
2018-12-20 We are going to comeback strong said ABC CEO
2018-12-22 Shares are down massively for ABC Corp.
My df2 looks like this:
Dates Price
2018-10-04 120
2018-12-24 131
First method of merging I do is:
pd.merge_asof(df1_zscore.sort_values(by=['Dates']), df_n.sort_values(by=['News_Dates']), left_on=['Dates'], right_on=['News_Dates'] \
tolerance=pd.Timedelta('5d'), direction='backward')
The resulting df is:
Dates News_Dates News Price
2018-10-04 2018-10-01 ABC Corp to Sell stakes 120
2018-12-24 2018-12-22 Shares are down massively for ABC Corp. 131
The second way of merging I do is:
pd.merge_asof(df_n.sort_values(by=['Dates']), df1_zscore.sort_values(by=['Dates']), left_on=['News_Dates'], right_no=['Dates'] \
tolerance=pd.Timedelta('5d'), direction='forward').dropna()
And the resulting df as:
News_Dates News Dates Price
2018-09-29 Huge blow to ABC Corp. as they lost the 2012 tax case 2018-10-04 120
2018-09-30 ABC Corp. suffers a loss 2018-10-04 120
2018-10-01 ABC Corp to Sell stakes 2018-10-04 120
2018-12-22 Shares are down massively for ABC Corp. 2018-12-24 131
Both the merging results in separate dfs, however there are values in both the cases which are missing, like for second case for 4th October price, news from 29th, 30th Sept should have been merged. And in case 2 for 24th December price 20th December should also have been merged.
So I'm not quite able to figure out where am I going wrong.
P.S. My objective is to merge the price df with the news df that have come in the last 5 days from the price date.

You can swap the left and right dataframe:
df = pd.merge_asof(
df1,
df2,
left_on='News_Dates',
right_on='Dates',
tolerance=pd.Timedelta('5D'),
direction='nearest'
)
df = df[['Dates', 'News_Dates', 'News', 'Price']]
print(df)
Dates News_Dates News Price
0 2018-10-04 2018-09-29 Huge blow to ABC Corp. as they lost the 2012 t... 120
1 2018-10-04 2018-09-30 ABC Corp. suffers a loss 120
2 2018-10-04 2018-10-01 ABC Corp to Sell stakes 120
3 2018-12-24 2018-12-20 We are going to comeback strong said ABC CEO 131
4 2018-12-24 2018-12-22 Shares are down massively for ABC Corp. 131

Here is my solution using numpy
df_n = pd.DataFrame([('2018-09-29', 'Huge blow to ABC Corp. as they lost the 2012 tax case'), ('2018-09-30', 'ABC Corp. suffers a loss'), ('2018-10-01', 'ABC Corp to Sell stakes'), ('2018-12-20', 'We are going to comeback strong said ABC CEO'), ('2018-12-22', 'Shares are down massively for ABC Corp.')], columns=('News_Dates', 'News'))
df1_zscore = pd.DataFrame([('2018-10-04', '120'), ('2018-12-24', '131')], columns=('Dates', 'Price'))
df_n["News_Dates"] = pd.to_datetime(df_n["News_Dates"])
df1_zscore["Dates"] = pd.to_datetime(df1_zscore["Dates"])
n_dates = df_n["News_Dates"].values
p_dates = df1_zscore[["Dates"]].values
## substract each pair of n_dates and p_dates and create a matrix
mat_date_compare = (p_dates - n_dates).astype('timedelta64[D]')
## get matrix of boolean for which difference is between 0 and 5 day
## to be used as index for original array
comparision = (mat_date_compare <= pd.Timedelta("5d")) & (mat_date_compare >= pd.Timedelta("0d"))
## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(n_dates)*len(p_dates))[comparision.ravel()]
## calculate row and column index from cell number to index the df
pd.concat([df1_zscore.iloc[ind//len(n_dates)].reset_index(drop=True),
df_n.iloc[ind%len(n_dates)].reset_index(drop=True)], sort=False, axis=1)
Result
Dates Price News_Dates News
0 2018-10-04 120 2018-09-29 Huge blow to ABC Corp. as they lost the 2012 t...
1 2018-10-04 120 2018-09-30 ABC Corp. suffers a loss
2 2018-10-04 120 2018-10-01 ABC Corp to Sell stakes
3 2018-12-24 131 2018-12-20 We are going to comeback strong said ABC CEO
4 2018-12-24 131 2018-12-22 Shares are down massively for ABC Corp.

How do I Group By Date and Measure fields to calculate rank?

I have a data set with Student Names, the date of transaction and the amount.
Each student has made multiple transactions.
I want to calculate current month rank and previous month rank based on total amount for each student.
I am able to do a group by Student Name to calculate the total amount for each student using:
transactions['Totals'] = transactions.groupby('Student Name')['Sale Amount'].transform('sum')
How do I extend this to make two different columns that calculate previous month totals and current month totals for each student, so I can assign previous month and current month ranks to them?
The date is in the following format:
09/05/2015 04:18 PM
07/15/2019 09:50 AM
05/18/2018 02:34 PM
08/11/2018 06:29 PM
06/14/2018 07:42 AM
EDIT : Adding dataframe for reference:
Out[15]:
Date of Transaction Student Name Sale Amount
0 09/05/2015 04:18 PM Dan Kelly 4333
1 07/15/2019 09:50 AM Peter Dyer 8805
2 05/18/2018 02:34 PM Natalie Robertson 5640
3 08/11/2018 06:29 PM Sean Miller 6485
4 06/14/2018 07:42 AM Thomas Forsyth 6815
... ... ...
9977 03/15/2018 09:28 PM Grace Vance 6379
9978 08/07/2019 11:14 PM Alexandra Cameron 6688
9979 01/09/2015 10:53 AM Sebastian Vaughan 2262
9980 05/19/2019 10:00 PM Caroline Blake 6977
9981 01/11/2016 04:05 AM Austin Edmunds 3205
[9982 rows x 3 columns]
EDIT : Adding sample expected output:

I've created a dataframe with the minimal data you informed: 'Student Name', 'Sale Amount', 'Date'
My dataframe:
df = pd.DataFrame([['12/05/2019 04:18 PM','Marisa',500],
['11/29/2019 04:18 PM','Marisa',500],
['11/20/2019 04:18 PM','Marisa',800],
['12/04/2019 04:18 PM','Peter',300],
['11/30/2019 04:18 PM','Peter',300],
['12/05/2019 04:18 PM','Debra',400],
['11/28/2019 04:18 PM','Debra',200],
['11/15/2019 04:18 PM','Debra',600],
['10/23/2019 04:18 PM','Debra',200]],columns=['Date','Student Name','Sale Amount']
)
Be sure date is a datetime column.
df.Date = pd.to_datetime(df.Date)
This gives you the total amount per month per student in the original dataframe:
df['Total'] = df.groupby(['Student Name',pd.Grouper(key='Date', freq='1M')])['Sale Amount'].transform('sum')
Date Student Name Sale Amount Total
0 2019-12-05 16:18:00 Marisa 500 500
1 2019-11-29 16:18:00 Marisa 500 1300
2 2019-11-20 16:18:00 Marisa 800 1300
3 2019-12-04 16:18:00 Peter 300 300
4 2019-11-30 16:18:00 Peter 300 300
5 2019-12-05 16:18:00 Debra 400 400
6 2019-11-28 16:18:00 Debra 200 800
7 2019-11-15 16:18:00 Debra 600 800
8 2019-10-23 16:18:00 Debra 200 200
How to print only the selected results?
df is dnew now:
dnew = df
Let's strip datetime to keep months only:
#Strip date to month
dnew['Date'] = dnew['Date'].apply(lambda x:x.date().strftime('%m'))
Drop Sale Amount entries and group by Student Name and Date (new dataframe is "sales"):
#Drop Sale Amount
sales = dnew.drop(['Sale Amount'], axis=1).groupby(['Student Name','Date'])['Total'].max()
print(sales)
Student Name Date
Debra 10 200
11 800
12 400
Marisa 11 1300
12 500
Peter 11 300
12 300
Actually, "sales" is pandas.core.series.Series and it's important to know that
print(sales.index)
MultiIndex([( 'Debra', '10'),
( 'Debra', '11'),
( 'Debra', '12'),
('Marisa', '11'),
('Marisa', '12'),
( 'Peter', '11'),
( 'Peter', '12')],
names=['Student Name', 'Date'])
from datetime import datetime
curMonth = int(datetime.today().strftime('%m')) #transform to integer to perform (curMonth-1)
#12
#months of interest
moi = sales.iloc[(sales.index.get_level_values('Date') == str(curMonth-1)) | (sales.index.get_level_values('Date') == str(curMonth))]
print(moi)
Student Name Date
Debra 11 800
12 400
Marisa 11 1300
12 500
Peter 11 300
12 300

pd.merge_asof() based on Time-Difference not merging all values - Pandas

I have two dataframes, one with news and the other with stock price. Both the dataframes have a "Date" column. I want to merge them on a gap of 5 days.
Lets say my news dataframe is df1 and the other price dataframe as df2.
My df1 looks like this:
News_Dates News
2018-09-29 Huge blow to ABC Corp. as they lost the 2012 tax case
2018-09-30 ABC Corp. suffers a loss
2018-10-01 ABC Corp to Sell stakes
2018-12-20 We are going to comeback strong said ABC CEO
2018-12-22 Shares are down massively for ABC Corp.
My df2 looks like this:
Dates Price
2018-10-04 120
2018-12-24 131
First method of merging I do is:
pd.merge_asof(df1_zscore.sort_values(by=['Dates']), df_n.sort_values(by=['News_Dates']), left_on=['Dates'], right_on=['News_Dates'] \
tolerance=pd.Timedelta('5d'), direction='backward')
The resulting df is:
Dates News_Dates News Price
2018-10-04 2018-10-01 ABC Corp to Sell stakes 120
2018-12-24 2018-12-22 Shares are down massively for ABC Corp. 131
The second way of merging I do is:
pd.merge_asof(df_n.sort_values(by=['Dates']), df1_zscore.sort_values(by=['Dates']), left_on=['News_Dates'], right_no=['Dates'] \
tolerance=pd.Timedelta('5d'), direction='forward').dropna()
And the resulting df as:
News_Dates News Dates Price
2018-09-29 Huge blow to ABC Corp. as they lost the 2012 tax case 2018-10-04 120
2018-09-30 ABC Corp. suffers a loss 2018-10-04 120
2018-10-01 ABC Corp to Sell stakes 2018-10-04 120
2018-12-22 Shares are down massively for ABC Corp. 2018-12-24 131
Both the merging results in separate dfs, however there are values in both the cases which are missing, like for second case for 4th October price, news from 29th, 30th Sept should have been merged. And in case 2 for 24th December price 20th December should also have been merged.
So I'm not quite able to figure out where am I going wrong.
P.S. My objective is to merge the price df with the news df that have come in the last 5 days from the price date.

You can swap the left and right dataframe:
df = pd.merge_asof(
df1,
df2,
left_on='News_Dates',
right_on='Dates',
tolerance=pd.Timedelta('5D'),
direction='nearest'
)
df = df[['Dates', 'News_Dates', 'News', 'Price']]
print(df)
Dates News_Dates News Price
0 2018-10-04 2018-09-29 Huge blow to ABC Corp. as they lost the 2012 t... 120
1 2018-10-04 2018-09-30 ABC Corp. suffers a loss 120
2 2018-10-04 2018-10-01 ABC Corp to Sell stakes 120
3 2018-12-24 2018-12-20 We are going to comeback strong said ABC CEO 131
4 2018-12-24 2018-12-22 Shares are down massively for ABC Corp. 131

Here is my solution using numpy
df_n = pd.DataFrame([('2018-09-29', 'Huge blow to ABC Corp. as they lost the 2012 tax case'), ('2018-09-30', 'ABC Corp. suffers a loss'), ('2018-10-01', 'ABC Corp to Sell stakes'), ('2018-12-20', 'We are going to comeback strong said ABC CEO'), ('2018-12-22', 'Shares are down massively for ABC Corp.')], columns=('News_Dates', 'News'))
df1_zscore = pd.DataFrame([('2018-10-04', '120'), ('2018-12-24', '131')], columns=('Dates', 'Price'))
df_n["News_Dates"] = pd.to_datetime(df_n["News_Dates"])
df1_zscore["Dates"] = pd.to_datetime(df1_zscore["Dates"])
n_dates = df_n["News_Dates"].values
p_dates = df1_zscore[["Dates"]].values
## substract each pair of n_dates and p_dates and create a matrix
mat_date_compare = (p_dates - n_dates).astype('timedelta64[D]')
## get matrix of boolean for which difference is between 0 and 5 day
## to be used as index for original array
comparision = (mat_date_compare <= pd.Timedelta("5d")) & (mat_date_compare >= pd.Timedelta("0d"))
## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(n_dates)*len(p_dates))[comparision.ravel()]
## calculate row and column index from cell number to index the df
pd.concat([df1_zscore.iloc[ind//len(n_dates)].reset_index(drop=True),
df_n.iloc[ind%len(n_dates)].reset_index(drop=True)], sort=False, axis=1)
Result
Dates Price News_Dates News
0 2018-10-04 120 2018-09-29 Huge blow to ABC Corp. as they lost the 2012 t...
1 2018-10-04 120 2018-09-30 ABC Corp. suffers a loss
2 2018-10-04 120 2018-10-01 ABC Corp to Sell stakes
3 2018-12-24 131 2018-12-20 We are going to comeback strong said ABC CEO
4 2018-12-24 131 2018-12-22 Shares are down massively for ABC Corp.

Subtract dates across DataFrames

I just start learning Python and R, so any advice using either of them would be much appreciated.
My data are stored in two dataframes. One is sales data, for each consumer, we can see the date when he purchases something. It is possible the same person purchases more than once:
Date Person ID Product
01-05-2012 1 cereal
01-05-2012 2 apple
02-08-2012 3 beef
03-22-2013 72 pot
07-19-2012 1 cake
The second dataframe has membership data, which tell us when did a person enrolled in the program:
Date Person ID Type Status
06-11-2008 1 Gold New
10-12-2011 2 Gold New
02-08-2011 3 Silver Renewal
02-01-2012 72 Gold Renewal
03-22-2012 1 Gold Renewal
What I want to do is, for the same person, how long does it take before a person purchases something before he enrolls in a program.
For example, person 1 got a new membership on 06-11-2008 and purchased cereal on 01-05-2012. I would like to calculate how many days there are between these two dates.
However, these information are stored in separate dataframes. I don't think they can be append or merged into one dataframe, because one person can have more than one observations in one or both of the dataframes.
What I am thinking is, extract all the dates from sales data, and extract all the dates from the license data. Then merge these two new dataframes into a new dataframe. This will give me:
License Date Person ID Sales Date
06-11-2008 1 01-05-2012
10-12-2011 2 01-05-2012
02-08-2011 3 02-08-2011
02-01-2012 72 03-22-2013
06-11-2008 1 07-19-2012
03-22-2012 1 01-05-2012
03-22-2012 1 07-19-2012
But the problem here is, if a person has two license dates (ex. one new and one renewal), then merge the data will give 2*(sales dates)... but I only want the sales dates for a license that is valid..
For example, person 1 used license 06-11-2008 to buy cereal on 01-05-2012, and used license 03-22-2012 to buy on 07-19-2012. But merging the dataframes will give me 4 records rather than the 2 I want...
The result I would want is the time to purchase for each sale, after he gets the license which he used for that purchase:
License Date Person ID Sales Date TimeToPurchase
06-11-2008 1 01-05-2012 ...
10-12-2011 2 01-05-2012 ...
02-08-2011 3 02-08-2011 ...
02-01-2012 72 03-22-2013 ...
03-22-2012 1 07-19-2012 ...
Is there a better way you suggest I can do?
Thank you very much for the help!

pandas
First your dates need to be saved as datetime, which you can accomplish like this:
sales['Date'] = pd.to_datetime(sales['Date'])
memberships['Date'] = pd.to_datetime(memberships['Date'])
Then you merge them by Person ID and arrive to the format that has duplicates.
m = sales.merge(memberships, left_on='Person ID', right_on='Person ID',
suffixes=('_sales', '_memberships'))
m
Date_sales Person ID Product Date_memberships Type Status
0 2012-01-05 1 cereal 2008-06-11 Gold New
1 2012-01-05 1 cereal 2012-03-22 Gold Renewal
2 2012-07-19 1 cake 2008-06-11 Gold New
3 2012-07-19 1 cake 2012-03-22 Gold Renewal
4 2012-01-05 2 apple 2011-10-12 Gold New
5 2012-02-08 3 beef 2011-02-08 Silver Renewal
6 2013-03-22 72 pot 2012-02-01 Gold Renewal
Now you can calculate the elapsed days between the sales and the membership dates like this:
m['TimeToPurchase'] = (m['Date_sales'] - m['Date_memberships']).dt.days
m
Date_sales Person ID Product Date_memberships Type Status TimeToPurchase
0 2012-01-05 1 cereal 2008-06-11 Gold New 1303
1 2012-01-05 1 cereal 2012-03-22 Gold Renewal -77
2 2012-07-19 1 cake 2008-06-11 Gold New 1499
3 2012-07-19 1 cake 2012-03-22 Gold Renewal 119
4 2012-01-05 2 apple 2011-10-12 Gold New 85
5 2012-02-08 3 beef 2011-02-08 Silver Renewal 365
6 2013-03-22 72 pot 2012-02-01 Gold Renewal 415
From here you can first eliminate the negatives and then get the minimum TimeToPurchase for each Person ID and Date sales.
m = m[m['TimeToPurchase'] >= 0]
keep = m.groupby(['Person ID', 'Date_sales'], as_index=False)['TimeToPurchase'].min()
keep
Person ID Date_sales TimeToPurchase
1 2012-01-05 1303
1 2012-07-19 119
2 2012-01-05 85
3 2012-02-08 365
72 2013-03-22 415
These are the records that you want to keep in your merged table, which you can filter with an inner join:
result = m.merge(keep,
left_on=['Person ID', 'Date_sales', 'TimeToPurchase'],
right_on=['Person ID', 'Date_sales', 'TimeToPurchase'])
result
Date_sales Person ID Product Date_memberships Type Status TimeToPurchase
2012-01-05 1 cereal 2008-06-11 Gold New 1303
2012-07-19 1 cake 2012-03-22 Gold Renewal 119
2012-01-05 2 apple 2011-10-12 Gold New 85
2012-02-08 3 beef 2011-02-08 Silver Renewal 365
2013-03-22 72 pot 2012-02-01 Gold Renewal 415
data.table
Same logic as above, so I'll just paste the code.
# Date types
sales[, Date := as.Date(Date, format = "%m-%d-%Y")]
memberships[, Date := as.Date(Date, format = "%m-%d-%Y")]
# Merge
m <- memberships[sales, on = "Person ID"]
# Calculate elapsed days
m[, TimeToPurchase := as.numeric(m$i.Date - m$Date)]
# Eliminate negatives
m <- m[TimeToPurchase >= 0]
# Calculate records to keep
keep <- m[, .(TimeToPurchase = min(TimeToPurchase)), by = .(`Person ID`, i.Date)]
# Filter result
result <- m[keep, on = c("Person ID", "i.Date", "TimeToPurchase")]
result
Date Person ID Type Status i.Date Product TimeToPurchase
1: 2008-06-11 1 Gold New 2012-01-05 cereal 1303
2: 2011-10-12 2 Gold New 2012-01-05 apple 85
3: 2011-02-08 3 Silver Renewal 2012-02-08 beef 365
4: 2012-02-01 72 Gold Renewal 2013-03-22 pot 415
5: 2012-03-22 1 Gold Renewal 2012-07-19 cake 119

Here is a solution using R and library(data.table) assuming you are only interested in the latest time to purchase:
Edit: after question was updated
library(data.table)
purchaseDT <- data.table(stringsAsFactors=FALSE,
Date = c("01-05-2009", "01-05-2012", "02-08-2012", "03-22-2013"),
PersonID = c(1, 2, 1, 72),
Product = c("cereal", "apple", "beef", "pot")
)
programDT <- data.table(stringsAsFactors=FALSE,
Date = c("06-11-2008", "10-12-2011", "02-08-2011", "02-01-2012"),
PersonID = c(1, 2, 1, 72),
Type = c("Gold", "Gold", "Silver", "Gold"),
Status = c("New", "New", "Renewal", "Renewal")
)
purchaseDT[, PurchaseDate := as.Date(Date, format="%m-%d-%Y")]
programDT[, LicenseDate := as.Date(Date, format="%m-%d-%Y")]
purchaseDT[, Date := NULL]
programDT[, Date := NULL]
mergedDT <- purchaseDT[programDT, on="PersonID"]
mergedDT[, TimeToPurchase := PurchaseDate-LicenseDate]
mergedDT <- mergedDT[TimeToPurchase > 0]
resultDT <- mergedDT[, .(TimeToPurchase = min(TimeToPurchase)), by = c("LicenseDate", "PersonID")]
resultDT[, PurchaseDate := LicenseDate+TimeToPurchase]
print(resultDT)
Result:
LicenseDate PersonID TimeToPurchase PurchaseDate
1: 2008-06-11 1 208 days 2009-01-05
2: 2011-10-12 2 85 days 2012-01-05
3: 2011-02-08 1 365 days 2012-02-08
4: 2012-02-01 72 415 days 2013-03-22

Here is one idea for you. First, I merged the two data sets using Person_ID and Date. In this example, I needed to create a date object (i.e., Date) in the first mutate(). I sorted the data by Person_ID and Date. Then, I created a new grouping variable. What I did was to identify rows where Status is either "New" or "Renewal". This means that I identified when a license became valid for the first time. That row becomes the first row for each license. For each group, I chose the first two rows. The data is arranged by Person_ID and Date, so the 2nd row should be the one that a customer bought something with the valid license for the first time. Finally, I calculated the interval (i.e., time2purchase) using Date.
full_join(df1, df2, by = c("Person_ID", "Date")) %>%
mutate(Date = as.Date(Date, format = "%m-%d-%Y")) %>%
arrange(Person_ID, Date) %>%
mutate(group = findInterval(x = 1:n(), vec = grep(Status, pattern = "New|Renewal"))) %>%
group_by(group) %>%
slice(1:2) %>%
summarize(time2purchase = Date[2]-Date[1])
group time2purchase
<int> <time>
1 1 1303 days
2 2 119 days
3 3 85 days
4 4 365 days
5 5 415 days
To make things clearer, I leave the results below, which you can generate
using mutate() instead of summarize().
Date Person_ID Product Type Status group time2purchase
<date> <int> <chr> <chr> <chr> <int> <time>
1 2008-06-11 1 NA Gold New 1 1303 days
2 2012-03-22 1 NA Gold Renewal 2 119 days
3 2011-10-12 2 NA Gold New 3 85 days
4 2011-02-08 3 NA Silver Renewal 4 365 days
5 2012-02-01 72 NA Gold Renewal 5 415 days
DATA
df1 <-structure(list(Date = c("01-05-2012", "01-05-2012", "02-08-2012",
"03-22-2013", "07-19-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Product = c("cereal", "apple", "beef", "pot", "cake")), class = "data.frame",
row.names = c(NA,
-5L))
df2 <- structure(list(Date = c("06-11-2008", "10-12-2011", "02-08-2011",
"02-01-2012", "03-22-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Type = c("Gold", "Gold", "Silver", "Gold", "Gold"), Status = c("New",
"New", "Renewal", "Renewal", "Renewal")), class = "data.frame", row.names = c(NA,
-5L))

Combine two pandas DataFrames where the date fields are within two months of each other

I need to combine 2 pandas dataframes where df1.date is within 2 months previous of df2. I then want to calculate how many traders had traded the same stock during that period and count the total shares purchased.
I have tried using the approach listed below, but found it far to complicated. I believe there would be a smarter/simpler solution.
Pandas: how to merge two dataframes on offset dates?
A sample dataset is below:
DF1 (team_1):
date shares symbol trader
31/12/2013 154 FDX Max
30/06/2016 2367 GOOGL Max
21/07/2015 293 ORCL Max
18/07/2015 304 ORCL Sam
DF2 (team_2):
date shares symbol trader
23/08/2015 345 ORCL John
04/07/2014 567 FB John
06/12/2013 221 ACER Sally
31/11/2012 889 HP John
05/06/2010 445 ABBV Kate
Required output:
date shares symbol trader team_2_traders team_2_shares_bought
23/08/2015 345 ORCL John 2 597
04/07/2014 567 FB John 0 0
06/12/2013 221 ACER Sally 0 0
31/11/2012 889 HP John 0 0
05/06/2010 445 ABBV Kate 0 0
This adds 2 new columns...
'team_2_traders' = count of how many traders from team_1 traded the same stock during the previous 2 months from the date listed on DF2.
'team_2_shares_bought' = count of the total shares purchased by team_1 during the previous 2 months from the date listed on DF2.
If anyone is willing to give this a crack, please use the snippet below to setup the dataframes. Please keep in mind the actual dataset contains millions of rows and 6,000 company stocks.
team_1 = {'symbol':['FDX','GOOGL','ORCL','ORCL'],
'date':['31/12/2013','30/06/2016','21/07/2015','18/07/2015'],
'shares':[154,2367,293,304],
'trader':['Max','Max','Max','Sam']}
df1 = pd.DataFrame(team_1)
team_2 = {'symbol':['ORCL','FB','ACER','HP','ABBV'],
'date':['23/08/2015','04/07/2014','06/12/2013','31/11/2012','05/06/2010'],
'shares':[345,567,221,889,445],
'trader':['John','John','Sally','John','Kate']}
df2 = pd.DataFrame(team_2)
Appreciate the help - thank you.

Please check my solution.
from pandas.tseries.offsets import MonthEnd
df_ = df2.merge(df1, on=['symbol'])
df_['date_x'] = pd.to_datetime(df_['date_x'])
df_['date_y'] = pd.to_datetime(df_['date_y'])
df_2m = df_[df_['date_x'] < df_['date_y'] + MonthEnd(2)] \
.loc[:, ['date_y', 'shares_y', 'symbol', 'trader_y']] \
.groupby('symbol')
df1_ = pd.concat([df_2m['shares_y'].sum(), df_2m['trader_y'].count()], axis=1)
print(df1_)
shares_y trader_y
symbol
ORCL 597 2
print(df2.merge(df1_.reset_index(), on='symbol', how='left').fillna(0))
date shares symbol trader shares_y trader_y
0 23/08/2015 345 ORCL John 597.0 2.0
1 04/07/2014 567 FB John 0.0 0.0
2 06/12/2013 221 ACER Sally 0.0 0.0
3 30/11/2012 889 HP John 0.0 0.0
4 05/06/2010 445 ABBV Kate 0.0 0.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.