Subtract dates across DataFrames - python

I just start learning Python and R, so any advice using either of them would be much appreciated.
My data are stored in two dataframes. One is sales data, for each consumer, we can see the date when he purchases something. It is possible the same person purchases more than once:
Date Person ID Product
01-05-2012 1 cereal
01-05-2012 2 apple
02-08-2012 3 beef
03-22-2013 72 pot
07-19-2012 1 cake
The second dataframe has membership data, which tell us when did a person enrolled in the program:
Date Person ID Type Status
06-11-2008 1 Gold New
10-12-2011 2 Gold New
02-08-2011 3 Silver Renewal
02-01-2012 72 Gold Renewal
03-22-2012 1 Gold Renewal
What I want to do is, for the same person, how long does it take before a person purchases something before he enrolls in a program.
For example, person 1 got a new membership on 06-11-2008 and purchased cereal on 01-05-2012. I would like to calculate how many days there are between these two dates.
However, these information are stored in separate dataframes. I don't think they can be append or merged into one dataframe, because one person can have more than one observations in one or both of the dataframes.
What I am thinking is, extract all the dates from sales data, and extract all the dates from the license data. Then merge these two new dataframes into a new dataframe. This will give me:
License Date Person ID Sales Date
06-11-2008 1 01-05-2012
10-12-2011 2 01-05-2012
02-08-2011 3 02-08-2011
02-01-2012 72 03-22-2013
06-11-2008 1 07-19-2012
03-22-2012 1 01-05-2012
03-22-2012 1 07-19-2012
But the problem here is, if a person has two license dates (ex. one new and one renewal), then merge the data will give 2*(sales dates)... but I only want the sales dates for a license that is valid..
For example, person 1 used license 06-11-2008 to buy cereal on 01-05-2012, and used license 03-22-2012 to buy on 07-19-2012. But merging the dataframes will give me 4 records rather than the 2 I want...
The result I would want is the time to purchase for each sale, after he gets the license which he used for that purchase:
License Date Person ID Sales Date TimeToPurchase
06-11-2008 1 01-05-2012 ...
10-12-2011 2 01-05-2012 ...
02-08-2011 3 02-08-2011 ...
02-01-2012 72 03-22-2013 ...
03-22-2012 1 07-19-2012 ...
Is there a better way you suggest I can do?
Thank you very much for the help!

pandas
First your dates need to be saved as datetime, which you can accomplish like this:
sales['Date'] = pd.to_datetime(sales['Date'])
memberships['Date'] = pd.to_datetime(memberships['Date'])
Then you merge them by Person ID and arrive to the format that has duplicates.
m = sales.merge(memberships, left_on='Person ID', right_on='Person ID',
suffixes=('_sales', '_memberships'))
m
Date_sales Person ID Product Date_memberships Type Status
0 2012-01-05 1 cereal 2008-06-11 Gold New
1 2012-01-05 1 cereal 2012-03-22 Gold Renewal
2 2012-07-19 1 cake 2008-06-11 Gold New
3 2012-07-19 1 cake 2012-03-22 Gold Renewal
4 2012-01-05 2 apple 2011-10-12 Gold New
5 2012-02-08 3 beef 2011-02-08 Silver Renewal
6 2013-03-22 72 pot 2012-02-01 Gold Renewal
Now you can calculate the elapsed days between the sales and the membership dates like this:
m['TimeToPurchase'] = (m['Date_sales'] - m['Date_memberships']).dt.days
m
Date_sales Person ID Product Date_memberships Type Status TimeToPurchase
0 2012-01-05 1 cereal 2008-06-11 Gold New 1303
1 2012-01-05 1 cereal 2012-03-22 Gold Renewal -77
2 2012-07-19 1 cake 2008-06-11 Gold New 1499
3 2012-07-19 1 cake 2012-03-22 Gold Renewal 119
4 2012-01-05 2 apple 2011-10-12 Gold New 85
5 2012-02-08 3 beef 2011-02-08 Silver Renewal 365
6 2013-03-22 72 pot 2012-02-01 Gold Renewal 415
From here you can first eliminate the negatives and then get the minimum TimeToPurchase for each Person ID and Date sales.
m = m[m['TimeToPurchase'] >= 0]
keep = m.groupby(['Person ID', 'Date_sales'], as_index=False)['TimeToPurchase'].min()
keep
Person ID Date_sales TimeToPurchase
1 2012-01-05 1303
1 2012-07-19 119
2 2012-01-05 85
3 2012-02-08 365
72 2013-03-22 415
These are the records that you want to keep in your merged table, which you can filter with an inner join:
result = m.merge(keep,
left_on=['Person ID', 'Date_sales', 'TimeToPurchase'],
right_on=['Person ID', 'Date_sales', 'TimeToPurchase'])
result
Date_sales Person ID Product Date_memberships Type Status TimeToPurchase
2012-01-05 1 cereal 2008-06-11 Gold New 1303
2012-07-19 1 cake 2012-03-22 Gold Renewal 119
2012-01-05 2 apple 2011-10-12 Gold New 85
2012-02-08 3 beef 2011-02-08 Silver Renewal 365
2013-03-22 72 pot 2012-02-01 Gold Renewal 415
data.table
Same logic as above, so I'll just paste the code.
# Date types
sales[, Date := as.Date(Date, format = "%m-%d-%Y")]
memberships[, Date := as.Date(Date, format = "%m-%d-%Y")]
# Merge
m <- memberships[sales, on = "Person ID"]
# Calculate elapsed days
m[, TimeToPurchase := as.numeric(m$i.Date - m$Date)]
# Eliminate negatives
m <- m[TimeToPurchase >= 0]
# Calculate records to keep
keep <- m[, .(TimeToPurchase = min(TimeToPurchase)), by = .(`Person ID`, i.Date)]
# Filter result
result <- m[keep, on = c("Person ID", "i.Date", "TimeToPurchase")]
result
Date Person ID Type Status i.Date Product TimeToPurchase
1: 2008-06-11 1 Gold New 2012-01-05 cereal 1303
2: 2011-10-12 2 Gold New 2012-01-05 apple 85
3: 2011-02-08 3 Silver Renewal 2012-02-08 beef 365
4: 2012-02-01 72 Gold Renewal 2013-03-22 pot 415
5: 2012-03-22 1 Gold Renewal 2012-07-19 cake 119

Here is a solution using R and library(data.table) assuming you are only interested in the latest time to purchase:
Edit: after question was updated
library(data.table)
purchaseDT <- data.table(stringsAsFactors=FALSE,
Date = c("01-05-2009", "01-05-2012", "02-08-2012", "03-22-2013"),
PersonID = c(1, 2, 1, 72),
Product = c("cereal", "apple", "beef", "pot")
)
programDT <- data.table(stringsAsFactors=FALSE,
Date = c("06-11-2008", "10-12-2011", "02-08-2011", "02-01-2012"),
PersonID = c(1, 2, 1, 72),
Type = c("Gold", "Gold", "Silver", "Gold"),
Status = c("New", "New", "Renewal", "Renewal")
)
purchaseDT[, PurchaseDate := as.Date(Date, format="%m-%d-%Y")]
programDT[, LicenseDate := as.Date(Date, format="%m-%d-%Y")]
purchaseDT[, Date := NULL]
programDT[, Date := NULL]
mergedDT <- purchaseDT[programDT, on="PersonID"]
mergedDT[, TimeToPurchase := PurchaseDate-LicenseDate]
mergedDT <- mergedDT[TimeToPurchase > 0]
resultDT <- mergedDT[, .(TimeToPurchase = min(TimeToPurchase)), by = c("LicenseDate", "PersonID")]
resultDT[, PurchaseDate := LicenseDate+TimeToPurchase]
print(resultDT)
Result:
LicenseDate PersonID TimeToPurchase PurchaseDate
1: 2008-06-11 1 208 days 2009-01-05
2: 2011-10-12 2 85 days 2012-01-05
3: 2011-02-08 1 365 days 2012-02-08
4: 2012-02-01 72 415 days 2013-03-22

Here is one idea for you. First, I merged the two data sets using Person_ID and Date. In this example, I needed to create a date object (i.e., Date) in the first mutate(). I sorted the data by Person_ID and Date. Then, I created a new grouping variable. What I did was to identify rows where Status is either "New" or "Renewal". This means that I identified when a license became valid for the first time. That row becomes the first row for each license. For each group, I chose the first two rows. The data is arranged by Person_ID and Date, so the 2nd row should be the one that a customer bought something with the valid license for the first time. Finally, I calculated the interval (i.e., time2purchase) using Date.
full_join(df1, df2, by = c("Person_ID", "Date")) %>%
mutate(Date = as.Date(Date, format = "%m-%d-%Y")) %>%
arrange(Person_ID, Date) %>%
mutate(group = findInterval(x = 1:n(), vec = grep(Status, pattern = "New|Renewal"))) %>%
group_by(group) %>%
slice(1:2) %>%
summarize(time2purchase = Date[2]-Date[1])
group time2purchase
<int> <time>
1 1 1303 days
2 2 119 days
3 3 85 days
4 4 365 days
5 5 415 days
To make things clearer, I leave the results below, which you can generate
using mutate() instead of summarize().
Date Person_ID Product Type Status group time2purchase
<date> <int> <chr> <chr> <chr> <int> <time>
1 2008-06-11 1 NA Gold New 1 1303 days
2 2012-03-22 1 NA Gold Renewal 2 119 days
3 2011-10-12 2 NA Gold New 3 85 days
4 2011-02-08 3 NA Silver Renewal 4 365 days
5 2012-02-01 72 NA Gold Renewal 5 415 days
DATA
df1 <-structure(list(Date = c("01-05-2012", "01-05-2012", "02-08-2012",
"03-22-2013", "07-19-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Product = c("cereal", "apple", "beef", "pot", "cake")), class = "data.frame",
row.names = c(NA,
-5L))
df2 <- structure(list(Date = c("06-11-2008", "10-12-2011", "02-08-2011",
"02-01-2012", "03-22-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Type = c("Gold", "Gold", "Silver", "Gold", "Gold"), Status = c("New",
"New", "Renewal", "Renewal", "Renewal")), class = "data.frame", row.names = c(NA,
-5L))

Related

Dataframe Insert Labels if filename starts with a 'b'

I want to create a dataframe and give a lable to each file, based on the first letter of the filename:
This is where I created the dataframe, which works out fine:
[IN]
df = pd.read_csv('data.txt', sep="\t", names=['file', 'text', 'label'], header=None, engine='python')
texts = df['text'].values.astype("U")
print(df)
[OUT]
file text label
0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... NaN
1 b_002.txt Dollar gains on Greenspan speechThe dollar has... NaN
2 b_003.txt Yukos unit buyer faces loan claimThe owners of... NaN
3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... NaN
4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... NaN
... ... ... ...
2220 t_397.txt BT program to beat dialler scamsBT is introduc... NaN
2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... NaN
2222 t_399.txt Be careful how you codeA new European directiv... NaN
2223 t_400.txt US cyber security chief resignsThe man making ... NaN
2224 t_401.txt Losing yourself in online gamingOnline role pl... NaN
Now I want to insert labels based on the filename
for index, row in df.iterrows():
if row['file'].startswith('b'):
row['label'] = 0
elif row['file'].startswith('e'):
row['label'] = 1
elif row['file'].startswith('p'):
row['label'] = 2
elif row['file'].startswith('s'):
row['label'] = 3
else:
row['label'] = 4
print(df)
[OUT]
file text label
0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... 4
1 b_002.txt Dollar gains on Greenspan speechThe dollar has... 4
2 b_003.txt Yukos unit buyer faces loan claimThe owners of... 4
3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... 4
4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... 4
... ... ... ...
2220 t_397.txt BT program to beat dialler scamsBT is introduc... 4
2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... 4
2222 t_399.txt Be careful how you codeA new European directiv... 4
2223 t_400.txt US cyber security chief resignsThe man making ... 4
2224 t_401.txt Losing yourself in online gamingOnline role pl... 4
As you can see, every row got the label 4. What did I do wrong?
here is one way to do it
instead of for loop, you can use map to assign the values to the label
# create a dictionary of key: value map
d={'b':0,'e':1,'p':2,'s':3}
else_val=4
#take the first character from the filename, and map using dictionary
# null values (else condition) will be 4
df['file'].str[:1].map(d).fillna(else_val).astype(int)
file text label
0 0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... 0
1 1 b_002.txt Dollar gains on Greenspan speechThe dollar has... 0
2 2 b_003.txt Yukos unit buyer faces loan claimThe owners of... 0
3 3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... 0
4 4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... 0
5 2220 t_397.txt BT program to beat dialler scamsBT is introduc... 4
6 2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... 4
7 2222 t_399.txt Be careful how you codeA new European directiv... 4
8 2223 t_400.txt US cyber security chief resignsThe man making ... 4
9 2224 t_401.txt Losing yourself in online gamingOnline role pl... 4
According to the documentation usage of iterrows() to modify data frame not guaranteed work in all cases beacuse it is not preserve dtype accross rows and etc...
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
Therefore do instead as follows.
def label():
if row['file'].startswith('b'):
return 0
elif row['file'].startswith('e'):
return 1
elif row['file'].startswith('p'):
return 2
elif row['file'].startswith('s'):
return 3
else:
return 4
df['label'] = df.apply(lambda row :label(row[0]),axis=1)

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

Using Pandas to map results of a groupby.sum() to another dataframe?

I have two dataframes - one which is a micro level containing all line items purchased across all transactions (DF1). The other dataframe will be built, with the intention to be a higher level aggregation that summarizes the revenue generated per transaction, essentially summing up all line items for each transaction (DF2).
df1
Out[df1]:
transaction_id item_id amount
0 AJGDO-12304 120 $120
1 AJGDO-12304 40 $10
2 AJGDO-12304 01 $10
3 ODSKF-99130 120 $120
4 ODSKF-99130 44 $30
5 ODSKF-99130 03 $50
df2
Out[df2]
transaction_id location_id customer_id revenue(THIS WILL BE THE ADDED COLUMN!)
0 AJGDO-12304 2131234 1234 $140
1 ODSKF-99130 213124 1345 $200
How would I go about linking the output of a groupby.sum() and assigning it to df2? The revenue column will essentially be the revenue aggregation of df1['transaction_id'] and I want to link it to df2['transaction_id']
Here is what I currently have tried but am struggling with putting together,
results = df1.groupby('transaction_id')['amount'].sum()
df2['revenue'] = df2['transaction_id'].merge(results,how='left').value
Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200
Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200

Combine two pandas DataFrames where the date fields are within two months of each other

I need to combine 2 pandas dataframes where df1.date is within 2 months previous of df2. I then want to calculate how many traders had traded the same stock during that period and count the total shares purchased.
I have tried using the approach listed below, but found it far to complicated. I believe there would be a smarter/simpler solution.
Pandas: how to merge two dataframes on offset dates?
A sample dataset is below:
DF1 (team_1):
date shares symbol trader
31/12/2013 154 FDX Max
30/06/2016 2367 GOOGL Max
21/07/2015 293 ORCL Max
18/07/2015 304 ORCL Sam
DF2 (team_2):
date shares symbol trader
23/08/2015 345 ORCL John
04/07/2014 567 FB John
06/12/2013 221 ACER Sally
31/11/2012 889 HP John
05/06/2010 445 ABBV Kate
Required output:
date shares symbol trader team_2_traders team_2_shares_bought
23/08/2015 345 ORCL John 2 597
04/07/2014 567 FB John 0 0
06/12/2013 221 ACER Sally 0 0
31/11/2012 889 HP John 0 0
05/06/2010 445 ABBV Kate 0 0
This adds 2 new columns...
'team_2_traders' = count of how many traders from team_1 traded the same stock during the previous 2 months from the date listed on DF2.
'team_2_shares_bought' = count of the total shares purchased by team_1 during the previous 2 months from the date listed on DF2.
If anyone is willing to give this a crack, please use the snippet below to setup the dataframes. Please keep in mind the actual dataset contains millions of rows and 6,000 company stocks.
team_1 = {'symbol':['FDX','GOOGL','ORCL','ORCL'],
'date':['31/12/2013','30/06/2016','21/07/2015','18/07/2015'],
'shares':[154,2367,293,304],
'trader':['Max','Max','Max','Sam']}
df1 = pd.DataFrame(team_1)
team_2 = {'symbol':['ORCL','FB','ACER','HP','ABBV'],
'date':['23/08/2015','04/07/2014','06/12/2013','31/11/2012','05/06/2010'],
'shares':[345,567,221,889,445],
'trader':['John','John','Sally','John','Kate']}
df2 = pd.DataFrame(team_2)
Appreciate the help - thank you.
Please check my solution.
from pandas.tseries.offsets import MonthEnd
df_ = df2.merge(df1, on=['symbol'])
df_['date_x'] = pd.to_datetime(df_['date_x'])
df_['date_y'] = pd.to_datetime(df_['date_y'])
df_2m = df_[df_['date_x'] < df_['date_y'] + MonthEnd(2)] \
.loc[:, ['date_y', 'shares_y', 'symbol', 'trader_y']] \
.groupby('symbol')
df1_ = pd.concat([df_2m['shares_y'].sum(), df_2m['trader_y'].count()], axis=1)
print(df1_)
shares_y trader_y
symbol
ORCL 597 2
print(df2.merge(df1_.reset_index(), on='symbol', how='left').fillna(0))
date shares symbol trader shares_y trader_y
0 23/08/2015 345 ORCL John 597.0 2.0
1 04/07/2014 567 FB John 0.0 0.0
2 06/12/2013 221 ACER Sally 0.0 0.0
3 30/11/2012 889 HP John 0.0 0.0
4 05/06/2010 445 ABBV Kate 0.0 0.0

Pandas percentage calculation with groupby operation

I have a data frame in pandas which have following format.
coupon_applied coupon_type dish_id dish_name dish_price
0 Yes Rs 20 off 012 Paneer Biryani 110
dish_quant_bought dish_quantity dish_substitute dish_type
0 50 2 Yes Veg
rder_time order_id order_lat year
0 2015-12-05 16:30:04.345 order_1 73.955741 2015
month day time
0 12 Saturday 16:30:04.345000
I want to calculate dish selling rate for a particular time interval.
dish selling rate = (dish_quant)/(dish_quant_bought)
I am doing following..
df_final[(df_final['time'] > datetime.time(16,30)) & (df_final['time'] < datetime.time(16,35))].groupby('dish_name').sum()['dish_quantity']
Which gives me following
dish_name
Chicken Biryani 2
Chicken Tikka Masala 2
Mutton Biryani 2
Paneer Biryani 2
But I am unable to divide dish quantity sold by dish quantity bought..
How to do it? please help...
IIUC you can do it very simple:
df = df_final[(df_final['time'] > datetime.time(16,30)) & (df_final['time'] < datetime.time(16,35))].groupby('dish_name').sum()
df['dish_selling_rate'] = df['dish_quantity'] / df['dish_quant_bought']
print df.head(5)

Categories

Resources