I want to compare this dataframe df1:
Product Price
0 Waterproof Liner 40
1 Phone Tripod 50
2 Waterproof Pants 0
3 baby Kids play Mat 985
4 Hiking BACKPACKS 34
5 security Camera 160
with df2 as shown below:
Product Id
0 Home Security IP Camera 508760
1 Hiking Backpacks – Spring Products 287950
2 Waterproof Eyebrow Liner 678897
3 Waterproof Pants – Winter Product 987340
4 Baby Kids Water Play Mat – Summer Product 111500
I want to compare Product column in df1 with Product df2. In order to find The good id of the product. And if there is similarity < 80 it will put 'Remove' in the ID field
NB: The text of the Product column in df1 and df2 are not 100% matched
Can Anyone help me with this or how can i use fuzzy wazzy to get the good id?
Here is my code
import pandas as pd
from fuzzywuzzy import process
data1 = {'Product1': ['Waterproof Liner','Phone Tripod','Waterproof Pants','baby Kids play Mat','Hiking BACKPACKS','security Camera'],
'Price':[40,50,0,985,34,160]}
data2 = {'Product2': ['Home Security IP Camera','Hiking Backpacks – Spring Products','Waterproof Eyebrow Liner',
'Waterproof Pants – Winter Product','Baby Kids Water Play Mat – Summer Product'],
'Id': [508760,287950,678897,987340,111500],}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
dfm = pd.DataFrame(df1["Product1"].apply(lambda x: process.extractOne(x, df2["Product2"]))
.tolist(), columns=['Product1',"match_comp", "Id"])
What i got :
Product1 match_comp Id
0 Waterproof Eyebrow Liner 86 2
1 Waterproof Eyebrow Liner 50 2
2 Waterproof Pants – Winter Product 90 3
3 Baby Kids Water Play Mat – Summer Product 86 4
4 Hiking Backpacks – Spring Products 90 1
5 Home Security IP Camera 86 0
What is expected to be :
Product Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760
You can make a wrapper function:
def extract(s):
name,score,_ = process.extractOne(s, df2["Product2"], score_cutoff=0)
if score < 80:
return 'Remove'
return df2.set_index('Product2').loc[name, 'Id']
df1['ID'] = df1["Product1"].apply(extract)
output:
Product1 Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760
NB. the output is not exactly what you expect, you have to explain why rows 4/5 should be dropped
I just start learning Python and R, so any advice using either of them would be much appreciated.
My data are stored in two dataframes. One is sales data, for each consumer, we can see the date when he purchases something. It is possible the same person purchases more than once:
Date Person ID Product
01-05-2012 1 cereal
01-05-2012 2 apple
02-08-2012 3 beef
03-22-2013 72 pot
07-19-2012 1 cake
The second dataframe has membership data, which tell us when did a person enrolled in the program:
Date Person ID Type Status
06-11-2008 1 Gold New
10-12-2011 2 Gold New
02-08-2011 3 Silver Renewal
02-01-2012 72 Gold Renewal
03-22-2012 1 Gold Renewal
What I want to do is, for the same person, how long does it take before a person purchases something before he enrolls in a program.
For example, person 1 got a new membership on 06-11-2008 and purchased cereal on 01-05-2012. I would like to calculate how many days there are between these two dates.
However, these information are stored in separate dataframes. I don't think they can be append or merged into one dataframe, because one person can have more than one observations in one or both of the dataframes.
What I am thinking is, extract all the dates from sales data, and extract all the dates from the license data. Then merge these two new dataframes into a new dataframe. This will give me:
License Date Person ID Sales Date
06-11-2008 1 01-05-2012
10-12-2011 2 01-05-2012
02-08-2011 3 02-08-2011
02-01-2012 72 03-22-2013
06-11-2008 1 07-19-2012
03-22-2012 1 01-05-2012
03-22-2012 1 07-19-2012
But the problem here is, if a person has two license dates (ex. one new and one renewal), then merge the data will give 2*(sales dates)... but I only want the sales dates for a license that is valid..
For example, person 1 used license 06-11-2008 to buy cereal on 01-05-2012, and used license 03-22-2012 to buy on 07-19-2012. But merging the dataframes will give me 4 records rather than the 2 I want...
The result I would want is the time to purchase for each sale, after he gets the license which he used for that purchase:
License Date Person ID Sales Date TimeToPurchase
06-11-2008 1 01-05-2012 ...
10-12-2011 2 01-05-2012 ...
02-08-2011 3 02-08-2011 ...
02-01-2012 72 03-22-2013 ...
03-22-2012 1 07-19-2012 ...
Is there a better way you suggest I can do?
Thank you very much for the help!
pandas
First your dates need to be saved as datetime, which you can accomplish like this:
sales['Date'] = pd.to_datetime(sales['Date'])
memberships['Date'] = pd.to_datetime(memberships['Date'])
Then you merge them by Person ID and arrive to the format that has duplicates.
m = sales.merge(memberships, left_on='Person ID', right_on='Person ID',
suffixes=('_sales', '_memberships'))
m
Date_sales Person ID Product Date_memberships Type Status
0 2012-01-05 1 cereal 2008-06-11 Gold New
1 2012-01-05 1 cereal 2012-03-22 Gold Renewal
2 2012-07-19 1 cake 2008-06-11 Gold New
3 2012-07-19 1 cake 2012-03-22 Gold Renewal
4 2012-01-05 2 apple 2011-10-12 Gold New
5 2012-02-08 3 beef 2011-02-08 Silver Renewal
6 2013-03-22 72 pot 2012-02-01 Gold Renewal
Now you can calculate the elapsed days between the sales and the membership dates like this:
m['TimeToPurchase'] = (m['Date_sales'] - m['Date_memberships']).dt.days
m
Date_sales Person ID Product Date_memberships Type Status TimeToPurchase
0 2012-01-05 1 cereal 2008-06-11 Gold New 1303
1 2012-01-05 1 cereal 2012-03-22 Gold Renewal -77
2 2012-07-19 1 cake 2008-06-11 Gold New 1499
3 2012-07-19 1 cake 2012-03-22 Gold Renewal 119
4 2012-01-05 2 apple 2011-10-12 Gold New 85
5 2012-02-08 3 beef 2011-02-08 Silver Renewal 365
6 2013-03-22 72 pot 2012-02-01 Gold Renewal 415
From here you can first eliminate the negatives and then get the minimum TimeToPurchase for each Person ID and Date sales.
m = m[m['TimeToPurchase'] >= 0]
keep = m.groupby(['Person ID', 'Date_sales'], as_index=False)['TimeToPurchase'].min()
keep
Person ID Date_sales TimeToPurchase
1 2012-01-05 1303
1 2012-07-19 119
2 2012-01-05 85
3 2012-02-08 365
72 2013-03-22 415
These are the records that you want to keep in your merged table, which you can filter with an inner join:
result = m.merge(keep,
left_on=['Person ID', 'Date_sales', 'TimeToPurchase'],
right_on=['Person ID', 'Date_sales', 'TimeToPurchase'])
result
Date_sales Person ID Product Date_memberships Type Status TimeToPurchase
2012-01-05 1 cereal 2008-06-11 Gold New 1303
2012-07-19 1 cake 2012-03-22 Gold Renewal 119
2012-01-05 2 apple 2011-10-12 Gold New 85
2012-02-08 3 beef 2011-02-08 Silver Renewal 365
2013-03-22 72 pot 2012-02-01 Gold Renewal 415
data.table
Same logic as above, so I'll just paste the code.
# Date types
sales[, Date := as.Date(Date, format = "%m-%d-%Y")]
memberships[, Date := as.Date(Date, format = "%m-%d-%Y")]
# Merge
m <- memberships[sales, on = "Person ID"]
# Calculate elapsed days
m[, TimeToPurchase := as.numeric(m$i.Date - m$Date)]
# Eliminate negatives
m <- m[TimeToPurchase >= 0]
# Calculate records to keep
keep <- m[, .(TimeToPurchase = min(TimeToPurchase)), by = .(`Person ID`, i.Date)]
# Filter result
result <- m[keep, on = c("Person ID", "i.Date", "TimeToPurchase")]
result
Date Person ID Type Status i.Date Product TimeToPurchase
1: 2008-06-11 1 Gold New 2012-01-05 cereal 1303
2: 2011-10-12 2 Gold New 2012-01-05 apple 85
3: 2011-02-08 3 Silver Renewal 2012-02-08 beef 365
4: 2012-02-01 72 Gold Renewal 2013-03-22 pot 415
5: 2012-03-22 1 Gold Renewal 2012-07-19 cake 119
Here is a solution using R and library(data.table) assuming you are only interested in the latest time to purchase:
Edit: after question was updated
library(data.table)
purchaseDT <- data.table(stringsAsFactors=FALSE,
Date = c("01-05-2009", "01-05-2012", "02-08-2012", "03-22-2013"),
PersonID = c(1, 2, 1, 72),
Product = c("cereal", "apple", "beef", "pot")
)
programDT <- data.table(stringsAsFactors=FALSE,
Date = c("06-11-2008", "10-12-2011", "02-08-2011", "02-01-2012"),
PersonID = c(1, 2, 1, 72),
Type = c("Gold", "Gold", "Silver", "Gold"),
Status = c("New", "New", "Renewal", "Renewal")
)
purchaseDT[, PurchaseDate := as.Date(Date, format="%m-%d-%Y")]
programDT[, LicenseDate := as.Date(Date, format="%m-%d-%Y")]
purchaseDT[, Date := NULL]
programDT[, Date := NULL]
mergedDT <- purchaseDT[programDT, on="PersonID"]
mergedDT[, TimeToPurchase := PurchaseDate-LicenseDate]
mergedDT <- mergedDT[TimeToPurchase > 0]
resultDT <- mergedDT[, .(TimeToPurchase = min(TimeToPurchase)), by = c("LicenseDate", "PersonID")]
resultDT[, PurchaseDate := LicenseDate+TimeToPurchase]
print(resultDT)
Result:
LicenseDate PersonID TimeToPurchase PurchaseDate
1: 2008-06-11 1 208 days 2009-01-05
2: 2011-10-12 2 85 days 2012-01-05
3: 2011-02-08 1 365 days 2012-02-08
4: 2012-02-01 72 415 days 2013-03-22
Here is one idea for you. First, I merged the two data sets using Person_ID and Date. In this example, I needed to create a date object (i.e., Date) in the first mutate(). I sorted the data by Person_ID and Date. Then, I created a new grouping variable. What I did was to identify rows where Status is either "New" or "Renewal". This means that I identified when a license became valid for the first time. That row becomes the first row for each license. For each group, I chose the first two rows. The data is arranged by Person_ID and Date, so the 2nd row should be the one that a customer bought something with the valid license for the first time. Finally, I calculated the interval (i.e., time2purchase) using Date.
full_join(df1, df2, by = c("Person_ID", "Date")) %>%
mutate(Date = as.Date(Date, format = "%m-%d-%Y")) %>%
arrange(Person_ID, Date) %>%
mutate(group = findInterval(x = 1:n(), vec = grep(Status, pattern = "New|Renewal"))) %>%
group_by(group) %>%
slice(1:2) %>%
summarize(time2purchase = Date[2]-Date[1])
group time2purchase
<int> <time>
1 1 1303 days
2 2 119 days
3 3 85 days
4 4 365 days
5 5 415 days
To make things clearer, I leave the results below, which you can generate
using mutate() instead of summarize().
Date Person_ID Product Type Status group time2purchase
<date> <int> <chr> <chr> <chr> <int> <time>
1 2008-06-11 1 NA Gold New 1 1303 days
2 2012-03-22 1 NA Gold Renewal 2 119 days
3 2011-10-12 2 NA Gold New 3 85 days
4 2011-02-08 3 NA Silver Renewal 4 365 days
5 2012-02-01 72 NA Gold Renewal 5 415 days
DATA
df1 <-structure(list(Date = c("01-05-2012", "01-05-2012", "02-08-2012",
"03-22-2013", "07-19-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Product = c("cereal", "apple", "beef", "pot", "cake")), class = "data.frame",
row.names = c(NA,
-5L))
df2 <- structure(list(Date = c("06-11-2008", "10-12-2011", "02-08-2011",
"02-01-2012", "03-22-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Type = c("Gold", "Gold", "Silver", "Gold", "Gold"), Status = c("New",
"New", "Renewal", "Renewal", "Renewal")), class = "data.frame", row.names = c(NA,
-5L))
I need to combine 2 pandas dataframes where df1.date is within 2 months previous of df2. I then want to calculate how many traders had traded the same stock during that period and count the total shares purchased.
I have tried using the approach listed below, but found it far to complicated. I believe there would be a smarter/simpler solution.
Pandas: how to merge two dataframes on offset dates?
A sample dataset is below:
DF1 (team_1):
date shares symbol trader
31/12/2013 154 FDX Max
30/06/2016 2367 GOOGL Max
21/07/2015 293 ORCL Max
18/07/2015 304 ORCL Sam
DF2 (team_2):
date shares symbol trader
23/08/2015 345 ORCL John
04/07/2014 567 FB John
06/12/2013 221 ACER Sally
31/11/2012 889 HP John
05/06/2010 445 ABBV Kate
Required output:
date shares symbol trader team_2_traders team_2_shares_bought
23/08/2015 345 ORCL John 2 597
04/07/2014 567 FB John 0 0
06/12/2013 221 ACER Sally 0 0
31/11/2012 889 HP John 0 0
05/06/2010 445 ABBV Kate 0 0
This adds 2 new columns...
'team_2_traders' = count of how many traders from team_1 traded the same stock during the previous 2 months from the date listed on DF2.
'team_2_shares_bought' = count of the total shares purchased by team_1 during the previous 2 months from the date listed on DF2.
If anyone is willing to give this a crack, please use the snippet below to setup the dataframes. Please keep in mind the actual dataset contains millions of rows and 6,000 company stocks.
team_1 = {'symbol':['FDX','GOOGL','ORCL','ORCL'],
'date':['31/12/2013','30/06/2016','21/07/2015','18/07/2015'],
'shares':[154,2367,293,304],
'trader':['Max','Max','Max','Sam']}
df1 = pd.DataFrame(team_1)
team_2 = {'symbol':['ORCL','FB','ACER','HP','ABBV'],
'date':['23/08/2015','04/07/2014','06/12/2013','31/11/2012','05/06/2010'],
'shares':[345,567,221,889,445],
'trader':['John','John','Sally','John','Kate']}
df2 = pd.DataFrame(team_2)
Appreciate the help - thank you.
Please check my solution.
from pandas.tseries.offsets import MonthEnd
df_ = df2.merge(df1, on=['symbol'])
df_['date_x'] = pd.to_datetime(df_['date_x'])
df_['date_y'] = pd.to_datetime(df_['date_y'])
df_2m = df_[df_['date_x'] < df_['date_y'] + MonthEnd(2)] \
.loc[:, ['date_y', 'shares_y', 'symbol', 'trader_y']] \
.groupby('symbol')
df1_ = pd.concat([df_2m['shares_y'].sum(), df_2m['trader_y'].count()], axis=1)
print(df1_)
shares_y trader_y
symbol
ORCL 597 2
print(df2.merge(df1_.reset_index(), on='symbol', how='left').fillna(0))
date shares symbol trader shares_y trader_y
0 23/08/2015 345 ORCL John 597.0 2.0
1 04/07/2014 567 FB John 0.0 0.0
2 06/12/2013 221 ACER Sally 0.0 0.0
3 30/11/2012 889 HP John 0.0 0.0
4 05/06/2010 445 ABBV Kate 0.0 0.0