Finding one to many matches in two Pandas Dataframes - python

I am attempting to put together a generic matching process for financial data. The goal is to take one set of data with larger transactions and match it to a set of data with smaller transactions. Some are one to many, others are one to one.
There are a few times where it may be reversed and part of the approach is to feed back the miss matches in inverse order to capture those possible matches.
I have three different modules I have created to iterate across each other to complete the work, but I am not getting consistent results. I see possible matches in my data that should be picked up but are not.
There is no clear matching criteria either, so the assumption is if I put the datasets in date order, and look for matching values, I want to take the first match since it should be closer to the same timeframe.
I am using Pandas and Itertools, but maybe not in the ideal format. Any help to get consistent matches would be appreciated.
Data examples:
Large Transaction Size:
AID AIssue Date AAmount
1508 3/14/2018 -560
1506 3/27/2018 -35
1500 4/25/2018 5000
Small Transaction Size:
BID BIssue Date BAmount
1063 3/6/2018 -300
1062 3/6/2018 -260
839 3/22/2018 -35
423 4/24/2018 5000
Expected Results
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
1500 4/25/2018 5000 423 4/24/2018 5000
but I usually get
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
with the 5000 not matching. And this is one example, but positive negative does not appear to be the factor when looking at the larger data set.
When reviewing the unmatched results from each, I see at least one $5000 transaction I would expect to be a 1-1 match and it is not in the results.
def matches(iterable):
s = list(iterable)
#Only going to 5 matches to avoid memory overrun on large datasets
s = list(itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(5)))
return [list(elem) for elem in s]
def one_to_many(dfL, dfS, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
S = dfS[dfS.columns[dID]].values.tolist()
S_amount = dfS[dfS.columns[dVal]].values.tolist()
S = matches(S)
S_amount = matches(S_amount)
#get ID of first large record, the ID to be matched in this module
L = dfL[dfL.columns[dID]].iloc[0]
#get Value of first large record, this value will be matching criteria
L_amount = dfL[dfL.columns[dVal]].iloc[0]
count_of_sets = len(S)
for a in range(0,count_of_sets):
list_of_items = S[a]
list_of_values = S_amount[a]
if round(sum(list_of_values),2) == round(L_amount,2):
break
if round(sum(list_of_values),2) == round(L_amount,2):
retVal = list_of_items
else:
retVal = [-1]
return retVal
def iterate_one_to_many(dfLarge, dfSmall, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
#returns a list of dataframes [paired matches, unmatched from dfL, unmatched from dfS]
dfLarge = dfLarge.set_index(dfLarge.columns[dID]).sort_values([dfLarge.columns[dDT], dfLarge.columns[dVal]]).reset_index()
dfSmall = dfSmall.set_index(dfSmall.columns[dID]).sort_values([dfSmall.columns[dDT], dfSmall.columns[dVal]]).reset_index()
end_row = len(dfLarge.columns[dID]) - 1
matches_master = pd.DataFrame(data = None, columns = dfLarge.columns.append(dfSmall.columns))
for lg in range(0,end_row):
sm_match_id = one_to_many(dfLarge, dfSmall)
lg_match_id = dfLarge[dfLarge.columns[dID]][lg]
if sm_match_id != [-1]:
end_of_matches = len(sm_match_id)
for sm in range(0, end_of_matches):
if sm == 0:
sm_match = dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy()
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
else:
sm_match = sm_match.append(dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy())
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
lg_match = dfLarge.loc[dfLarge[dfLarge.columns[dID]] == lg_match_id].copy()
sm_match['Match'] = lg
lg_match['Match'] = lg
sm_match.set_index('Match', inplace=True)
lg_match.set_index('Match', inplace=True)
matches = lg_match.join(sm_match, how='left')
matches_master = matches_master.append(matches)
dfLarge = dfLarge.loc[dfLarge[dfLarge.columns[dID]] != lg_match_id].copy()
return [matches_master, dfLarge, dfSmall]

IIUUC, the match is just to find the transaction in the Large DataFrame which is on or the closest future transaction to a transaction in the small one. You can use pandas.merge_asof() to perform a match based on the closest date in the future.
import pandas as pd
# Ensure your dates are datetime
df_large['AIssue Date'] = pd.to_datetime(df_large['AIssue Date'])
df_small['BIssue Date'] = pd.to_datetime(df_small['BIssue Date'])
merged = pd.merge_asof(df_small, df_large, left_on='BIssue Date',
right_on='AIssue Date', direction='forward')
merged is now:
BID BAmount BIssue Date AID AAmount AIssue Date
0 1063 -300 2018-03-06 1508 -560 2018-03-14
1 1062 -260 2018-03-06 1508 -560 2018-03-14
2 839 -35 2018-03-22 1506 -35 2018-03-27
3 423 5000 2018-04-24 1500 5000 2018-04-25
If you expect things to never match, you can also throw in a tolerance to restrict the matches to within a smaller window., that way a missing value in one DataFrame doesn't throw everything off.

in my module iterate_one_to_many, I was counting my row length incorrectly. I needed to replace
end_row = len(dfLarge.columns[dID]) - 1
with
end_row = len(dfLarge.index)

Related

Pyhthon: Getting "list index out of range" error; I know why but don't know how to resolve this

I am currently working on a data science project. The Idea is to clean the data from "glassdoor_jobs.csv", and present it in a much more understandable manner.
import pandas as pd
df = pd.read_csv('glassdoor_jobs.csv')
#salary parsing
#Removing "-1" Ratings
#Clean up "Founded"
#state field
#Parse out job description
df['hourly'] = df['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0)
df['employer_provided'] = df['Salary Estimate'].apply(lambda x: 1 if 'employer provided salary' in x.lower() else 0)
df = df[df['Salary Estimate'] != '-1']
Salary = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
minus_Kd = Salary.apply(lambda x: x.replace('K', '').replace('$',''))
minus_hr = minus_Kd.apply(lambda x: x.lower().replace('per hour', '').replace('employer provided salary:', ''))
df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]))
df['max_salary'] = minus_hr.apply(lambda x: int(x.split('-')[1]))
I am getting the error at that last line. After digging a bit, I found out in minus_hr, some of the 'Salary Estimate' only has one number instead of range:
index
Salary Estimate
0
150
1
58
2
130
3
125-150
4
110-140
5
200
6
67- 77
And so on. Now I'm trying to figure out how to work around the "list index out of range", and make max_salary the same as the min_salary for the cells with only one value.
I am also trying to get average between the min and max salary, and if the cell only has a single value, make that value the average
So in the end, something like index 0 would look like:
index
min
max
average
0
150
150
150
You'll have to add in a conditional statement somewhere.
df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]) if '-' in x else x)
The above might do it, or you can define a function.
def max_salary(cell_value):
if '-' in cell_value:
max_salary = split(cell_value, '-')[1]
else:
max_salary = cell_value
return max_salary
df['max_salary'] = minus_hr.apply(lambda x: max_salary(x))
def avg_salary(cell_value):
if '-' in cell_value:
salaries = split(cell_value,'-')
avg = sum(salaries)/len(salaries)
else:
avg = cell_value
return avg
df['avg_salary'] = minus_hr.apply(lambda x: avg_salary(x))
Swap in min_salary and repeat
Test the length of x.split('-') before accessing the elements.
salaries = x.split('-')
if len(salaries) == 1:
# only one salary number is given, so assign the same value to min and max
df['min_salary'] = df['max_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
else:
# two salary numbers are given
df['min_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
df['max_salary'] = minus_hr.apply(lambda x: int(salaries[1]))
If you want to avoid .apply()...
Try:
import numpy as np
# extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
sals = df['Salary Estimate'].str.extractall(r'(?P<min_salary>\d+)[^0-9]*(?P<max_salary>\d*)?')
# reset the new frame's index
sals = sals.reset_index()
# join the extracted min/max salary columns to the original dataframe and fill any blanks with nan
df = df.join(sals[['min_salary', 'max_salary']].fillna(np.nan))
# fill any nan values in the 'max_salary' column with values from the 'min_salary' column
df['max_salary'] = df['max_salary'].fillna(df['min_salary'])
# set the type of the columns to int
df['min_salary'] = df['min_salary'].astype(int)
df['max_salary'] = df['max_salary'].astype(int)
# calculate the average
df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)
# see what you've got
print(df)
Or without using regex:
import numpy as np
# extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
df['sals'] = df['Salary Estimate'].str.split('-')
# expand the list in sals to two columns filling with nan
df[['min_salary', 'max_salary']] = pd.DataFrame(df.sals.tolist()).fillna(np.nan)
# delete the sals column
del df['sals']
# # fill any nan values in the 'max_salary' column with values from the 'min_salary' column
df['max_salary'] = df['max_salary'].fillna(df['min_salary'])
# # set the type of the columns to int
df['min_salary'] = df['min_salary'].astype(int)
df['max_salary'] = df['max_salary'].astype(int)
# # calculate the average
df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)
# see you've got
print(df)
Output:
Salary Estimate min_salary max_salary average_salary
0 150 150 150 150
1 58 58 58 58
2 130 130 130 130
3 125-150 125 150 137
4 110-140 110 140 125
5 200 200 200 200
6 67- 77 67 77 72

Is there a faster way to split a pandas dataframe into two complementary parts?

Good evening all,
I have a situation where I need to split a dataframe into two complementary parts based on the value of one feature.
What I mean by this is that for every row in dataframe 1, I need a complementary row in dataframe 2 that takes on the opposite value of that specific feature.
In my source dataframe, the feature I'm referring to is stored under column "773", and it can take on values of either 0.0 or 1.0.
I came up with the following code that does this sufficiently, but it is remarkably slow. It takes about a minute to split 10,000 rows, even on my all-powerful EC2 instance.
data = chunk.iloc[:,1:776]
listy1 = []
listy2 = []
for i in range(0,len(data)):
random_row = data.sample(n=1).iloc[0]
listy1.append(random_row.tolist())
if random_row["773"] == 0.0:
x = data[data["773"] == 1.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
else:
x = data[data["773"] == 0.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
df1 = pd.DataFrame(listy1)
df2 = pd.DataFrame(listy2)
Note: I don't care about duplicate rows, because this data is being used to train a model that compares two objects to tell which one is "better."
Do you have some insight into why this is so slow, or any suggestions as to make this faster?
A key concept in efficient numpy/scipy/pandas coding is using library-shipped vectorized functions whenever possible. Try to process multiple rows at once instead of iterate explicitly over rows. i.e. avoid for loops and .iterrows().
The implementation provided is a little subtle in terms of indexing, but the vectorization thinking should be straightforward as follows:
Draw the main dataset at once.
The complementary dataset: draw the 0-rows at once, the complementary 1-rows at once, and then put them into the corresponding rows at once.
Code:
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(52) # reproducibility
n = 10000
df = pd.DataFrame(
data={
"773": [0,1]*int(n/2),
"dummy1": list(range(n)),
"dummy2": list(range(0, 10*n, 10))
}
)
t0 = datetime.now()
print("Program begins...")
# 1. draw the main dataset
draw_idx = np.random.choice(n, n) # repeatable draw
df_main = df.iloc[draw_idx, :].reset_index(drop=True)
# 2. draw the complementary dataset
# (1) count number of 1's and 0's
n_1 = np.count_nonzero(df["773"][draw_idx].values)
n_0 = n - n_1
# (2) split data for drawing
df_0 = df[df["773"] == 0].reset_index(drop=True)
df_1 = df[df["773"] == 1].reset_index(drop=True)
# (3) draw n_1 indexes in df_0 and n_0 indexes in df_1
idx_0 = np.random.choice(len(df_0), n_1)
idx_1 = np.random.choice(len(df_1), n_0)
# (4) broadcast the drawn rows into the complementary dataset
df_comp = df_main.copy()
mask_0 = (df_main["773"] == 0).values
df_comp.iloc[mask_0 ,:] = df_1.iloc[idx_1, :].values # df_1 into mask_0
df_comp.iloc[~mask_0 ,:] = df_0.iloc[idx_0, :].values # df_0 into ~mask_0
print(f"Program ends in {(datetime.now() - t0).total_seconds():.3f}s...")
Check
print(df_main.head(5))
773 dummy1 dummy2
0 0 28 280
1 1 11 110
2 1 13 130
3 1 23 230
4 0 86 860
print(df_comp.head(5))
773 dummy1 dummy2
0 1 19 190
1 0 74 740
2 0 28 280 <- this row is complementary to df_main
3 0 60 600
4 1 37 370
Efficiency gain: 14.23s -> 0.011s (ca. 128x)

Pandas: filter by date proximity

I have a frame like:
id title date
0 1211 jingle bells 2019-01-15
1 1212 jingle bells 2019-01-15
2 1225 tom boat 2019-06-15
3 2112 tom boat 2019-06-15
4 3122 tom boat 2017-03-15
5 1762 tom boat 2017-03-15
An item is defined as the group of id with the same title and with date within 70 days of the first. I need a dictionary of ids grouped by title if date is within 70 days of each other. Expected outcome here is:
d = {0: [1211,1212], 1: [1225,2112], 2: [3122,1762]}
Any given title can have uncapped number of dictionary entries or just one. id are unique to one title. At the moment, I do something like:
itemlist = []
for i in list(df.title):
dates = list(df.loc[df.title==i,'date'])
if (max(dates)-min(dates)).days > 70:
items = []
while len(dates)>0:
extract = [i for i in dates if (i-min(dates)).days<70]
items.append(list(df.loc[(df.title==i)&(df.date.isin(extract)),'id'])
dates = [i for i in dates if i not in extract
else:
items = [list(df.loc[df.title==i,'id'])]
itemlist += items
d = {j:i for i in range(len(itemlist)) for j in itemlist[i]}
It doesn't quite work yet, I'm bugfixing. That said, I feel like this is a lot of iteration - any ideas on how to do this better?
another acceptable output would be a list of dataframes, one per item.
I think sorting your dataframe can help you solve the problem much more efficiently.
df = df.sort_values(['title', 'date'])
itemlist = []
counter = 0 # to get items at constant time
for title in set(df.title):
dates = df.loc[df['title']==title].date.tolist()
item = []
min_date = dates[0]
for date in dates:
if (date-min_date).days>70: # we need a new item
itemlist.append(item) # append original item
item = [df.iloc[counter, 0]] # new item
min_date = date
else:
item.append(df.iloc[counter, 0])
counter += 1
itemlist.append(item)
d = {i:j for i,j in enumerate(itemlist)}
print(d)
Even though the code became a bit long, there are only two loops (except the last one to change the list into dict) and it loops n_rows time in total, which means it only looks at every row once.
The use of counter is to use df.iloc which uses positional index (instead of labels or conditional statements like df.loc), hence computes faster-with O(1).

Create Multiple dataframes from a large text file

Using Python, how do I break a text file into data frames where every 84 rows is a new, different dataframe? The first column x_ft is the same value every 84 rows then increments up by 5 ft for the next 84 rows. I need each identical x_ft value and corresponding values in the row for the other two columns (depth_ft and vel_ft_s) to be in the new dataframe too.
My text file is formatted like this:
x_ft depth_ft vel_ft_s
0 270 3535.755 551.735107
1 270 3534.555 551.735107
2 270 3533.355 551.735107
3 270 3532.155 551.735107
4 270 3530.955 551.735107
.
.
33848 2280 3471.334 1093.897339
33849 2280 3470.134 1102.685547
33850 2280 3468.934 1113.144287
33851 2280 3467.734 1123.937134
I have tried many, many different ways but keep running into errors and would really appreciate some help.
I suggest looking into pandas.read_table, which automatically outputs a DataFrame. Once doing so, you can isolate the rows of the DataFrame that you are looking to separate (every 84 rows) by doing something like this:
df = #Read txt datatable with Pandas
arr = []
#This gives you an array of all x values in your dataset
for x in range(0,403):
val = 270+5*x
arr.append(val)
#This generates csv files for every row with a specific x_ft value with its corresponding columns (depth_ft and vel_ft_s)
for x_value in arr:
tempdf = df[(df['x_ft'])] = x_value
tempdf.to_csv("df"+x_value+".csv")
You can get indexes to split your data:
rows = 84
datasets = round(len(data)/rows) # total datasets
index_list = []
for index in data.index:
x = index % rows
if x == 0:
index_list.append(index)
print(index_list)
So, split original dataset by indexes:
l_mod = index_list + [max(index_list)+1]
dfs_list = [data.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
print(len(dfs_list))
Outputs
print(type(dfs_list[1]))
# pandas.core.frame.DataFrame
print(len(dfs_list[0]))
# 84

How do I add the value_counts for two columns, on each row of a DataFrame?

I have this DataFrame that I have taken from another DataFrame. It has the start station for a bike trip and an end station. I plan to add them to a network using networkx and from_pandas_dataframe(). I just need to make another Series/column for the weights.
I want for each row to find the value_counts for each start station and end station and add them together as a weight.
So for first entry I would find the occurrences for stations 3058 and 3082, add them and place the result on the weights column like this.
EDIT: Adding code as requested:
df = data[['start_station','end_station']]
a = df.start_station.value_counts()
b = df.end_station.value_counts()
pd.options.display.max_rows=300
c = a + b
And here's the dataset: https://ufile.io/cxbov
You could do it like this:
df = pd.read_csv('metro.csv')
s = df[['start_station','end_station']].apply(pd.value_counts).sum(1)
df_out = df[['start_station','end_station']].assign(weight = df['start_station'].map(s) + df['end_station'].map(s))
print(df_out.head())
Output:
start_station end_station weight
0 3058 3082 6248
1 3058 3082 6248
2 4147 4174 496
3 4157 4162 903
4 3013 3013 100

Categories

Resources