Merging pandas dataframes based on nearest value(s) - python

I have two dataframes, say A and B, that have some columns named attr1, attr2, attrN.
I have a certain distance measure, and I would like to merge the dataframes, such that each row in A is merged with the row in B that has the shortest distance between attributes. Note that rows in B can be repeated when merging.
For example (with one attribute to keep things simple), merging these two tables using absolute difference distance |A.attr1 - B.att1|
A | attr1 B | attr1
0 | 10 0 | 15
1 | 20 1 | 27
2 | 30 2 | 80
should yield the following merged table
M | attr1_A attr1_B
0 | 10 15
1 | 20 15
2 | 30 27
My current way of doing this is slow and is based on comparing each row of A with each row of B, but code is also not clear because I have to preserve indices for merging and I am not satisfied at all, but I cannot come up with a better solution.
How can I perform the merge as above using pandas? Are there any convenience methods or functions that can be helpful here?
EDIT: Just to clarify, in the dataframes there are also other columns which are not used in the distance calculation, but have to be merged as well.

One way you could do it as follows:
A = pd.DataFrame({'attr1':[10,20,30]})
B = pd.DataFrame({'attr1':[15,15,27]})
Use a cross join to get all combinations
Update for 1.2+ pandas use how='cross'
merge_AB = A.merge(B, how='cross', suffixes = ('_A', '_B'))
Older pandas version use psuedo key...
A = A.assign(key=1)
B = B.assign(key=1)
merged_AB =pd.merge(A,B, on='key',suffixes=('_A','_B'))
Now let's find the min distances in merged_AB
M = merged_AB.groupby('attr1_A').apply(lambda x:abs(x['attr1_A']-x['attr1_B'])==abs(x['attr1_A']-x['attr1_B']).min())
merged_AB[M.values].drop_duplicates().drop('key',axis=1)
Output:
attr1_A attr1_B
0 10 15
3 20 15
8 30 27

Related

Inner join in pandas

I have two dataframes:
The first one was extracted from the manifest database.The data explains about the value, route (origin and destination), and also the actual SLA
awb_number route value sla_actual (days)
01 A - B 24,000 2
02 A - C 25,000 3
03 C - B 29,000 5
04 B - D 35,000 6
The second dataframe explains about the route (origin and destination) and also internal SLA (3PL SLA).
route sla_partner (days)
A - B 4
B - A 3
A - C 3
B - D 5
I would like to investigate the gap between the SLA actual and 3PL SLA, so what I do is to join these two dataframes based on the routes.
I supposed the result would be like this:
awb_number route value sla_actual sla_partner
01 A - B 24,000 2 4
02 A - C 25,000 3 3
03 C - B 29,000 5 NaN
04 B - D 35,000 6 5
What I have done is:
df_sla_check = pd.merge(df_actual, df_sla_partner, on = ['route_city_lazada'], how = 'inner')
The first dataframe has 36,000 rows while the second dataframe has 20,000 rows, but the result returns over 700,000 rows. Is there something wrong with my logic? Isn't it supposed to return around 20,000 rows - 36,000 rows?
Can somebody help me how to do this correctly?
Thank you in advance
Apply Left outer Join. I think it will solve the problem.
According to the points raised by #boi-doingthings and #Peddi Santhoshkumar, I would also suggest to use left joiner, such as the following for your datasets:
df_sla_check = pd.merge(df_actual, df_sla_partner, on=['route'], how='left')
For what you are showing, 'route' may be the appropriate name for your column.
Please confirm the joining field passed in on argument.
Further, you should check the number of unique keys on which the join is happening.
The most natural cause of the spike in the joined dataframe is that one record of df1 gets mapped to multiple records of df2 and vice-versa.
df1.route.value_counts()
df2.route.value_counts()
The alternate way is to change how parameter to 'left'.

Pandas.DataFrame: find the index of the row whose value in a given column is closest to (but below) a specified value

In a Pandas.DataFrame, I would like to find the index of the row whose value in a given column is closest to (but below) a specified value. Specifically, say I am given the number 40 and the DataFrame df:
| | x |
|---:|----:|
| 0 | 11 |
| 1 | 15 |
| 2 | 17 |
| 3 | 25 |
| 4 | 54 |
I want to find the index of the row such that df["x"] is lower but as close as possible to 40. Here, the answer would be 3 because df[3,'x']=25 is smaller than the given number 40 but closest to it.
My dataframe has other columns, but I can assume that the column "x" is increasing.
For an exact match, I did (correct me if there is a better method):
list = df[(df.x == number)].index.tolist()
if list:
result = list[0]
But for the general case, I do not know how to do it in a "vectorized" way.
Filter rows below 40 by Series.lt in boolean indexing and get mximal index value by Series.idxmax:
a = df.loc[df['x'].lt(40), 'x'].idxmax()
print (a)
3
For improve performance is possible use numpy.where with np.max, solution working if default index:
a = np.max(np.where(df['x'].lt(40))[0])
print (a)
3
If not default RangeIndex:
df = pd.DataFrame({'x':[11,15,17,25,54]}, index=list('abcde'))
a = np.max(np.where(df['x'].lt(40))[0])
print (a)
3
print (df.index[a])
d
How about that:
import pandas as pd
data = {'x':[0,1,2,3,4,20,50]}
df = pd.DataFrame(data)
#get df with selected condition
sub_df = df[df['x'] <= 40]
#get the idx of the maximum
idx = sub_df.idxmax()
print(idx)
Use Series.where to mask greater or equal than n, then use Series.idxmax to obtain
the closest one:
n=40
val = df['x'].where(df['x'].lt(n)).idxmax()
print(val)
3
We could also use Series.mask:
df['x'].mask(df['x'].ge(40)).idxmax()
or callable with loc[]
df['x'].loc[lambda x: x.lt(40)].idxmax()
#alternative
#df.loc[lambda col: col['x'].lt(40),'x'].idxmax()
If not by default RangeIndex
i = df.loc[lambda col: col['x'].lt(40),'x'].reset_index(drop=True).idxmax()
df.index[i]

How to perform element wise operation on two sets of columns in pandas

I have the dataframe:
c1 | c2 | c3 | c4
5 | 4 | 9 | 3
How could I perform element wise division (or some other operation) between c1/c2 and c3/c4
So that the outcome is:
.5555 | 1.33333
I've tried:
df[['c1', 'c2']].div(df[['c3', 'c4']], axis='index'))
But that just resulted in NaNs.
Pretty straightforward, just divide by the values
df[['c1', 'c2']]/df[['c3','c4']].values
Orders matter, so make sure to use correct ordering in the denominator. No need to recreate the DataFrame
One solution is to drop down to NumPy and create a new dataframe:
res = pd.DataFrame(df[['c1', 'c2']].values / df[['c3', 'c4']].values)
print(res)
0 1
0 0.555556 1.333333
I'm not positive I'm understanding your question correctly , but you can literally just divide the series.
df['c1/c2'] = df['c1'] / df['c2']
See this answer: How to divide two column in a dataframe
EDIT: Okay, I understand what OPs asking now.. Please see other answer.

Producing every combination of columns from one pandas dataframe in python

I'd like to take a dataframe and visualize how useful each column is in a k-neighbors analysis so I was wondering if there was a way to loop through dropping columns and analyzing the dataframe in order to produce an accuracy for every single combination of columns. I'm really not sure if there are some functions in pandas that I'm not aware of that could make this easier or how to loop through the dataframe to produce every combination of the original dataframe. If I have not explained it well I will try and create a diagram.
a | b | c | | labels |
1 | 2 | 3 | | 0 |
5 | 6 | 7 | | 1 |
The dataframe above would produce something like this after being run through the splitting and k-neighbors function:
a & b = 43%
a & c = 56%
b & c = 78%
a & b & c = 95%
import itertools
min_size = 2
max_size = df.shape[1]
column_subsets = itertools.chain(*map(lambda x: itertools.combinations(df.columns, x), range(min_size,max_size+1)))
for column_subset in column_subsets:
foo(df[list(column_subset)])
where df is your dataframe and foo is whatever kNA you're doing. Although you said "all combinations", I put min_size at 2 since your example has only size >= 2. And these are more precisely referred to as "subsets" rather than "combinations".

Pandas (python): max in columns define new value in new column

I have a df with about 50 columns:
Product ID | Cat1 | Cat2 |Cat3 | ... other columns ...
8937456 0 5 10
8497534 25 3 0
8754392 4 15 7
Cat signifies how many quantities of that product fell into a category. Now I want to add a column "Category" denoting the majority Category for a product (ignoring the other columns and just considering the Cat columns).
df_goal:
Product ID | Cat1 | Cat2 |Cat3 | Category | ... other columns ...
8937456 0 5 10 3
8497534 25 3 0 1
8754392 4 15 7 2
I think I need to use max and apply or map?
I found those on stackoverflow, but they don't not address the category assignment. In Excel I renamed the columns from Cat 1 to 1 and used index(match(max)).
Python Pandas max value of selected columns
How should I take the max of 2 columns in a dataframe and make it another column?
Assign new value in DataFrame column based on group max
Here's a NumPy way with numpy.argmax -
df['Category'] = df.values[:,1:].argmax(1)+1
To restrict the selection to those columns, use those column headers/names specifically and then use idxmax and finally replace the string Cat with `empty strings, like so -
df['Category'] = df[['Cat1','Cat2','Cat3']].idxmax(1).str.replace('Cat','')
numpy.argmax or panda's idxmax basically gets us the ID of max element along an axis.
If we know that the column names for the Cat columns start at 1st column and end at 4th one, we can slice the dataframe : df.iloc[:,1:4] instead of df[['Cat1','Cat2','Cat3']].

Categories

Resources