Inner join in pandas - python

I have two dataframes:
The first one was extracted from the manifest database.The data explains about the value, route (origin and destination), and also the actual SLA
awb_number route value sla_actual (days)
01 A - B 24,000 2
02 A - C 25,000 3
03 C - B 29,000 5
04 B - D 35,000 6
The second dataframe explains about the route (origin and destination) and also internal SLA (3PL SLA).
route sla_partner (days)
A - B 4
B - A 3
A - C 3
B - D 5
I would like to investigate the gap between the SLA actual and 3PL SLA, so what I do is to join these two dataframes based on the routes.
I supposed the result would be like this:
awb_number route value sla_actual sla_partner
01 A - B 24,000 2 4
02 A - C 25,000 3 3
03 C - B 29,000 5 NaN
04 B - D 35,000 6 5
What I have done is:
df_sla_check = pd.merge(df_actual, df_sla_partner, on = ['route_city_lazada'], how = 'inner')
The first dataframe has 36,000 rows while the second dataframe has 20,000 rows, but the result returns over 700,000 rows. Is there something wrong with my logic? Isn't it supposed to return around 20,000 rows - 36,000 rows?
Can somebody help me how to do this correctly?
Thank you in advance

Apply Left outer Join. I think it will solve the problem.

According to the points raised by #boi-doingthings and #Peddi Santhoshkumar, I would also suggest to use left joiner, such as the following for your datasets:
df_sla_check = pd.merge(df_actual, df_sla_partner, on=['route'], how='left')
For what you are showing, 'route' may be the appropriate name for your column.

Please confirm the joining field passed in on argument.
Further, you should check the number of unique keys on which the join is happening.
The most natural cause of the spike in the joined dataframe is that one record of df1 gets mapped to multiple records of df2 and vice-versa.
df1.route.value_counts()
df2.route.value_counts()
The alternate way is to change how parameter to 'left'.

Related

How can I merge aggregate two dataframes in Pandas while subtracting column values?

I'm working on a rudimentary inventory system and am having trouble finding a solution to this obstacle. I've got two Pandas dataframes, both sharing two columns: PLU and QTY. PLU acts as an item identifier, and QTY is the quantity of the item in one dataframe, while being the quantity sold in another. Here are two very simple examples of what the data looks like:
final_purch:
PLU QTY
12345678 12
90123456 7
78901234 2
pmix_diff:
PLU QTY
12345678 9
90123456 3
78901234 1
In this case, I'd want to find any matching PLUs and subtract the pmix_df QTY from the final_purch QTY.
In an earlier part of the project, I used aggregate functions to get rid of duplicates while summing the QTY column. It worked great, but I can't find a way to do something similar here with subtraction. I'm fairly new to Python/Pandas, so any help is greatly appreciated. :)
here is one way to do that
Using assign and merge
df.assign(QTY = df['QTY'] - df.merge(df2, on='PLU', suffixes=('','_y'), how='left')['QTY_y'].fillna(0))
PLU QTY
0 12345678 3
1 90123456 4
2 78901234 1
You may do:
df = final_purch.set_index('PLU').join(pmix_df.set_index('PLU'), lsuffix='final', rsuffix='pmix')
df['QTYdiff'] = df['QTYfinal']-df['QTYpmix']
output:
QTYfinal QTYpmix QTYdiff
PLU
12345678 12 9 3
90123456 7 3 4
78901234 2 1 1

Is it normal for pandas to need 30 secs to perform calculations on a 50k-row dataframe?

I use the pandas read_excel function to work with data. I have two excel files with 70k rows and 3 columns (the first column is date), and it only takes 4-5 seconds to combine, align the data, delete any rows with incomplete data and return a new dataframe (df) with 50k rows and 4 columns, where date is the index.
Then, i use the below code to perform some calculations and add another 2 columns in my df:
for i, row in df.iterrows():
df["new_column1"] = df["column1"] - 2 * df["column4"]
df["new_column2"]= df["column1"] - 2.5 * df["column4"]
It takes approx 30 seconds for the above code to be executed, even though the calculations are simple. Is this normal, or is there a way to speed up the execution? (i am on win 10, 16GB Ram and i7-8565U processor)
I am not particularly interested in increasing the columns in my database - getting the two new columns on a list would suffice.
Thanks.
Note that the code in your loop contains neither row nor i.
So drop for ... row and execute just:
df["new_column1"] = df["column1"] - 2 * df["column4"]
df["new_column2"]= df["column1"] - 2.5 * df["column4"]
It is enough to execute the above code only once, not in a loop.
Your code unnecessarily performs the above operations multiple times
(actually as many times as how many rows has your DataFrame) and this
is why it takes so long.
Edit following question as of 18:59Z
To perform vectorized operations like "check one column and do something
to another column", use the following schema, base on boolean indexing.
Assume that the source df contains:
column1 column4
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
Then if you want to:
select rows with even value in column1,
and add some value (e.g. 200) to column4,
run:
df.loc[df.column1 % 2 == 0, 'column4'] += 200
In this example:
df.column1 % 2 == 0 - provides boolean indexing over rows,
column4 - selects particular column,
+= 200 - performs the actual operation.
The result is:
column1 column4
0 1 11
1 2 212
2 3 13
3 4 214
4 5 15
5 6 216
6 7 17
7 8 218
But there ase more complex cases, when the condition involves calling of
some custom code or you want to update several columns.
In such cases you should use either iterrow of apply, but these
operations are executed much slower.

Subsetting into several data frames by identifier in Python?

I wish to subset a data frame in Python by identifier. For instance, suppose we have the below data:
ID Number
A 50
A 45
A 21
B 78
B 79
B 12
C 15
C 74
C 10
I want to split the data into three separate data frames, i.e. all data for A would be the first data frame, B would be the second, C the third.
I'm having trouble going about this. I've tried using set for unique values but am thinking this is not the way to go about it. Any help appreciated.
Is this what you want ? (PS: I consider auto assign name to Dataframe)
variables = locals()
for i in df['ID'].unique():
variables["df{0}".format(i)] = df.loc[df.ID == i,]
dfA
Out[147]:
ID Number
0 A 1
3 A 1
6 A 1

Merging pandas dataframes based on nearest value(s)

I have two dataframes, say A and B, that have some columns named attr1, attr2, attrN.
I have a certain distance measure, and I would like to merge the dataframes, such that each row in A is merged with the row in B that has the shortest distance between attributes. Note that rows in B can be repeated when merging.
For example (with one attribute to keep things simple), merging these two tables using absolute difference distance |A.attr1 - B.att1|
A | attr1 B | attr1
0 | 10 0 | 15
1 | 20 1 | 27
2 | 30 2 | 80
should yield the following merged table
M | attr1_A attr1_B
0 | 10 15
1 | 20 15
2 | 30 27
My current way of doing this is slow and is based on comparing each row of A with each row of B, but code is also not clear because I have to preserve indices for merging and I am not satisfied at all, but I cannot come up with a better solution.
How can I perform the merge as above using pandas? Are there any convenience methods or functions that can be helpful here?
EDIT: Just to clarify, in the dataframes there are also other columns which are not used in the distance calculation, but have to be merged as well.
One way you could do it as follows:
A = pd.DataFrame({'attr1':[10,20,30]})
B = pd.DataFrame({'attr1':[15,15,27]})
Use a cross join to get all combinations
Update for 1.2+ pandas use how='cross'
merge_AB = A.merge(B, how='cross', suffixes = ('_A', '_B'))
Older pandas version use psuedo key...
A = A.assign(key=1)
B = B.assign(key=1)
merged_AB =pd.merge(A,B, on='key',suffixes=('_A','_B'))
Now let's find the min distances in merged_AB
M = merged_AB.groupby('attr1_A').apply(lambda x:abs(x['attr1_A']-x['attr1_B'])==abs(x['attr1_A']-x['attr1_B']).min())
merged_AB[M.values].drop_duplicates().drop('key',axis=1)
Output:
attr1_A attr1_B
0 10 15
3 20 15
8 30 27

NaNs after merging two dataframes

I have two dataframes like the following:
df1
id name
-------------------------
0 43 c
1 23 t
2 38 j
3 9 s
df2
user id
--------------------------------------------------
0 222087 27,26
1 1343649 6,47,17
2 404134 18,12,23,22,27,43,38,20,35,1
3 1110200 9,23,2,20,26,47,37
I want to split all the ids in df2 into multiple rows and join the resultant dataframe to df1 on "id".
I do the following:
b = pd.DataFrame(df2['id'].str.split(',').tolist(), index=df2.user_id).stack()
b = b.reset_index()[[0, 'user_id']] # var1 variable is currently labeled 0
b.columns = ['Item_id', 'user_id']
When I try to merge, I get NaNs in the resultant dataframe.
pd.merge(b, df1, on = "id", how="left")
id user name
-------------------------------------
0 27 222087 NaN
1 26 222087 NaN
2 6 1343649 NaN
3 47 1343649 NaN
4 17 1343649 NaN
So, I tried doing the following:
b['name']=np.nan
for i in range(0, len(df1)):
b['name'][(b['id'] == df1['id'][i])] = df1['name'][i]
It still gives the same result as above. I am confused as to what could cause this because I am sure both of them should work!
Any help would be much appreciated!
I read similar posts on SO but none seemed to have a concrete answer. I am also not sure if this is not at all related to coding or not.
Thanks in advance!
Problem is you need convert column id in df2 to int, because output of string functions is always string, also if works with numeric.
df2.id = df2.id.astype(int)
Another solution is convert df1.id to string:
df1.id = df1.id.astype(str)
And get NaNs because no match - str values doesnt match with int values.

Categories

Resources