Left merging does not work - python

When I merge two simple dataframes, then everything works fine. But when I apply the same code to my real dataframes, then the merging does not work correctly:
I want to merge df1 and df2 on column A using left joining.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A4','A5'],
'C': ['C0', 'C1', 'C2', 'C3','C4','C5'],
'D': ['D0', 'D1', 'D2', 'D3','D4','A5']})
result = pd.merge(df1, df2[["A","C"]], how='left', on='A')
In this case the result is correct (the number of rows in result is the same as df1).
However when I run the same code on my real data, the number of rows in result is much larger than df1 and is more similar to df2.
result = pd.merge(df1, df2[["ID","EVENT"]], how='left', on='ID')
The field ID is of type String (astype(str)).
What might be the reason on this? I cannot post here the real dataset, but maybe some indications still might be done based on my explanation. Thanks.
UDPATE:
I checked the dataframe result and I can see many duplicated rows having the same ID. Why?

See this slightly modified example (I modified the last two values in column A in df2):
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A0','A0'],
'C': ['C0', 'C1', 'C2', 'C3','C4','C5'],
'D': ['D0', 'D1', 'D2', 'D3','D4','A5']})
result = pd.merge(df1, df2[["A","C"]], how='left', on='A')
Output:
A B C
0 A0 B0 C0
1 A0 B0 C4
2 A0 B0 C5
3 A1 B1 C1
4 A2 B2 C2
5 A3 B3 C3
There is one A0 row for each A0 in df2. This is also what is happening with your data.

Related

Reorder your dataframe by reordering one column

Having a dataframe which looks like this:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
I wonder how to rearange the dataframe when having a different order in one column that one wants to apply to all the others, for example having changed the A column in this example?
df2 = pd.DataFrame({'A': ['A3', 'A0', 'A2', 'A1'],
'B': ['B3', 'B0', 'B2', 'B1'],
'C': ['C3', 'C0', 'C2', 'C1'],
'D': ['D3', 'D0', 'D2', 'D1']},
index=[0, 1, 2, 3])
You can use indexing via set_index, reindex and reset_index. Assumes your values in A are unique, which is the only case where such a transformation would make sense.
L = ['A3', 'A0', 'A2', 'A1']
res = df1.set_index('A').reindex(L).reset_index()
print(res)
A B C D
0 A3 B3 C3 D3
1 A0 B0 C0 D0
2 A2 B2 C2 D2
3 A1 B1 C1 D1
did you mean to sort 1 specific row? if so, use:
df1.iloc[:1] = df1.iloc[:1].sort_index(axis=1,ascending=False)
print(df1)
for all columns use:
df1 = df1.sort_index(axis=0,ascending=False)
for specific columns use the iloc function.
You can use the key parameter from the sorted function:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
key = {'A3': 0, 'A0': 1, 'A2' : 2, 'A1': 3}
df1['A'] = sorted(df1.A, key=lambda e: key.get(e, 4))
print(df1)
Output
A B C D
0 A3 B0 C0 D0
1 A0 B1 C1 D1
2 A2 B2 C2 D2
3 A1 B3 C3 D3
By changing the values of key, you can set whatever order you want.
UPDATE
If want you want is to alter the order of the other columns based on the new order of A, you could try something like this:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A3', 'A0', 'A2', 'A1'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
key = [df1.A.values.tolist().index(k) for k in df2.A]
df2.B = df2['B'][key].tolist()
print(df2)
Output
A B C D
0 A3 B3 C0 D0
1 A0 B0 C1 D1
2 A2 B2 C2 D2
3 A1 B1 C3 D3
To alter all the columns just apply the above for each column. Somthing like this:
for column in df2.columns.values:
if column != 'A':
df2[column] = df2[column][key].tolist()
print(df2)
Output
A B C D
0 A3 B3 C3 D3
1 A0 B0 C0 D0
2 A2 B2 C2 D2
3 A1 B1 C1 D1

How to insert a row of df1 one time in two rows of df2 in pandas dataframe

I'd like to insert rows of a specific dataframe one time in two rows in another specific dataframe. At the end, I'd like to do this for several columns of df1 and df2 (not only D and E).
I've got two different dataframes:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'E': ['E0', 'E1', 'E2', 'E3']},
index=[0, 1, 2, 3])
And I'd like to merge them like
df3 = pd.DataFrame({'A': ['A0', 'A0', 'A1', 'A1', 'A2', 'A2', 'A3', 'A3'],
'B': ['B0', 'B0', 'B1', 'B1', 'B2', 'B2', 'B3', 'B3'],
'C': ['C0', 'C0', 'C1', 'C1', 'C2', 'C2', 'C3', 'C3'],
'D': ['D0', 'E0', 'D1', 'E1', 'D2', 'E2', 'D3', 'E3']},
index=[0, 1, 2, 3, 4, 5, 6, 7])
1) Using pd.concat and sort_index
In [1006]: (pd.concat([df1, df2.rename(columns={'E': 'D'})])
.sort_index().reset_index(drop=True))
Out[1006]:
A B C D
0 A0 B0 C0 D0
1 A0 B0 C0 E0
2 A1 B1 C1 D1
3 A1 B1 C1 E1
4 A2 B2 C2 D2
5 A2 B2 C2 E2
6 A3 B3 C3 D3
7 A3 B3 C3 E3
2) Or, Using append and sort_index
In [1007]: df1.append(df2.rename(columns={'E': 'D'})).sort_index().reset_index(drop=True)
Out[1007]:
A B C D
0 A0 B0 C0 D0
1 A0 B0 C0 E0
2 A1 B1 C1 D1
3 A1 B1 C1 E1
4 A2 B2 C2 D2
5 A2 B2 C2 E2
6 A3 B3 C3 D3
7 A3 B3 C3 E3
Test
In [1009]: (pd.concat([df1, df2.rename(columns={'E': 'D'})])
.sort_index().reset_index(drop=True)
.equals(df3))
Out[1009]: True
In [1010]: pd.concat([df1, df2.rename(columns={'E': 'D'})]).equals(df3)
Out[1010]: False
The concat function lets you combine multiple DataFrames:
frames = [df1, df2.rename(columns={'E': 'D'})]
pd.concat(frames)
You can additional DataFrames to the list, but you will have to rename columns to have have merge correctly.

column concat by specific column pandas

I have two Datframes like
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['j0', 'j1', 'j2'])
right = pd.DataFrame({'A': ['A1', 'A0', 'A2'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
i want to column bind by the 'A' column. how to achieve this ?
and i want to do this by pd.concat not pd.merge

A strange errro,ValueError: Shape of passed values is (7, 4), indices imply (7, 2)

The codes below throw an exception, ValueError: Shape of passed values is (7, 4), indices imply (7, 2).
df4 = pd.DataFrame({'E': ['B2', 'B3', 'B6', 'B7'],
'F': ['D2', 'D3', 'D6', 'D7'],
'G': ['F2', 'F3', 'F6', 'F7']},
index=[2, 2, 6, 7])
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']},
index=[0, 1, 2])
result00 = pd.concat([df1, df4], axis=1,join='inner')
I am confused about the error. How to merge the two dataframe?
The result of merging i want is like below
you can use merge() method:
In [122]: pd.merge(df1, df4, left_index=True, right_index=True)
Out[122]:
A B C D E F G
2 A2 B2 C2 D2 B2 D2 F2
2 A2 B2 C2 D2 B3 D3 F3
you can use the pd.concat in the following form:
result00 = pd.concat([df1, df4], axis=1, join_axes = [df4.index], join = 'inner').dropna()
The earlier code did not work since there was a duplicate index in df2. Hope this helps

How to merge two data frames based on different column names [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
import pandas as pd
left = pd.DataFrame({'A': ['A1', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['K0', 'K1', 'K0', 'K1']})
right = pd.DataFrame({'AA': ['A1', 'A3'],
'BB': ['B0', 'B3'],
'CC': ['K0', 'K1'],
'DD': ['D0', 'D1']})
I want to join these two data frames by adding column DD to left. The values of DD should be selected based on comparing A and AA, B and BB, C and CC.
The simple joining case would be as as shown below, but in my case I need to compare columns with different names, and then I want only add DD to right.
result = left.join(right, on='DD')
The result should be:
result = pd.DataFrame({'A': ['A1', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['K0', 'K1', 'K0', 'K1'],
'DD': ['D0', NaN, NaN, 'D1']})
Use pandas merge method with left_on and right_on parameters.
left.merge(right, how='left',
left_on=['A', 'B', 'C'],
right_on=['AA', 'BB', 'CC'])[['A', 'B', 'C', 'DD']]
gets you:
A B C DD
0 A1 B0 K0 D0
1 A1 B1 K1 NaN
2 A2 B2 K0 NaN
3 A3 B3 K1 D1
It looks like you want to merge.
However at the moment the columns names don't match up (A is AA in right).
So first let's normalize them:
In [11]: right.columns = right.columns.map(lambda x: x[0])
Then we can merge on the shared columns:
In [12]: left.merge(right)
Out[12]:
A B C D
0 A1 B0 K0 D0
1 A3 B3 K1 D1
In [13]: left.merge(right, how="outer")
Out[13]:
A B C D
0 A1 B0 K0 D0
1 A1 B1 K1 NaN
2 A2 B2 K0 NaN
3 A3 B3 K1 D1

Categories

Resources