I have two Dataframes that I want to concatenate horizontally, grouping them by the value of a column. From the pandas.pydata website they do:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', 'D3', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7])
df1 =
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
df4 =
B D F
2 B2 D2 F2
3 B3 D3 F3
6 B6 D6 F6
7 B7 D7 F7
result = pd.concat([df1, df4], axis=1, join='inner')
result =
A B C D B D F
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
This works, and I'm happy about it.
So I'm using this trick to merge 2 dataframes by the value of a certain column, basically I reindex the Dataframe with that column and then I do the concatenation.
However values in that column are repeated, so I end with dataframes with repeated indexes:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 3, 3, 2])
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', 'D3', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7])
df1 =
A B C D
0 A0 B0 C0 D0
3 A1 B1 C1 D1
3 A2 B2 C2 D2
2 A3 B3 C3 D3
df4 =
B D F
2 B2 D2 F2
3 B3 D3 F3
6 B6 D6 F6
7 B7 D7 F7
So I would expect this two dataframes to join, so I will end up with:
result =
A B C D B D F
3 A1 B1 C1 D1 B2 D2 F2
3 A2 B2 C2 D2 B2 D2 F2
2 A3 B3 C3 D3 B3 D3 F3
(Notice that the two rows with index 3 in df1 both join with the row with index 3 in df4) However this doesn't work.
ValueError: Shape of passed values is (7, 5), indices imply (7, 3)
How can I achieve that? f I can avoid merging by index but I can specify a column it would be even better
One possible solution with merge with matching by index, default how='inner' should be omit:
result = pd.merge(df1, df4, left_index=True, right_index=True)
print (result)
A B_x C D_x B_y D_y F
2 A3 B3 C3 D3 B2 D2 F2
3 A1 B1 C1 D1 B3 D3 F3
3 A2 B2 C2 D2 B3 D3 F3
It create combination of duplicated matched rows:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 3, 3, 3])
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', 'D3', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 3, 7])
print (df1)
A B C D
0 A0 B0 C0 D0
3 A1 B1 C1 D1
3 A2 B2 C2 D2
3 A3 B3 C3 D3
print (df4)
B D F
2 B2 D2 F2
3 B3 D3 F3
3 B6 D6 F6
7 B7 D7 F7
result = pd.merge(df1, df4, left_index=True, right_index=True)
print (result)
A B_x C D_x B_y D_y F
3 A1 B1 C1 D1 B3 D3 F3
3 A1 B1 C1 D1 B6 D6 F6
3 A2 B2 C2 D2 B3 D3 F3
3 A2 B2 C2 D2 B6 D6 F6
3 A3 B3 C3 D3 B3 D3 F3
3 A3 B3 C3 D3 B6 D6 F6
Another possible solution is to use join:
df1.join(df4,how='inner', lsuffix='_df1', rsuffix='_df4')
Output:
A B_df1 C D_df1 B_df4 D_df4 F
2 A3 B3 C3 D3 B2 D2 F2
3 A1 B1 C1 D1 B3 D3 F3
3 A2 B2 C2 D2 B3 D3 F3
Related
Not sure if this can be done with pandas or if I need to write a loop with some logic.
I have some data representing chains of pairs of nodes:
pairs = [
# A1 -> B1 -> C1
{'source': 'A1', 'target': 'B1'},
{'source': 'B1', 'target': 'C1'},
# A1 -> D1
{'source': 'A1', 'target': 'D1'},
# C2 -> A2 -> B2
{'source': 'C2', 'target': 'A2'},
{'source': 'A2', 'target': 'B2'},
]
And I want to resolve those chains to create the list of nodes they contain:
results = [
['A1', 'B1', 'C1', 'D1'],
['C2', 'A2', 'B2'],
]
So far I have this code which does allow me to match some of those nodes together:
def pair_nodes(df, src, tgt):
df = df.groupby([src]).agg({tgt: 'unique'}).reset_index()
df['nodes'] = df.apply(lambda r: np.append(r[src], r[tgt]), axis=1)
return df
df1 = pair_nodes(df, 'source', 'target')
df2 = pair_nodes(df, 'target', 'source')
print(df1)
print(df2)
Which gives me:
source target nodes
0 A1 [B1, D1] [A1, B1, D1]
1 A2 [B2] [A2, B2]
2 B1 [C1] [B1, C1]
3 C2 [A2] [C2, A2]
target source nodes
0 A2 [C2] [A2, C2]
1 B1 [A1] [B1, A1]
2 B2 [A2] [B2, A2]
3 C1 [B1] [C1, B1]
4 D1 [A1] [D1, A1]
And I'm a stuck there. What I guess I'm missing is to merge rows from df1 and df2 whenever source or target is found in nodes
I had a look at df.merge but it only seems to work for exact key match.
Can this be achieved with pandas or do I need to write a custom loop/logic to do this?
Creating the desired result with merging dataframes can be a complicated process.
The above used login of merging will not be able to satisfy all types of graphs. Have a look at the below method.
# Create graph
graph = {}
for pair in pairs:
if pair['source'] in graph.keys():
graph[pair['source']].append(pair['target'])
else:
graph[pair['source']] = [pair['target']]
# Graph
print(graph)
{
'A1': ['B1', 'D1'],
'B1': ['C1'],
'C2': ['A2'],
'A2': ['B2']
}
# Generating list of nodes
start = 'A1' # Starting node parameter
result = [start]
for each in result:
if each in graph.keys():
result.extend(graph[each])
result = list(set(result))
# Output
print(result)
['A1', 'B1', 'C1', 'D1']
Having a dataframe which looks like this:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
I wonder how to rearange the dataframe when having a different order in one column that one wants to apply to all the others, for example having changed the A column in this example?
df2 = pd.DataFrame({'A': ['A3', 'A0', 'A2', 'A1'],
'B': ['B3', 'B0', 'B2', 'B1'],
'C': ['C3', 'C0', 'C2', 'C1'],
'D': ['D3', 'D0', 'D2', 'D1']},
index=[0, 1, 2, 3])
You can use indexing via set_index, reindex and reset_index. Assumes your values in A are unique, which is the only case where such a transformation would make sense.
L = ['A3', 'A0', 'A2', 'A1']
res = df1.set_index('A').reindex(L).reset_index()
print(res)
A B C D
0 A3 B3 C3 D3
1 A0 B0 C0 D0
2 A2 B2 C2 D2
3 A1 B1 C1 D1
did you mean to sort 1 specific row? if so, use:
df1.iloc[:1] = df1.iloc[:1].sort_index(axis=1,ascending=False)
print(df1)
for all columns use:
df1 = df1.sort_index(axis=0,ascending=False)
for specific columns use the iloc function.
You can use the key parameter from the sorted function:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
key = {'A3': 0, 'A0': 1, 'A2' : 2, 'A1': 3}
df1['A'] = sorted(df1.A, key=lambda e: key.get(e, 4))
print(df1)
Output
A B C D
0 A3 B0 C0 D0
1 A0 B1 C1 D1
2 A2 B2 C2 D2
3 A1 B3 C3 D3
By changing the values of key, you can set whatever order you want.
UPDATE
If want you want is to alter the order of the other columns based on the new order of A, you could try something like this:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A3', 'A0', 'A2', 'A1'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
key = [df1.A.values.tolist().index(k) for k in df2.A]
df2.B = df2['B'][key].tolist()
print(df2)
Output
A B C D
0 A3 B3 C0 D0
1 A0 B0 C1 D1
2 A2 B2 C2 D2
3 A1 B1 C3 D3
To alter all the columns just apply the above for each column. Somthing like this:
for column in df2.columns.values:
if column != 'A':
df2[column] = df2[column][key].tolist()
print(df2)
Output
A B C D
0 A3 B3 C3 D3
1 A0 B0 C0 D0
2 A2 B2 C2 D2
3 A1 B1 C1 D1
I'd like to insert rows of a specific dataframe one time in two rows in another specific dataframe. At the end, I'd like to do this for several columns of df1 and df2 (not only D and E).
I've got two different dataframes:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'E': ['E0', 'E1', 'E2', 'E3']},
index=[0, 1, 2, 3])
And I'd like to merge them like
df3 = pd.DataFrame({'A': ['A0', 'A0', 'A1', 'A1', 'A2', 'A2', 'A3', 'A3'],
'B': ['B0', 'B0', 'B1', 'B1', 'B2', 'B2', 'B3', 'B3'],
'C': ['C0', 'C0', 'C1', 'C1', 'C2', 'C2', 'C3', 'C3'],
'D': ['D0', 'E0', 'D1', 'E1', 'D2', 'E2', 'D3', 'E3']},
index=[0, 1, 2, 3, 4, 5, 6, 7])
1) Using pd.concat and sort_index
In [1006]: (pd.concat([df1, df2.rename(columns={'E': 'D'})])
.sort_index().reset_index(drop=True))
Out[1006]:
A B C D
0 A0 B0 C0 D0
1 A0 B0 C0 E0
2 A1 B1 C1 D1
3 A1 B1 C1 E1
4 A2 B2 C2 D2
5 A2 B2 C2 E2
6 A3 B3 C3 D3
7 A3 B3 C3 E3
2) Or, Using append and sort_index
In [1007]: df1.append(df2.rename(columns={'E': 'D'})).sort_index().reset_index(drop=True)
Out[1007]:
A B C D
0 A0 B0 C0 D0
1 A0 B0 C0 E0
2 A1 B1 C1 D1
3 A1 B1 C1 E1
4 A2 B2 C2 D2
5 A2 B2 C2 E2
6 A3 B3 C3 D3
7 A3 B3 C3 E3
Test
In [1009]: (pd.concat([df1, df2.rename(columns={'E': 'D'})])
.sort_index().reset_index(drop=True)
.equals(df3))
Out[1009]: True
In [1010]: pd.concat([df1, df2.rename(columns={'E': 'D'})]).equals(df3)
Out[1010]: False
The concat function lets you combine multiple DataFrames:
frames = [df1, df2.rename(columns={'E': 'D'})]
pd.concat(frames)
You can additional DataFrames to the list, but you will have to rename columns to have have merge correctly.
given two large dataframes, is there any concise and efficient code (avoid using any for loop directly) that allow me to obtain the complement of these two dataframes?
the most straight forward way to me is to compute union-intersection as shown in the naive example below, but I do not know how to implement this in an elegant languages of pandas or np
df1= pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2= pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
intersection= pd.merge(df1, df2, how='inner',on=['key1', 'key2'])
union=pd.merge(df1, df2, how='outer',on=['key1', 'key2'])
complement=union-intersection
thanks for any comments and answers
Starting with this:
df1= pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2= pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
intersection = pd.merge(df1, df2, how='inner',on=['key1', 'key2'])
union = pd.merge(df1, df2, how='outer',on=['key1', 'key2'])
print union
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
5 NaN NaN K2 K0 C3 D3
print intersection
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
union-intersection try this:
union[union.isnull().any(axis=1)]
A B key1 key2 C D
1 A1 B1 K0 K1 NaN NaN
4 A3 B3 K2 K1 NaN NaN
5 NaN NaN K2 K0 C3 D3
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
import pandas as pd
left = pd.DataFrame({'A': ['A1', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['K0', 'K1', 'K0', 'K1']})
right = pd.DataFrame({'AA': ['A1', 'A3'],
'BB': ['B0', 'B3'],
'CC': ['K0', 'K1'],
'DD': ['D0', 'D1']})
I want to join these two data frames by adding column DD to left. The values of DD should be selected based on comparing A and AA, B and BB, C and CC.
The simple joining case would be as as shown below, but in my case I need to compare columns with different names, and then I want only add DD to right.
result = left.join(right, on='DD')
The result should be:
result = pd.DataFrame({'A': ['A1', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['K0', 'K1', 'K0', 'K1'],
'DD': ['D0', NaN, NaN, 'D1']})
Use pandas merge method with left_on and right_on parameters.
left.merge(right, how='left',
left_on=['A', 'B', 'C'],
right_on=['AA', 'BB', 'CC'])[['A', 'B', 'C', 'DD']]
gets you:
A B C DD
0 A1 B0 K0 D0
1 A1 B1 K1 NaN
2 A2 B2 K0 NaN
3 A3 B3 K1 D1
It looks like you want to merge.
However at the moment the columns names don't match up (A is AA in right).
So first let's normalize them:
In [11]: right.columns = right.columns.map(lambda x: x[0])
Then we can merge on the shared columns:
In [12]: left.merge(right)
Out[12]:
A B C D
0 A1 B0 K0 D0
1 A3 B3 K1 D1
In [13]: left.merge(right, how="outer")
Out[13]:
A B C D
0 A1 B0 K0 D0
1 A1 B1 K1 NaN
2 A2 B2 K0 NaN
3 A3 B3 K1 D1