I have 3 dataframes:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'B': ['B0', 'B1', 'B2', 'B3'],\
'C': ['C0', 'C1', 'C2', 'C3'],\
'D': ['D0', 'D1', 'D2', 'D3']},\
index=[0,1,2,3])
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'E': ['E0', 'E1', 'E2', 'E3']},\
index=[0,1,2,3])
df3 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'F': ['F0', 'F1', 'F2', 'F3']},\
index=[0,1,2,3])
I want to combine them together to get the following results:
A B C D E F
0 A0 B0 C0 D0 E0 F0
1 A1 B1 C1 D1 E1 F1
2 A2 B2 C2 D2 E2 F2
3 A3 B3 C3 D3 E3 F3
When I try to combine them, I keep getting:
A B C D A E A F
0 A0 B0 C0 D0 A0 E0 A0 F0
1 A1 B1 C1 D1 A1 E1 A1 F1
2 A2 B2 C2 D2 A2 E2 A2 F2
3 A3 B3 C3 D3 A3 E3 A3 F3
The common column (A) is duplicated once for each dataframe used in the concat call. I have tried various combinations on:
df4 = pd.concat([df1, df2, df3], axis=1, sort=False)
Some variations have been disastrous while some keep giving the undesired result. Any suggestions would be much appreciated. Thanks.
Try
df4 = (pd.concat((df.set_index('A') for df in (df1,df2,df3)), axis=1)
.reset_index()
)
Output:
A B C D E F
0 A0 B0 C0 D0 E0 F0
1 A1 B1 C1 D1 E1 F1
2 A2 B2 C2 D2 E2 F2
3 A3 B3 C3 D3 E3 F3
Related
I have a 2 pandas dataframe which looks like this:
A0 B0 C0
A1 B1 C1
A2 B2 C2
A3 B3 C3
and
A2 D0 E0
A0 D1 E1
A3 D2 E2
A1 D3 E3
How make this:
A0 B0 C0 D1 E1
A1 B1 C1 D3 E3
A2 B2 C2 D0 E0
A3 B3 C3 D2 E2
You are looking for merge
df1 = pd.DataFrame( [['A0', 'B0', 'C0'],
['A1', 'B1', 'C1'],
['A2', 'B2', 'C2'],
['A3', 'B3', 'C3']])
df1.columns = ['c1', 'c2', 'c3']
df2 = pd.DataFrame([['A2', 'D0', 'E0'],
['A0', 'D1', 'E1'],
['A3', 'D2', 'E2'],
['A1', 'D3', 'E3']])
df2.columns = ['c1', 'c4', 'c5']
df1.merge(df2, on = 'c1', how = 'left')
Output:
c1 c2 c3 c4 c5
0 A0 B0 C0 D1 E1
1 A1 B1 C1 D3 E3
2 A2 B2 C2 D0 E0
3 A3 B3 C3 D2 E2
I'm in the early stages of building my first neural network and I'm brand new to python.
I am at a roadblock because I don't know how to write code to shuffle my data with its corresponding labels. I imported my csv, and I used numpy to create a matrix. I also created a matrix for my labels
filepath = '/My Drive/t_data9(1).csv'
my_data = pd.read_csv('/content/gdrive' + filepath, index_col=0)
my_data_matrix = np.array(my_data)
labels = [0]*5000 + [1]*5000
labels_matrix = np.array(labels)
I can access my data, so it's there. I just need to mix it up before I can separate out some training and validation rows and throw it in the NN I am buidling with keras. Please advise.
You may concat the feature and labels into a single data frame and do as follows to shuffle the whole sample:
Dummy Example
import pandas as pd
my_data = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
my_data.head()
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
And labels
labels = [0]*2 + [1]*2
my_data['labels'] = labels
my_data.head()
A B C D labels
0 A0 B0 C0 D0 0
1 A1 B1 C1 D1 0
2 A2 B2 C2 D2 1
3 A3 B3 C3 D3 1
And shuffling:
my_data = my_data.sample(frac=1).reset_index(drop=True) # shuffling
my_data.head()
A B C D labels
0 A2 B2 C2 D2 1
1 A0 B0 C0 D0 0
2 A3 B3 C3 D3 1
3 A1 B1 C1 D1 0
I want to stack two DataFrames horizontally without re-indexing the first DataFrame (df1) as these indices contain some important information. However, indices on the second DataFrame (df2) has no significance and can be modified.
I could not find any way without converting the df2 to numpy and passing the indices of df1 at creation. For better understanding please find the below example.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 2, 3,4])
df2 = pd.DataFrame({'A1': ['A4', 'A5', 'A6', 'A7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D2': ['D4', 'D5', 'D6', 'D7']},
index=[ 4, 5, 6 ,7])
print(df1)
print(df2)
A B D
-------------
0 A0 B0 D0
2 A1 B1 D1
3 A2 B2 D2
4 A3 B3 D3
A1 C D2
-------------
4 A4 C4 D4
5 A5 C5 D5
6 A6 C6 D6
7 A7 C7 D7
Result I want:
A B D A1 C D2
--------------------------
0 A0 B0 D0 A4 C4 D4
2 A1 B1 D1 A5 C5 D5
3 A2 B2 D2 A6 C6 D6
4 A3 B3 D3 A7 C7 D7
PS: I would prefer a "one-shot" command to achieve this instead of using loops and adding each value.
Change the index of df2 to the index of df1 and them concatenate the dataframes:
df2.index = df1.index
pd.concat([df1, df2], axis=1)
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two dataframes with same columns. Only one column has different values. I want to concatenate the two without duplication.
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['B0', 'B1', 'B2']})
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['A0', 'A1', 'A2']})
df1
Out[630]:
key cat B
0 K0 C0 A0
1 K1 C1 A1
2 K2 C2 A2
df2
Out[631]:
key cat B
0 K0 C0 B0
1 K1 C1 B1
2 K2 C2 B2
I tried:
result = pd.concat([df1, df2], axis=1)
result
Out[633]:
key cat B key cat B
0 K0 C0 A0 K0 C0 B0
1 K1 C1 A1 K1 C1 B1
2 K2 C2 A2 K2 C2 B2
The desired output:
key cat B_df1 B_df2
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2
NOTE: I could drop duplicates afterwards and rename columns but that doesn't seem efficient
pd.merge will do the job
pd.merge(df1,df2, on=['key','cat'])
Output
key cat B_x B_y
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2
How can I combine 2 dataframe df1 and df2 in order to get df3 that has the rows of df1 and df2 that have the same index (and the same values in the columns)?
df1 = pd.DataFrame({'A': ['A0', 'A2', 'A3', 'A7'],
'B': ['B0', 'B2', 'B3', 'B7'],
'C': ['C0', 'C2', 'C3', 'C7'],
'D': ['D0', 'D2', 'D3', 'D7']},
index=[0, 2, 3,7])
test 1
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A7'],
'B': ['B0', 'B1', 'B2', 'B7'],
'C': ['C0', 'C1', 'C2', 'C7'],
'D': ['D0', 'D1', 'D2', 'D7']},
index=[0, 1, 2, 7])
test 2
df2 = pd.DataFrame({'A': ['A1'],
'B': ['B1'],
'C': ['C1'],
'D': ['D1']},
index=[1])
Expected output test 1
Out[13]:
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
Expected output test 2
Empty DataFrame
Columns: [A, B, C, D]
Index: []
First, get the intersection of indices. Next, find all rows where all the columns are identical, and then just index into either dataframe.
idx = df1.index & df2.index
df_out = df1.loc[(df1.loc[idx] == df2.loc[idx]).all(1).index]
print(df_out)
You can also use df.isin (slightly different from the other answer):
df_out = df1[df1.isin(df2).all(1)]
print(df_out)
Test 1
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
Test 2
Empty DataFrame
Columns: [A, B, C, D]
Index: []
I believe this is amore pythonic solution:
df1[df2.isin(df1)].dropna()
gives:
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
pd.merge(df1.reset_index(), df2.reset_index()).set_index('index')
This adds the index of each dataframe as a column, then joins on all the columns (which now includes the index) and then sets the index to be back to the original values.
Or you can try this .
For test 1
df1['index']=df1.index
df2['index']=df2.index
df1['Mark']=df1.apply(lambda x : ' '.join(x.astype(str)),axis=1)
df2['Mark']=df2.apply(lambda x : ' '.join(x.astype(str)),axis=1)
df1[df1.Mark.isin(df2.Mark)].drop(['Mark','index'],1)
Out[20]:
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
For test 2
Out[28]:
Empty DataFrame
Columns: [A, B, C, D]
Index: []