Merging two DataFrames horizontally without reindexing the first - python

I want to stack two DataFrames horizontally without re-indexing the first DataFrame (df1) as these indices contain some important information. However, indices on the second DataFrame (df2) has no significance and can be modified.
I could not find any way without converting the df2 to numpy and passing the indices of df1 at creation. For better understanding please find the below example.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 2, 3,4])
df2 = pd.DataFrame({'A1': ['A4', 'A5', 'A6', 'A7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D2': ['D4', 'D5', 'D6', 'D7']},
index=[ 4, 5, 6 ,7])
print(df1)
print(df2)
A B D
-------------
0 A0 B0 D0
2 A1 B1 D1
3 A2 B2 D2
4 A3 B3 D3
A1 C D2
-------------
4 A4 C4 D4
5 A5 C5 D5
6 A6 C6 D6
7 A7 C7 D7
Result I want:
A B D A1 C D2
--------------------------
0 A0 B0 D0 A4 C4 D4
2 A1 B1 D1 A5 C5 D5
3 A2 B2 D2 A6 C6 D6
4 A3 B3 D3 A7 C7 D7
PS: I would prefer a "one-shot" command to achieve this instead of using loops and adding each value.

Change the index of df2 to the index of df1 and them concatenate the dataframes:
df2.index = df1.index
pd.concat([df1, df2], axis=1)

Related

How merge 2 dataframe

I have a 2 pandas dataframe which looks like this:
A0 B0 C0
A1 B1 C1
A2 B2 C2
A3 B3 C3
and
A2 D0 E0
A0 D1 E1
A3 D2 E2
A1 D3 E3
How make this:
A0 B0 C0 D1 E1
A1 B1 C1 D3 E3
A2 B2 C2 D0 E0
A3 B3 C3 D2 E2
You are looking for merge
df1 = pd.DataFrame( [['A0', 'B0', 'C0'],
['A1', 'B1', 'C1'],
['A2', 'B2', 'C2'],
['A3', 'B3', 'C3']])
df1.columns = ['c1', 'c2', 'c3']
df2 = pd.DataFrame([['A2', 'D0', 'E0'],
['A0', 'D1', 'E1'],
['A3', 'D2', 'E2'],
['A1', 'D3', 'E3']])
df2.columns = ['c1', 'c4', 'c5']
df1.merge(df2, on = 'c1', how = 'left')
Output:
c1 c2 c3 c4 c5
0 A0 B0 C0 D1 E1
1 A1 B1 C1 D3 E3
2 A2 B2 C2 D0 E0
3 A3 B3 C3 D2 E2

How do I shuffle the rows of a matrix and its corresponding labels?

I'm in the early stages of building my first neural network and I'm brand new to python.
I am at a roadblock because I don't know how to write code to shuffle my data with its corresponding labels. I imported my csv, and I used numpy to create a matrix. I also created a matrix for my labels
filepath = '/My Drive/t_data9(1).csv'
my_data = pd.read_csv('/content/gdrive' + filepath, index_col=0)
my_data_matrix = np.array(my_data)
labels = [0]*5000 + [1]*5000
labels_matrix = np.array(labels)
I can access my data, so it's there. I just need to mix it up before I can separate out some training and validation rows and throw it in the NN I am buidling with keras. Please advise.
You may concat the feature and labels into a single data frame and do as follows to shuffle the whole sample:
Dummy Example
import pandas as pd
my_data = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
my_data.head()
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
And labels
labels = [0]*2 + [1]*2
my_data['labels'] = labels
my_data.head()
A B C D labels
0 A0 B0 C0 D0 0
1 A1 B1 C1 D1 0
2 A2 B2 C2 D2 1
3 A3 B3 C3 D3 1
And shuffling:
my_data = my_data.sample(frac=1).reset_index(drop=True) # shuffling
my_data.head()
A B C D labels
0 A2 B2 C2 D2 1
1 A0 B0 C0 D0 0
2 A3 B3 C3 D3 1
3 A1 B1 C1 D1 0

Combine pandas dataframes eliminating common columns with python

I have 3 dataframes:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'B': ['B0', 'B1', 'B2', 'B3'],\
'C': ['C0', 'C1', 'C2', 'C3'],\
'D': ['D0', 'D1', 'D2', 'D3']},\
index=[0,1,2,3])
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'E': ['E0', 'E1', 'E2', 'E3']},\
index=[0,1,2,3])
df3 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'F': ['F0', 'F1', 'F2', 'F3']},\
index=[0,1,2,3])
I want to combine them together to get the following results:
A B C D E F
0 A0 B0 C0 D0 E0 F0
1 A1 B1 C1 D1 E1 F1
2 A2 B2 C2 D2 E2 F2
3 A3 B3 C3 D3 E3 F3
When I try to combine them, I keep getting:
A B C D A E A F
0 A0 B0 C0 D0 A0 E0 A0 F0
1 A1 B1 C1 D1 A1 E1 A1 F1
2 A2 B2 C2 D2 A2 E2 A2 F2
3 A3 B3 C3 D3 A3 E3 A3 F3
The common column (A) is duplicated once for each dataframe used in the concat call. I have tried various combinations on:
df4 = pd.concat([df1, df2, df3], axis=1, sort=False)
Some variations have been disastrous while some keep giving the undesired result. Any suggestions would be much appreciated. Thanks.
Try
df4 = (pd.concat((df.set_index('A') for df in (df1,df2,df3)), axis=1)
.reset_index()
)
Output:
A B C D E F
0 A0 B0 C0 D0 E0 F0
1 A1 B1 C1 D1 E1 F1
2 A2 B2 C2 D2 E2 F2
3 A3 B3 C3 D3 E3 F3

Optimal way of Reshaping Pandas Dataframe

I have a one dimensional dataframe setup like this:
[A1,B1,C1,A2,B2,C2,A3,B3,C3,A4,B4,C4,A5,B5,C5,A6,B6,C6]
In the my program A1,...,C6 will be numbers read from a csv.
I would like to reshape it into a 2d dataframe like this:
[A1,B1,C1]
[A2,B2,C2]
[A3,B3,C3]
[A4,B4,C4]
[A5,B5,C5]
[A6,B6,C6]
I could make this using loops but it will slow the program down a lot since I would be making this transformation many times. What is the optimal command for reshaping data this way? I looked through a bunch of the reshape dataframe questions but couldn't find anything specific to this. Thanks in advance.
Setup
s = "A1,B1,C1,A2,B2,C2,A3,B3,C3,A4,B4,C4,A5,B5,C5,A6,B6,C6".split(',')
Using Numpy
pd.DataFrame(np.array(s).reshape(-1, 3))
0 1 2
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
4 A5 B5 C5
5 A6 B6 C6
Iterator shenanigans
pd.DataFrame([*zip(*[iter(s)]*3)])
0 1 2
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
4 A5 B5 C5
5 A6 B6 C6
Using a stride (step) when parsing the list, assuming the data is in the format you provided.
s = [A1,B1,C1,A2,B2,C2,A3,B3,C3,A4,B4,C4,A5,B5,C5,A6,B6,C6]
Note that if s is initially a dataframe with one row and 18 columns, you can convert it to a list via:
s = s.T.iloc[:, 0].tolist()
Then convert the result into a dataframe of your chosen dimension via:
df = pd.DataFrame({'A': s[::3], 'B': s[1::3], 'C': s[2::3]})
More generally:
s = range(18)
cols = 3
>>> pd.DataFrame([s[n:(n + cols)] for n in range(0, len(s), cols)])
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Using list split
[s[x:x+3] for x in range(0, len(s),3)]
Out[1151]:
[['A1', 'B1', 'C1'],
['A2', 'B2', 'C2'],
['A3', 'B3', 'C3'],
['A4', 'B4', 'C4'],
['A5', 'B5', 'C5'],
['A6', 'B6', 'C6']]
#pd.DataFrame([s[x:x+3] for x in range(0, len(s),3)])
I would reshape the array and ensure that the order argument is set to "A"
mylist = np.array(['a1', 'b1', 'c1', 'a2', 'b2', 'c2', 'a3', 'b3', 'c3', 'a4', 'b4', 'c4', 'a5','b5', 'c5', 'a6', 'b6', 'c6'])
reshapedList = mylist.reshape((6, 3), order = 'A')
print(mylist)
>>> ['a1' 'b1' 'c1' 'a2' 'b2' 'c2' 'a3' 'b3' 'c3' 'a4' 'b4' 'c4' 'a5' 'b5' 'c5' 'a6' 'b6' 'c6']
print(reshapedList)
[['a1' 'b1' 'c1']
['a2' 'b2' 'c2']
['a3' 'b3' 'c3']
['a4' 'b4' 'c4']
['a5' 'b5' 'c5']
['a6' 'b6' 'c6']]
If you want a pandas dataframe, you can get it as follows.
df = pd.DataFrame(mylist.reshape((6, 3), order = 'A'), columns = list('ABC'))
>>> df
A B C
0 a1 b1 c1
1 a2 b2 c2
2 a3 b3 c3
3 a4 b4 c4
4 a5 b5 c5
5 a6 b6 c6
Note:
It is important that you take sometime to check the differences between dataframe and array. Your question spoke of dataframe but what you really meant was array.

Extracting slices that are identical between two dataframes

How can I combine 2 dataframe df1 and df2 in order to get df3 that has the rows of df1 and df2 that have the same index (and the same values in the columns)?
df1 = pd.DataFrame({'A': ['A0', 'A2', 'A3', 'A7'],
'B': ['B0', 'B2', 'B3', 'B7'],
'C': ['C0', 'C2', 'C3', 'C7'],
'D': ['D0', 'D2', 'D3', 'D7']},
index=[0, 2, 3,7])
test 1
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A7'],
'B': ['B0', 'B1', 'B2', 'B7'],
'C': ['C0', 'C1', 'C2', 'C7'],
'D': ['D0', 'D1', 'D2', 'D7']},
index=[0, 1, 2, 7])
test 2
df2 = pd.DataFrame({'A': ['A1'],
'B': ['B1'],
'C': ['C1'],
'D': ['D1']},
index=[1])
Expected output test 1
Out[13]:
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
Expected output test 2
Empty DataFrame
Columns: [A, B, C, D]
Index: []
First, get the intersection of indices. Next, find all rows where all the columns are identical, and then just index into either dataframe.
idx = df1.index & df2.index
df_out = df1.loc[(df1.loc[idx] == df2.loc[idx]).all(1).index]
print(df_out)
You can also use df.isin (slightly different from the other answer):
df_out = df1[df1.isin(df2).all(1)]
print(df_out)
Test 1
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
Test 2
Empty DataFrame
Columns: [A, B, C, D]
Index: []
I believe this is amore pythonic solution:
df1[df2.isin(df1)].dropna()
gives:
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
pd.merge(df1.reset_index(), df2.reset_index()).set_index('index')
This adds the index of each dataframe as a column, then joins on all the columns (which now includes the index) and then sets the index to be back to the original values.
Or you can try this .
For test 1
df1['index']=df1.index
df2['index']=df2.index
df1['Mark']=df1.apply(lambda x : ' '.join(x.astype(str)),axis=1)
df2['Mark']=df2.apply(lambda x : ' '.join(x.astype(str)),axis=1)
df1[df1.Mark.isin(df2.Mark)].drop(['Mark','index'],1)
Out[20]:
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
For test 2
Out[28]:
Empty DataFrame
Columns: [A, B, C, D]
Index: []

Categories

Resources