Optimal way of Reshaping Pandas Dataframe - python

I have a one dimensional dataframe setup like this:
[A1,B1,C1,A2,B2,C2,A3,B3,C3,A4,B4,C4,A5,B5,C5,A6,B6,C6]
In the my program A1,...,C6 will be numbers read from a csv.
I would like to reshape it into a 2d dataframe like this:
[A1,B1,C1]
[A2,B2,C2]
[A3,B3,C3]
[A4,B4,C4]
[A5,B5,C5]
[A6,B6,C6]
I could make this using loops but it will slow the program down a lot since I would be making this transformation many times. What is the optimal command for reshaping data this way? I looked through a bunch of the reshape dataframe questions but couldn't find anything specific to this. Thanks in advance.

Setup
s = "A1,B1,C1,A2,B2,C2,A3,B3,C3,A4,B4,C4,A5,B5,C5,A6,B6,C6".split(',')
Using Numpy
pd.DataFrame(np.array(s).reshape(-1, 3))
0 1 2
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
4 A5 B5 C5
5 A6 B6 C6
Iterator shenanigans
pd.DataFrame([*zip(*[iter(s)]*3)])
0 1 2
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
4 A5 B5 C5
5 A6 B6 C6

Using a stride (step) when parsing the list, assuming the data is in the format you provided.
s = [A1,B1,C1,A2,B2,C2,A3,B3,C3,A4,B4,C4,A5,B5,C5,A6,B6,C6]
Note that if s is initially a dataframe with one row and 18 columns, you can convert it to a list via:
s = s.T.iloc[:, 0].tolist()
Then convert the result into a dataframe of your chosen dimension via:
df = pd.DataFrame({'A': s[::3], 'B': s[1::3], 'C': s[2::3]})
More generally:
s = range(18)
cols = 3
>>> pd.DataFrame([s[n:(n + cols)] for n in range(0, len(s), cols)])
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17

Using list split
[s[x:x+3] for x in range(0, len(s),3)]
Out[1151]:
[['A1', 'B1', 'C1'],
['A2', 'B2', 'C2'],
['A3', 'B3', 'C3'],
['A4', 'B4', 'C4'],
['A5', 'B5', 'C5'],
['A6', 'B6', 'C6']]
#pd.DataFrame([s[x:x+3] for x in range(0, len(s),3)])

I would reshape the array and ensure that the order argument is set to "A"
mylist = np.array(['a1', 'b1', 'c1', 'a2', 'b2', 'c2', 'a3', 'b3', 'c3', 'a4', 'b4', 'c4', 'a5','b5', 'c5', 'a6', 'b6', 'c6'])
reshapedList = mylist.reshape((6, 3), order = 'A')
print(mylist)
>>> ['a1' 'b1' 'c1' 'a2' 'b2' 'c2' 'a3' 'b3' 'c3' 'a4' 'b4' 'c4' 'a5' 'b5' 'c5' 'a6' 'b6' 'c6']
print(reshapedList)
[['a1' 'b1' 'c1']
['a2' 'b2' 'c2']
['a3' 'b3' 'c3']
['a4' 'b4' 'c4']
['a5' 'b5' 'c5']
['a6' 'b6' 'c6']]
If you want a pandas dataframe, you can get it as follows.
df = pd.DataFrame(mylist.reshape((6, 3), order = 'A'), columns = list('ABC'))
>>> df
A B C
0 a1 b1 c1
1 a2 b2 c2
2 a3 b3 c3
3 a4 b4 c4
4 a5 b5 c5
5 a6 b6 c6
Note:
It is important that you take sometime to check the differences between dataframe and array. Your question spoke of dataframe but what you really meant was array.

Related

How do I shuffle the rows of a matrix and its corresponding labels?

I'm in the early stages of building my first neural network and I'm brand new to python.
I am at a roadblock because I don't know how to write code to shuffle my data with its corresponding labels. I imported my csv, and I used numpy to create a matrix. I also created a matrix for my labels
filepath = '/My Drive/t_data9(1).csv'
my_data = pd.read_csv('/content/gdrive' + filepath, index_col=0)
my_data_matrix = np.array(my_data)
labels = [0]*5000 + [1]*5000
labels_matrix = np.array(labels)
I can access my data, so it's there. I just need to mix it up before I can separate out some training and validation rows and throw it in the NN I am buidling with keras. Please advise.
You may concat the feature and labels into a single data frame and do as follows to shuffle the whole sample:
Dummy Example
import pandas as pd
my_data = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
my_data.head()
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
And labels
labels = [0]*2 + [1]*2
my_data['labels'] = labels
my_data.head()
A B C D labels
0 A0 B0 C0 D0 0
1 A1 B1 C1 D1 0
2 A2 B2 C2 D2 1
3 A3 B3 C3 D3 1
And shuffling:
my_data = my_data.sample(frac=1).reset_index(drop=True) # shuffling
my_data.head()
A B C D labels
0 A2 B2 C2 D2 1
1 A0 B0 C0 D0 0
2 A3 B3 C3 D3 1
3 A1 B1 C1 D1 0

Merging two DataFrames horizontally without reindexing the first

I want to stack two DataFrames horizontally without re-indexing the first DataFrame (df1) as these indices contain some important information. However, indices on the second DataFrame (df2) has no significance and can be modified.
I could not find any way without converting the df2 to numpy and passing the indices of df1 at creation. For better understanding please find the below example.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 2, 3,4])
df2 = pd.DataFrame({'A1': ['A4', 'A5', 'A6', 'A7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D2': ['D4', 'D5', 'D6', 'D7']},
index=[ 4, 5, 6 ,7])
print(df1)
print(df2)
A B D
-------------
0 A0 B0 D0
2 A1 B1 D1
3 A2 B2 D2
4 A3 B3 D3
A1 C D2
-------------
4 A4 C4 D4
5 A5 C5 D5
6 A6 C6 D6
7 A7 C7 D7
Result I want:
A B D A1 C D2
--------------------------
0 A0 B0 D0 A4 C4 D4
2 A1 B1 D1 A5 C5 D5
3 A2 B2 D2 A6 C6 D6
4 A3 B3 D3 A7 C7 D7
PS: I would prefer a "one-shot" command to achieve this instead of using loops and adding each value.
Change the index of df2 to the index of df1 and them concatenate the dataframes:
df2.index = df1.index
pd.concat([df1, df2], axis=1)

Understanding the FutureWarning on using join_axes when concatenating with Pandas

I have two DataFrames:
df1:
A B C
1 A1 B1 C1
2 A2 B2 C2
df2:
B C D
3 B3 C3 D3
4 B4 C4 D4
Columns B and C are identical for both.
I'd like to concatenate them vertically and keep the columns of the first DataFrame:
pd.concat([df1, df2], join_axes=[df1.columns]):
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
This works, but raises a
FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
I couldn't find (either in the documentation or through Google) how to "Use .reindex or .reindex_like on the result to achieve the same functionality".
Colab notebook illustrating issue: https://colab.research.google.com/drive/13EBq2z0Nh05JY7ovrdnLGtfeqdKVvZq0
Just like what the error mentioned add reindex
pd.concat([df1,df2.reindex(columns=df1.columns)])
Out[286]:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
df1 = pd.DataFrame({'A': ['A1', 'A2'], 'B': ['B1', 'B2'], 'C': ['C1', 'C2']})
df2 = pd.DataFrame({'B': ['B3', 'B4'], 'C': ['C3', 'C4'], 'D': ['D1', 'D2']})
pd.concat([df1, df2], sort=False)[df1.columns]
yields the desired result.
OR...
pd.concat([df1, df2], sort=False).reindex(df1.columns, axis=1)
Output:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4

Combine pandas dataframes eliminating common columns with python

I have 3 dataframes:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'B': ['B0', 'B1', 'B2', 'B3'],\
'C': ['C0', 'C1', 'C2', 'C3'],\
'D': ['D0', 'D1', 'D2', 'D3']},\
index=[0,1,2,3])
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'E': ['E0', 'E1', 'E2', 'E3']},\
index=[0,1,2,3])
df3 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'F': ['F0', 'F1', 'F2', 'F3']},\
index=[0,1,2,3])
I want to combine them together to get the following results:
A B C D E F
0 A0 B0 C0 D0 E0 F0
1 A1 B1 C1 D1 E1 F1
2 A2 B2 C2 D2 E2 F2
3 A3 B3 C3 D3 E3 F3
When I try to combine them, I keep getting:
A B C D A E A F
0 A0 B0 C0 D0 A0 E0 A0 F0
1 A1 B1 C1 D1 A1 E1 A1 F1
2 A2 B2 C2 D2 A2 E2 A2 F2
3 A3 B3 C3 D3 A3 E3 A3 F3
The common column (A) is duplicated once for each dataframe used in the concat call. I have tried various combinations on:
df4 = pd.concat([df1, df2, df3], axis=1, sort=False)
Some variations have been disastrous while some keep giving the undesired result. Any suggestions would be much appreciated. Thanks.
Try
df4 = (pd.concat((df.set_index('A') for df in (df1,df2,df3)), axis=1)
.reset_index()
)
Output:
A B C D E F
0 A0 B0 C0 D0 E0 F0
1 A1 B1 C1 D1 E1 F1
2 A2 B2 C2 D2 E2 F2
3 A3 B3 C3 D3 E3 F3

Extracting slices that are identical between two dataframes

How can I combine 2 dataframe df1 and df2 in order to get df3 that has the rows of df1 and df2 that have the same index (and the same values in the columns)?
df1 = pd.DataFrame({'A': ['A0', 'A2', 'A3', 'A7'],
'B': ['B0', 'B2', 'B3', 'B7'],
'C': ['C0', 'C2', 'C3', 'C7'],
'D': ['D0', 'D2', 'D3', 'D7']},
index=[0, 2, 3,7])
test 1
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A7'],
'B': ['B0', 'B1', 'B2', 'B7'],
'C': ['C0', 'C1', 'C2', 'C7'],
'D': ['D0', 'D1', 'D2', 'D7']},
index=[0, 1, 2, 7])
test 2
df2 = pd.DataFrame({'A': ['A1'],
'B': ['B1'],
'C': ['C1'],
'D': ['D1']},
index=[1])
Expected output test 1
Out[13]:
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
Expected output test 2
Empty DataFrame
Columns: [A, B, C, D]
Index: []
First, get the intersection of indices. Next, find all rows where all the columns are identical, and then just index into either dataframe.
idx = df1.index & df2.index
df_out = df1.loc[(df1.loc[idx] == df2.loc[idx]).all(1).index]
print(df_out)
You can also use df.isin (slightly different from the other answer):
df_out = df1[df1.isin(df2).all(1)]
print(df_out)
Test 1
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
Test 2
Empty DataFrame
Columns: [A, B, C, D]
Index: []
I believe this is amore pythonic solution:
df1[df2.isin(df1)].dropna()
gives:
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
pd.merge(df1.reset_index(), df2.reset_index()).set_index('index')
This adds the index of each dataframe as a column, then joins on all the columns (which now includes the index) and then sets the index to be back to the original values.
Or you can try this .
For test 1
df1['index']=df1.index
df2['index']=df2.index
df1['Mark']=df1.apply(lambda x : ' '.join(x.astype(str)),axis=1)
df2['Mark']=df2.apply(lambda x : ' '.join(x.astype(str)),axis=1)
df1[df1.Mark.isin(df2.Mark)].drop(['Mark','index'],1)
Out[20]:
A B C D
0 A0 B0 C0 D0
2 A2 B2 C2 D2
7 A7 B7 C7 D7
For test 2
Out[28]:
Empty DataFrame
Columns: [A, B, C, D]
Index: []

Categories

Resources