Related
I'm in the early stages of building my first neural network and I'm brand new to python.
I am at a roadblock because I don't know how to write code to shuffle my data with its corresponding labels. I imported my csv, and I used numpy to create a matrix. I also created a matrix for my labels
filepath = '/My Drive/t_data9(1).csv'
my_data = pd.read_csv('/content/gdrive' + filepath, index_col=0)
my_data_matrix = np.array(my_data)
labels = [0]*5000 + [1]*5000
labels_matrix = np.array(labels)
I can access my data, so it's there. I just need to mix it up before I can separate out some training and validation rows and throw it in the NN I am buidling with keras. Please advise.
You may concat the feature and labels into a single data frame and do as follows to shuffle the whole sample:
Dummy Example
import pandas as pd
my_data = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
my_data.head()
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
And labels
labels = [0]*2 + [1]*2
my_data['labels'] = labels
my_data.head()
A B C D labels
0 A0 B0 C0 D0 0
1 A1 B1 C1 D1 0
2 A2 B2 C2 D2 1
3 A3 B3 C3 D3 1
And shuffling:
my_data = my_data.sample(frac=1).reset_index(drop=True) # shuffling
my_data.head()
A B C D labels
0 A2 B2 C2 D2 1
1 A0 B0 C0 D0 0
2 A3 B3 C3 D3 1
3 A1 B1 C1 D1 0
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two dataframes with same columns. Only one column has different values. I want to concatenate the two without duplication.
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['B0', 'B1', 'B2']})
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['A0', 'A1', 'A2']})
df1
Out[630]:
key cat B
0 K0 C0 A0
1 K1 C1 A1
2 K2 C2 A2
df2
Out[631]:
key cat B
0 K0 C0 B0
1 K1 C1 B1
2 K2 C2 B2
I tried:
result = pd.concat([df1, df2], axis=1)
result
Out[633]:
key cat B key cat B
0 K0 C0 A0 K0 C0 B0
1 K1 C1 A1 K1 C1 B1
2 K2 C2 A2 K2 C2 B2
The desired output:
key cat B_df1 B_df2
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2
NOTE: I could drop duplicates afterwards and rename columns but that doesn't seem efficient
pd.merge will do the job
pd.merge(df1,df2, on=['key','cat'])
Output
key cat B_x B_y
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2
I am really struggling to understand the "left_index" and "right_index" arguments in pandas.merge. I read the documentation, searched around, experimented with various setting and tried to understand but I am still confused. Consider this example:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'E': [1,2,3,4]})
Now, when I run the following command:
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, left_index=True)
I get:
key1_x key2_x A B key1_y key2_y C D E _merge
0 K0 K0 A0 B0 K0 K0 C0 D0 1.0 both
1 K0 K1 A1 B1 K1 K0 C1 D1 2.0 both
2 K0 K1 A1 B1 K1 K0 C2 D2 3.0 both
3 K1 K0 A2 B2 NaN NaN NaN NaN NaN left_only
3 K2 K1 A3 B3 NaN NaN NaN NaN NaN left_only
3 NaN NaN NaN NaN K2 K0 C3 D3 4.0 right_only
However, running the same with right_index=True gives an error. Same if I introduce both. More interestingly, running the following merge gives a very unexpected result
pd.merge(left, right, on=['key1', 'key2'],how='outer', validate = 'one_to_many', indicator=True, left_index = True, right_index = True)
Result is:
key1 key2 A B C D E _merge
0 K0 K0 A0 B0 C0 D0 1 both
1 K0 K1 A1 B1 C1 D1 2 both
2 K1 K0 A2 B2 C2 D2 3 both
3 K2 K1 A3 B3 C3 D3 4 both
As you can see, all information for right frame for key1 and key2 is completely lost.
Please help me understand the purpose and function of these arguments. Thank you.
Merging happens in a couple of ways:
Column-Column Merge: Use left_on, right_on and how.
Example:
# Gives same answer
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how = 'outer')
pd.merge(left, right, on=['key1', 'key2'], how='outer', indicator=True)
Index-Index Merge: Set left_index and right_index to True or use on and use how.
Example:
pd.merge(left, right, how = 'inner', right_index = True, left_index = True)
# If you make matching unique multi-indexes for both data frames you can do
# pd.merge(left, right, how = 'inner', on = ['indexname1', 'indexname2'])
# In your data frames, you're keys duplicate values so you can't do this
# In general, a column with duplicate values does not make a good key
Column-Index Merge: Use left_on + right_index or left_index + right_on and how.
Note: Both the values in index and left_on must match. If you're index is a integer and you're left_on is a string, you get error. Also, number of indexing levels must match.
Example:
# If how not specified, inner join is used
pd.merge(left, right, right_on=['E'], left_index = True, how = 'outer')
# Gives error because left_on is string and right_index is integer
pd.merge(left, right, left_on=['key1'], right_index = True, how = 'outer')
# This gave you error because left_on has indexing level of 2 but right_index only has indexing level of 1.
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, right_index=True)
You kind of mix up the different types of merges which gave weird results.
If you can't see how the merging is going to happen conceptually, chances are a computer isn't going to do any better.
If I understand the behavior of merge correctly, you should pick only one option for left and right respectively (i.e. You should not pick left_on=['x'] and left_index=True at the same time). Otherwise, strange thing can happen in arbitrary way since it confuses merge as to which key should be actually used as you have shown in current implementation of merge (I have not checked the pandas source in detail, but the behavior can change for different implementations in each version). Here is a small experiment.
>>> left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
>>> right
key1 key2 C D E
0 K0 K0 C0 D0 1
1 K1 K0 C1 D1 2
2 K1 K0 C2 D2 3
3 K2 K0 C3 D3 4
(1) merge using ['key1', 'key2']
>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')
key1 key2 A B C D E
0 K0 K0 A0 B0 C0 D0 1.0
1 K0 K1 A1 B1 NaN NaN NaN
2 K1 K0 A2 B2 C1 D1 2.0
3 K1 K0 A2 B2 C2 D2 3.0
4 K2 K1 A3 B3 NaN NaN NaN
5 K2 K0 NaN NaN C3 D3 4.0
(2) Set ['key1', 'key2'] as left index and merge it using the index and keys
>>> left = left.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_on=['key1', 'key2'], how='outer').reset_index(drop=True)
A B key1 key2 C D E
0 A0 B0 K0 K0 C0 D0 1.0
1 A1 B1 K0 K1 NaN NaN NaN
2 A2 B2 K1 K0 C1 D1 2.0
3 A2 B2 K1 K0 C2 D2 3.0
4 A3 B3 K2 K1 NaN NaN NaN
5 NaN NaN K2 K0 C3 D3 4.0
(3) Further set ['key1', 'key2'] as right index and merge it using the index
>>> right = right.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_index=True, how='outer').reset_index()
key1 key2 A B C D E
0 K0 K0 A0 B0 C0 D0 1.0
1 K0 K1 A1 B1 NaN NaN NaN
2 K1 K0 A2 B2 C1 D1 2.0
3 K1 K0 A2 B2 C2 D2 3.0
4 K2 K0 NaN NaN C3 D3 4.0
5 K2 K1 A3 B3 NaN NaN NaN
Please note that (1)(2)(3) above are showing the same results, and even if ['key1', 'key2'] are set as index, you can still use left_on = ['key1', 'key2'] instead of left_index=True.
Now, if you really want to merge using both ['key1', 'key2'] with index, one way to achieve this is:
>>> pd.merge(left.reset_index(), right.reset_index(), on=['index', 'key1', 'key2'], how='outer')
index key1 key2 A B C D E
0 0 K0 K0 A0 B0 C0 D0 1.0
1 1 K0 K1 A1 B1 NaN NaN NaN
2 2 K1 K0 A2 B2 C2 D2 3.0
3 3 K2 K1 A3 B3 NaN NaN NaN
4 1 K1 K0 NaN NaN C1 D1 2.0
5 3 K2 K0 NaN NaN C3 D3 4.0
If you read down to here, I'm pretty sure now you know how to achieve above using multiple different ways.
Hope this helps.
I'm a little stuck, can you please help me with this. I've simplified the problem I'm facing to the following:
Input
Desired Output
I know how to handle the case where the dictionaries in col. c have same keys.
You can create DataFrame by constructor, reshape by stack and last join to original:
df1 = (pd.DataFrame(df.c.values.tolist())
.stack()
.reset_index(level=1)
.rename(columns={0:'val','level_1':'key'}))
print (df1)
key val
0 c00 v00
0 c01 v01
1 c10 v10
2 c20 v20
2 c21 v21
2 c22 v22
df = df.drop('c', 1).join(df1).reset_index(drop=True)
print (df)
a b key val
0 a0 b0 c00 v00
1 a0 b0 c01 v01
2 a1 b1 c10 v10
3 a2 b2 c20 v20
4 a2 b2 c21 v21
5 a2 b2 c22 v22
Here is one way:
import pandas as pd
from itertools import chain
df = pd.DataFrame([['a0', 'b0', {'c00': 'v00', 'c01': 'v01'}],
['a1', 'b1', {'c10': 'v10'}],
['a2', 'b2', {'c20': 'v20', 'c21': 'v21', 'c22': 'v22'}] ],
columns=['a', 'b', 'c'])
# first convert 'c' to list of tuples
df['c'] = df['c'].apply(lambda x: list(x.items()))
lens = list(map(len, df['c']))
# create dataframe
df_out = pd.DataFrame({'a': np.repeat(df['a'].values, lens),
'b': np.repeat(df['b'].values, lens),
'c': list(chain.from_iterable(df['c'].values))})
# unpack tuple
df_out = df_out.join(df_out['c'].apply(pd.Series))\
.rename(columns={0: 'key', 1: 'val'}).drop('c', 1)
# a b key val
# 0 a0 b0 c00 v00
# 1 a0 b0 c01 v01
# 2 a1 b1 c10 v10
# 3 a2 b2 c20 v20
# 4 a2 b2 c21 v21
# 5 a2 b2 c22 v22
My solution is next:
import pandas as pd
t=pd.DataFrame([['a0','b0',{'c00':'v00','c01':'v01'}],['a1','b1',{'c10':'v10'}],['a2','b2',{'c20':'v20','c21':'v21','c22':'v22'}]],columns=['a','b','c'])
l2=[]
for i in t.index:
for j in t.loc[i,'c']:
l2+=[[t.loc[i,'a'],t.loc[i,'b'],j,t.loc[i,'c'][j]]]
t2=pd.DataFrame(l2,columns=['a','b','key','val'])
where 't' is your DataFrame, which you obtain as you want.
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two dataframes,
df1 = pd.DataFrame({'A': ['A1', 'A1', 'A2', 'A3'],
'B': ['121', '345', '123', '146'],
'C': ['K0', 'K1', 'K0', 'K1']})
df2 = pd.DataFrame({'A': ['A1', 'A3'],
'BB': ['B0', 'B3'],
'CC': ['121', '345'],
'DD': ['D0', 'D1']})
Now I need to get the similiar rows from column A and B from df1 and column A and CC from df2.
And so I tried possible merge options, such as:
both_DFS=pd.merge(df1,df2, how='left',left_on=['A','B'],right_on=['A','CC'])
and this will not give me row information from df2 dataframe which is what I needed. Meaning, I have all column names from df2 but the rows are just empty or Nan.
And then I tried:
Both_DFs=pd.merge(df1,df2, how='left',left_on=['A','B'],right_on=['A','CC'])[['A','B','CC']]
And this give me error as,
KeyError: "['B'] not in index"
I am aiming to have a merged Dataframe with all columns from both df1 and df2. Any suggestions would be great
Desired output:
Both_DFs
A B C BB CC DD
0 A1 121 K0 B0 121 D0
So in my data frames (df1 and df2), only one row has exact match for both columns of interest. That is, Column A and B from df1 has only one row matching exactly to rows in columns A and CC in df2
Well, if you declare column A as index, it works:
Both_DFs = pd.merge(df1.set_index('A', drop=True),df2.set_index('A', drop=True), how='left',left_on=['B'],right_on=['CC'], left_index=True, right_index=True).dropna().reset_index()
This results in:
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
2 A3 146 K1 B3 345 D1
EDIT
You just needed:
Both_DFs = pd.merge(df1,df2, how='left',left_on=['A','B'],right_on=['A','CC']).dropna()
Which gives:
A B C BB CC DD
0 A1 121 K0 B0 121 D0
You can also use join with default left join or merge, last if necessary remove rows with NaNs by dropna:
print (df1.join(df2.set_index('A'), on='A').dropna())
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
3 A3 146 K1 B3 345 D1
print (pd.merge(df1, df2, on='A', how='left').dropna())
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
3 A3 146 K1 B3 345 D1
EDIT:
I think you need inner join (by default, so on='inner' can be omit):
Both_DFs = pd.merge(df1,df2, left_on=['A','B'],right_on=['A','CC'])
print (Both_DFs)
A B C BB CC DD
0 A1 121 K0 B0 121 D0
I don't know if your example show exactly your problem but,
If we try to merge with MultiIndex, we need to have the 2 index matching.
df1['A'] == df2['A'] && df1['B'] == df2['CC']
Here we haven't any row that match the 2 index.
If we merge just by df1['A'], we got something like this :
Both_DFs=pd.merge(df1, df2, how='left', left_on=['A'], right_on=['A'])
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
2 A2 121 K0 NaN NaN NaN
3 A3 146 K1 B3 345 D1
If you wan't remove line row that not in df2 try to change 'how' method to inner.
Both_DFs=pd.merge(df1, df2, how='left', left_on=['A'], right_on=['A'])
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
2 A3 146 K1 B3 345 D1
Did this approach of what you're looking for ?