Using pandas merge, the resulting columns are confusing:
df1 = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)))
df2 = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)))
df2[0] = df1[0] # matching key on the first column.
# Now the weird part.
pd.merge(df1, df2, left_on=0, right_on=0).shape
Out[96]: (5, 9)
pd.merge(df1, df2, left_index=True, right_index=True).shape
Out[102]: (5, 10)
pd.merge(df1, df2, left_on=0, right_on=1).shape
Out[107]: (0, 11)
The number of columns are not fixed, the column labels are also unstable, worse yet these are not documented clearly.
I want to read some columns of the resulting data frame, which have many columns (hundreds). Currently I am using .iloc[] because labeling is too much work. But I am worried that this is error prone due to the weird merged result.
What is the correct way to read some columns in the merged data frame?
Python: 2.7.13, Pandas: 0.19.2
Merge key
1.1 Merge on key when the join-key is a column (This is the right solution for you as you say "df2[0] = df1[0] # matching key on the first column.
")
1.2 Merge on index when the merge-key is the index
==> reason why you get 1 column more in the second merge (pd.merge(df1, df2, left_index=True, right_index=True).shape) is because the initial join keys appears now twice '0_x' & '0_y'
Regarding column names
Column names do not change during a merge, UNLESS there are columns with the same name in both dataframes. The columns change like following, you get :
'initial_column_name'+'_x' (the suffix '_x' is added to the column of the left dataframe (df1))
'initial_column_name'+'_y' (the suffix '_y' is added to the column of the right dataframe (df2) )
To deal with 3 different cases for the number of columns in merged result, I ended up checking the number of columns, then convert the column number index to use in .iloc[]. Here is the code, for future searchers.
Still the best way I know to deal with huge number of columns now. I will mark the better answer if there is one.
Utility method to convert column number index:
def get_merged_column_index(num_col_df, num_col_df1, num_col_df2, col_df1=[], col_df2=[], joinkey_df1=[], joinkey_df2=[]):
"""Transform the column indexes in old source dataframes to column indexes in merged dataframe. Check for different pandas merged result formats.
:param num_col_df: number of columns in merged dataframe df.
:param num_col_df1: number of columns in df1.
:param num_col_df2: number of columns in df2.
:param col_df1: (list of int) column position in df1 to keep (0-based).
:param col_df2: (list of int) column position in df2 to keep (0-based).
:param joinkey_df1: (list of int) column position (0-based). Not implemented now.
:param joinkey_df2: (list of int) column position (0-based). Not implemented now.
:return: (list of int) transformed column indexes, 0-based, in merged dataframe.
"""
col_df1 = np.array(col_df1)
col_df2 = np.array(col_df2)
if num_col_df == num_col_df1 + num_col_df2: # merging keeps same old columns
col_df2 += num_col_df1
elif num_col_df == num_col_df1 + num_col_df2 + 1: # merging add column 'key_0' to the head
col_df1 += 1
col_df2 += num_col_df1 + 1
elif num_col_df <= num_col_df1 + num_col_df2 - 1: # merging deletes (possibly many) duplicated "join-key" columns in df2, keep and do not change order columns in df1.
raise ValueError('Format of merged result is too complicated.')
else:
raise ValueError('Undefined format of merged result.')
return np.concatenate((col_df1, col_df2)).astype(int).tolist()
Then:
cols_toextract_df1 = []
cols_toextract_df2 = []
converted_cols = get_merged_column_index(num_col_df=df.shape[1], num_col_df1=df1.shape[1], num_col_df2=df2.shape[1], col_df1=cols_toextract_df1, col_df2=cols_toextract_df1)
extracted_df = df.iloc[:, converted_cols]
Related
I have a large data frame df and a small data frame df_right with 2 columns a and b. I want to do a simple left join / lookup on a without copying df.
I come up with this code but I am not sure how robust it is:
dtmp = pd.merge(df[['a']], df_right, on = 'a', how = "left") #one col left join
df['b'] = dtmp['b'].values
I know it certainly fails when there are duplicated keys: pandas left join - why more results?
Is there better way to do this?
Related:
Outer merging two data frames in place in pandas
What are the exact downsides of copy=False in DataFrame.merge()?
You are almost there.
There are 4 cases to consider:
Both df and df_right do not have duplicated keys
Only df has duplicated keys
Only df_right has duplicated keys
Both df and df_right have duplicated keys
Your code fails in case 3 & 4 since the merging extends the number of row count in df. In order to make it work, you need to choose what information to drop in df_right prior to merging. The purpose of this is to enforce any merging scheme to be either case 1 or 2.
For example, if you wish to keep "first" values for each duplicated key in df_right, the following code works for all 4 cases above.
dtmp = pd.merge(df[['a']], df_right.drop_duplicates('a', keep='first'), on='a', how='left')
df['b'] = dtmp['b'].values
Alternatively, if column 'b' of df_right consists of numeric values and you wish to have summary statistic:
dtmp = pd.merge(df[['a']], df_right.groupby('a').mean().reset_index(drop=False), on='a', how='left')
df['b'] = dtmp['b'].values
I'm new to python. I'm trying to concat 2 csv files to find out the differences. I'm using Id column as index to concatenate the values. Since the Csv files has duplicate IDs I'm getting the below error:
ValueError: Shape of passed values is (17, 4), indices imply (13, 4)
The error is on the line:
df_all_changes = pd.concat([old, new],axis=1,keys=['src','tgt'], join='inner')
Q1: How to handle/remedy the above error? Can someone please help
Q2: Also I want to know what the below line does:
df_changed = df_all_changes.groupby(level=0, axis=0).apply(lambda frame: frame.apply(report_diff, axis=1))
Q3: what would happen if I give level=1, axis=1 in the above line?
import pandas as pd
#list of key column(s)
key=['Id']
# Read in the two excel files and fill NA
old = pd.read_csv('Source.csv')
new = pd.read_csv('Target.csv')
#set index
old=old.set_index(key)
new=new.set_index(key)
#identify dropped rows and added (new) rows
dropped_rows = set(old.index) - set(new.index)
added_rows = set(new.index) - set(old.index)
#print(old.loc[dropped_rows])
#combine data
df_all_changes = pd.concat([old,new],axis=1,keys=['src','tgt'],join='inner')
print(df_all_changes)
#swap column indexes
df_all_changes = df_all_changes.swaplevel(axis='columns')#[new.columns[0:]]
#prepare functio for comparing old values and new values
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)
#apply the report_diff function
df_changed = df_all_changes.groupby(level=0, axis=0).apply(lambda frame: frame.apply(report_diff, axis=1))
print(df_changed)
You may want to provide examples of what the dataframe looks like;
Q1 - because two dataframes do not have the same number of rows, instead of using pd.concat, it would suggest you use pd.merge:
df_all_changes pd.merge(new, old, on=['src', 'tgt'], how='inner')
Q2
level=0 means that you want to group by first hierarchical index; and
axis=0 means you want to split by row level (it is default setting)
you should look at the documentation. and .apply() simply applies your custom function where you comparing old and new values column-wise (axis=1).
Q3 - mentioned above in Q2.
This question already has answers here:
Difference(s) between merge() and concat() in pandas
(7 answers)
Closed 2 years ago.
Say I have the following 2 pandas dataframes:
import pandas as pd
A = [174,-155,-931,301]
B = [943,847,510,16]
C = [325,914,501,884]
D = [-956,318,319,-83]
E = [767,814,43,-116]
F = [110,-784,-726,37]
G = [-41,964,-67,-207]
H = [-555,787,764,-788]
df1 = pd.DataFrame({"A": A, "B": B, "C": C, "D": D})
df2 = pd.DataFrame({"E": E, "B": F, "C": G, "D": H})
If I do concat with join=outer, I get the following resulting dataframe:
pd.concat([data1,data2], join='outer')
If I do df1.combine_first(df2), I get the following:
df1.set_index('B').combine_first(df2.set_index('B')).reset_index()
If I do pd.merge(df1, df2), I get the following which is identical to the result produced by concat:
pd.merge(data1, data2, on=['B','C','D'], how='outer')
And finally, if I do df1.join(df2, how='outer'), I get the following:
df1.join(df2, how='outer', on='B', lsuffix='_left', rsuffix='_right')
I don't fully understand how and why each produces different results.
concat: append one dataframe to another along the given axis (default axix=0 meaning concat along index, i.e. put other dataframe below given dataframe). Data are aligned on the other axis (i.e. for default setting align columns). This is why we get NaNs in the non-matching columns 'A' and 'E'.
combine_first: replace NaNs in dataframe by existing values in other dataframe, where rows and columns are pooled (union of rows and cols from both dataframes). In your example, there are no missing values from the beginning but they emerge due to the union operation as your indices have no common entries. The order of the rows results from the sorted combined index (df1.B and df2.B).
So if there are no missing values in your dataframe you wouldn't normally use combine_first.
merge is a database-style combination of two dataframes that offers more options on how to merge (left, right, specific columns) than concat. In your example, the data of the result are identical, but there's a difference in the index between concat and merge: when merging on columns, the dataframe indices will be ignored and a new index will be created.
join merges df1 and df2 on the index of df1 and the given column (in the example 'B') of df2. In your example this is the same as pd.merge(df1, df2, left_on=df1.index, right_on='B', how='outer', suffixes=('_left', '_right')). As there's no match between the index of df1 and column 'B' of df2 there will be a lot of NaNs due to the outer join.
I have a dataframe in which the first column contains a list of random size, from 0 to around 10 items in each list. This dataframe also contains several other columns of data.
I would like to insert as many columns as the length of the longest list, and then populate the values across sequentially such that each column has one item from the list in column one.
I was unsure of a good way to go about this.
sample = [[[0,2,3,7,8,9],2,3,4,5],[[1,2],2,3,4,5],[[1,3,4,5,6,7,8,9,0],2,3,4,5]]
headers = ["col1","col2","col3","col4","col5"]
df = pd.DataFrame(sample, columns = headers)
In this example I would like to add 9 columns after column 1, as this is the maxiumum length of the list in the third row of the dataframe. These columns would be populated with:
0 2 3 7 8 9 NULL NULL NULL in the first row,
1 2 NULL NULL NULL NULL NULL NULL NULL in the second, etc...
Edit to fit OPs edit
This is how I would do it. First I would pad the lists of the original column so that they're all the same length and it's easier to work with them. Afterwards it's a matter of creating the columns and filling it with the value corresponding to the position in the list. Let's say our lists are of size up to 4 for an easier example:
df = pd.DataFrame(sample, columns = headers)
df = df.rename(columns={'col1':'col_of_lists'})
max_length = max(df['col_of_lists'].apply(lambda x:len(x)))
df['col_of_lists'] = df['col_of_lists'].apply(lambda x:x + ([np.nan] * (max_length - len(x))))
for i in range(max_length):
df['col_'+str(i)] = df['col_of_lists'].apply(lambda x: x[i])
The easiest way to turn a series of lists into separate columns is to use apply to convert them into a Series, which triggers the 'expand' result type:
result = df['col1'].apply(pd.Series)
At this point, we can adjust the columns from the automatically numbered to include the name of the original 'col1', for example:
result.columns = [
'col1_{}'.format(i + 1)
for i in result.columns]
Finally, we can join it back to the original DataFrame. Using the fact that this was the first column makes it easy, just joining it to the left of the original frame, dropping the original 'col1' in the process:
result = result.join(df.drop('col1', axis=1))
You can even do it all as a one-liner, by using the rename() method to change column names:
df['col1'].apply(pd.Series).rename(
lambda i: 'col1_{}'.format(i + 1),
axis=1,
).join(df.drop('col1', axis=1))
I am working with two csv files and imported as dataframe, df1 and df2
df1 has 50000 rows and df2 has 150000 rows.
I want to compare (iterate through each row) the 'time' of df2 with
df1, find the difference in time and return the values of all column
corresponding to similar row, save it in df3 (time synchronization)
For example, 35427949712 (of 'time' in df1) is nearest or equal to
35427949712 (of 'time' in df2), So I would like to return the
contents to df1 ('velocity_x' and 'yaw') and df2 ('velocity' and
'yawrate') and save in df3
For this i used two techniques, shown in code.
Code 1 takes very long time to execute 72 hours which is not practice since i have lot of csv files
Code 2 gives me "memory error" and kernel dies.
Would be great, if I get a more robust solution for the problem considering computational time, memory and power(Intel Core i7-6700HQ, 8 GB Ram)
Here is the sample data,
import pandas as pd
df1 = pd.DataFrame({'time': [35427889701, 35427909854, 35427929709,35427949712, 35428009860],
'velocity_x':[12.5451, 12.5401,12.5351,12.5401,12.5251],
'yaw' : [-0.0787806, -0.0784749, -0.0794889,-0.0795915,-0.0795472]})
df2 = pd.DataFrame({'time': [35427929709, 35427949712, 35427009860,35427029728, 35427049705],
'velocity':[12.6583, 12.6556,12.6556,12.6556,12.6444],
'yawrate' : [-0.0750492, -0.0750492, -0.074351,-0.074351,-0.074351]})
df3 = pd.DataFrame(columns=['time','velocity_x','yaw','velocity','yawrate'])
Code1
for index, row in df1.iterrows():
min=100000
for indexer, rows in df2.iterrows():
if abs(float(row['time'])-float(rows['time']))<min:
min = abs(float(row['time'])-float(rows['time']))
#storing the position
pos = indexer
df3.loc[index,'time'] = df1['time'][pos]
df3.loc[index,'velocity_x'] = df1['velocity_x'][pos]
df3.loc[index,'yaw'] = df1['yaw'][pos]
df3.loc[index,'velocity'] = df2['velocity'][pos]
df3.loc[index,'yawrate'] = df2['yawrate'][pos]
Code2
df1['key'] = 1
df2['key'] = 1
df1.rename(index=str, columns ={'time' : 'time_x'}, inplace=True)
df = df2.merge(df1, on='key', how ='left').reset_index()
df['diff'] = df.apply(lambda x: abs(x['time'] - x['time_x']), axis=1)
df.sort_values(by=['time', 'diff'], inplace=True)
df=df.groupby(['time']).first().reset_index()[['time', 'velocity_x', 'yaw', 'velocity', 'yawrate']]
You're looking for pandas.merge_asof. It allows you to combine 2 DataFrames on a key, in this case time, without the requirement that they are an exact match. You can choose a direction for prioritizing the match, but in this case it's obvious that you want nearest
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
One caveat is that you need to sort things for merge_asof to work.
import pandas as pd
pd.merge_asof(df2.sort_values('time'), df1.sort_values('time'), on='time', direction='nearest')
# time velocity yawrate velocity_x yaw
#0 35427009860 12.6556 -0.074351 12.5451 -0.078781
#1 35427029728 12.6556 -0.074351 12.5451 -0.078781
#2 35427049705 12.6444 -0.074351 12.5451 -0.078781
#3 35427929709 12.6583 -0.075049 12.5351 -0.079489
#4 35427949712 12.6556 -0.075049 12.5401 -0.079591
Just be careful about which DataFrame you choose as the left or right frame, as that changes the result. In this case I'm selecting the time in df1 which is closest in absolute distance to the time in df2.
You also need to be careful if you have duplicated on keys in the right df because for exact matches, merge_asof only merges the last sorted row of the right df to the left df, instead of creating multiple entries for each exact match. If that's a problem, you can instead merge the exact keys first to get all of the combinations, and then merge the remainder with asof.
just a side note (as not an answer)
min_delta=100000
for indexer, rows in df2.iterrows():
if abs(float(row['time'])-float(rows['time']))<min_delta:
min_delta = abs(float(row['time'])-float(rows['time']))
#storing the position
pos = indexer
can be written as
diff = np.abs(row['time'] - df2['time'])
pos = np.argmin(diff)
(always avoid for loops)
and don't call your vars with a built-in name (min)