How to get Python function to update original df [duplicate] - python

I have a large data frame df and a small data frame df_right with 2 columns a and b. I want to do a simple left join / lookup on a without copying df.
I come up with this code but I am not sure how robust it is:
dtmp = pd.merge(df[['a']], df_right, on = 'a', how = "left") #one col left join
df['b'] = dtmp['b'].values
I know it certainly fails when there are duplicated keys: pandas left join - why more results?
Is there better way to do this?
Related:
Outer merging two data frames in place in pandas
What are the exact downsides of copy=False in DataFrame.merge()?

You are almost there.
There are 4 cases to consider:
Both df and df_right do not have duplicated keys
Only df has duplicated keys
Only df_right has duplicated keys
Both df and df_right have duplicated keys
Your code fails in case 3 & 4 since the merging extends the number of row count in df. In order to make it work, you need to choose what information to drop in df_right prior to merging. The purpose of this is to enforce any merging scheme to be either case 1 or 2.
For example, if you wish to keep "first" values for each duplicated key in df_right, the following code works for all 4 cases above.
dtmp = pd.merge(df[['a']], df_right.drop_duplicates('a', keep='first'), on='a', how='left')
df['b'] = dtmp['b'].values
Alternatively, if column 'b' of df_right consists of numeric values and you wish to have summary statistic:
dtmp = pd.merge(df[['a']], df_right.groupby('a').mean().reset_index(drop=False), on='a', how='left')
df['b'] = dtmp['b'].values

Related

Joining two dataframes on subvalue of the key column

I am currently trying to join / merge two df on the column Key, where in df1 the key is a standalone value such as 5, but in df2, the key can consist of multiple values such as [5,6,13].
For example like this:
df1 = pd.DataFrame({'key': [["5","6","13"],["10","7"],["6","8"]]})
df2 = pd.DataFrame({'sub_key': ["5","10","6"]})
However, my df are a lot bigger and consist of many columns, so an efficient solution would be great.
As a result I would like to have a table like this:
Key1
Key2
5
5,6,13
10
10,7
and so on ....
I already tried to apply this approach to my code, but it didn't work:
df1['join'] = 1
df2['join'] = 1
merged= df1.merge(df2, on='join').drop('join', axis=1)
df2.drop('join', axis=1, inplace=True)
merged['match'] = merged.apply(lambda x: x.key(x.sub_key), axis=1).ge(0)
I also tried to split and explode the column and to join on single values, however there the problem was, that not all column values were split correctly and I would need to combine everything back into one cell once joined.
Help would be much appreciated!
If you only want to match the first key:
df1['sub_key'] = df1.key.str[0]
df1.merge(df2)
If you want to match ANY key:
df3 = df1.explode('key').rename(columns={'key':'sub_key'})
df3 = df3.join(df1)
df3.merge(df2)
Edit: First version had a small bug, fixed it.

Join in Pandas Dataframe using conditional join statement

I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)

How to take data from another dataframe based on 2 columns

Here is my code
A['period_id'] = A['period_number','Session'].map(B.set_index(['period_number','Session'])['period_id'])
So I want to take data from column period_id of B to give to A, based on criteria that 2 columns (period_number and Session) are matched. However it gave me error. What can I do?
You can use pd.merge:
A_columns = A.columns
A_columns.append("period_id")
# merge based on period_number and Session
merged_df = pd.merge(A, B, how='left', left_on=['period_number','Session'], right_on = ['period_number','Session'])
final_df = merged_df[A_columns] # filter for only columns in A + `period_id` from B
Note that if A's column names are different for period_number and Session, you'll have to adjust your left_on, and vice versa for B. To be explicit A is the left dataframe here, and B is the right dataframe.

pandas - why will left join introduce new values and a lot of duplicates?

When performing an operation like
df1 = pd.DataFrame({'idNo':[1,2,3], 'value_1':[0,1,0]})
df2 = pd.DataFrame({'idNo':[1,2,3], 'value_2':[1,1,0]})
merged_data = pd.merge(df1, df2, on='idNo', how='left')
print(df1.shape)
print(merged_data.shape)
merged_data.duplicated(subset=['idNo']).sum()
How can it be that merged_data.duplicated will not be 0 (it is 0 for this minimal example)? And if it is > 0, can I safely drop duplicates? Is pandas joining via the index ans messing something up?
For my real data read from a CSV I see the problem that a lot of duplicated values will be introduced for such a left join operation, but do not understand why. Is it safe to simply drop the duplicates?
edit
this basically only concatenates columns. Maybe there is a better operation in pandas which will not cause duplicates?
you have a duplicate 'idNo' in one of your dfs
df1 = pd.DataFrame({'idNo':[1,2,3], 'value':[0,1,0]})
df2 = pd.DataFrame({'idNo':[1,2,3,3], 'value':[1,1,0,1]})
merged_data = pd.merge(df1, df2, on='idNo', how='left')
print(df1.shape)
print(merged_data.shape)
merged_data.duplicated(subset=['idNo']).sum()
(3, 2)
(4, 3)
1
This makes perfect sense!

How to read the result of pandas merge?

Using pandas merge, the resulting columns are confusing:
df1 = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)))
df2 = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)))
df2[0] = df1[0] # matching key on the first column.
# Now the weird part.
pd.merge(df1, df2, left_on=0, right_on=0).shape
Out[96]: (5, 9)
pd.merge(df1, df2, left_index=True, right_index=True).shape
Out[102]: (5, 10)
pd.merge(df1, df2, left_on=0, right_on=1).shape
Out[107]: (0, 11)
The number of columns are not fixed, the column labels are also unstable, worse yet these are not documented clearly.
I want to read some columns of the resulting data frame, which have many columns (hundreds). Currently I am using .iloc[] because labeling is too much work. But I am worried that this is error prone due to the weird merged result.
What is the correct way to read some columns in the merged data frame?
Python: 2.7.13, Pandas: 0.19.2
Merge key
1.1 Merge on key when the join-key is a column (This is the right solution for you as you say "df2[0] = df1[0] # matching key on the first column.
")
1.2 Merge on index when the merge-key is the index
==> reason why you get 1 column more in the second merge (pd.merge(df1, df2, left_index=True, right_index=True).shape) is because the initial join keys appears now twice '0_x' & '0_y'
Regarding column names
Column names do not change during a merge, UNLESS there are columns with the same name in both dataframes. The columns change like following, you get :
'initial_column_name'+'_x' (the suffix '_x' is added to the column of the left dataframe (df1))
'initial_column_name'+'_y' (the suffix '_y' is added to the column of the right dataframe (df2) )
To deal with 3 different cases for the number of columns in merged result, I ended up checking the number of columns, then convert the column number index to use in .iloc[]. Here is the code, for future searchers.
Still the best way I know to deal with huge number of columns now. I will mark the better answer if there is one.
Utility method to convert column number index:
def get_merged_column_index(num_col_df, num_col_df1, num_col_df2, col_df1=[], col_df2=[], joinkey_df1=[], joinkey_df2=[]):
"""Transform the column indexes in old source dataframes to column indexes in merged dataframe. Check for different pandas merged result formats.
:param num_col_df: number of columns in merged dataframe df.
:param num_col_df1: number of columns in df1.
:param num_col_df2: number of columns in df2.
:param col_df1: (list of int) column position in df1 to keep (0-based).
:param col_df2: (list of int) column position in df2 to keep (0-based).
:param joinkey_df1: (list of int) column position (0-based). Not implemented now.
:param joinkey_df2: (list of int) column position (0-based). Not implemented now.
:return: (list of int) transformed column indexes, 0-based, in merged dataframe.
"""
col_df1 = np.array(col_df1)
col_df2 = np.array(col_df2)
if num_col_df == num_col_df1 + num_col_df2: # merging keeps same old columns
col_df2 += num_col_df1
elif num_col_df == num_col_df1 + num_col_df2 + 1: # merging add column 'key_0' to the head
col_df1 += 1
col_df2 += num_col_df1 + 1
elif num_col_df <= num_col_df1 + num_col_df2 - 1: # merging deletes (possibly many) duplicated "join-key" columns in df2, keep and do not change order columns in df1.
raise ValueError('Format of merged result is too complicated.')
else:
raise ValueError('Undefined format of merged result.')
return np.concatenate((col_df1, col_df2)).astype(int).tolist()
Then:
cols_toextract_df1 = []
cols_toextract_df2 = []
converted_cols = get_merged_column_index(num_col_df=df.shape[1], num_col_df1=df1.shape[1], num_col_df2=df2.shape[1], col_df1=cols_toextract_df1, col_df2=cols_toextract_df1)
extracted_df = df.iloc[:, converted_cols]

Categories

Resources