How to merge two dataframes without getting additional rows? - python

Basically, I have two dataframes, the first one looks like this:
And the second one like this:
I want to get the columns "lat" and "lnt" of the second one and add to the first one only if the name of the city matches in both dataframes. I tried using pd.merge(), but it's creating new rows with duplicated values.
If possible, I would like to put a NaN in the rows which didn't have any match at all, but I don't want to remove nor add rows to the original dataframe.

The Pandas merge function defaults to an inner join. Since you're looking to merge in the columns of df2 to df1, you should use a left join. This will give you all the rows of df1, and the matching values from df2.
df3 = df1.merge(df2, on = 'city', how = 'left')

merged_df = df1.merge(df2, how = 'inner', on = ['City'])

Related

How to check if two pandas dataframes have same values and concatenate those rows?

I got a DF called "df" with 4 numerical columns [frame,id,x,y]
I made a loop that creates two dataframes called df1 and df2. Both df1 and df2 are subseted of the original dataframe.
What I want to do (and I am not understanding how to do it) is this: I want to CHECK if df1 and df2 have same VALUES in the column called "id". If they do, I want to concatenate those rows of df2 (that have the same id values) to df1.
For example: if df1 has rows with different id values (1,6,4,8) and df2 has this id values (12,7,8,10). I want to concatenate df2 rows that have the id value=8 to df1. That is all I need
This is my code:
for i in range(0,max(df['frame']),30):
df1=df[df['frame'].between(i, i+30)]
df2=df[df['frame'].between(i-30, i)]
There are several ways to accomplish what you need.
The simplest one is to get the slice of df2 that contains the values you need with .isin() and concatenate it with df1 in one line.
df3 = pd.concat([df1, df2[df2.id.isin(df1.id)]], axis = 0)
To gain more control and avoid any errors that might stem from updating df1 and df2 elsewhere, you may want to take the apart this one-liner.
look_for_vals = set(df1['id'].tolist())
# do some stuff
need_ix = df2[df2["id"].isin(look_for_vals )].index
# do more stuff
df3 = pd.concat([df1, df2.loc[need_ix,:]], axis=0)
Instead of set() you may also use df1['id'].unique()

Concatenating DataFrames where DataFrame1 contains the missing values of DataFrame2 (Column Specific). DataFrame1 does not have NaN values

I need to concatenate two DataFrames where both dataframes have a column named 'sample ids'. The first dataframe has all the relevant information needed, however the sample ids column in the first dataframe is missing all the sample ids that are within the second dataframe. Is there a way to insert the 'missing' sample ids (IN SEQUENTIAL ORDER) into the first dataframe using the second dataframe?
I have tried the following:
pd.concat([DF1,DF2],axis=1)
this did retain all information from both DataFrames, but the sample ids from both datframes were separated into different columns.
pd.merge(DF1,DF2,how='outer/inner/left/right')
this did not produce the desired outcome in the least...
I have shown the templates of the two dataframes below. Please help my brain is exploding!!!
DataFrame 2
DataFrame 1
If you want to:
insert the 'missing' sample ids (IN SEQUENTIAL ORDER) into the first
dataframe using the second dataframe
you can use an outer join by .merge() with how='outer', as follows:
df_out = df1.merge(df2, on="samp_id", how='outer')
To further ensure the samp_id are IN SEQUENTIAL ORDER, you can further sort on samp_id using .sort_values(), as follows:
df_out = df1.merge(df2, on="samp_id", how='outer').sort_values('samp_id', ignore_index=True)
Try this :
df = df1.merge(df2, on="samp_id")

Python - Merge dataframes where column contains a specific value

I'm using Python Pandas to merge two dataframe, like so:
new_df = pd.merge(df1, df2, 'inner', left_on='Zip_Code', right_on='Zip_Code_List')
However, I would like to do this ONLY where another column ('Business_Name') in df2 contains a certain value. How do I do this? So, something like, "When Business Name is Walmart, merge these two dataframes."
#you can filter on df2 first and then merge.
pd.merge(df1, df2.query("Business_Name' == 'Walmart'"), how='inner', left_on='Zip_Code', right_on='Zip_Code_List')

Pandas join two tables but not sure with which column in df2 ('or' syntax?)

I would like to join two tables together. Both tables are very large (around 1m rows). The problem is, It's not always clear which row the join needs to be made to. Ideally the program should try to join with col_x and if that fails try col_y
I would need to do an or logic as follows:
df3=pd.merge(df1,df2,left_on'col1', right_on='col_x' or 'col_y', how='left')
Any suggestions how this is best implemented are appreciated.
I would create a new column that contains the values you want to merge on first. Haven't tested, but I think it would be something like
# first create new column
df2['merge_col'] = df2['col_x']
# replace empty values
empty_rows = df2['merge_col'].isnull()
df2.loc[empty_rows, 'merge_col'] = df2.loc[empty_rows, 'col_y']
# merge with the new column
df3 = pd.merge(df1, df2, left_on='col1', right_on='merge_col', how='left')
Have you tried something like the below or do you require the check to be done within the merge function?
if df3["col_x"] == "":
df3=pd.merge(df1,df2,left_on'col1', right_on='col_y', how='left')
else:
df3=pd.merge(df1,df2,left_on'col1', right_on='col_x', how='left')

How do I join two dataframes (pandas) with different indices?

I'm working on a way to transform sequence/genotype data from a csv format to a genepop format.
I have two dataframes: df1 is empty, df1.index (rows = samples) consists of almost the same as df2.index, except I inserted "POP" in several places (to specify the different populations). df2 holds the data, with Loci as columns.
I want to insert the values from df2 into df1, keeping empty rows where df1.index = 'POP'.
I tried join, combine, combine_first and concat, but they all seem to take the rows that exist in both df's.
Is there a way to do this?
It sounds like you want an 'outer' join:
df1.join(df2, how='outer')

Categories

Resources