Merge two dataframes one common column - python

I have two data frames, one with three columns and another with two columns. Two columns are common in both data frames:
enter image description here
I have to update the Marks column of df1 from df2 where the data is missing only and keep the existing value as same in the df1.
I have tried pd.merge but the result created a separate column which was not intended.

Following worked for me:
df1['Mark'] = df1.Marks_x.combine_first(df1.Marks_y)
df1['Marks_x'] = df1['Mark']
df1 = df1.drop(['Marks_y', 'Mark'], axis=1)
df1 = df1.rename(columns = {'Marks_x':'Marks'})

Related

How to check if two pandas dataframes have same values and concatenate those rows?

I got a DF called "df" with 4 numerical columns [frame,id,x,y]
I made a loop that creates two dataframes called df1 and df2. Both df1 and df2 are subseted of the original dataframe.
What I want to do (and I am not understanding how to do it) is this: I want to CHECK if df1 and df2 have same VALUES in the column called "id". If they do, I want to concatenate those rows of df2 (that have the same id values) to df1.
For example: if df1 has rows with different id values (1,6,4,8) and df2 has this id values (12,7,8,10). I want to concatenate df2 rows that have the id value=8 to df1. That is all I need
This is my code:
for i in range(0,max(df['frame']),30):
df1=df[df['frame'].between(i, i+30)]
df2=df[df['frame'].between(i-30, i)]
There are several ways to accomplish what you need.
The simplest one is to get the slice of df2 that contains the values you need with .isin() and concatenate it with df1 in one line.
df3 = pd.concat([df1, df2[df2.id.isin(df1.id)]], axis = 0)
To gain more control and avoid any errors that might stem from updating df1 and df2 elsewhere, you may want to take the apart this one-liner.
look_for_vals = set(df1['id'].tolist())
# do some stuff
need_ix = df2[df2["id"].isin(look_for_vals )].index
# do more stuff
df3 = pd.concat([df1, df2.loc[need_ix,:]], axis=0)
Instead of set() you may also use df1['id'].unique()

How to merge two differently sized dataframes based on date

I am trying to merge two dataframes based on a date column with this code:
data_df = (pd.merge(data, one_min_df, on='date', how='outer'))
The first dataframe has 3784 columns and the second dataframe has 3764. Every date in the second dataframe is also within the first dataframe. I would like to get the dataframes to merge on the date column with any dates that the longer dataframe has being left as blank or NaN etc.
The code I have here gives the 3764 values followed by 20 empty rows, rather than correctly matching them.
Try this:
data['date'] = pd.to_datetime(data['date'])
one_min_df['date'] = pd.to_datetime(one_min_df['date'])
data_df = pd.merge(data, one_min_df, on='date', how='left')

Concatenating DataFrames where DataFrame1 contains the missing values of DataFrame2 (Column Specific). DataFrame1 does not have NaN values

I need to concatenate two DataFrames where both dataframes have a column named 'sample ids'. The first dataframe has all the relevant information needed, however the sample ids column in the first dataframe is missing all the sample ids that are within the second dataframe. Is there a way to insert the 'missing' sample ids (IN SEQUENTIAL ORDER) into the first dataframe using the second dataframe?
I have tried the following:
pd.concat([DF1,DF2],axis=1)
this did retain all information from both DataFrames, but the sample ids from both datframes were separated into different columns.
pd.merge(DF1,DF2,how='outer/inner/left/right')
this did not produce the desired outcome in the least...
I have shown the templates of the two dataframes below. Please help my brain is exploding!!!
DataFrame 2
DataFrame 1
If you want to:
insert the 'missing' sample ids (IN SEQUENTIAL ORDER) into the first
dataframe using the second dataframe
you can use an outer join by .merge() with how='outer', as follows:
df_out = df1.merge(df2, on="samp_id", how='outer')
To further ensure the samp_id are IN SEQUENTIAL ORDER, you can further sort on samp_id using .sort_values(), as follows:
df_out = df1.merge(df2, on="samp_id", how='outer').sort_values('samp_id', ignore_index=True)
Try this :
df = df1.merge(df2, on="samp_id")

Make two different dataframes using parent dataframe

I want to split one dataframe into two different data frames based on one of the columns value
Eg: df(parents dataframe)
df has a column MODE with values swiggy , zomato
df1 with all the columns which has common with MODE = swiggy
df2 with all the columns which has common with MODE= Zomato
I know its simple, I am beginner, Please help. Thanks.
df1 = df[df['MODE'] == 'swiggy'] and df2 = df[df['MODE'] == 'Zomato'].
This way, you will be filtering the dataframe based on the MODE column and assigning the resulting dataframe to new variables.

inner join with huge dataframes (~2 million columns)

I am trying to join two data frames (df1 and df2) based on matching values from one column (called 'Names') that is found in each data frame. I have tried this using R's inner_join function as well as Python's pandas merge function, and have been able to get both to work successfully on smaller subsets of my data. I think my problem is with the size of my data frames.
My data frames are as follows:
df1 has the 'Names' column with 5 additional columns and has ~900 rows.
df2 has the 'Names' column with ~2million additional columns and has ~900 rows.
I have tried (in R):
df3 <- inner_join(x = df1, y = df2, by = 'Name')
I have also tried (in Python where df1 and df2 are Pandas data frames):
df3 = df1.merge(right = df2, how = 'inner', left_on = 1, right_on = 0)
(where the 'Name' column is at index 1 of df1 and at index 0 of df2)
When I apply the above to my full data frames, it runs for a very long time and eventually crashes. Additionally, I suspect that the problem may be with the 2 million columns of my df2, so I tried sub-setting it (row-wise) into smaller data frames. My plan was to join the small subsets of df2 with df1 and then row bind the new data frames together at the end. However, joining even the smaller partitioned df2s was unsuccessful.
I would appreciate any suggestions anyone would be able to provide.
Thank you everyone for your help! Using data.table as #shadowtalker suggested, sped up the process tremendously. Just for reference in case anyone is trying to do something similar, df1 was approximately 400 mb and my df2 file was approximately 3gb.
I was able to accomplish the task as follows:
library(data.table)
df1 <- setDT(df1)
df2 <- setDT(df2)
setkey(df1, Name)
setkey(df2, Name)
df3 <- df1[df2, nomatch = 0]
This is a really ugly workaround where I break up df2's columns and add them piece by piece. Not sure it will work, but it might be worth a try:
# First, I only grab the "Name" column from df2
df3 = df1.merge(right=df2[["Name"]], how="inner", on="Name")
# Then I save all the column headers (excluding
# the "Name" column) in a separate list
df2_columns = df2.columns[np.logical_not(df2.columns.isin(["Name"]))]
# This determines how many columns are going to get added each time.
num_cols_per_loop = 1000
# And this just calculates how many times you'll need to go through the loop
# given the number of columns you set to get added each loop
num_loops = int(len(df2_columns)/num_cols_per_loop) + 1
for i in range(num_loops):
# For each run of the loop, we determine which rows will get added
this_column_sublist = df2_columns[i*num_cols_per_loop : (i+1)*num_cols_per_loop]
# You also need to add the "Name" column to make sure
# you get the observations in the right order
this_column_sublist = np.append("Name",this_column_sublist)
# Finally, merge with just the subset of df2
df3 = df3.merge(right=df2[this_column_sublist], how="inner", on="Name")
Like I said, it's an ugly workaround, but it just might work.

Categories

Resources