I am trying to merge two dataframes (call them DF1 & DF2) that basically look like the below. My goal is:
I want open/close/low/high to all come from DF1.
I want numEvents and Volume = DF1 + DF2.
In cases where DF2 has rows that don't exist in DF1, I want open/close/low/high to be NaN (so I can later backfill them), and numEvents and Volume to come from DF2 as is.
Any help is much appreciated!
use pd.merge:
it's outer join since you want data from both dfs.
pd.merge([A,B],how='outer', on=<mutual_key>)
Use the left_on and right_on attributes of pd.merge(). You choose the fields that you want to merge.
DF1.merge(DF2, how='outer', right_on=<keys>...)
Related
How do I join together 4 DataFrames? The names of the DataFrames are called df1, df2, df3, and df4.
They are all the same column size and I am trying to use the 'inner' join.
How would I modify this code to make it work for all four?
I tried using this code and it worked to combine two of them, but I could not figure out how to write it to work for all four DataFrames.
dfJoin = df1.join(df2,how='inner')
print(dfJoin)
You just have to chain together the joins.
dfJoin = df1.join(df2, how="inner", on="common_column") /
.join(df3, how="inner", on="common_column") /
.join(df4, how="inner", on="common_column")
or if you have more than 4, just put them in a list df_list and iterate through it.
Joining or appending multiple DataFrames in one go can be down with pd.concat():
list_of_df = [df1, df2, df3, df4]
df = pd.concat(list_of_df, how=“inner”)
Your question does now state if you want them merged column- or index-wise, but since your state that they have the same number of columns, I assume you wish to append said DataFrames. For this case the code above works. If you want to make a wide DataFrame, change the attribute axisto 1.
I have two data frames with different row numbers and columns. Both tables has few common columns including "Customer ID". Both tables look like this with a size of 11697 rows × 15 columns and 385839 rows × 6 columns respectively. Customer ID might be repeating in second table. I want to concat both of the tables and want to merge similar columns using Customer ID. How can I do that with python PANDAS.
One table looks like this -
and the other one looks like this -
I am using below code -
pd.concat([df1, df2], sort=False)
Just wanted to make sure that I am not losing any information ? How can I check if there are multiple entries with one ID and how can I combine it in one result ?
EDIT -
When I am using above code, here is before and after values of NA'S in the dataset -
Can someone tell, where I went wrong ?
I believe that DataFrame.merge would work in this case:
# use how='outer' to preserve all information from both DataFrames
df1.merge(df2, how='outer', on='customer_id')
DataFrame.join could also work if both DataFrames had their indexes set to customer_id (it is also simpler):
df1 = df1.set_index('customer_id')
df2 = df2.set_index('customer_id')
df1.join(df2, how='outer')
Documentation for DataFrame.merge
Documentation for DataFrame.join
pd.concat will do the trick here,just set axis to 1 to concatenate on the second axis(columns), you should set the index to customer_id for both data frames first
import pandas as pd
pd.concat([df1.set_index('customer_id'), df2.set_index('customer_id')], axis = 1)
if you want to omit the rows with empty values as a result of your concatenaton, use dropna:
pd.concat([df1.set_index('customer_id'), df2.set_index('customer_id')], axis = 1).dropna()
Welcome, I have a simple question, to which I haven't found a solution.
I have two dataframes df1 and df2:
df1 contains several columns and a multiindex as year-month-week
df2 contains the multiindex year-week with only one column in the df.
I would like to create an inner join of df1 and df2, joining on 'year' and 'week'.
I have tried to do the following:
df1['newcol'] = df1.index.get_level_values(2).map(lambda x: df2.newcol[x])
Which only joins on month (or year?), is there any way to expand it so that the merge is actually right?
Thanks in advance!
df1
df2
Eventually i solved with with removing the multiindex and doing a good old inner join on the two columns and then recreating the multiindex at the end.
Here are the sniplets:
df=df.reset_index()
df2=df2.reset_index()
df['year']=df['year'].apply(int)
df2['year']=df2['year'].apply(int)
df['week']=df['week'].apply(int)
df2['week']=df2['week'].apply(int)
result = pd.merge(df, df2, how='left', left_on= ['year','week'],right_on= ['year','week'])
result=result.set_index(['year', 'month','week','day'])
I'm using Python Pandas to merge two dataframe, like so:
new_df = pd.merge(df1, df2, 'inner', left_on='Zip_Code', right_on='Zip_Code_List')
However, I would like to do this ONLY where another column ('Business_Name') in df2 contains a certain value. How do I do this? So, something like, "When Business Name is Walmart, merge these two dataframes."
#you can filter on df2 first and then merge.
pd.merge(df1, df2.query("Business_Name' == 'Walmart'"), how='inner', left_on='Zip_Code', right_on='Zip_Code_List')
I'm working on a way to transform sequence/genotype data from a csv format to a genepop format.
I have two dataframes: df1 is empty, df1.index (rows = samples) consists of almost the same as df2.index, except I inserted "POP" in several places (to specify the different populations). df2 holds the data, with Loci as columns.
I want to insert the values from df2 into df1, keeping empty rows where df1.index = 'POP'.
I tried join, combine, combine_first and concat, but they all seem to take the rows that exist in both df's.
Is there a way to do this?
It sounds like you want an 'outer' join:
df1.join(df2, how='outer')