I'm working on a way to transform sequence/genotype data from a csv format to a genepop format.
I have two dataframes: df1 is empty, df1.index (rows = samples) consists of almost the same as df2.index, except I inserted "POP" in several places (to specify the different populations). df2 holds the data, with Loci as columns.
I want to insert the values from df2 into df1, keeping empty rows where df1.index = 'POP'.
I tried join, combine, combine_first and concat, but they all seem to take the rows that exist in both df's.
Is there a way to do this?
It sounds like you want an 'outer' join:
df1.join(df2, how='outer')
Related
I got a DF called "df" with 4 numerical columns [frame,id,x,y]
I made a loop that creates two dataframes called df1 and df2. Both df1 and df2 are subseted of the original dataframe.
What I want to do (and I am not understanding how to do it) is this: I want to CHECK if df1 and df2 have same VALUES in the column called "id". If they do, I want to concatenate those rows of df2 (that have the same id values) to df1.
For example: if df1 has rows with different id values (1,6,4,8) and df2 has this id values (12,7,8,10). I want to concatenate df2 rows that have the id value=8 to df1. That is all I need
This is my code:
for i in range(0,max(df['frame']),30):
df1=df[df['frame'].between(i, i+30)]
df2=df[df['frame'].between(i-30, i)]
There are several ways to accomplish what you need.
The simplest one is to get the slice of df2 that contains the values you need with .isin() and concatenate it with df1 in one line.
df3 = pd.concat([df1, df2[df2.id.isin(df1.id)]], axis = 0)
To gain more control and avoid any errors that might stem from updating df1 and df2 elsewhere, you may want to take the apart this one-liner.
look_for_vals = set(df1['id'].tolist())
# do some stuff
need_ix = df2[df2["id"].isin(look_for_vals )].index
# do more stuff
df3 = pd.concat([df1, df2.loc[need_ix,:]], axis=0)
Instead of set() you may also use df1['id'].unique()
I have a big dataframe with 100 rows and the structure is [qtr_dates<datetime.date>, sales<float>] and a small dataframe with same structure with less than 100 rows. I want to merge these two dfs such that merged df will have all the rows from small df and remaining rows will be taken from big df.
Right now I am doing this
df = big_df.merge(small_df, on=big_df.columns.tolist(), how='outer')
But this is creating a df with duplicate qtr_dates.
Use concat with remove duplicates by DataFrame.drop_duplicates:
pd.concat([small_df, big_df], ignore_index=True).drop_duplicates(subset=['qtr_dates'])
If I understand correctly, you want everything from the bigger dataframe, but if that date is present in the smaller data frame you would want it replaced by the relevant value from the smaller one?
Hence I think you want to do this:
df = big_df.merge(small_df, on=big_df.columns.tolist(),how='left',indicator=True)
df = df[df._merge!= "both"]
df_out = pd.concat([df,small_df],ignore_index=True)
This would remove any rows from the big_df which exist in the small_df in the 2nd step, before then adding the small_df rows by concatenating rather than merging.
If you had more column names that weren't involved with the join you'd have to do some column renaming/dropping though I think.
Hope that's right.
Try maybe join instead of merge.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
Basically, I have two dataframes, the first one looks like this:
And the second one like this:
I want to get the columns "lat" and "lnt" of the second one and add to the first one only if the name of the city matches in both dataframes. I tried using pd.merge(), but it's creating new rows with duplicated values.
If possible, I would like to put a NaN in the rows which didn't have any match at all, but I don't want to remove nor add rows to the original dataframe.
The Pandas merge function defaults to an inner join. Since you're looking to merge in the columns of df2 to df1, you should use a left join. This will give you all the rows of df1, and the matching values from df2.
df3 = df1.merge(df2, on = 'city', how = 'left')
merged_df = df1.merge(df2, how = 'inner', on = ['City'])
I have two data frames with different row numbers and columns. Both tables has few common columns including "Customer ID". Both tables look like this with a size of 11697 rows × 15 columns and 385839 rows × 6 columns respectively. Customer ID might be repeating in second table. I want to concat both of the tables and want to merge similar columns using Customer ID. How can I do that with python PANDAS.
One table looks like this -
and the other one looks like this -
I am using below code -
pd.concat([df1, df2], sort=False)
Just wanted to make sure that I am not losing any information ? How can I check if there are multiple entries with one ID and how can I combine it in one result ?
EDIT -
When I am using above code, here is before and after values of NA'S in the dataset -
Can someone tell, where I went wrong ?
I believe that DataFrame.merge would work in this case:
# use how='outer' to preserve all information from both DataFrames
df1.merge(df2, how='outer', on='customer_id')
DataFrame.join could also work if both DataFrames had their indexes set to customer_id (it is also simpler):
df1 = df1.set_index('customer_id')
df2 = df2.set_index('customer_id')
df1.join(df2, how='outer')
Documentation for DataFrame.merge
Documentation for DataFrame.join
pd.concat will do the trick here,just set axis to 1 to concatenate on the second axis(columns), you should set the index to customer_id for both data frames first
import pandas as pd
pd.concat([df1.set_index('customer_id'), df2.set_index('customer_id')], axis = 1)
if you want to omit the rows with empty values as a result of your concatenaton, use dropna:
pd.concat([df1.set_index('customer_id'), df2.set_index('customer_id')], axis = 1).dropna()
I am trying to merge two dataframes (call them DF1 & DF2) that basically look like the below. My goal is:
I want open/close/low/high to all come from DF1.
I want numEvents and Volume = DF1 + DF2.
In cases where DF2 has rows that don't exist in DF1, I want open/close/low/high to be NaN (so I can later backfill them), and numEvents and Volume to come from DF2 as is.
Any help is much appreciated!
use pd.merge:
it's outer join since you want data from both dfs.
pd.merge([A,B],how='outer', on=<mutual_key>)
Use the left_on and right_on attributes of pd.merge(). You choose the fields that you want to merge.
DF1.merge(DF2, how='outer', right_on=<keys>...)