Pandas join vs add column - python

I have 2 dataframes (df1 and df2) with the same MultiIndex. df1 has column A, df2 has column B.
I found 2 ways of 'joining' these dataframes:
df_joined = df1.join(df2, how='inner')
or
df1['B'] = df2['B']
First option takes much longer. Why?
Does option 2 not look at indexes and just 'attaches' the column to the right?
Running this afterwards returns True, so the end result is the same it seems, but perhaps this is because the indexes in df1 and df2 are also in the same order:
df_joined.equals(df1)
Is there any faster way to join the dataframes knowing the indexes are the same?

There is no faster way than df1['B'] = df2['B'] if indices are aligned.
Assigning a series to another series is already well optimised in pandas.
join takes longer than assignment as it explicitly lines up df1.index and df2.index, which is expensive. It is not assumed that indices are in consistent order. As per pd.DataFrame.join documentation, if no column is specified the join will take place on the dataframes' respective indices.
I would be surprised if you find this is a bottleneck in your workflow. If it is, then I suggest you drop down to numpy arrays and avoid pandas altogether.

Related

How should I filter one dataframe by entries from another one in pandas with isin?

I have two dataframes (df1, df2). The columns names and indices are the same (the difference in columns entries). Also, df2 has only 20 entries (which also existed in df1 as i said).
I want to filter df1 by df2 entries, but when i try to do it with isin but nothing happens.
df1.isin(df2) or df1.index.isin(df2.index)
Tell me please what I'm doing wrong and how should I do it..
First of all the isin function in pandas returns a Dataframe of booleans and not the result you want. So it makes sense that the cmds you used did not work.
I am possitive that hte following psot will help
pandas - filter dataframe by another dataframe by row elements
If you want to select the entries in df1 with an index that is also present in df2, you should be able to do it with:
df1.loc[df2.index]
or if you really want to use isin:
df1[df1.index.isin(df2.index)]

Need help converting a merge function from R to Python, shape of resulting df is the same but losing more rows in Python after dropping duplicates

I believe the merge type in R is a left outer join. The merge I implemented in Python returned a dataframe that had the same shape as the resulting merged df in R. Although when I had dropped the duplicates (df2.drop_duplicates), 4000 rows were dropped in Python as opposed to the 50 rows dropped when applying the drop duplicates function to the post-merge R data frame
The dataframe I need to merge are df1 and df2
R:
df2<-merge( df2[ , -which(names(df2) %in% c(column9,column10))], df1[,c(column1,column2,column4,column5)],by.x=c(column1,column2),by.y=c(column2,column4),all.x=T
Python:
df2 = df2[[column1,column2,column3...column8]].merge(df1[[column1,column2,column4,column5]],how='left',left_on=[column1,column2],right_on=[column2,column4]
df2[column1] and df2[column2] are the columns I want to merge on because their names in df1 are df1[column2] and df1[column4] but have the same row values.
My gut tells me that the issue is stemming from this portion of the code that I might be misinterpreting: -which(names(df2) %in% c(column9,column10)
Please feel free to send some tips my way if I'm messing up somewhere
First, the list subset of columns in Pandas is no longer recommended. Instead, use reindex to subset columns which handles missing labels.
And the R translation of -which(names(df2) %in% c(column9, column10)) in Pandas can be ~df2.columns.isin([column9, column10]). And because isin returns a boolean series, to subset consider DataFrame.loc:
df2 = (df.loc[:, ~df2.columns.isin([column9, column10])]
.merge(df1.reindex([column1, column2, column4, column5], axis='columns'),
how='left',
left_on=[column1, column2],
right_on=[column2, column4])
)

Check if two pandas dataframes are equal with mismatching indexes python

I was wondering if it is possible to check the similarity between the two dataframes below. They are the same, however the first and the third rows are flipped. Is there a way to check that these dataframes are the same regardless of the order of the index? Thank you for any help!
You can use merge and then look for a subset of rows that doesn't exist in either dataframe.
df_a = pd.DataFrame([['a','b','c'], ['c','d','e'], ['e','f','g']], columns=['col1','col2','col3'])
df_b = pd.DataFrame([['e','f','g'], ['c','d','e'], ['a','b','c']], columns=['col1','col2','col3'])
df_merged = pd.merge(df_a, df_b, on=df_a.columns.tolist(), how='outer', indicator='Exist')
print(df_merged[(df_merged['Exist'] != 'both')])
Sort the DFs in the same way and then compare, or iterate through all the columns, sort one column at a time and compare it
If this is ok and you need help writing the code, let me know

pandas df.fillna - filling NaNs after outer join with correct values

I have two dataframes, sharing some columns together.
I'm trying to:
1) Merge the two dataframes together, i.e. adding the columns which are different:
diff = df2[df2.columns.difference(df1.columns)]
merged = pd.merge(df1, diff, how='outer', sort=False, on='ID')
Up to here, everything works as expected.
2) Now, to replace the NaN values with the values of df2
merged = merged[~merged.index.duplicated(keep='first')]
merged.fillna(value=df2)
And it is here that I get:
pandas.core.indexes.base.InvalidIndexError
I don't have any duplicates, and I can't find any information as to what can cause this.
The solution to this problem is to use a different method - combine_first()
this way, each row with missing data is filled with data from the other dataframe, as can be seen here Merging together values within Series or DataFrame columns
In case, number of rows changes because of the merge, fillna sometimes cause error. Try the following!
merged.fillna(df2.groupby(level=0).transform("mean"))
related question

How do I join two dataframes (pandas) with different indices?

I'm working on a way to transform sequence/genotype data from a csv format to a genepop format.
I have two dataframes: df1 is empty, df1.index (rows = samples) consists of almost the same as df2.index, except I inserted "POP" in several places (to specify the different populations). df2 holds the data, with Loci as columns.
I want to insert the values from df2 into df1, keeping empty rows where df1.index = 'POP'.
I tried join, combine, combine_first and concat, but they all seem to take the rows that exist in both df's.
Is there a way to do this?
It sounds like you want an 'outer' join:
df1.join(df2, how='outer')

Categories

Resources