I was wondering if it is possible to check the similarity between the two dataframes below. They are the same, however the first and the third rows are flipped. Is there a way to check that these dataframes are the same regardless of the order of the index? Thank you for any help!
You can use merge and then look for a subset of rows that doesn't exist in either dataframe.
df_a = pd.DataFrame([['a','b','c'], ['c','d','e'], ['e','f','g']], columns=['col1','col2','col3'])
df_b = pd.DataFrame([['e','f','g'], ['c','d','e'], ['a','b','c']], columns=['col1','col2','col3'])
df_merged = pd.merge(df_a, df_b, on=df_a.columns.tolist(), how='outer', indicator='Exist')
print(df_merged[(df_merged['Exist'] != 'both')])
Sort the DFs in the same way and then compare, or iterate through all the columns, sort one column at a time and compare it
If this is ok and you need help writing the code, let me know
Related
I have two dataframes (df1, df2). The columns names and indices are the same (the difference in columns entries). Also, df2 has only 20 entries (which also existed in df1 as i said).
I want to filter df1 by df2 entries, but when i try to do it with isin but nothing happens.
df1.isin(df2) or df1.index.isin(df2.index)
Tell me please what I'm doing wrong and how should I do it..
First of all the isin function in pandas returns a Dataframe of booleans and not the result you want. So it makes sense that the cmds you used did not work.
I am possitive that hte following psot will help
pandas - filter dataframe by another dataframe by row elements
If you want to select the entries in df1 with an index that is also present in df2, you should be able to do it with:
df1.loc[df2.index]
or if you really want to use isin:
df1[df1.index.isin(df2.index)]
I have two data frames of the format
I want to concatenate the two such that I have a resultant table of the format
Thanks in advance!
Use pandas outer merge.
For example if your first dataframe is df1 and second is df2 then,
result_df = df1.merge(df2, how="outer", left_on="Time 15 Min",right_on="Time Event")
See documentation for more info.
I have a big dataframe with 100 rows and the structure is [qtr_dates<datetime.date>, sales<float>] and a small dataframe with same structure with less than 100 rows. I want to merge these two dfs such that merged df will have all the rows from small df and remaining rows will be taken from big df.
Right now I am doing this
df = big_df.merge(small_df, on=big_df.columns.tolist(), how='outer')
But this is creating a df with duplicate qtr_dates.
Use concat with remove duplicates by DataFrame.drop_duplicates:
pd.concat([small_df, big_df], ignore_index=True).drop_duplicates(subset=['qtr_dates'])
If I understand correctly, you want everything from the bigger dataframe, but if that date is present in the smaller data frame you would want it replaced by the relevant value from the smaller one?
Hence I think you want to do this:
df = big_df.merge(small_df, on=big_df.columns.tolist(),how='left',indicator=True)
df = df[df._merge!= "both"]
df_out = pd.concat([df,small_df],ignore_index=True)
This would remove any rows from the big_df which exist in the small_df in the 2nd step, before then adding the small_df rows by concatenating rather than merging.
If you had more column names that weren't involved with the join you'd have to do some column renaming/dropping though I think.
Hope that's right.
Try maybe join instead of merge.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
I have 2 dataframes (df1 and df2) with the same MultiIndex. df1 has column A, df2 has column B.
I found 2 ways of 'joining' these dataframes:
df_joined = df1.join(df2, how='inner')
or
df1['B'] = df2['B']
First option takes much longer. Why?
Does option 2 not look at indexes and just 'attaches' the column to the right?
Running this afterwards returns True, so the end result is the same it seems, but perhaps this is because the indexes in df1 and df2 are also in the same order:
df_joined.equals(df1)
Is there any faster way to join the dataframes knowing the indexes are the same?
There is no faster way than df1['B'] = df2['B'] if indices are aligned.
Assigning a series to another series is already well optimised in pandas.
join takes longer than assignment as it explicitly lines up df1.index and df2.index, which is expensive. It is not assumed that indices are in consistent order. As per pd.DataFrame.join documentation, if no column is specified the join will take place on the dataframes' respective indices.
I would be surprised if you find this is a bottleneck in your workflow. If it is, then I suggest you drop down to numpy arrays and avoid pandas altogether.
I try to merge dataframes by index and only take certain columns to the result.
result = pd.concat([self.retailer_categories_probes_df['euclidean_distance'], self.retailers_categories_df['euclidean_distance']])
But with the result I get the 'euclidean_distance' from first table ?
Any idea what is wrong ?
Also how I can give names to the destination columns ?
I think you may need axis=1:
result = pd.concat([self.retailer_categories_probes_df['euclidean_distance'], self.retailers_categories_df['euclidean_distance']], axis=1)
See pd.concat() docs