Inner Join two dataframes does not sort properly - python - python

I am trying to merge two dataframes on inner join and append the values and I was able to perform the join but for some reason the values are ordered properly in each column.
To explain more about this,
Please find the below screenshot where my first dataframe has the stored values of each column
My second dataframe has the string values which needs to be replaced with the values stored in my dataframe 1 above.
Below is the output that I have got but when you look at the values and compare with dataframe 2, they are not assigned properly, For eg:If you consider row 1 in dataframe 2, the Column 1 should have value(i.e. secind column in Dataframe 2) 1.896552 but in my outut I have something else.
Below is the code I worked with to achive the above result.
Joined_df_wna_test = pd.DataFrame()
for col in Unseen_cleaned_rmv_un_2:
Joined_data = pd.merge(Output_df_unseen, my_dataframe_categorical, on=col, how='inner')
Joined_df = pd.DataFrame(Joined_data)
Joined_df_wna_test[col]= Joined_df.Value
Joined_df_wna_test
Joined_df_wna_test
Could someone please help me in overcoming this issue this?

Found Answer for this
The how='left' part is what actually makes it keep the order
Joined_data = Unseen_cleaned_rmv_un.merge(my_dataframe_categorical, on=col, how='left')

Related

Combining two dataframes with different rows, keeping contents of first dataframe on all rows

Good day All,
I have two data frames that needs to be merged which is a little different to the ones I found so far and could not get it working. What I am currently getting, which I am sure is to do with the index, as dataframe 1 only has 1 record. I need to copy the contents of dataframe one into new columns of dataframe 2 for all rows.
Current problem highlighted in red
I have tried merge, append, reset index etc...
DF 1:
Dataframe 1
DF 2:
Dataframe 2
Output Requirement:
Required Output
Any suggestions would be highly appreciated
Update:
I got it to work using the below statements, is there a more dynamic way than specifying the column names?
mod_df['Type'] = mod_df['Type'].fillna(method="ffill")
mod_df['Date'] = mod_df['Date'].fillna(method="ffill")
mod_df['Version'] = mod_df['Version'].fillna(method="ffill")
Assuming you have a single row in df1, use a cross merge:
out = df2.merge(df1, how='cross')

How can I extract data from another dataframe for rows with same id?

I am trying to compare dataframe A and B, both with a column "id", and create a new column in dataframe A that writes the value of a column "description" in dataframe B, if the ids for both dataframes match. If the id is not found in dataframe B I would just leave it blank "".
B is a smaller dataframe than A.
Right now I created a boolean column that tells me if the id is found in dataframe B:
A["found_in_b"] = A["id_a"].isin(B['id_b'])
Pd: I tried an approach of comparing the ids with iteritems and then trying to save the "description" value but it wouldn't save anything.
Another thing I tried is this:
A.loc[A.found_in_b > 0, 'description'] = B.description[B['id_b'].values == A["id_a"].values]
But it didn't work either. I am stuck at this point and any tip or guidance for extracting the "description" column for rows that have matching ids would help me a lot to finish my first data project.
You can use a left join.
B_tmp = B[["id_b","description"]]
A = pd.merge(A, B_tmp, left_on="id_a", right_on="id_b", how="left")
you will have NaN values when the value in id_a is not in the B data frame
Please give an example to explain your problem. From the problem above, I think left join is what you are looking for. Hope this helps:
df1 = pd.DataFrame({'id':[1,2,3,4,5,6,7,8], 'val': ['a','b','c','d','e','f','g','h']})
df2 = pd.DataFrame({'id':[1,3,4,6,8], 'val': ['a','c','d','f','e']})
df = pd.merge(df1, df2, on='id', how='left')

pandas df.fillna - filling NaNs after outer join with correct values

I have two dataframes, sharing some columns together.
I'm trying to:
1) Merge the two dataframes together, i.e. adding the columns which are different:
diff = df2[df2.columns.difference(df1.columns)]
merged = pd.merge(df1, diff, how='outer', sort=False, on='ID')
Up to here, everything works as expected.
2) Now, to replace the NaN values with the values of df2
merged = merged[~merged.index.duplicated(keep='first')]
merged.fillna(value=df2)
And it is here that I get:
pandas.core.indexes.base.InvalidIndexError
I don't have any duplicates, and I can't find any information as to what can cause this.
The solution to this problem is to use a different method - combine_first()
this way, each row with missing data is filled with data from the other dataframe, as can be seen here Merging together values within Series or DataFrame columns
In case, number of rows changes because of the merge, fillna sometimes cause error. Try the following!
merged.fillna(df2.groupby(level=0).transform("mean"))
related question

Python : obtain a value by comparing different data frames columns

I need an help cause I'm trying to obtain a value by comparing different dataframes columns.
First of all, I've tried to use "for loop" to reach the goal, but I have milion of rows, so it takes a lot of time.
Now, I would like to use numpy.where, in this way:
I have 2 data frames:
- df1 where each rows are different to the others (the column ID is the unique primary Key) --> df1['ID', 'status', 'boolean']
- df2 contains few rows, and each rows are different to the others -->
df2['code', 'segment', 'value']
Now, I need to create a new column for dataframe1 called 'weight'.
I tried to create the column 'weight' in this way:
df1['weight'] = numpy.where(df1['boolean'] == 1, df2[ (df2['code']==df1['ID']) & (df2['segment']==df1['status'])] ['value'], 0)
Columns 'code'+'segment' is a unique key, so it returns one and only one value.
The program execution show this error:
"ValueError: Can only compare identically-labeled Series objects"
Can anyone help me to understand it?
Thank you.
You could do this with a left join
Something like this might work. Without sample data I can't check this in detail
df_merged = df1.join(df2.set_index(['code', 'segment']), how='left', on=['ID', 'status'])
df1['weight'] = df_merged['value'].re_index(df1.index).fillna(0)
The set_index() is needed for
on : column name, tuple/list of column names, or array-like
Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation

Pandas join two tables but not sure with which column in df2 ('or' syntax?)

I would like to join two tables together. Both tables are very large (around 1m rows). The problem is, It's not always clear which row the join needs to be made to. Ideally the program should try to join with col_x and if that fails try col_y
I would need to do an or logic as follows:
df3=pd.merge(df1,df2,left_on'col1', right_on='col_x' or 'col_y', how='left')
Any suggestions how this is best implemented are appreciated.
I would create a new column that contains the values you want to merge on first. Haven't tested, but I think it would be something like
# first create new column
df2['merge_col'] = df2['col_x']
# replace empty values
empty_rows = df2['merge_col'].isnull()
df2.loc[empty_rows, 'merge_col'] = df2.loc[empty_rows, 'col_y']
# merge with the new column
df3 = pd.merge(df1, df2, left_on='col1', right_on='merge_col', how='left')
Have you tried something like the below or do you require the check to be done within the merge function?
if df3["col_x"] == "":
df3=pd.merge(df1,df2,left_on'col1', right_on='col_y', how='left')
else:
df3=pd.merge(df1,df2,left_on'col1', right_on='col_x', how='left')

Categories

Resources