Python Merge Two DataFrames Only Retrieve Specific Columns in the Result - python

Hi - I want to merge two python DataFrames, but don't want to bring over ALL of the columns from both dataframes to my new dataframe. In the picture below, if I join df1 and df2 on 'acct' and want to bring back all the columns from df1 and ONLY 'entity' from df2, how would I write that? I don't want to have to drop any columns so doing a normal merge isn't what I'm looking for. Can anyone help? Thanks!

When you perform the merge operation, you can modify a dataframe object that is in your function, which will mean the underlying objects df1 and df2 remain unchanged. An example would look like this:
df_result = df1.merge(df2[ ['acct','entity'] ], on ='acct')
This will let you do your partial merge without modifying either original dataframe.

Related

How to check if two pandas dataframes have same values and concatenate those rows?

I got a DF called "df" with 4 numerical columns [frame,id,x,y]
I made a loop that creates two dataframes called df1 and df2. Both df1 and df2 are subseted of the original dataframe.
What I want to do (and I am not understanding how to do it) is this: I want to CHECK if df1 and df2 have same VALUES in the column called "id". If they do, I want to concatenate those rows of df2 (that have the same id values) to df1.
For example: if df1 has rows with different id values (1,6,4,8) and df2 has this id values (12,7,8,10). I want to concatenate df2 rows that have the id value=8 to df1. That is all I need
This is my code:
for i in range(0,max(df['frame']),30):
df1=df[df['frame'].between(i, i+30)]
df2=df[df['frame'].between(i-30, i)]
There are several ways to accomplish what you need.
The simplest one is to get the slice of df2 that contains the values you need with .isin() and concatenate it with df1 in one line.
df3 = pd.concat([df1, df2[df2.id.isin(df1.id)]], axis = 0)
To gain more control and avoid any errors that might stem from updating df1 and df2 elsewhere, you may want to take the apart this one-liner.
look_for_vals = set(df1['id'].tolist())
# do some stuff
need_ix = df2[df2["id"].isin(look_for_vals )].index
# do more stuff
df3 = pd.concat([df1, df2.loc[need_ix,:]], axis=0)
Instead of set() you may also use df1['id'].unique()

How should I filter one dataframe by entries from another one in pandas with isin?

I have two dataframes (df1, df2). The columns names and indices are the same (the difference in columns entries). Also, df2 has only 20 entries (which also existed in df1 as i said).
I want to filter df1 by df2 entries, but when i try to do it with isin but nothing happens.
df1.isin(df2) or df1.index.isin(df2.index)
Tell me please what I'm doing wrong and how should I do it..
First of all the isin function in pandas returns a Dataframe of booleans and not the result you want. So it makes sense that the cmds you used did not work.
I am possitive that hte following psot will help
pandas - filter dataframe by another dataframe by row elements
If you want to select the entries in df1 with an index that is also present in df2, you should be able to do it with:
df1.loc[df2.index]
or if you really want to use isin:
df1[df1.index.isin(df2.index)]

How to match two columns from different Dataframes and with different length?

I have already Generated df1 and df2.
df1
df2
Both Dataframes have a common column, df1[TB_DIV] and df2[DIV].
I want to generate a new df3 that contains all the info in df1 filtered by all the df2[DIV] which are NOT IN df1.
I tried to use the .isin function to filter df1 with the df2 info, but wasn't able to get the expected values.
m = DIV_LIST.DIV.isin(DIV_TABLE.TB_DIV)
DIV_LIST1 = DIV_LIST[m]
I obtained a empty df3 and in some cases errors due to a length mismatch.
Try going about it like this:
df1.loc[df1['TB_DIV'].isin(df2['DIV'])]
To get those that are not in, use:
df1.loc[~df1['TB_DIV'].isin(df2['DIV'])]

How can I join two dataframes with different dtypes?

I have a dataframe(df1) with index as a date range and no columns specified and another dataframe(df2) with float values in every column.
I tried joining a specific column from df2 to df1 using .join() method and ended up with all values as NaN in df1. What should I do to solve this?
It's unclear what you mean without any example of the data or their shape, and without more details about what kind of 'join' you're trying to do. It sounds like you are trying to concatenate dataframes without relying on a column or index level names to join on. That's what join or merge try to do, so if you don't have common values on the on parameter of the join, you'll end up with nans. If I'm correct and you just want a concatenation of dataframes, then you can use concat. I can't provide the code without more details, but it would look something like this:
new_df = pd.concat([df1, df2[['whatever_column_you_want_to_concatenate']]], axis=1)

Pandas how work out updates between two DataFrames

I’m trying to code something that can check between two dataframes and let me know what's different between the two. Unfortunately, I’ve hit a block and don’t know how to code a solution and was hoping someone could help me out..
The dataframes are populated from two CSV files. The structure of each csv is the same, just the data is different. There are 5 columns and the column names are identical in each dataframe.
The first dataframe df1 is latest data, the other dataframe df2 is existing data. I’m trying to work out:
New rows added in df1.
Rows missing df1 that were in df2.
Rows that are present in both dataframes, but a value has
changed.
The way I’ve been approaching this is to create a third dataset df3 and create a merge of df1 & df2 using the following code:
df3 = pd.merge(df1, df2, indicator=True,
on=["A","B","C","D","E"], how="outer")
This gives me a _merge column and I can create two new datasets using the following code
df4 = df3.loc[df3['_merge'] == 'left_only']
df5 = df3.loc[df3['_merge'] == 'right_only']
At this point, in df4, I have a list of rows which are either genuine new rows, or existing rows which have modified values.
I’m stuck on how to work out which is which?
The data consists of 5 columns, only column D will ever change.
I was thinking about looping through each row of df4 and somehow checking if a row is present in df5 where columns A,B,C and E all match? At that point I can mark the row in df4 as an update and not a new row.
I’d really appreciate the help if someone would let me know the correct way of doing this.
Thanks in advance.

Categories

Resources