Pandas how work out updates between two DataFrames

Pandas how work out updates between two DataFrames - python

I’m trying to code something that can check between two dataframes and let me know what's different between the two. Unfortunately, I’ve hit a block and don’t know how to code a solution and was hoping someone could help me out..
The dataframes are populated from two CSV files. The structure of each csv is the same, just the data is different. There are 5 columns and the column names are identical in each dataframe.
The first dataframe df1 is latest data, the other dataframe df2 is existing data. I’m trying to work out:
New rows added in df1.
Rows missing df1 that were in df2.
Rows that are present in both dataframes, but a value has
changed.
The way I’ve been approaching this is to create a third dataset df3 and create a merge of df1 & df2 using the following code:
df3 = pd.merge(df1, df2, indicator=True,
on=["A","B","C","D","E"], how="outer")
This gives me a _merge column and I can create two new datasets using the following code
df4 = df3.loc[df3['_merge'] == 'left_only']
df5 = df3.loc[df3['_merge'] == 'right_only']
At this point, in df4, I have a list of rows which are either genuine new rows, or existing rows which have modified values.
I’m stuck on how to work out which is which?
The data consists of 5 columns, only column D will ever change.
I was thinking about looping through each row of df4 and somehow checking if a row is present in df5 where columns A,B,C and E all match? At that point I can mark the row in df4 as an update and not a new row.
I’d really appreciate the help if someone would let me know the correct way of doing this.
Thanks in advance.

Related

Pandas: How to Squash Multiple Rows into One Row with More Columns

I'm looking for a way to convert 5 rows in a pandas dataframe into one row with 5 times the amount of columns (so I have the same information, just squashed into one row). Let me explain:
I'm working with hockey game statistics. Currently, there are 5 rows representing the same game in different situations, each with 111 columns. I want to convert these 5 rows into one row (so that one game is represented by one row) but keep the information contained in the different situations. In other words, I want to convert 5 rows, each with 111 columns into one row with 554 columns (554=111*5 minus one since we're joining on gameId).
Here is my DF head:
So, as an example, we can see the first 5 rows have gameId = 2008020001, but each have a different situation (i.e. other, all, 5on5, 4on5, and 5on4). I'd like these 5 rows to be converted into one row with gameId = 2008020001, and with columns labelled according to their situation.
For example, I want columns for all unblockedShotAttemptsAgainst, 5on5 unblockedShotAttemptsAgainst, 5on4 unblockedShotAttemptsAgainst, 4on5 unblockedShotAttemptsAgainst, and other unblockedShotAttemptsAgainst (and the same for every other stat).
Any info would be greatly appreciated. It's also worth mentioning that my dataset is fairly large (177990 rows), so an efficient solution is desired. The resulting dataframe should have one-fifth the rows and 5 times the columns. Thanks in advance!
---- What I've Tried Already ----
I tried to do this using df.apply() and some nested for loops, but it got very ugly very quickly and was incredibly slow. I think pandas has a better way of doing this, but I'm not sure how.
Looking at other SO answers, I initially thought it might have something to do with df.pivot() or df.groupby(), but I couldn't figure it out. Thanks again!

It sounds like what you are looking for is pd.get_dummies()
cols = df.columns
#get dummies
df1 = pd.get_dummies(df, columns = ['situation'])
#drop all columns from existing df, including original col passed into get dummies
df1.drop(cols, axis=1 , inplace=True)
#add dummy cols to original df
df = pd.concat([df, df1], axis=1)
#drop duplicate rows
df.groupby(cols).first()
For the last line you can also use df.drop_duplicates() : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

Combining two dataframes with different rows, keeping contents of first dataframe on all rows

Good day All,
I have two data frames that needs to be merged which is a little different to the ones I found so far and could not get it working. What I am currently getting, which I am sure is to do with the index, as dataframe 1 only has 1 record. I need to copy the contents of dataframe one into new columns of dataframe 2 for all rows.
Current problem highlighted in red
I have tried merge, append, reset index etc...
DF 1:
Dataframe 1
DF 2:
Dataframe 2
Output Requirement:
Required Output
Any suggestions would be highly appreciated
Update:
I got it to work using the below statements, is there a more dynamic way than specifying the column names?
mod_df['Type'] = mod_df['Type'].fillna(method="ffill")
mod_df['Date'] = mod_df['Date'].fillna(method="ffill")
mod_df['Version'] = mod_df['Version'].fillna(method="ffill")

Assuming you have a single row in df1, use a cross merge:
out = df2.merge(df1, how='cross')

Inner Join two dataframes does not sort properly - python

I am trying to merge two dataframes on inner join and append the values and I was able to perform the join but for some reason the values are ordered properly in each column.
To explain more about this,
Please find the below screenshot where my first dataframe has the stored values of each column
My second dataframe has the string values which needs to be replaced with the values stored in my dataframe 1 above.
Below is the output that I have got but when you look at the values and compare with dataframe 2, they are not assigned properly, For eg:If you consider row 1 in dataframe 2, the Column 1 should have value(i.e. secind column in Dataframe 2) 1.896552 but in my outut I have something else.
Below is the code I worked with to achive the above result.
Joined_df_wna_test = pd.DataFrame()
for col in Unseen_cleaned_rmv_un_2:
Joined_data = pd.merge(Output_df_unseen, my_dataframe_categorical, on=col, how='inner')
Joined_df = pd.DataFrame(Joined_data)
Joined_df_wna_test[col]= Joined_df.Value
Joined_df_wna_test
Joined_df_wna_test
Could someone please help me in overcoming this issue this?

Found Answer for this
The how='left' part is what actually makes it keep the order
Joined_data = Unseen_cleaned_rmv_un.merge(my_dataframe_categorical, on=col, how='left')

Python Merge Two DataFrames Only Retrieve Specific Columns in the Result

Hi - I want to merge two python DataFrames, but don't want to bring over ALL of the columns from both dataframes to my new dataframe. In the picture below, if I join df1 and df2 on 'acct' and want to bring back all the columns from df1 and ONLY 'entity' from df2, how would I write that? I don't want to have to drop any columns so doing a normal merge isn't what I'm looking for. Can anyone help? Thanks!

When you perform the merge operation, you can modify a dataframe object that is in your function, which will mean the underlying objects df1 and df2 remain unchanged. An example would look like this:
df_result = df1.merge(df2[ ['acct','entity'] ], on ='acct')
This will let you do your partial merge without modifying either original dataframe.

inner join with huge dataframes (~2 million columns)

I am trying to join two data frames (df1 and df2) based on matching values from one column (called 'Names') that is found in each data frame. I have tried this using R's inner_join function as well as Python's pandas merge function, and have been able to get both to work successfully on smaller subsets of my data. I think my problem is with the size of my data frames.
My data frames are as follows:
df1 has the 'Names' column with 5 additional columns and has ~900 rows.
df2 has the 'Names' column with ~2million additional columns and has ~900 rows.
I have tried (in R):
df3 <- inner_join(x = df1, y = df2, by = 'Name')
I have also tried (in Python where df1 and df2 are Pandas data frames):
df3 = df1.merge(right = df2, how = 'inner', left_on = 1, right_on = 0)
(where the 'Name' column is at index 1 of df1 and at index 0 of df2)
When I apply the above to my full data frames, it runs for a very long time and eventually crashes. Additionally, I suspect that the problem may be with the 2 million columns of my df2, so I tried sub-setting it (row-wise) into smaller data frames. My plan was to join the small subsets of df2 with df1 and then row bind the new data frames together at the end. However, joining even the smaller partitioned df2s was unsuccessful.
I would appreciate any suggestions anyone would be able to provide.

Thank you everyone for your help! Using data.table as #shadowtalker suggested, sped up the process tremendously. Just for reference in case anyone is trying to do something similar, df1 was approximately 400 mb and my df2 file was approximately 3gb.
I was able to accomplish the task as follows:
library(data.table)
df1 <- setDT(df1)
df2 <- setDT(df2)
setkey(df1, Name)
setkey(df2, Name)
df3 <- df1[df2, nomatch = 0]

This is a really ugly workaround where I break up df2's columns and add them piece by piece. Not sure it will work, but it might be worth a try:
# First, I only grab the "Name" column from df2
df3 = df1.merge(right=df2[["Name"]], how="inner", on="Name")
# Then I save all the column headers (excluding
# the "Name" column) in a separate list
df2_columns = df2.columns[np.logical_not(df2.columns.isin(["Name"]))]
# This determines how many columns are going to get added each time.
num_cols_per_loop = 1000
# And this just calculates how many times you'll need to go through the loop
# given the number of columns you set to get added each loop
num_loops = int(len(df2_columns)/num_cols_per_loop) + 1
for i in range(num_loops):
# For each run of the loop, we determine which rows will get added
this_column_sublist = df2_columns[i*num_cols_per_loop : (i+1)*num_cols_per_loop]
# You also need to add the "Name" column to make sure
# you get the observations in the right order
this_column_sublist = np.append("Name",this_column_sublist)
# Finally, merge with just the subset of df2
df3 = df3.merge(right=df2[this_column_sublist], how="inner", on="Name")
Like I said, it's an ugly workaround, but it just might work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.