How to merge a big dataframe with small dataframe?

How to merge a big dataframe with small dataframe? - python

I have a big dataframe with 100 rows and the structure is [qtr_dates<datetime.date>, sales<float>] and a small dataframe with same structure with less than 100 rows. I want to merge these two dfs such that merged df will have all the rows from small df and remaining rows will be taken from big df.
Right now I am doing this
df = big_df.merge(small_df, on=big_df.columns.tolist(), how='outer')
But this is creating a df with duplicate qtr_dates.

Use concat with remove duplicates by DataFrame.drop_duplicates:
pd.concat([small_df, big_df], ignore_index=True).drop_duplicates(subset=['qtr_dates'])

If I understand correctly, you want everything from the bigger dataframe, but if that date is present in the smaller data frame you would want it replaced by the relevant value from the smaller one?
Hence I think you want to do this:
df = big_df.merge(small_df, on=big_df.columns.tolist(),how='left',indicator=True)
df = df[df._merge!= "both"]
df_out = pd.concat([df,small_df],ignore_index=True)
This would remove any rows from the big_df which exist in the small_df in the 2nd step, before then adding the small_df rows by concatenating rather than merging.
If you had more column names that weren't involved with the join you'd have to do some column renaming/dropping though I think.
Hope that's right.

Try maybe join instead of merge.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

Related

Pandas: How to Squash Multiple Rows into One Row with More Columns

I'm looking for a way to convert 5 rows in a pandas dataframe into one row with 5 times the amount of columns (so I have the same information, just squashed into one row). Let me explain:
I'm working with hockey game statistics. Currently, there are 5 rows representing the same game in different situations, each with 111 columns. I want to convert these 5 rows into one row (so that one game is represented by one row) but keep the information contained in the different situations. In other words, I want to convert 5 rows, each with 111 columns into one row with 554 columns (554=111*5 minus one since we're joining on gameId).
Here is my DF head:
So, as an example, we can see the first 5 rows have gameId = 2008020001, but each have a different situation (i.e. other, all, 5on5, 4on5, and 5on4). I'd like these 5 rows to be converted into one row with gameId = 2008020001, and with columns labelled according to their situation.
For example, I want columns for all unblockedShotAttemptsAgainst, 5on5 unblockedShotAttemptsAgainst, 5on4 unblockedShotAttemptsAgainst, 4on5 unblockedShotAttemptsAgainst, and other unblockedShotAttemptsAgainst (and the same for every other stat).
Any info would be greatly appreciated. It's also worth mentioning that my dataset is fairly large (177990 rows), so an efficient solution is desired. The resulting dataframe should have one-fifth the rows and 5 times the columns. Thanks in advance!
---- What I've Tried Already ----
I tried to do this using df.apply() and some nested for loops, but it got very ugly very quickly and was incredibly slow. I think pandas has a better way of doing this, but I'm not sure how.
Looking at other SO answers, I initially thought it might have something to do with df.pivot() or df.groupby(), but I couldn't figure it out. Thanks again!

It sounds like what you are looking for is pd.get_dummies()
cols = df.columns
#get dummies
df1 = pd.get_dummies(df, columns = ['situation'])
#drop all columns from existing df, including original col passed into get dummies
df1.drop(cols, axis=1 , inplace=True)
#add dummy cols to original df
df = pd.concat([df, df1], axis=1)
#drop duplicate rows
df.groupby(cols).first()
For the last line you can also use df.drop_duplicates() : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

Using pandas I need to create a new column that takes a value from a previous row

I have many rows of data and one of the columns is a flag. I have 3 identifiers that need to match between rows.
What I have:
partnumber, datetime1, previousdatetime1, datetime2, previousdatetime2, flag
What I need:
partnumber, datetime1, previousdatetime1, datetime2, previousdatetime2, flag, previous_flag
I need to find flag from the row where partnumber matches, and where the previousdatetime1(current row*) == datetime1(other row)*, and the previousdatetime2(current row) == datetime2(other row).
*To note, the rows are not necessarily in order so the previous row may come later in the dataframe
I'm not quite sure where to start. I got this logic working in PBI using a LookUpValue and basically finding where partnumber = Value(partnumber), datetime1 = Value(datetime1), datetime2 = Value(datetime2). Thanks for the help!

Okay, so assuming you've read this in as a pandas dataframe df1:
(1) Make a copy of the dataframe:
df2=df1.copy()
(2) For sanity, drop some columns in df2
df2.drop(['previousdatetime1','previousdatetime2'],axis=1,inplace=True)
Now you have a df2 that has columns:
['partnumber','datetime1','datetime2','flag']
(3) Merge the two dataframes
newdf=df1.merge(df2,how='left',left_on=['partnumber','previousdatetime1'],right_on=['partnumber','datetime1'],suffixes=('','_previous'))
Now you have a newdf that has columns:
['partnumber','datetime1','previousdatetime1','datetime2','previousdatetime2','flag','partnumber_previous','datetime1_previous','datetime2_previous','flag_previous']
(4) Drop the unnecessary columns
newdf.drop(['partnumber_previous', 'datetime1_previous', 'datetime2_previous'],axis=1,inplace=True)
Now you have a newdf that has columns:
['partnumber','datetime1','previousdatetime1','datetime2','previousdatetime2','flag','flag_previous']

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.

Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.

You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task

You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

inner join with huge dataframes (~2 million columns)

I am trying to join two data frames (df1 and df2) based on matching values from one column (called 'Names') that is found in each data frame. I have tried this using R's inner_join function as well as Python's pandas merge function, and have been able to get both to work successfully on smaller subsets of my data. I think my problem is with the size of my data frames.
My data frames are as follows:
df1 has the 'Names' column with 5 additional columns and has ~900 rows.
df2 has the 'Names' column with ~2million additional columns and has ~900 rows.
I have tried (in R):
df3 <- inner_join(x = df1, y = df2, by = 'Name')
I have also tried (in Python where df1 and df2 are Pandas data frames):
df3 = df1.merge(right = df2, how = 'inner', left_on = 1, right_on = 0)
(where the 'Name' column is at index 1 of df1 and at index 0 of df2)
When I apply the above to my full data frames, it runs for a very long time and eventually crashes. Additionally, I suspect that the problem may be with the 2 million columns of my df2, so I tried sub-setting it (row-wise) into smaller data frames. My plan was to join the small subsets of df2 with df1 and then row bind the new data frames together at the end. However, joining even the smaller partitioned df2s was unsuccessful.
I would appreciate any suggestions anyone would be able to provide.

Thank you everyone for your help! Using data.table as #shadowtalker suggested, sped up the process tremendously. Just for reference in case anyone is trying to do something similar, df1 was approximately 400 mb and my df2 file was approximately 3gb.
I was able to accomplish the task as follows:
library(data.table)
df1 <- setDT(df1)
df2 <- setDT(df2)
setkey(df1, Name)
setkey(df2, Name)
df3 <- df1[df2, nomatch = 0]

This is a really ugly workaround where I break up df2's columns and add them piece by piece. Not sure it will work, but it might be worth a try:
# First, I only grab the "Name" column from df2
df3 = df1.merge(right=df2[["Name"]], how="inner", on="Name")
# Then I save all the column headers (excluding
# the "Name" column) in a separate list
df2_columns = df2.columns[np.logical_not(df2.columns.isin(["Name"]))]
# This determines how many columns are going to get added each time.
num_cols_per_loop = 1000
# And this just calculates how many times you'll need to go through the loop
# given the number of columns you set to get added each loop
num_loops = int(len(df2_columns)/num_cols_per_loop) + 1
for i in range(num_loops):
# For each run of the loop, we determine which rows will get added
this_column_sublist = df2_columns[i*num_cols_per_loop : (i+1)*num_cols_per_loop]
# You also need to add the "Name" column to make sure
# you get the observations in the right order
this_column_sublist = np.append("Name",this_column_sublist)
# Finally, merge with just the subset of df2
df3 = df3.merge(right=df2[this_column_sublist], how="inner", on="Name")
Like I said, it's an ugly workaround, but it just might work.

How do I join two dataframes (pandas) with different indices?

I'm working on a way to transform sequence/genotype data from a csv format to a genepop format.
I have two dataframes: df1 is empty, df1.index (rows = samples) consists of almost the same as df2.index, except I inserted "POP" in several places (to specify the different populations). df2 holds the data, with Loci as columns.
I want to insert the values from df2 into df1, keeping empty rows where df1.index = 'POP'.
I tried join, combine, combine_first and concat, but they all seem to take the rows that exist in both df's.
Is there a way to do this?

It sounds like you want an 'outer' join:
df1.join(df2, how='outer')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.