inner join with huge dataframes (~2 million columns) - python

I am trying to join two data frames (df1 and df2) based on matching values from one column (called 'Names') that is found in each data frame. I have tried this using R's inner_join function as well as Python's pandas merge function, and have been able to get both to work successfully on smaller subsets of my data. I think my problem is with the size of my data frames.
My data frames are as follows:
df1 has the 'Names' column with 5 additional columns and has ~900 rows.
df2 has the 'Names' column with ~2million additional columns and has ~900 rows.
I have tried (in R):
df3 <- inner_join(x = df1, y = df2, by = 'Name')
I have also tried (in Python where df1 and df2 are Pandas data frames):
df3 = df1.merge(right = df2, how = 'inner', left_on = 1, right_on = 0)
(where the 'Name' column is at index 1 of df1 and at index 0 of df2)
When I apply the above to my full data frames, it runs for a very long time and eventually crashes. Additionally, I suspect that the problem may be with the 2 million columns of my df2, so I tried sub-setting it (row-wise) into smaller data frames. My plan was to join the small subsets of df2 with df1 and then row bind the new data frames together at the end. However, joining even the smaller partitioned df2s was unsuccessful.
I would appreciate any suggestions anyone would be able to provide.

Thank you everyone for your help! Using data.table as #shadowtalker suggested, sped up the process tremendously. Just for reference in case anyone is trying to do something similar, df1 was approximately 400 mb and my df2 file was approximately 3gb.
I was able to accomplish the task as follows:
library(data.table)
df1 <- setDT(df1)
df2 <- setDT(df2)
setkey(df1, Name)
setkey(df2, Name)
df3 <- df1[df2, nomatch = 0]

This is a really ugly workaround where I break up df2's columns and add them piece by piece. Not sure it will work, but it might be worth a try:
# First, I only grab the "Name" column from df2
df3 = df1.merge(right=df2[["Name"]], how="inner", on="Name")
# Then I save all the column headers (excluding
# the "Name" column) in a separate list
df2_columns = df2.columns[np.logical_not(df2.columns.isin(["Name"]))]
# This determines how many columns are going to get added each time.
num_cols_per_loop = 1000
# And this just calculates how many times you'll need to go through the loop
# given the number of columns you set to get added each loop
num_loops = int(len(df2_columns)/num_cols_per_loop) + 1
for i in range(num_loops):
# For each run of the loop, we determine which rows will get added
this_column_sublist = df2_columns[i*num_cols_per_loop : (i+1)*num_cols_per_loop]
# You also need to add the "Name" column to make sure
# you get the observations in the right order
this_column_sublist = np.append("Name",this_column_sublist)
# Finally, merge with just the subset of df2
df3 = df3.merge(right=df2[this_column_sublist], how="inner", on="Name")
Like I said, it's an ugly workaround, but it just might work.

Related

How to check if two pandas dataframes have same values and concatenate those rows?

I got a DF called "df" with 4 numerical columns [frame,id,x,y]
I made a loop that creates two dataframes called df1 and df2. Both df1 and df2 are subseted of the original dataframe.
What I want to do (and I am not understanding how to do it) is this: I want to CHECK if df1 and df2 have same VALUES in the column called "id". If they do, I want to concatenate those rows of df2 (that have the same id values) to df1.
For example: if df1 has rows with different id values (1,6,4,8) and df2 has this id values (12,7,8,10). I want to concatenate df2 rows that have the id value=8 to df1. That is all I need
This is my code:
for i in range(0,max(df['frame']),30):
df1=df[df['frame'].between(i, i+30)]
df2=df[df['frame'].between(i-30, i)]
There are several ways to accomplish what you need.
The simplest one is to get the slice of df2 that contains the values you need with .isin() and concatenate it with df1 in one line.
df3 = pd.concat([df1, df2[df2.id.isin(df1.id)]], axis = 0)
To gain more control and avoid any errors that might stem from updating df1 and df2 elsewhere, you may want to take the apart this one-liner.
look_for_vals = set(df1['id'].tolist())
# do some stuff
need_ix = df2[df2["id"].isin(look_for_vals )].index
# do more stuff
df3 = pd.concat([df1, df2.loc[need_ix,:]], axis=0)
Instead of set() you may also use df1['id'].unique()

How to merge 2 dataframe rows in a new dataframe row with pandas?

I have 2 variables (dataframes) one is 47 colums wide and the other is 87, they are DF2 and DF2.
Then I have a variable (dataframe) called full_data. Df1 and DF2 are two different subset of data I want to merge together once I find 2 rows are equal.
I am doing everything I want so far besides appending the right value to the new dataframe.
below is the line of code I have been playing around with:
full_data = full_data.append(pd.concat([df1[i:i+1].copy(),df2[j:j+1]].copy(), axis=1), ignore_index = True)
once I find the rows in both Df1 and DF2 are equal I am trying to read both those rows and put them one after the other as a single row in the variable full_data. What is happening right now is that the line of code is writting 2 rows and no one as I want.
what I want is full_data.append(Df1 DF2) and right now I am getting
full_data(i)=DF1
full_data(i+1)=DF2
Any help would be apreciated.
EM
full_data = full_data.append(pd.concat([df1[i:i+1].copy(),df2[j:j+1]].copy(), axis=1), ignore_index = True)
In the end I solved my problem. Probably I was not clear enough but my question but what was happening when concatenating is that I was getting duplicated or multiple rows when the expected result was getting a single row concatenation.
The issues was found to be with the indexing. Indexing had to be reset because of the way pandas works.
I found an example and explanation here
My solution here:
df3 = df2[j:j+1].copy()
df4 = df1[i:i+1].copy()
full_data = full_data.append(
pd.concat([df4.reset_index(drop=True), df3.reset_index(drop=True)], axis=1),
ignore_index = True
)
I first created a copy of my variables and then reset the indexes.

Combining two dataframes with different rows, keeping contents of first dataframe on all rows

Good day All,
I have two data frames that needs to be merged which is a little different to the ones I found so far and could not get it working. What I am currently getting, which I am sure is to do with the index, as dataframe 1 only has 1 record. I need to copy the contents of dataframe one into new columns of dataframe 2 for all rows.
Current problem highlighted in red
I have tried merge, append, reset index etc...
DF 1:
Dataframe 1
DF 2:
Dataframe 2
Output Requirement:
Required Output
Any suggestions would be highly appreciated
Update:
I got it to work using the below statements, is there a more dynamic way than specifying the column names?
mod_df['Type'] = mod_df['Type'].fillna(method="ffill")
mod_df['Date'] = mod_df['Date'].fillna(method="ffill")
mod_df['Version'] = mod_df['Version'].fillna(method="ffill")
Assuming you have a single row in df1, use a cross merge:
out = df2.merge(df1, how='cross')

How to merge multiple (6) dataframes together based on one common column in python/pandas?

Before I start, I have found similar questions and tried the responding answers however, I am still running into an issue and can't figure out why.
I have 6 data frames. I want one resulting data frame that merges all of these 6 into one, based on their common index column Country. Things to note are: the data frames have different number of rows, some country's do not have corresponding values, resulting in NaN.
Here is what I have tried:
data_frames = [WorldPopulation_df, WorldEconomy_df, WorldEducation_df, WorldAggression_df, WorldCorruption_df, WorldCyberCapabilities_df]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Country'], how = 'outer'), data_frames)
This doesn't work as the final resulting data frame pairs up the wrong values with wrong country. Any suggestions?
let's see, "pd.merge" is used when you would add new columns from a key.
In case you have 6 dataframes with the same number of columns, in the same order, you can try this:
columns_order = ['country', 'column_1']
concat_ = pd.concat(
[data_1[columns_order], data_2[columns_order], data_3[columns_order],
data_4[columns_order], data_5[columns_order], data_6[columns_order]],
ignore_index=True,
axis=0
)
From here, if you want to have a single value for the "country" column, you can apply a group by to it:
concat_.groupby(by=['country']).agg({'column_1': max}).reset_index()

How to merge a big dataframe with small dataframe?

I have a big dataframe with 100 rows and the structure is [qtr_dates<datetime.date>, sales<float>] and a small dataframe with same structure with less than 100 rows. I want to merge these two dfs such that merged df will have all the rows from small df and remaining rows will be taken from big df.
Right now I am doing this
df = big_df.merge(small_df, on=big_df.columns.tolist(), how='outer')
But this is creating a df with duplicate qtr_dates.
Use concat with remove duplicates by DataFrame.drop_duplicates:
pd.concat([small_df, big_df], ignore_index=True).drop_duplicates(subset=['qtr_dates'])
If I understand correctly, you want everything from the bigger dataframe, but if that date is present in the smaller data frame you would want it replaced by the relevant value from the smaller one?
Hence I think you want to do this:
df = big_df.merge(small_df, on=big_df.columns.tolist(),how='left',indicator=True)
df = df[df._merge!= "both"]
df_out = pd.concat([df,small_df],ignore_index=True)
This would remove any rows from the big_df which exist in the small_df in the 2nd step, before then adding the small_df rows by concatenating rather than merging.
If you had more column names that weren't involved with the join you'd have to do some column renaming/dropping though I think.
Hope that's right.
Try maybe join instead of merge.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

Categories

Resources