How to merge two dataframes with different lengths in python - python

I am trying to merge two weelly DateFrames, which are made-up of one column each, but with different lengths.
Could I please know how to merge them, maintaining the 'Week' indexing?
[df1]
Week Coeff1
1 -0.456662
1 -0.533774
1 -0.432871
1 -0.144993
1 -0.553376
... ...
53 -0.501221
53 -0.025225
53 1.529864
53 0.044380
53 -0.501221
[16713 rows x 1 columns]
[df2]
Week Coeff
1 0.571707
1 0.086152
1 0.824832
1 -0.037042
1 1.167451
... ...
53 -0.379374
53 1.076622
53 -0.547435
53 -0.638206
53 0.067848
[63265 rows x 1 columns]
I've tried this code:
df3 = pd.merge(df1, df2, how='inner', on='Week')
df3 = df3.drop_duplicates()
df3
But it gave me a new df (df3) with 13386431 rows × 2 columns
Desired outcome: A new df which has 3 columns (week, coeff1, coeff2), as df2 is longer, I expect to have some NaNs in coeff1 to fill the gaps.

I assume your output should look somewhat like this:
Week
Coeff1
Coeff2
1
-0.456662
0.571707
1
-0.533774
0.086152
1
-0.432871
0.824832
2
3
3
2
NaN
3
Don't mind the actual numbers though.
The problem is you won't achieve that with a join on Week, neither left nor inner and that is due to the fact that the Week-Index is not unique.
So, on a left join, pandas is going to join all the Coeff2-Values where df2.Week == 1 on every single row in df1 where df1.Week == 1. And that is why you get these millions of rows.
I will try and give you a workaround later, but maybe this helps you to think about this problem from another perspective!
Now is later:
What you actually want to do is to concatenate the Dataframes "per week".
You achieve that by iterating over every week, creating a df_subset[week] concatenating df1[week] and df2[week] by axis=1 and then concatenating all these subsets on axis=0 afterwards:
weekly_dfs=[]
for week in df1.Week.unique():
sub_df1 = df1.loc[df1.Week == week, "Coeff1"].reset_index(drop=True)
sub_df2 = df2.loc[df2.Week == week, "Coeff2"].reset_index(drop=True)
concat_df = pd.concat([sub_df1, sub_df2], axis=1)
concat_df["Week"] = week
weekly_dfs.append(concat_df)
df3 = pd.concat(weekly_dfs).reset_index(drop=True)
The last reset of the index is optional but I recommend it anyways!

Based on your last comment on the question, you may want to concatenate instead of merging the two data frames:
df3 = pd.concat([df1,df2], ignore_index=True, axis=1)
The resulting DataFrame should have 63265 rows and will need some work to get it to the required format (remove the added index columns, rename the remaining columns, etc.), but pd.concat should be a good start.

According to pandas' merge documentation, you can use merge in a way like that:
What you are looking for is a left join. However, the default option is an inner join. You can change this by passing a different how argument:
df2.merge(df1,how='left', left_on='Week', right_on='Week')
note that this would keep these rows in the bigger df and assign NaN to them when merging with the shorter df.

Related

Copying (assembling) the column from smaller data frames into the bigger data frame with pandas

I have a data frame with measurements for several groups of participants, and I am doing some calculations for each group. I want to add a column in a big data frame (all participants), from secondary data frames (partial list of participants).
When I do merge a couple of times (merging a new data frame to the existing one), it creates a duplicate of the column instead of one column.
As the size of the dataframes is different I can not compare them directly.
I tried
#df1 - main bigger dataframe, df2 - smaller dataset contains group of df1
for i in range(len(df1)):
# checking indeces to place the data to correct participant:
if df1.index[i] not in df2['index']:
pass
else :
df1['rate'][i] = list(df2[rate][df2['index']==i])
It does not work properly though. Can you please help with the correct way of assembling the column?
update: where the index of the initial dataframe and the "index" column of the calculation is the same, copy the rate value from the calculation into main df
main dataframe 1df
index
rate
1
0
2
0
3
0
4
0
5
0
6
0
dataframe with calculated values
index
rate
1
value
4
value
6
value
output df
index
rate
1
value
2
0
3
0
4
value
5
0
6
value
Try this – using .join() to merge dataframes on their indices and combining two columns using .combine_first():
df = df1.join(df2, lsuffix="_df1", rsuffix="_df2")
df["rate"] = df["rate_df2"].combine_first(df["rate_df1"])
EDIT:
This assumes both dataframes use a matching index. If that is not the case for df2, run this first:
df2 = df2.set_index('index')

assign value to pandas column based on data in another dataframe

I have 2 dataframes
df1
ID ID2 NUMBER
1 2 null
df2
ID ID2 NUMBER
1 2 1
1 2 2
1 2 3
So when doing merge between df1 and df2 usin ID and ID2 I get duplicated columns because df1 has 3 matches in df2. I'd like to assign a random number to df1 and use it for merging, this way I always get 1 to 1 merge.
The problem is that my dataset is rather big and sometimes I have only 1 row in df2 (so merge works properly) and sometimes I have 10+ rows in df2. I'd like to assign a number to df1 using:
rand(1,len(df1[(df1.ID=1) & (df1.ID2=2]))
I think I found a solution I'm posting it here so others can tell me if there is a better way.
def select_random_row(grp):
ID= grp.ID.iloc[0]
ID2= grp.ID2.iloc[0]
return random.randint(1, len(df1[(df1.ID== ID) & (df1.ID2 == ID2)]))
df2['g'] = df2.groupby(['ID','ID2']).apply(select_random_row)
EDIT:
This is way to slow to do on large dataset... I decided to just use drop_duplicates before merging and keep 1st record. It isn't randomly like I wanted but it is better than nothing

Number of rows changes even after `pandas.merge` with `left` option

I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?
panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)
prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()
temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")
after_len = len(temp_2000)
print(before_len, after_len)
> 12661 13915
This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:
left: use only keys from left frame
However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.
Example:
In [359]: df_1
Out[359]:
A B
0 a AAA
1 b BBA
2 c CCF
and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):
In [360]: df_3
Out[360]:
key value
0 a 1
1 a 2
2 b 3
3 a 4
If I merge these two on left.A, here's what happens:
In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]:
A B key value
0 a AAA a 1.0
1 a AAA a 2.0
2 a AAA a 4.0
3 b BBA b 3.0
4 c CCF NaN NaN
This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.
I hope this helps!
The problem of doubling of rows after each merge() (of any type, 'both' or 'left') is usually caused by duplicates in any of the keys, so we need to drop them first:
left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
If you do not have any duplication, as indicated in the above answer. You should double-check the names of removed entries. In my case, I discovered that the names of removed entries are inconsistent between the df1 and df2 and I solved the problem by:
df1["col1"] = df2["col2"]

Merge pandas dataframe, with column operation

I searched archive, but did not find what I wanted (probably because I don't really know what key words to use)
Here is my problem: I have a bunch of dataframes need to be merged; I also want to update the values of a subset of columns with the sum across the dataframes.
For example, I have two dataframes, df1 and df2:
df1=pd.DataFrame([ [1,2],[1,3], [0,4]], columns=["a", "b"])
df2=pd.DataFrame([ [1,6],[1,4]], columns=["a", "b"])
a b a b
0 1 2 0 1 5
1 1 3 2 0 6
2 0 4
after merging, I'd like to have the column 'b' updated with the sum of matched records, while column 'a' should be just like df1 (or df2, don't really care) as before:
a b
0 1 7
1 1 3
2 0 10
Now, expand this to merging three or more data frames.
Are there straightforward, build-in tricks to do this? or I need to process one by one, line by line?
===== Edit / Clarification =====
In the real world example, each data frame may contain indexes that are not in the other data frames. In this case, the merged data frame should have all of them and update the shared entries/indexes with sum (or some other operation).
Only partial, not complete solution yet. But the main point is solved:
df3 = pd.concat([df1, df2], join = "outer", axis=1)
df4 = df3.b.sum(axis=1)
df3 will have two 'a' columns, and two 'b' columns. the sum() function on df3.b add two 'b' columns and ignore NaNs. Now df4 has column 'b' with sum of df1 and df2's 'b' columns, and all the indexes.
did not solve the column 'a' though. In my real case, there are quite few number of NaN in df3.a , while others in df3.a should be the same. I haven't found a straightforward way to make a column 'a' in df4 and fill value with non-NaN. Now searching for a "count" function to get occurance of elements in rows of df3.a (imagine it has a few dozens column 'a').

Pandas: join 'on' failing

I have two DataFrames, df1:
ID value 1
0 5 162
1 7 185
2 11 156
and df2:
ID Comment
1 5
2 7 Yes!
6 11
... which I want to join using ID, with a result that looks like this:
ID value 1 Comment
5 162
7 185 Yes!
11 156
The real DataFrames are much larger and contain more columns, and I essentially want to add the Comment column from df2 to df1. I tried using
df1 = df1.join(df2['Comment'], on='ID')
... but that only gets me a new empty Comment column in df1, like .join somehow fails to use the ID column as the index. I have also tried
df1 = df1.join(df2['Comment'])
... but that uses the default indexes, which don't match between the two DataFrames (they also have different lengths), giving me a Comment value on the wrong place.
What am I doing wrong?
You can just do a merge to achieve what you want:
In [30]:
df1.merge(df2, on='ID')
Out[30]:
ID value1 Comment
0 5 162 None
1 7 185 Yes!
2 11 156 None
[3 rows x 3 columns]
The problem with join is that by default it performs a left index join, because your dataframes do not have a common index values that match then your comment column ends up being empty
EDIT
Following on from the comments, if you want to retain all values in df1 and add just the comments that are not empty and have ID's that exist in df1 then you can perform a left merge:
df1.merge(df2.dropna( subset=['Comment']), on='ID', how='left')
This will drop any rows with empty comments, use the ID column to merge both df1 and df2 to but perform a left merge so retains all values on left hand side but will merge comments that match ID column, the default is inner which retains IDs that are in both left and right dfs.
Further information on merge and further examples.

Categories

Resources