Merge pandas dataframe, with column operation - python

I searched archive, but did not find what I wanted (probably because I don't really know what key words to use)
Here is my problem: I have a bunch of dataframes need to be merged; I also want to update the values of a subset of columns with the sum across the dataframes.
For example, I have two dataframes, df1 and df2:
df1=pd.DataFrame([ [1,2],[1,3], [0,4]], columns=["a", "b"])
df2=pd.DataFrame([ [1,6],[1,4]], columns=["a", "b"])
a b a b
0 1 2 0 1 5
1 1 3 2 0 6
2 0 4
after merging, I'd like to have the column 'b' updated with the sum of matched records, while column 'a' should be just like df1 (or df2, don't really care) as before:
a b
0 1 7
1 1 3
2 0 10
Now, expand this to merging three or more data frames.
Are there straightforward, build-in tricks to do this? or I need to process one by one, line by line?
===== Edit / Clarification =====
In the real world example, each data frame may contain indexes that are not in the other data frames. In this case, the merged data frame should have all of them and update the shared entries/indexes with sum (or some other operation).

Only partial, not complete solution yet. But the main point is solved:
df3 = pd.concat([df1, df2], join = "outer", axis=1)
df4 = df3.b.sum(axis=1)
df3 will have two 'a' columns, and two 'b' columns. the sum() function on df3.b add two 'b' columns and ignore NaNs. Now df4 has column 'b' with sum of df1 and df2's 'b' columns, and all the indexes.
did not solve the column 'a' though. In my real case, there are quite few number of NaN in df3.a , while others in df3.a should be the same. I haven't found a straightforward way to make a column 'a' in df4 and fill value with non-NaN. Now searching for a "count" function to get occurance of elements in rows of df3.a (imagine it has a few dozens column 'a').

Related

How to merge two dataframes with different lengths in python

I am trying to merge two weelly DateFrames, which are made-up of one column each, but with different lengths.
Could I please know how to merge them, maintaining the 'Week' indexing?
[df1]
Week Coeff1
1 -0.456662
1 -0.533774
1 -0.432871
1 -0.144993
1 -0.553376
... ...
53 -0.501221
53 -0.025225
53 1.529864
53 0.044380
53 -0.501221
[16713 rows x 1 columns]
[df2]
Week Coeff
1 0.571707
1 0.086152
1 0.824832
1 -0.037042
1 1.167451
... ...
53 -0.379374
53 1.076622
53 -0.547435
53 -0.638206
53 0.067848
[63265 rows x 1 columns]
I've tried this code:
df3 = pd.merge(df1, df2, how='inner', on='Week')
df3 = df3.drop_duplicates()
df3
But it gave me a new df (df3) with 13386431 rows × 2 columns
Desired outcome: A new df which has 3 columns (week, coeff1, coeff2), as df2 is longer, I expect to have some NaNs in coeff1 to fill the gaps.
I assume your output should look somewhat like this:
Week
Coeff1
Coeff2
1
-0.456662
0.571707
1
-0.533774
0.086152
1
-0.432871
0.824832
2
3
3
2
NaN
3
Don't mind the actual numbers though.
The problem is you won't achieve that with a join on Week, neither left nor inner and that is due to the fact that the Week-Index is not unique.
So, on a left join, pandas is going to join all the Coeff2-Values where df2.Week == 1 on every single row in df1 where df1.Week == 1. And that is why you get these millions of rows.
I will try and give you a workaround later, but maybe this helps you to think about this problem from another perspective!
Now is later:
What you actually want to do is to concatenate the Dataframes "per week".
You achieve that by iterating over every week, creating a df_subset[week] concatenating df1[week] and df2[week] by axis=1 and then concatenating all these subsets on axis=0 afterwards:
weekly_dfs=[]
for week in df1.Week.unique():
sub_df1 = df1.loc[df1.Week == week, "Coeff1"].reset_index(drop=True)
sub_df2 = df2.loc[df2.Week == week, "Coeff2"].reset_index(drop=True)
concat_df = pd.concat([sub_df1, sub_df2], axis=1)
concat_df["Week"] = week
weekly_dfs.append(concat_df)
df3 = pd.concat(weekly_dfs).reset_index(drop=True)
The last reset of the index is optional but I recommend it anyways!
Based on your last comment on the question, you may want to concatenate instead of merging the two data frames:
df3 = pd.concat([df1,df2], ignore_index=True, axis=1)
The resulting DataFrame should have 63265 rows and will need some work to get it to the required format (remove the added index columns, rename the remaining columns, etc.), but pd.concat should be a good start.
According to pandas' merge documentation, you can use merge in a way like that:
What you are looking for is a left join. However, the default option is an inner join. You can change this by passing a different how argument:
df2.merge(df1,how='left', left_on='Week', right_on='Week')
note that this would keep these rows in the bigger df and assign NaN to them when merging with the shorter df.

assign value to pandas column based on data in another dataframe

I have 2 dataframes
df1
ID ID2 NUMBER
1 2 null
df2
ID ID2 NUMBER
1 2 1
1 2 2
1 2 3
So when doing merge between df1 and df2 usin ID and ID2 I get duplicated columns because df1 has 3 matches in df2. I'd like to assign a random number to df1 and use it for merging, this way I always get 1 to 1 merge.
The problem is that my dataset is rather big and sometimes I have only 1 row in df2 (so merge works properly) and sometimes I have 10+ rows in df2. I'd like to assign a number to df1 using:
rand(1,len(df1[(df1.ID=1) & (df1.ID2=2]))
I think I found a solution I'm posting it here so others can tell me if there is a better way.
def select_random_row(grp):
ID= grp.ID.iloc[0]
ID2= grp.ID2.iloc[0]
return random.randint(1, len(df1[(df1.ID== ID) & (df1.ID2 == ID2)]))
df2['g'] = df2.groupby(['ID','ID2']).apply(select_random_row)
EDIT:
This is way to slow to do on large dataset... I decided to just use drop_duplicates before merging and keep 1st record. It isn't randomly like I wanted but it is better than nothing

Number of rows changes even after `pandas.merge` with `left` option

I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?
panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)
prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()
temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")
after_len = len(temp_2000)
print(before_len, after_len)
> 12661 13915
This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:
left: use only keys from left frame
However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.
Example:
In [359]: df_1
Out[359]:
A B
0 a AAA
1 b BBA
2 c CCF
and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):
In [360]: df_3
Out[360]:
key value
0 a 1
1 a 2
2 b 3
3 a 4
If I merge these two on left.A, here's what happens:
In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]:
A B key value
0 a AAA a 1.0
1 a AAA a 2.0
2 a AAA a 4.0
3 b BBA b 3.0
4 c CCF NaN NaN
This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.
I hope this helps!
The problem of doubling of rows after each merge() (of any type, 'both' or 'left') is usually caused by duplicates in any of the keys, so we need to drop them first:
left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
If you do not have any duplication, as indicated in the above answer. You should double-check the names of removed entries. In my case, I discovered that the names of removed entries are inconsistent between the df1 and df2 and I solved the problem by:
df1["col1"] = df2["col2"]

Added column to existing dataframe but entered all numbers as NaN

So I created two dataframes from existing CSV files, both consisting of entirely numbers. The second dataframe consists of an index from 0 to 8783 and one column of numbers and I want to add it on as a new column to the first dataframe which has an index consisting of a month, day and hour. I tried using append, merge and concat and none worked and then tried simply using:
x1GBaverage['Power'] = x2_cut
where x1GBaverage is the first dataframe and x2_cut is the second. When I did this it added x2_cut on properly but all the values were entered as NaN instead of the numerical values that they should be. How should I be approaching this?
x1GBaverage['Power'] = x2_cut.values
problem solved :)
The thing about pandas is that values are implicitly linked to their indices unless you deliberately specify that you only need the values to be transferred over.
If they're the same row counts and you just want to tack it on the end, the indexes either need to match, or you need to just pass the underlying values. In the example below, columns 3 and 5 are the index matching & value versions, and 4 is what you're running into now:
In [58]: df = pd.DataFrame(np.random.random((3,3)))
In [59]: df
Out[59]:
0 1 2
0 0.670812 0.500688 0.136661
1 0.185841 0.239175 0.542369
2 0.351280 0.451193 0.436108
In [61]: df2 = pd.DataFrame(np.random.random((3,1)))
In [62]: df2
Out[62]:
0
0 0.638216
1 0.477159
2 0.205981
In [64]: df[3] = df2
In [66]: df.index = ['a', 'b', 'c']
In [68]: df[4] = df2
In [70]: df[5] = df2.values
In [71]: df
Out[71]:
0 1 2 3 4 5
a 0.670812 0.500688 0.136661 0.638216 NaN 0.638216
b 0.185841 0.239175 0.542369 0.477159 NaN 0.477159
c 0.351280 0.451193 0.436108 0.205981 NaN 0.205981
If the row counts differ, you'll need to use df.merge and let it know which columns it should be using to join the two frames.

Apply function to pandas dataframe that returns multiple rows

I would like to apply a function to a pandas DataFrame that splits some of the rows into two. So for example, I may have this as input:
df = pd.DataFrame([{'one': 3, 'two': 'a'}, {'one': 5, 'two': 'b,c'}], index=['i1', 'i2'])
one two
i1 3 a
i2 5 b,c
And I want something like this as output:
one two
i1 3 a
i2_0 5 b
i2_1 5 c
My hope was that I could just use apply() on the data frame, calling a function that returns a dataframe with 1 or more rows itself, which would then get merged back together. However, this does not seem to work at all. Here is a test case where I am just trying to duplicate each row:
dfa = df.apply(lambda s: pd.DataFrame([s.to_dict(), s.to_dict()]), axis=1)
one two
i1 one two
i2 one two
So if I return a DataFrame, the column names of that DataFrame seem to become the contents of the rows. This is obviously not what I want.
There is another question on here that was solved by using .groupby(), however I don't think this applies to my case since I don't actually want to group by anything.
What is the correct way to do this?
You have a messed up database (comma separated string where you should have separate columns). We first fix this:
df2 = pd.concat([df['one'], pd.DataFrame(df.two.str.split(',').tolist(), index=df.index)], axis=1)
Which gives us something more neat as
In[126]: df2
Out[126]:
one 0 1
i1 3 a None
i2 5 b c
Now, we can just do
In[125]: df2.set_index('one').unstack().dropna()
Out[125]:
one
0 3 a
5 b
1 5 c
Adjusting the index (if desired) is trivial and left to the reader as an exercise.

Categories

Resources