I have 2 dataframes
df1
ID ID2 NUMBER
1 2 null
df2
ID ID2 NUMBER
1 2 1
1 2 2
1 2 3
So when doing merge between df1 and df2 usin ID and ID2 I get duplicated columns because df1 has 3 matches in df2. I'd like to assign a random number to df1 and use it for merging, this way I always get 1 to 1 merge.
The problem is that my dataset is rather big and sometimes I have only 1 row in df2 (so merge works properly) and sometimes I have 10+ rows in df2. I'd like to assign a number to df1 using:
rand(1,len(df1[(df1.ID=1) & (df1.ID2=2]))
I think I found a solution I'm posting it here so others can tell me if there is a better way.
def select_random_row(grp):
ID= grp.ID.iloc[0]
ID2= grp.ID2.iloc[0]
return random.randint(1, len(df1[(df1.ID== ID) & (df1.ID2 == ID2)]))
df2['g'] = df2.groupby(['ID','ID2']).apply(select_random_row)
EDIT:
This is way to slow to do on large dataset... I decided to just use drop_duplicates before merging and keep 1st record. It isn't randomly like I wanted but it is better than nothing
Related
I have a data frame with measurements for several groups of participants, and I am doing some calculations for each group. I want to add a column in a big data frame (all participants), from secondary data frames (partial list of participants).
When I do merge a couple of times (merging a new data frame to the existing one), it creates a duplicate of the column instead of one column.
As the size of the dataframes is different I can not compare them directly.
I tried
#df1 - main bigger dataframe, df2 - smaller dataset contains group of df1
for i in range(len(df1)):
# checking indeces to place the data to correct participant:
if df1.index[i] not in df2['index']:
pass
else :
df1['rate'][i] = list(df2[rate][df2['index']==i])
It does not work properly though. Can you please help with the correct way of assembling the column?
update: where the index of the initial dataframe and the "index" column of the calculation is the same, copy the rate value from the calculation into main df
main dataframe 1df
index
rate
1
0
2
0
3
0
4
0
5
0
6
0
dataframe with calculated values
index
rate
1
value
4
value
6
value
output df
index
rate
1
value
2
0
3
0
4
value
5
0
6
value
Try this – using .join() to merge dataframes on their indices and combining two columns using .combine_first():
df = df1.join(df2, lsuffix="_df1", rsuffix="_df2")
df["rate"] = df["rate_df2"].combine_first(df["rate_df1"])
EDIT:
This assumes both dataframes use a matching index. If that is not the case for df2, run this first:
df2 = df2.set_index('index')
I'm normally OK on the joining and appending front, but this one has got me stumped.
I've got one dataframe with only one row in it. I have another with multiple rows. I want to append the value from one of the columns of my first dataframe to every row of my second.
df1:
id
Value
1
word
df2:
id
data
1
a
2
b
3
c
Output I'm seeking:
df2
id
data
Value
1
a
word
2
b
word
3
c
word
I figured that this was along the right lines, but it listed out NaN for all rows:
df2 = df2.append(df1[df1['Value'] == 1])
I guess I could just join on the id value and then copy the value to all rows, but I assumed there was a cleaner way to do this.
Thanks in advance for any help you can provide!
Just get the first element in the value column of df1 and assign it to value column of df2
df2['value'] = df1.loc[0, 'value']
I am trying to merge two weelly DateFrames, which are made-up of one column each, but with different lengths.
Could I please know how to merge them, maintaining the 'Week' indexing?
[df1]
Week Coeff1
1 -0.456662
1 -0.533774
1 -0.432871
1 -0.144993
1 -0.553376
... ...
53 -0.501221
53 -0.025225
53 1.529864
53 0.044380
53 -0.501221
[16713 rows x 1 columns]
[df2]
Week Coeff
1 0.571707
1 0.086152
1 0.824832
1 -0.037042
1 1.167451
... ...
53 -0.379374
53 1.076622
53 -0.547435
53 -0.638206
53 0.067848
[63265 rows x 1 columns]
I've tried this code:
df3 = pd.merge(df1, df2, how='inner', on='Week')
df3 = df3.drop_duplicates()
df3
But it gave me a new df (df3) with 13386431 rows × 2 columns
Desired outcome: A new df which has 3 columns (week, coeff1, coeff2), as df2 is longer, I expect to have some NaNs in coeff1 to fill the gaps.
I assume your output should look somewhat like this:
Week
Coeff1
Coeff2
1
-0.456662
0.571707
1
-0.533774
0.086152
1
-0.432871
0.824832
2
3
3
2
NaN
3
Don't mind the actual numbers though.
The problem is you won't achieve that with a join on Week, neither left nor inner and that is due to the fact that the Week-Index is not unique.
So, on a left join, pandas is going to join all the Coeff2-Values where df2.Week == 1 on every single row in df1 where df1.Week == 1. And that is why you get these millions of rows.
I will try and give you a workaround later, but maybe this helps you to think about this problem from another perspective!
Now is later:
What you actually want to do is to concatenate the Dataframes "per week".
You achieve that by iterating over every week, creating a df_subset[week] concatenating df1[week] and df2[week] by axis=1 and then concatenating all these subsets on axis=0 afterwards:
weekly_dfs=[]
for week in df1.Week.unique():
sub_df1 = df1.loc[df1.Week == week, "Coeff1"].reset_index(drop=True)
sub_df2 = df2.loc[df2.Week == week, "Coeff2"].reset_index(drop=True)
concat_df = pd.concat([sub_df1, sub_df2], axis=1)
concat_df["Week"] = week
weekly_dfs.append(concat_df)
df3 = pd.concat(weekly_dfs).reset_index(drop=True)
The last reset of the index is optional but I recommend it anyways!
Based on your last comment on the question, you may want to concatenate instead of merging the two data frames:
df3 = pd.concat([df1,df2], ignore_index=True, axis=1)
The resulting DataFrame should have 63265 rows and will need some work to get it to the required format (remove the added index columns, rename the remaining columns, etc.), but pd.concat should be a good start.
According to pandas' merge documentation, you can use merge in a way like that:
What you are looking for is a left join. However, the default option is an inner join. You can change this by passing a different how argument:
df2.merge(df1,how='left', left_on='Week', right_on='Week')
note that this would keep these rows in the bigger df and assign NaN to them when merging with the shorter df.
How can we use Coalesce with multiple data frames.
columns_List = Emp_Id, Emp_Name, Dept_Id...
I have two data frames getting used in python script. df1[Columns_List] , df2[columns_List]. In both the dataframes i have same columns used but i will be having different values in both dataframes.
How can i use Coalesce so that lets say :In Dataframe df1[Columns_List] -- I have Emp_Name null then i want to pick Emp_Name from df2[Columns_list].
I am trying to create an output CSV file.
Please sorry if my framing of question is wrong..
Please find below sample data.
For Dataframe1 -- df1[Columns_List] .. Please find below output
EmpID,Emp_Name,Dept_id,DeptName
1,,1,
2,,2,
For Dataframe2 -- df2[Columns_List] .. Please find below output
EmpID,Emp_Name,Dept_id,DeptName
1,XXXXX,1,Sciece
2,YYYYY,2,Maths
I have source as Json file. Once i parse the data by python , i am using 2 dataframes in the same script. In Data frame 1 ( df1) i have Emp_Name & Dept_Name as null. In that case i want to pick data from Dataframe2 (df2).
In the above example i have provided few columns. But i may have n number of columns. but column ordering and column names will be always same. I am trying to achieve in such a way if any of the column from df1 is null then i want to pick value from df2.
Is that possible.. Please help me with any suggestionn...
You can use pandas.DataFrame.combine. This method does what you need: it builds a dataframe taking elements from two dataframes according to a custom function.
You can then write a custom function which picks the element from dataframe one unless that is null, in which case the element is taken from dataframe two.
Consider the two following dataframe. I built them according to your examples but with a small difference to emphatize that only emtpy string will be replaced:
columnlist = ["EmpID", "Emp_Name", "Dept_id", "DeptName"]
df1 = pd.DataFrame([[1, None, 1, np.NaN], [2, np.NaN, 2, None]], columns=columnlist)
df2 = pd.DataFrame([[1, "XXX", 2, "Science"], [2, "YYY", 3, "Math"]], columns=columnlist)
They are:
df1
EmpID Emp_Name Dept_id DeptName
0 1 NaN 1 NaN
1 2 NaN 2 NaN
df2
EmpID Emp_Name Dept_id DeptName
0 1 XXX 1 Science
1 2 YYY 3 Math
What you need to do is:
ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))
to get ddf:
ddf
EmpID Emp_Name Dept_id DeptName
0 1 XXX 1 Science
1 2 YYY 2 Math
As you can see, only Null values in df1 have been replaced with the corresponding values in df2.
EDIT: A bit deeper explanation
Since I've been asked in the comments, let me give a bit of explanation more on the solution:
ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))
Is a bit compact, but there is nothing much than some basic python techiques like list comprehensions, plus the use of pandas.DataFrame.combine. The pandas method is detailed in the docs I linked above. It compares the two dataframes column by column: the columns are passed to a custom function which must return a pandas.Series. This Series become a column in the returned dataframe.
In this case, the custom function is a lambda, which uses a list comprehension to loop over the pairs of elements (one from each column) and pick only one element of the pair (the first if not null, otherwise the second).
You can use a mask to get null values and replace those. The best part is that you don't have to eyeball anything; the function will find what to replace for you.
You can also adjust the pd.DataFrame.select_dtypes() function to suit your needs, or just go through multiple dtypes with appropriate conversion and detection measures being used.
import pandas as pd
ddict1 = {
'EmpID':[1,2],
'Emp_Name':['',''],
'Dept_id':[1,2],
'DeptName':['',''],
}
ddict2 = {
'EmpID':[1,2],
'Emp_Name':['XXXXX','YYYYY'],
'Dept_id':[1,2],
'DeptName':['Sciece','Maths'],
}
df1 = pd.DataFrame(ddict1)
df2 = pd.DataFrame(ddict2)
def replace_df_values(df_A, df_B):
## Select object dtypes
for i in df_A.select_dtypes(include=['object']):
### Check to see if column contains missing value
if len(df_A[df_A[i].str.contains('')]) > 0:
### Create mask for zero-length values (or null, your choice)
mask = df_A[i] == ''
### Replace on 1-for-1 basis using .loc[]
df_A.loc[mask, i] = df_B.loc[mask, i]
### Pass dataframes in reverse order to cover both scenarios
replace_df_values(df1, df2)
replace_df_values(df2, df1)
Initial values for df1:
EmpID Emp_Name Dept_id DeptName
0 1 1
1 2 2
Output for df1 after running function:
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1 Sciece
1 2 YYYYY 2 Maths
I replicated your dataframes:
# df1
EmpID Emp_Name Dept_id DeptName
0 1 1
1 2 2
# df2
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1 Sciece
1 2 YYYYY 2 Maths
If you want to replace missing values (NaN) from df1.column with existing values from df2.column, you could use .fillna(). For example:
df1['Emp_Name'].fillna(df2['Emp_Name'], inplace=True)
# df1
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1
1 2 YYYYY 2
If you want to replace all values from a given column with the values from the same column of another dataframe, you could use list comprehension.
df1['DeptName'] = [ each for each in list(df2['DeptName'])]
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1 Sciece
1 2 YYYYY 2 Maths
I'm sure there's a better way to do this, but I hope this helps!
I searched archive, but did not find what I wanted (probably because I don't really know what key words to use)
Here is my problem: I have a bunch of dataframes need to be merged; I also want to update the values of a subset of columns with the sum across the dataframes.
For example, I have two dataframes, df1 and df2:
df1=pd.DataFrame([ [1,2],[1,3], [0,4]], columns=["a", "b"])
df2=pd.DataFrame([ [1,6],[1,4]], columns=["a", "b"])
a b a b
0 1 2 0 1 5
1 1 3 2 0 6
2 0 4
after merging, I'd like to have the column 'b' updated with the sum of matched records, while column 'a' should be just like df1 (or df2, don't really care) as before:
a b
0 1 7
1 1 3
2 0 10
Now, expand this to merging three or more data frames.
Are there straightforward, build-in tricks to do this? or I need to process one by one, line by line?
===== Edit / Clarification =====
In the real world example, each data frame may contain indexes that are not in the other data frames. In this case, the merged data frame should have all of them and update the shared entries/indexes with sum (or some other operation).
Only partial, not complete solution yet. But the main point is solved:
df3 = pd.concat([df1, df2], join = "outer", axis=1)
df4 = df3.b.sum(axis=1)
df3 will have two 'a' columns, and two 'b' columns. the sum() function on df3.b add two 'b' columns and ignore NaNs. Now df4 has column 'b' with sum of df1 and df2's 'b' columns, and all the indexes.
did not solve the column 'a' though. In my real case, there are quite few number of NaN in df3.a , while others in df3.a should be the same. I haven't found a straightforward way to make a column 'a' in df4 and fill value with non-NaN. Now searching for a "count" function to get occurance of elements in rows of df3.a (imagine it has a few dozens column 'a').