I am trying to merge several DataFrames based on a common column. This will be done in a loop and the original DataFrame may not have all of the columns so an outer merge will be necessary. However when I do this over several different DataFrames columns duplicate with suffix _x and _y. I am looking for one DataFrame where the data is filled in and columns are added only if they did not previously exists.
df1=pd.DataFrame({'Company Name':['A','B','C','D'],'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
Company Name Data1 Data2
0 A 1 13
1 B 34 54
2 C 23 5354
3 D 66 443
A second DataFrame with additional information for some of the companies:
pd.DataFrame({'Company Name':['A','B'],'Address': ['str1', 'str2'], 'Phone': ['str1a', 'str2a']})
Company Name Address Phone
0 A str1 str1a
1 B str2 str2a
If I wanted to combine these two it will successfully merge into one using on=Column:
df1=pd.merge(df1,df2, on='Company Name', how='outer')
Company Name Data1 Data2 Address Phone
0 A 1 13 str1 str1a
1 B 34 54 str2 str2a
2 C 23 5354 NaN NaN
3 D 66 443 NaN NaN
However if I were to do this same command again in a loop, or if I were to merge with another DataFrame with other company information I end up getting duplicate columns similar to the following:
df1=pd.merge(df1,pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']}), on='Company Name', how='outer')
Company Name Data1 Data2 Address_x Phone_x Address_y Phone_y
0 A 1 13 str1 str1a NaN NaN
1 B 34 54 str2 str2a NaN NaN
2 C 23 5354 NaN NaN str3 str3a
3 D 66 443 NaN NaN NaN NaN
When what I really want is one DataFrame with the same columns, just populating any missing data.
Company Name Data1 Data2 Address Phone
0 A 1 13 str1 str1a
1 B 34 54 str2 str2a
2 C 23 5354 str3 str3a
3 D 66 443 NaN NaN
Thanks in advance. I have reviewed the previous questions asked here on duplicate columns as well as a review of the Pandas documentation with out any progress.
As you look for merging one dataframe at the time in a loop for, here is a way you can do it, that the new dataframe has new company name or not, new column or not:
df1 = pd.DataFrame({'Company Name':['A','B','C','D'],
'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
list_dfo = [pd.DataFrame({'Company Name':['A','B'],
'Address': ['str1', 'str2'], 'Phone': ['str1a', 'str2a']}),
pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']})]
for df_other in list_dfo:
df1 = pd.merge(df1,df_other,how='outer').groupby('Company Name').first().reset_index()
# and other code
At the end in this example:
print(df1)
Company Name Data1 Data2 Address Phone
0 A 1.0 13.0 str1 str1a
1 B 34.0 54.0 str2 str2a
2 C 23.0 5354.0 str3 str3a
3 D 66.0 443.0 NaN NaN
Instead of first, you can use last, which would keep the last valid value and not the first in each column per group, it depends on what data you need, the one from df1 or the one from df_other if available. In the example above, it does not change anything, but in the following case you will see:
#company A has a new address
df4 = pd.DataFrame({'Company Name':['A'],'Address':['new_str1']})
#first keep the value from df1
print(pd.merge(df1,df4,how='outer').groupby('Company Name')
.first().reset_index())
Out[21]:
Company Name Data1 Data2 Address Phone
0 A 1.0 13.0 str1 str1a #address is str1 from df1
1 B 34.0 54.0 str2 str2a
2 C 23.0 5354.0 str3 str3a
3 D 66.0 443.0 NaN NaN
#while last keep the value from df4
print (pd.merge(df1,df4,how='outer').groupby('Company Name')
.last().reset_index())
Out[22]:
Company Name Data1 Data2 Address Phone
0 A 1.0 13.0 new_str1 str1a #address is new_str1 from df4
1 B 34.0 54.0 str2 str2a
2 C 23.0 5354.0 str3 str3a
3 D 66.0 443.0 NaN NaN
IIUC, you might try this;
def update_df(df1, df_next):
if 'Company Name' not in list(df1):
pass
else:
df1.set_index('Company Name', inplace=True)
df_next.set_index('Company Name', inplace=True)
new_cols = [item for item in set(df_next) if item not in set(df1)]
for col in new_cols:
df1['{}'.format(col)] = col
df1.update(df_next)
update_df(df1, df2)
update_df(df1, df3)
df1
Data1 Data2 Address Phone
Company Name
A 1 13 str1 str1a
B 34 54 str2 str2a
C 23 5354 str3 str3a
D 66 443 Address Phone
note1; for being able to use df.update you have to set_index to 'Company Name', this function will check that for df1 once and a next time it will pass. The df added will have the index set to 'Company Name'.
note2; next the function will check whether there are new columns, add them and fill out with the column name (you might want to change that).
note3; lastly you perform df.update with the values you need.
Related
I have a 2 dataframes as follow:
df1 = pd.DataFrame({'Barcode':[1,2,3,4],'Store':['s1','s2','s3','s4']})
df2 = pd.DataFrame({'Date':['2020-10-10','2020-10-09','2020-10-08','2020-10-07','2020-10-06']})
How to have a dataframe which has df1 as rows and df2 as columns and consequently generate cells with null values.Something like below:
And the final step is to fill the cells with join on another table(df4):
df4 = pd.DataFrame({'Barcode':[1,2,3,4],'Store':['s1','s2','s3','s4'],'2020-10-10':[1,2,5,np.nan],'2020-10-09':[np.nan,2,3,0],'2020-10-08':[0,0,2,3],'2020-10-07':[np.nan,1,np.nan,2]})
Final df should look like bellow:
Any help is truly appreciated.
I hope I've understood your question right. You have 3 dataframes:
df1 = pd.DataFrame({'Barcode':[1,2,3,4],'Store':['s1','s2','s3','s4']})
df2 = pd.DataFrame({'Date':['2020-10-10','2020-10-09','2020-10-08','2020-10-07','2020-10-06']})
df4 = pd.DataFrame({'Barcode':[1,2,3,4],'Store':['s1','s2','s3','s4'],'2020-10-10':[1,2,5,np.nan],'2020-10-09':[np.nan,2,3,0],'2020-10-08':[0,0,2,3],'2020-10-07':[np.nan,1,np.nan,2]})
Then:
df1 = pd.DataFrame(df1, columns= df1.columns.tolist() + df2['Date'].tolist())
df1 = df1.set_index('Barcode')
df4 = df4.set_index('Barcode')
print(df1.fillna(df4))
Prints:
Store 2020-10-10 2020-10-09 2020-10-08 2020-10-07 2020-10-06
Barcode
1 s1 1.0 NaN 0.0 NaN NaN
2 s2 2.0 2.0 0.0 1.0 NaN
3 s3 5.0 3.0 2.0 NaN NaN
4 s4 NaN 0.0 3.0 2.0 NaN
First create a temporary DataFrame:
wrk = pd.DataFrame('', index=pd.MultiIndex.from_frame(df1),
columns=df2.Date.rename(None)); wrk
It is filled with empty strings, and has required column names from
df2. For now Barcode and Store are index columns.
This arrangement will be needed soon.
Then update it (in-place) with the data from df4:
wrk.update(df4.set_index(['Barcode', 'Store']))
And the last step is:
result = wrk.reset_index()
The result is:
Barcode Store 2020-10-10 2020-10-09 2020-10-08 2020-10-07 2020-10-06
0 1 s1 1 0
1 2 s2 2 2 0 1
2 3 s3 5 3 2
3 4 s4 0 3 2
for item in df2.Date.tolist():
df1[item] = np.nan
dfinal = df1.fillna(df4)
dfinal = dfinal.set_index('Barcode')
display(dfinal)
I'm sure there is an elegant solution for this, but I cannot find one. In a pandas dataframe, how do I remove all duplicate values in a column while ignoring one value?
repost_of_post_id title
0 7139471603 Man with an RV needs a place to park for a week
1 6688293563 Land for lease
2 None 2B/1.5B, Dishwasher, In Lancaster
3 None Looking For Convenience? Check Out Cordova Par...
4 None 2/bd 2/ba, Three Sparkling Swimming Pools, Sit...
5 None 1 bedroom w/Closet is bathrooms in Select Unit...
6 None Controlled Access/Gated, Availability 24 Hours...
7 None Beautiful 3 Bdrm 2 & 1/2 Bth Home For Rent
8 7143099582 Need Help Getting Approved?
9 None *MOVE IN READY APT* REQUEST TOUR TODAY!
What I want is to keep all None values in repost_of_post_id, but omit any duplicates of the numerical values, for example if there are duplicates of 7139471603 in the dataframe.
[UPDATE]
I got the desired outcome using this script, but I would like to accomplish this in a one-liner, if possible.
# remove duplicate repost id if present (i.e. don't remove rows where repost_of_post_id value is "None")
# ca_housing is the original dataframe that needs to be cleaned
ca_housing_repost_none = ca_housing.loc[ca_housing['repost_of_post_id'] == "None"]
ca_housing_repost_not_none = ca_housing.loc[ca_housing['repost_of_post_id'] != "None"]
ca_housing_repost_not_none_unique = ca_housing_repost_not_none.drop_duplicates(subset="repost_of_post_id")
ca_housing_unique = ca_housing_repost_none.append(ca_housing_repost_not_none_unique)
You could try dropping the None values, then detecting duplicates, then filtering them out of the original list.
In [1]: import pandas as pd
...: from string import ascii_lowercase
...:
...: ids = [1,2,3,None,None, None, 2,3, None, None,4,5]
...: df = pd.DataFrame({'id': ids, 'title': list(ascii_lowercase[:len(ids)])})
...: print(df)
...:
...: print(df[~df.index.isin(df.id.dropna().duplicated().loc[lambda x: x].index)])
id title
0 1.0 a
1 2.0 b
2 3.0 c
3 NaN d
4 NaN e
5 NaN f
6 2.0 g
7 3.0 h
8 NaN i
9 NaN j
10 4.0 k
11 5.0 l
id title
0 1.0 a
1 2.0 b
2 3.0 c
3 NaN d
4 NaN e
5 NaN f
8 NaN i
9 NaN j
10 4.0 k
11 5.0 l
You could use drop_duplicates and merge with the NaNs as follows:
df_cleaned = df.drop_duplicates('post_id', keep='first').merge(df[df.post_id.isnull()], how='outer')
This will keep the first occurence of ids duplicated and all NaNs rows.
I want to convert this table
0 thg John 3.0
1 thg James 4.0
2 mol NaN 5.0
3 mol NaN NaN
4 lob NaN NaN
In this following tables
df1
movie name rating
0 thg John 3.0
1 thg James 4.0
df2
movie rating
2 mol 5.0
df3
movie
3 mol
4 lob
Where each dataframe has no Nan value, Also tell method if I need to separate with respect to blank value instead of Nan.
I think that start of a new target DataFrame should occur not
only when the number of NaN values changes (compared to
previous row), but also when this number is the same, but
NaN values are in different columns.
So I propose the following formula:
dfs = [g.dropna(how='all',axis=1) for _,g in
df.groupby(df.isna().ne(df.isna().shift()).any(axis=1).cumsum())]
You can print partial DataFrames (any number of them) running:
n = 0
for grp in dfs:
print(f'\ndf No {n}:\n{grp}')
n += 1
The advantage of my solution over the other becomes obvious when you add
to the source DataFrame another row containing:
5 NaN NaN 3.0
It contains also 1 non-null value (like two previous rows).
The other solution will treat all these rows as one partial DataFrame
containing:
movie rating
3 mol NaN
4 lob NaN
5 NaN 3.0
as you can see, with NaN values, whereas my solution divides these
rows into 2 separate DataFrames, without any NaN.
create a list of dfs , with a groupby and dropna:
dfs = [g.dropna(how='all',axis=1) for _,g in df.groupby(df.isna().sum(1))]
print(dfs[0],'\n\n',dfs[1],'\n\n',dfs[2])
Or dict:
d = {f"df{e+1}": g[1].dropna(how='all',axis=1)
for e,g in enumerate(df.groupby(df.isna().sum(1)))}
print(d['df1'],'\n\n',d['df2'],'\n\n',d['df3']) #read the keys of d
movie name rating
0 thg John 3.0
1 thg James 4.0
movie rating
2 mol 5.0
movie
3 mol
4 lob
quantity:
a b c
3 1 nan
3 2 8
7 5 9
4 8 nan
price
34
I have two dataframes quantity and price and I want to join last row of quantity dataframe to price where c is not nan
I wrote these query but didn't got the desired output:
price = pd.concat(price,quantity["a","b","c"].tail(1).isnotnull())
what I want is like:
price a b c
34 7 5 9
If your dfs are these:
df = pd.DataFrame([[3,1,np.nan], [3,2,8], [7,5,9], [4,8,np.nan]], columns=['a','b','c'])
df2 = pd.DataFrame([34], columns=['price'])
You can do in this way:
final_df = pd.concat([df.dropna(subset=['c']).tail(1).reset_index(drop=True), df2], axis=1)
Output:
a b c price
0 7 5 9.0 34
I believe you need remove missing values and for last row - added double [] for one row DataFrame:
df=pd.concat([price.reset_index(drop=True),
quantity[["a","b","c"]].dropna(subset=['c']).iloc[[-1]].reset_index(drop=True)],
axis=1)
print (df)
price a b c
0 34 7 5 9.0
Detail:
print (quantity[["a","b","c"]].dropna().iloc[[-1]])
a b c
2 7 5 9.0
I would filter the df on not null then simply add the price to it:
new_df = df[df['c'].notnull()]
Where c is your column name.
new_df['price'] = 32 # or the price from your df
I've two dataframes in pandas as shown below. EmpID is a primary key in both dataframes.
df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])
I want to join these two dataframes with EmpID so that
Missing data in one dataframe can be filled with value from another table if exists and key matches
If there are observations with new keys then they should be appended in the resulting dataframe
I've used below code for achieving this.
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
But this code gives me duplicate columns which I don't want so I only used unique columns from both tables for merging.
ColNames = list(df_second.columns.difference(df_first.columns))
ColNames.append('EmpID')
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
Now I don't get duplicate columns but don't get value either in observations where key matches.
I'll really appreciate if someone can help me with this.
Regards,
Kailash Negi
It seems you need combine_first with set_index for match by indices created by columns EmpID:
df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
EmpID Department Location Name Salary
0 1 HR Delhi A 1000.0
1 2 NaN NaN B NaN
2 3 Finance NaN C 3000.0
3 4 NaN NaN D 8000.0
4 5 Programming NaN E 6000.0
5 8 Admin Mumbai B NaN
6 9 Ops Banglore D NaN
7 10 Analytics Mumbai K NaN
EDIT:
For some order of columns need reindex:
#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')
df = (df_first.set_index('EmpID')
.combine_first(df_second.set_index('EmpID'))
.reset_index()
.reindex(columns=ColNames))
print (df)
EmpID Name Department Location Salary
0 1 A HR Delhi 1000.0
1 2 B NaN NaN NaN
2 3 C Finance NaN 3000.0
3 4 D NaN NaN 8000.0
4 5 E Programming NaN 6000.0
5 8 B Admin Mumbai NaN
6 9 D Ops Banglore NaN
7 10 K Analytics Mumbai NaN