I wanted to know if there is a way for me to merge / re-join the missing rows simply by index.
My original way to approach is just to cleanly separate df1 into df1_cleaned and df1_untouched, and then join them back together. But I thought there's probably an easier way to re-join the two df2 since I didn't change the index. I tried outer merge with left_index and right_index but was left with the dupe columns with suffix to clean.
df1
index
colA
colB
colC
0
California
123
abc
1
New York
456
def
2
Texas
789
ghi
df2 (subset of df1 and cleaned)
index
colA
colB
colC
0
California
321
abc
2
Texas
789
ihg
end-result
index
colA
colB
colC
0
California
321
abc
1
New York
456
def
2
Texas
789
ihg
You can use combine_first or update:
df_out = df2.combine_first(df1)
or, pd.DataFrame.update (which is an inplace operation and will overwrite df1):
df1.update(df2)
Output:
colA colB colC
index
0 California 321.0 abc
1 New York 456.0 def
2 Texas 789.0 ihg
You can get difference of index, and add the missing index from df1 to df_result after reindexing df2
df_result = df2.reindex(df1.index)
missing_index = df1.index.difference(df2.index)
df_result.loc[missing_index] = df1.loc[missing_index]
print(df_result)
colA colB colC
0 California 321.0 abc
1 New York 456.0 def
2 Texas 789.0 ihg
Related
I am trying to replace values from a dataframe column with values from another based on a third one and keep the rest of the values from the first df.
# df1
country name value
romania john 100
russia emma 200
sua mark 300
china jack 400
# df2
name value
emma 2
mark 3
Desired result:
# df3
country name value
romania john 100
russia emma 2
sua mark 3
china jack 400
Thank you
One approach could be as follows:
Use Series.map on column name and turn df2 into a Series for mapping by setting its index to name (df.set_index).
Next, chain Series.fillna to replace NaN values with original values from df.value (i.e. whenever mapping did not result in a match) and assign to df['value'].
df['value'] = df['name'].map(df2.set_index('name')['value']).fillna(df['value'])
print(df)
country name value
0 romania john 100.0
1 russia emma 2.0
2 sua mark 3.0
3 china jack 400.0
N.B. The result will now contain floats. If you prefer integers, chain .astype(int) as well.
Another option could be using pandas.DataFrame.Update:
df1.set_index('name', inplace=True)
df1.update(df2.set_index('name'))
df1.reset_index(inplace=True)
name country value
0 john romania 100.0
1 emma russia 2.0
2 mark sua 3.0
3 jack china 400.0
Another option:
df3 = df1.merge(df2, on = 'name', how = 'left')
df3['value'] = df3.value_y.fillna(df3.value_x)
df3.drop(['value_x', 'value_y'], axis = 1, inplace = True)
# country name value
# 0 romania john 100.0
# 1 russia emma 2.0
# 2 sua mark 3.0
# 3 china jack 400.0
Reproducible data:
df1=pd.DataFrame({'country':['romania','russia','sua','china'],'name':['john','emma','mark','jack'],'value':[100,200,300,400]})
df2=pd.DataFrame({'name':['emma','mark'],'value':[2,3]})
Consider this sample df:
colAnum colB colCnum colD
123 House 456 Book
Car 789 Table
891 Chair Porch
I am trying to roll through this df and if the "num" column is an empty string, then make the adjacent column, to the right, empty as well.
This is the expected output:
colAnum colB colCnum colD
123 House 456 Book
789 Table
891 Chair
I attempted this with variations on this:
for idx, col in enumerate(df.columns):
if df.iloc[idx, col] == '':
df[idx+1,col] == ''
I am sure I am missing something simple to make this occur, but cannot work my way around it.
Try with shift with mask
out = df.mask(df.eq('').shift(axis=1).fillna(False),'')
colAnum colB colCnum colD
0 123.0 House 456.0 Book
1 789.0 Table
2 891.0 Chair
I have a two pandas dataframe with several rows that are near duplicates of each other, except for one value, which is timestamp value. My goal is to merge these dataframes into a single dataframe, and for these nearly repeat rows, get the row with the last timestamp.
Here is an example of what I'm working with:
DF1:
id name created_at
0 1 Cristiano Ronaldo 2020-01-20
1 2 Messi 2020-01-20
2 3 Juarez 2020-01-20
DF2:
id name created_at
0 1 Cristiano Ronaldo 2020-01-20
1 2 Messi 2020-01-20
2 3 Juarez 2020-02-20
And here is what I would like:
id name created_at
3 1 Cristiano Ronaldo 2020-01-20
4 2 Messi 2020-01-20
5 3 Juarez 2020-02-20
For the row Juarez I get the last "created_ad"
Tha is it possible?
You can append the second dataframe to the first one, sort the dataframe using timestamp and then drop duplicates.
df_merged = df1.append(df2, ignore_index = True)
df_merged = df_merged.sort_values('created_at')
df_columns = df_merged.columns.tolist()
df_columns.remove('created_at')
df_merged.drop_duplicates(inplace = True, keep = 'last', subset = df_columns)
I have two dataframes (with strings) that I am trying to compare to each other. One has a list of areas, the other has a list of areas with long,lat info as well. I am struggling to write a code to perform the following:
1) Check if the string in df1 matches (or a partially matches) area names in df2, then it will merge & carry over the long lat columns.
2) if df1 does not match with df2, then the new column will have NaN or zero.
Code:
import pandas as pd
df1 = pd.read_csv('Dubai Communities1.csv')
df1.head()
CNAME_E1
0 Abu Hail
1 Al Asbaq
2 Al Aweer First
3 Al Aweer Second
4 Al Bada
df2 = pd.read_csv('Dubai Communities2.csv')
df2.head()
COMM_NUM CNAME_E2 Latitude Longitude
0 315 UMM HURAIR 55.3237 25.2364
1 917 AL MARMOOM 55.4518 24.9756
2 624 WARSAN 55.4034 25.1424
3 123 AL MUTEENA 55.3228 25.2739
4 813 AL ROWAIYAH 55.3981 25.1053
The output after search and join will look like this:
CName_E1 CName_E3 Latitude Longitude
0 Area1 Area1 22 7.25
1 Area2 Area2 38 71.83
2 Area3 NaN NaN NaN
3 Area4 Area4 35 8.05
I have 3 dataframes as below
df1
id first_name surname state
1
88
190
2509
....
df2
id given_name surname state street_num
17 John Doe NY 5
88 Tom Murphy CA 423
190 Dave Casey KY 250
....
df3
id first_name family_name state car
1 John Woods NY ford
74 Tom Kite FL vw
2509 Mike Johnson KY toyota
Some id's from df1 are in df2 and others are in df3. There are also id's in df2 and df3 that are not in df1.
EDIT: there are also some id's in df1 that re not in either df2 or df3.
I want to fill the columns in df1 with the values in the dataframe containing the id. However, I do not want all columns (so i think merge is not suitable). I have tried to use the isin function but that way I could not update records individually and got an error. This was my attempt using isin:
df1.loc[df1.index.isin(df2.index), 'first_name'] = df2.given_name
Is there an easy way to do this without iterating through the dataframes checking if index matches?
I think you first need to rename your columns to align the DataFrames in concat and then reindex to filter by df1.index and df1.columns:
df21 = df2.rename(columns={'given_name':'first_name'})
df31 = df3.rename(columns={'family_name':'surname'})
df = pd.concat([df21, df31]).reindex(index=df1.index, columns=df1.columns)
print (df)
first_name surname state
d
1 John Woods NY
88 Tom Murphy CA
190 Dave Casey KY
2509 Mike Johnson KY
EDIT: If need intersection of indices only:
df4 = pd.concat([df21, df31])
df = df4.reindex(index=df1.index.intersection(df4.index), columns=df1.columns)