I am attempting to append two DataFrames using Python Pandas, but I am receiving a null error. How can I resolve this?
Here's the first DataFrame (after I load to Python):
name State
0 Tom NY
1 Lee CA
Here's the second DataFrame (after I load to Python) with no header:
0 1
0 Jon FL
1 Tan NJ
I attempt to append the DataFrames using:
pd.concat([df1,df2])
The result is:
name State 0 1
0 Tom NY NaN NaN
1 Lee CA NaN NaN
0 NaN NaN Jon FL
1 NaN NaN Tan NJ
I want the result to be:
name State
0 Tom NY
1 Lee CA
2 Jon FL
3 Tan NJ
I've made the following attempt, but it doesn't work:
pd.concat([df1,df2], axis=1)
Here is my second unsuccessful attempt:
pd.concat([df1,df2], ignore_index=True)
Align your column names and use append
df1.columns = df.columns
df.append(df1).reset_index(drop=True)
# Result
name State
0 Tom NY
1 Lee CA
2 Jon FL
3 Tan NJ
Rename your column name and then concat them:
df2.columns = df1.columns
pd.concat([df1, df2], ignore_index=True)
Output:
name State
0 Tom NY
1 Lee CA
2 Jon FL
3 Tan NJ
Related
I am trying to replace values from a dataframe column with values from another based on a third one and keep the rest of the values from the first df.
# df1
country name value
romania john 100
russia emma 200
sua mark 300
china jack 400
# df2
name value
emma 2
mark 3
Desired result:
# df3
country name value
romania john 100
russia emma 2
sua mark 3
china jack 400
Thank you
One approach could be as follows:
Use Series.map on column name and turn df2 into a Series for mapping by setting its index to name (df.set_index).
Next, chain Series.fillna to replace NaN values with original values from df.value (i.e. whenever mapping did not result in a match) and assign to df['value'].
df['value'] = df['name'].map(df2.set_index('name')['value']).fillna(df['value'])
print(df)
country name value
0 romania john 100.0
1 russia emma 2.0
2 sua mark 3.0
3 china jack 400.0
N.B. The result will now contain floats. If you prefer integers, chain .astype(int) as well.
Another option could be using pandas.DataFrame.Update:
df1.set_index('name', inplace=True)
df1.update(df2.set_index('name'))
df1.reset_index(inplace=True)
name country value
0 john romania 100.0
1 emma russia 2.0
2 mark sua 3.0
3 jack china 400.0
Another option:
df3 = df1.merge(df2, on = 'name', how = 'left')
df3['value'] = df3.value_y.fillna(df3.value_x)
df3.drop(['value_x', 'value_y'], axis = 1, inplace = True)
# country name value
# 0 romania john 100.0
# 1 russia emma 2.0
# 2 sua mark 3.0
# 3 china jack 400.0
Reproducible data:
df1=pd.DataFrame({'country':['romania','russia','sua','china'],'name':['john','emma','mark','jack'],'value':[100,200,300,400]})
df2=pd.DataFrame({'name':['emma','mark'],'value':[2,3]})
DOB Name
0 1956-10-30 Anna
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry
6 1972-05-04 Kate
In the dataframe similar to the one above where I have duplicate names. So I am want to add a suffix '_0' to the name if DOB is before 1990 and a duplicate name.
I am expecting a result like this
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I am using the following
df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0')
But I am getting this result
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 NaN
2 2001-09-09 NaN
3 1993-01-15 NaN
4 1999-05-02 NaN
5 1962-12-17 Jerry_0
6 1972-05-04 NaN
How can I add a suffix to the Name which is a duplicate and have to be born before 1990.
Problem in your df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0') is that df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))] is a filtered dataframe whose rows are less than the original. When you assign it back, the not filtered rows doesn't have corresponding value in the filtered dataframe, so it becomes NaN.
You can try mask instead
m = (df['DOB'] < '1990-01-01') & df['Name'].duplicated(keep=False)
df['Name'] = df['Name'].mask(m, df['Name']+'_0')
You can use masks and boolean indexing:
# is the year before 1990?
m1 = pd.to_datetime(df['DOB']).dt.year.lt(1990)
# is the name duplicated?
m2 = df['Name'].duplicated(keep=False)
# if both conditions are True, add '_0' to the name
df.loc[m1&m2, 'Name'] += '_0'
output:
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I have a DataFrame with 3 columns: ID, BossID and Name. Each row has a unique ID and has a corresponding name. BossID is the ID of the boss of the person in that row. Suppose I have the following DataFrame:
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3],
'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
So here, Anne is Ben's boss and Ben Coe is Cate and Dan's boss, etc.
Now, I want to have another column that has the boss's name for each person.
The desired output is:
id boss name boss_name
0 1 NaN Anne NaN
1 2 1.0 Ben Anne
2 3 2.0 Cate Ben
3 4 2.0 Dan Ben
4 5 3.0 Erin Cate
I can get my output using an ugly double for-loop. Is there a cleaner way to obtain the desired output?
This should work:
bossmap = df.set_index('id')['name'].squeeze()
df['boss_name'] = df['bossId'].map(bossmap)
You can set id as index, then use pd.Series.reindex
df = df.set_index('id')
df['boss_name'] = df['name'].reindex(df['bossId']).to_numpy() # or .to_list()
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe
Create a separate dataframe for 'name' and 'id'.
Renamed 'name' and set 'id' as the index
.merge df with the new dataframe
import pandas as pd
# test dataframe
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3], 'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
# separate dataframe with id and name
names = df[['id', 'name']].dropna().set_index('id').rename(columns={'name': 'boss_name'})
# merge the two
df = df.merge(names, left_on='bossId', right_index=True, how='left')
# df
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe
I have a dataset in which I add coordinates to cities based on zip-codes but several of these zip-codes are missing. Also, in some cases cities are missing, states are missing, or both are missing. For example:
ca_df[['OWNER_CITY', 'OWNER_STATE', 'OWNER_ZIP']]
OWNER_CITY OWNER_STATE OWNER_ZIP
495 MIAMI SHORE PA
496 SEATTLE
However, a second dataset has city, state & the matching zip-codes. This one is complete without any missing values.
df_coord.head()
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
I want to fill in the missing zip-codes in the first dataframe if:
Zip-code is empty
City is present
State is present
This is an all-or-nothing operations means, either all three criteria are met and the zip-code gets filled or nothing changes.
However, this is a fairly large dataset with > 50 million records so ideally I want to vectorize the operation by working column-wise.
Technically, that would fit np.where but as far as I know, np.where only takes of condition in the following format:
df1['OWNER_ZIP'] = np.where(df["cond"] ==X, df_coord['OWNER_ZIP'], "")
How do I ensure I only fill missing zip-codes when all conditions are met?
Given ca_df:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California NaN
2 Houston NaN NaN
and df_coord:
OWNER_ZIP CITY STATE
0 111 Miami Shore Florida
1 222 Los Angeles California
2 333 Houston Texas
You can use pd.notna along with pd.DataFrame#index like this:
inferrable_zips_df = pd.notna(ca_df["OWNER_CITY"]) & pd.notna(ca_df["OWNER_STATE"])
is_inferrable_zip = ca_df.index.isin(df_coord[inferrable_zips_df].index)
ca_df.loc[is_inferrable_zip, "OWNER_ZIP"] = df_coord["OWNER_ZIP"]
with ca_df resulting as:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California 222
2 Houston NaN NaN
I've changed the "" to np.nan, but if you still wish to use "" then you just need to change pd.notna(ca_df[...]) to ca_df[...] == "".
You can combine numpy.where statements to combine multiple rules. This should give you the array of row indices which abide to each of the three rules:
np.where(df["OWNER_ZIP"] == X) and np.where(df["CITY"] == Y) and np.where(df["STATE"] == Z)
Use:
print (df_coord)
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 NaN MN
3 NaN MIAMI SHORE PA
4 NaN SEATTLE NaN
First is necessary test if same dtypes in columns matching:
#or convert ca_df['OWNER_ZIP'] to integers
df_coord['OWNER_ZIP'] = df_coord['OWNER_ZIP'].astype(str)
print (df_coord.dtypes)
OWNER_ZIP object
CITY object
STATE object
dtype: object
print (ca_df.dtypes)
OWNER_ZIP object
OWNER_CITY object
OWNER_STATE object
dtype: object
Then filter for each combinations of columns - missing and non missing values and add new data by merge, then convert index to same like filtered data and assign back:
mask1 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].isna()
df1 = ca_df[mask1].drop('OWNER_ZIP', axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask1])
ca_df.loc[mask1, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df1
mask2 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].isna() & ca_df['OWNER_ZIP'].isna()
df2 = ca_df[mask2].drop(['OWNER_ZIP','OWNER_STATE'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask2])
ca_df.loc[mask2, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df2
mask3 = ca_df['OWNER_CITY'].isna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].notna()
df3 = ca_df[mask3].drop(['OWNER_CITY'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask3])
ca_df.loc[mask3, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df3
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
You can do a left join on these dataframes considering join on the columns 'city' and 'state'. That would give you the zip-code corresponding to a city and state if both values are non-null in the first dataframe (OWNER_CITY, OWNER_STATE, OWNER_ZIP) and since it would be a left join, it would also preserve your rows which either don't have a zip-code or have null/empty city and state values.
I have two different dataframes. First I had to check that the data in my df1 matches my df2. If that were the case, it add a column "isRep" = true otherwise it's equal to false. It created a df3 for me.
Now, I need to add an "idRep" column in my df3 that corresponds to the index, generate automatically with pandas, where to find the data in df2
This is the df1 :
Index Firstname Name Origine
0 Johnny Depp USA
1 Brad Pitt USA
2 Angelina Pitt USA
This is the d2 :
Index Firstname Name Origine
0 Kidman Nicole AUS
1 Jean Dujardin FR
2 Brad Pitt USA
After the merge with this code :
df = pd.merge(data, dataRep, on=['Firstname', 'Name', 'Origine'], how='left', indicator='IsRep')
df['IsRep'] = np.where(df.IsRep == 'both', True, False)
after this code, I got this result which is my df3 (its the same of the df1 but with the column "isRep" ) :
Index Firstname Name Origine isRep
0 Johnny Depp USA False
1 Brad Pitt USA True
2 Angelina Pitt USA False
Now, I need an other dataframe with the column named "idRep" where I put the index corresponds to df2 like that. But I don't know how I can do that:
Index Firstname Name Origine isRep IdRep
0 Johnny Depp USA False -
1 Brad Pitt USA True 2
2 Angelina Pitt USA False -
A bit of a hack would be to reset_index before you merge. Only reset the index on the right DataFrame.
m = dataRep.rename_axis('IdRep').reset_index()
df = pd.merge(data, m, on=['Firstname', 'Name', 'Origine'], how='left', indicator='IsRep')
df['IsRep'] = np.where(df.IsRep == 'both', True, False)
Firstname Name Origine IdRep IsRep
0 Johnny Depp USA NaN False
1 Brad Pitt USA 2.0 True
2 Angelina Pitt USA NaN False
reverse look up using dict
cols = ['Firstname', 'Name', 'Origine']
d = dict(zip(zip(*map(df2.get, cols)), df2.index))
z = [*zip(*map(df1.get, cols))]
df1.assign(
isRep=[*map(d.__contains__, z)],
IdRep=[*map(d.get, z)]
)
Firstname Name Origine isRep IdRep
Index
0 Johnny Depp USA False NaN
1 Brad Pitt USA True 2.0
2 Angelina Pitt USA False NaN
Variation where we take advantage of assign arguments being order dependent
cols = ['Firstname', 'Name', 'Origine']
d = dict(zip(zip(*map(df2.get, cols)), df2.index))
z = [*zip(*map(df1.get, cols))]
df1.assign(
IdRep=[*map(d.get, z)],
isRep=lambda d: d.IdRep.notna()
)