I have two dataframes as indicated below:
dfA =
Country City Pop
US Washington 1000
US Texas 5000
CH Geneva 500
CH Zurich 500
dfB =
Country City Density (pop/km2)
US Washington 10
US Texas 50
CH Geneva 5
CH Zurich 5
What I want is to compare the columns Country and City from both dataframes, and when these match such as:
US Washington & US Washington in both dataframes, it takes the Pop value and divides it by Density, as to get a new column area in dfB with the resulting division. Example of first row results dfB['area km2'] = 100
I have tried with np.where() but it is nit working. Any hints on how to achieve this?
Using index matching and div
match_on = ['Country', 'City']
dfA = dfA.set_index(match_on)
dfA.assign(ratio=dfA.Pop.div(df.set_index(['Country', 'City'])['Density (pop/km2)']))
Country City
US Washington 100.0
Texas 100.0
CH Geneva 100.0
Zurich 100.0
dtype: float64
You can also use merge to combine the two dataframes and divide as usual:
dfMerge = dfA.merge(dfB, on=['Country', 'City'])
dfMerge['area'] = dfMerge['Pop'].div(dfMerge['Density (pop/km2)'])
print(dfMerge)
Output:
Country City Pop Density (pop/km2) area
0 US Washington 1000 10 100.0
1 US Texas 5000 50 100.0
2 CH Geneva 500 5 100.0
3 CH Zurich 500 5 100.0
you can also use merge like below
dfB["Area"] = dfB.merge(dfA, on=["Country", "City"], how="left")["Pop"] / dfB["Density (pop/km2)"]
dfB
Related
I am trying to replace values from a dataframe column with values from another based on a third one and keep the rest of the values from the first df.
# df1
country name value
romania john 100
russia emma 200
sua mark 300
china jack 400
# df2
name value
emma 2
mark 3
Desired result:
# df3
country name value
romania john 100
russia emma 2
sua mark 3
china jack 400
Thank you
One approach could be as follows:
Use Series.map on column name and turn df2 into a Series for mapping by setting its index to name (df.set_index).
Next, chain Series.fillna to replace NaN values with original values from df.value (i.e. whenever mapping did not result in a match) and assign to df['value'].
df['value'] = df['name'].map(df2.set_index('name')['value']).fillna(df['value'])
print(df)
country name value
0 romania john 100.0
1 russia emma 2.0
2 sua mark 3.0
3 china jack 400.0
N.B. The result will now contain floats. If you prefer integers, chain .astype(int) as well.
Another option could be using pandas.DataFrame.Update:
df1.set_index('name', inplace=True)
df1.update(df2.set_index('name'))
df1.reset_index(inplace=True)
name country value
0 john romania 100.0
1 emma russia 2.0
2 mark sua 3.0
3 jack china 400.0
Another option:
df3 = df1.merge(df2, on = 'name', how = 'left')
df3['value'] = df3.value_y.fillna(df3.value_x)
df3.drop(['value_x', 'value_y'], axis = 1, inplace = True)
# country name value
# 0 romania john 100.0
# 1 russia emma 2.0
# 2 sua mark 3.0
# 3 china jack 400.0
Reproducible data:
df1=pd.DataFrame({'country':['romania','russia','sua','china'],'name':['john','emma','mark','jack'],'value':[100,200,300,400]})
df2=pd.DataFrame({'name':['emma','mark'],'value':[2,3]})
I have a dataset in which I add coordinates to cities based on zip-codes but several of these zip-codes are missing. Also, in some cases cities are missing, states are missing, or both are missing. For example:
ca_df[['OWNER_CITY', 'OWNER_STATE', 'OWNER_ZIP']]
OWNER_CITY OWNER_STATE OWNER_ZIP
495 MIAMI SHORE PA
496 SEATTLE
However, a second dataset has city, state & the matching zip-codes. This one is complete without any missing values.
df_coord.head()
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
I want to fill in the missing zip-codes in the first dataframe if:
Zip-code is empty
City is present
State is present
This is an all-or-nothing operations means, either all three criteria are met and the zip-code gets filled or nothing changes.
However, this is a fairly large dataset with > 50 million records so ideally I want to vectorize the operation by working column-wise.
Technically, that would fit np.where but as far as I know, np.where only takes of condition in the following format:
df1['OWNER_ZIP'] = np.where(df["cond"] ==X, df_coord['OWNER_ZIP'], "")
How do I ensure I only fill missing zip-codes when all conditions are met?
Given ca_df:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California NaN
2 Houston NaN NaN
and df_coord:
OWNER_ZIP CITY STATE
0 111 Miami Shore Florida
1 222 Los Angeles California
2 333 Houston Texas
You can use pd.notna along with pd.DataFrame#index like this:
inferrable_zips_df = pd.notna(ca_df["OWNER_CITY"]) & pd.notna(ca_df["OWNER_STATE"])
is_inferrable_zip = ca_df.index.isin(df_coord[inferrable_zips_df].index)
ca_df.loc[is_inferrable_zip, "OWNER_ZIP"] = df_coord["OWNER_ZIP"]
with ca_df resulting as:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California 222
2 Houston NaN NaN
I've changed the "" to np.nan, but if you still wish to use "" then you just need to change pd.notna(ca_df[...]) to ca_df[...] == "".
You can combine numpy.where statements to combine multiple rules. This should give you the array of row indices which abide to each of the three rules:
np.where(df["OWNER_ZIP"] == X) and np.where(df["CITY"] == Y) and np.where(df["STATE"] == Z)
Use:
print (df_coord)
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 NaN MN
3 NaN MIAMI SHORE PA
4 NaN SEATTLE NaN
First is necessary test if same dtypes in columns matching:
#or convert ca_df['OWNER_ZIP'] to integers
df_coord['OWNER_ZIP'] = df_coord['OWNER_ZIP'].astype(str)
print (df_coord.dtypes)
OWNER_ZIP object
CITY object
STATE object
dtype: object
print (ca_df.dtypes)
OWNER_ZIP object
OWNER_CITY object
OWNER_STATE object
dtype: object
Then filter for each combinations of columns - missing and non missing values and add new data by merge, then convert index to same like filtered data and assign back:
mask1 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].isna()
df1 = ca_df[mask1].drop('OWNER_ZIP', axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask1])
ca_df.loc[mask1, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df1
mask2 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].isna() & ca_df['OWNER_ZIP'].isna()
df2 = ca_df[mask2].drop(['OWNER_ZIP','OWNER_STATE'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask2])
ca_df.loc[mask2, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df2
mask3 = ca_df['OWNER_CITY'].isna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].notna()
df3 = ca_df[mask3].drop(['OWNER_CITY'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask3])
ca_df.loc[mask3, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df3
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
You can do a left join on these dataframes considering join on the columns 'city' and 'state'. That would give you the zip-code corresponding to a city and state if both values are non-null in the first dataframe (OWNER_CITY, OWNER_STATE, OWNER_ZIP) and since it would be a left join, it would also preserve your rows which either don't have a zip-code or have null/empty city and state values.
I have two dataframes (with strings) that I am trying to compare to each other. One has a list of areas, the other has a list of areas with long,lat info as well. I am struggling to write a code to perform the following:
1) Check if the string in df1 matches (or a partially matches) area names in df2, then it will merge & carry over the long lat columns.
2) if df1 does not match with df2, then the new column will have NaN or zero.
Code:
import pandas as pd
df1 = pd.read_csv('Dubai Communities1.csv')
df1.head()
CNAME_E1
0 Abu Hail
1 Al Asbaq
2 Al Aweer First
3 Al Aweer Second
4 Al Bada
df2 = pd.read_csv('Dubai Communities2.csv')
df2.head()
COMM_NUM CNAME_E2 Latitude Longitude
0 315 UMM HURAIR 55.3237 25.2364
1 917 AL MARMOOM 55.4518 24.9756
2 624 WARSAN 55.4034 25.1424
3 123 AL MUTEENA 55.3228 25.2739
4 813 AL ROWAIYAH 55.3981 25.1053
The output after search and join will look like this:
CName_E1 CName_E3 Latitude Longitude
0 Area1 Area1 22 7.25
1 Area2 Area2 38 71.83
2 Area3 NaN NaN NaN
3 Area4 Area4 35 8.05
I have a datafarme which looks like as follows (there are more columns having been dropped off):
memberID shipping_country
264991
264991 Canada
100 USA
5000
5000 UK
I'm trying to fill the blank cells with existing value of shipping country for each user:
memberID shipping_country
264991 Canada
264991 Canada
100 USA
5000 UK
5000 UK
However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method?
You can use GroupBy + ffill / bfill:
def filler(x):
return x.ffill().bfill()
res = df.groupby('memberID')['shipping_country'].apply(filler)
A custom function is necessary as there's no combined Pandas method to ffill and bfill sequentially.
This also caters for the situation where all values are NaN for a specific memberID; in this case they will remain NaN.
For the following sample dataframe (I added a memberID group that only contains '' in the shipping_country column):
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 54
This should work for you, and also as the behavior that if a memberID group only has empty string values ('') in shipping_country, those will be retained in the output df:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first').fillna('')
Yields:
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 54
If you would like to leave the empty strings '' as NaN in the output df, then just remove the fillna(''), leaving:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first')
You can use chained groupbys, one with forward fill and one with backfill:
# replace blank values with `NaN` first:
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
This method will also allow a group made up of all NaN to remain NaN:
>>> df
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 1
6 1
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 1 NaN
6 1 NaN
I have two dfs that look like the following:
Df1:
area team score
ontario team 1 60
ontario team 3 30
ontario team 2 50
new york team 1 90
new york team 2 30
Df2:
area team score
ontario team 1 60
ontario team 3 30
ontario team 2 50
new york team 1 90
new york team 2 70
If I do the following:
merge = pd.merge(df1, df2, on=['area', 'team'])
I get:
merge:
area team score_x score_y
ontario team 1 60 60
ontario team 3 30 30
ontario team 2 50 50
new york team 1 90 90
new york team 2 30 70
It can be noted that the score in the last row of both dfs is different.
I would like to find what the percent difference is in between score_x and score_y.
However I actually have hundreds of metrics such as "score". How can I find the percent difference of each column of the merged df which has the same key before the merge is done and the _x and _y are apended?
Whats the best way to do this? I guess I could just get a list of the common keys and append a _y and _x to each and then go through the list and check the percent difference of both columns, but is there a better way to do this?
Just set 'area' and 'team' as the frame index and do the "normal" math:
df1.set_index(['area','team'], inplace=True)
df2.set_index(['area','team'], inplace=True)
(df1 - df2) / df1
# score
#area team
#ontario team 1 0.000000
# team 3 0.000000
# team 2 0.000000
#new york team 1 0.000000
# team 2 -1.333333