Pandas - Merge and Groupby different dataframes and create new columns - python

Have n number of dataframes with n number of City columns.
df1:
ID City City1 City2 .... CityN
444x Lima DC
222x Rica Dallas
555x Rio London
333x NYC Tokyo
777x SF Nairobi
df2:
ID City City1 City2 .... CityN
000x Lima Miami
888x Cct Texas
999x Delhi
444x Tokyo Ktm
333x Aus Paris
dfN:
ID City City1 City2 .... CityN
444x Lima DC
333x Rica Dallas
555x Rio London
666x NYC Tokyo
777x SF Nairobi
I have tried merging the dataframes one by one but the City column values get overwritten by the last dataframe values.
dfOutput=df1.merge(df2, how='left', on='ID')
What I would like is retain all these City1, City2, ...CityN column values. I have listed the example output below.
ID City1 City2 City3 City4 City5 City6
444x Tokyo Lima DC Miami Ktm
333x NYC Tokyo Aus Paris Rica Dallas
And so on for the remaining IDs. I also tried using groupbyID provided from another question here in SO.
cities = df.groupby('ID')['City'].apply(lambda x: pd.Series([city for city in x])).unstack()
Thanks for your help.

IIUC you could use pd.merge without left parameter:
In [14]: df1
Out[14]:
ID City City1 City2
0 444x Lima - DC
1 222x Rica Dallas -
2 555x Rio London -
3 333x NYC Tokyo -
4 777x SF - Nairobi
In [15]: df2
Out[15]:
ID City City1 City2
0 000x Lima - Miami
1 888x Cct Texas -
2 999x Delhi - -
3 444x Tokyo Ktm -
4 333x Aus - Paris
In [16]: pd.merge(df1, df2, on='ID')
Out[16]:
ID City_x City1_x City2_x City_y City1_y City2_y
0 444x Lima - DC Tokyo Ktm -
1 333x NYC Tokyo - Aus - Paris
Then you could rename your columns for the resulting dataframe:
cols = ['ID'] + ['City' + str(i) for i in range(1, len(df3.columns))]
In [21]: cols
Out[21]: ['ID', 'City1', 'City2', 'City3', 'City4', 'City5', 'City6']

Related

How to match identical columns' fields from different DataFrames in Python?

I need to match the identical fields of two columns from two separate dataframes, and rewrite the original dataframe, considering the another one.
So I have this original df:
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Vienna
3 Toyota Zurich
4 Renault Sydney
5 Ford Toronto
6 BMW Hamburg
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat San Francisco
11 Audi New York City
12 Ferrari Oslo
13 Volkswagen Stockholm
14 Lamborghini Singapore
15 Mercedes Lisbon
16 Jaguar Boston
And this new df:
Car Brand Current City
0 Tesla Amsterdam
1 Renault Paris
2 BMW Munich
3 Fiat Detroit
4 Audi Berlin
5 Ferrari Bruxelles
6 Lamborghini Rome
7 Mercedes Madrid
I need to match the car brands that are identical within the above two dataframes and write the new associate city in the original df, so the result should be this one: (so for example Tesla is now Amsterdam instead of Vienna)
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Amsterdam
3 Toyota Zurich
4 Renault Paris
5 Ford Toronto
6 BMW Munich
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat Detroit
11 Audi Berlin
12 Ferrari Bruxelles
13 Volkswagen Stockholm
14 Lamborghini Rome
15 Mercedes Madrid
16 Jaguar Boston
I tried with this code for mapping the columns and rewrite the field, but it doesn't really work and I cannot figure out how to make it work:
original_df['Original City'] = original_df['Car Brand'].map(dict(corrected_df[['Car Brand', 'Current City']]))
How to make it work ? Thanks a lot!!!!
P.S.: Code for df:
cars = ['Daimler', 'Mitsubishi','Tesla', 'Toyota', 'Renault', 'Ford','BMW', 'Audi Sport','Citroen', 'Chevrolet', 'Fiat', 'Audi', 'Ferrari', 'Volkswagen','Lamborghini', 'Mercedes', 'Jaguar']
cities = ['Chicago', 'LA', 'Vienna', 'Zurich', 'Sydney', 'Toronto', 'Hamburg', 'Helsinki', 'Dublin', 'Brisbane', 'San Francisco', 'New York City', 'Oslo', 'Stockholm', 'Singapore', 'Lisbon', 'Boston']
data = {'Original Car Brand': cars, 'Original City': cities}
original_df = pd.DataFrame (data, columns = ['Original Car Brand', 'Original City'])
---
cars = ['Tesla', 'Renault', 'BMW', 'Fiat', 'Audi', 'Ferrari', 'Lamborghini', 'Mercedes']
cities = ['Amsterdam', 'Paris', 'Munich', 'Detroit', 'Berlin', 'Bruxelles', 'Rome', 'Madrid']
data = {'Car Brand': cars, 'Current City': cities}
corrected_df = pd.DataFrame (data, columns = ['Car Brand', 'Current City'])
Use Series.map with repalce not matched values by original column by Series.fillna:
s = corrected_df.set_index('Car Brand')['Current City']
original_df['Original City'] = (original_df['Original Car Brand'].map(s)
.fillna(original_df['Original City']))
print (original_df)
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Amsterdam
3 Toyota Zurich
4 Renault Paris
5 Ford Toronto
6 BMW Munich
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat Detroit
11 Audi Berlin
12 Ferrari Bruxelles
13 Volkswagen Stockholm
14 Lamborghini Rome
15 Mercedes Madrid
16 Jaguar Boston
Your solution should be changed with convert both columns to numpy array before dict:
d = dict(corrected_df[['Car Brand','Current City']].to_numpy())
original_df['Original City'] = (original_df['Original Car Brand'].map(d)
.fillna(original_df['Original City']))
You can use set_index() and assign() method:
resultdf=original_df.set_index('Original Car Brand').assign(OriginalCity=corrected_df.set_index('Car Brand'))
Finally use fillna() method and reset_index() method:
resultdf=resultdf['OriginalCity'].fillna(resultdf['Original City']).reset_index()
Let us try update
df1 = df1.set_index('Original Car Brand')
df1.update(df2.set_index('Car Brand'))
df1 = df1.reset_index()
Merge can do the work as well
original_df['Original City'] = original_df.merge(corrected_df,left_on='Original Car Brand', right_on='Car Brand',how='left')['Current City'].fillna(original_df['Original City'])

Pandas merge of two dataframes works on one column but not all

I have two dataframes:
df1:
Name Age Gender Phone
John 50 M 123458
James 60 M 1234522
Jenny 40 F 123459
Zoe 51 F 1234566
df2:
Name Age City Country
John 50 Sydney AUS
James 60 London UK
Jenny 40 NY USA
Zoe 51 Melbourne AUS
I run the following merge:
df1= df1.merge(df2, left_on=['Name', 'Age'],right_on=['Name','Age'], how='outer'
, suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')
Then I get an output that looks like:
df1:
Name Age Gender Phone City Country
John 50 M 123458 AUS
James 60 M 1234522 UK
Jenny 40 F 123459 USA
Zoe 51 F 1234566 AUS
John Sydney
James London
Jenny NY
Zoe Melbourne
I'm not sure why one column matches (therefore the join columns match) however one column doesn't? Really bizzare and not sure how to troubleshoot!
Thanks very much!

Pandas dataframe Split One column data into 2 using some condition

I have one dataframe which is below-
0
____________________________________
0 Country| India
60 Delhi
62 Mumbai
68 Chennai
75 Country| Italy
78 Rome
80 Venice
85 Milan
88 Country| Australia
100 Sydney
103 Melbourne
107 Perth
I want to Split the data in 2 columns so that in one column there will be country and on other there will be city. I have no idea where to start with. I want like below-
0 1
____________________________________
0 Country| India Delhi
1 Country| India Mumbai
2 Country| India Chennai
3 Country| Italy Rome
4 Country| Italy Venice
5 Country| Italy Milan
6 Country| Australia Sydney
7 Country| Australia Melbourne
8 Country| Australia Perth
Any Idea how to do this?
Look for rows where | is present and pull into another column, and fill down on the newly created column :
(
df.rename(columns={"0": "city"})
# this looks for rows that contain '|' and puts them into a
# new column called Country. rows that do not match will be
# null in the new column.
.assign(Country=lambda x: x.loc[x.city.str.contains("\|"), "city"])
# fill down on the Country column, this also has the benefit
# of linking the Country with the City,
.ffill()
# here we get rid of duplicate Country entries in city and Country
# this ensures that only Country entries are in the Country column
# and cities are in the City column
.query("city != Country")
# here we reverse the column positions to match your expected output
.iloc[:, ::-1]
)
Country city
60 Country| India Delhi
62 Country| India Mumbai
68 Country| India Chennai
78 Country| Italy Rome
80 Country| Italy Venice
85 Country| Italy Milan
100 Country| Australia Sydney
103 Country| Australia Melbourne
107 Country| Australia Perth
Use DataFrame.insert with Series.where and Series.str.startswith for replace not matched values to missing values with ffill for forward filling missing values and then remove rows with same values in both by Series.ne for not equal in boolean indexing:
df.insert(0, 'country', df[0].where(df[0].str.startswith('Country')).ffill())
df = df[df['country'].ne(df[0])].reset_index(drop=True).rename(columns={0:'city'})
print (df)
country city
0 Country|India Delhi
1 Country|India Mumbai
2 Country|India Chennai
3 Country|Italy Rome
4 Country|Italy Venice
5 Country|Italy Milan
6 Country|Australia Sydney
7 Country|Australia Melbourne
8 Country|Australia Perth

python groupby: how to move hierarchical grouped data from columns into rows?

i’ve got a python/pandas groupby that is grouped on name and looks like this:
name gender year city city total
jane female 2011 Detroit 1
2015 Chicago 1
dan male 2009 Lexington 1
bill male 2001 New York 1
2003 Buffalo 1
2000 San Francisco 1
and I want it to look like this:
name gender year1 city1 year2 city2 year3 city3 city total
jane female 2011 Detroit 2015 Chicago 2
dan male 2009 Lexington 1
bill male 2000 Chico 2001 NewYork 2003 Buffalo 3
so i want to keep the grouping by name and then order by year and make each name have only one column. it's a variation on a dummy variables maybe? i'm not even sure how to summarize it.

pandas join gives NaN values

I want to join 2 DataFrames
Zipcode Database (first 10 entries)
0 zip_code City State County Population
0 0 90001 Los Angeles California Los Angeles 54481
1 1 90002 Los Angeles California Los Angeles 44584
2 2 90003 Los Angeles California Los Angeles 58187
3 3 90004 Los Angeles California Los Angeles 67850
4 4 90005 Los Angeles California Los Angeles 43014
5 5 90006 Los Angeles California Los Angeles 62765
6 6 90007 Los Angeles California Los Angeles 45021
7 7 90008 Los Angeles California Los Angeles 30840
8 8 90009 Los Angeles California Los Angeles -
9 9 90010 Los Angeles California Los Angeles 1943
And data (first 10 entries)
buyer zip_code
0 SWEENEY,THOMAS R & MICHELLE H NaN
1 DOUGHERTY,HERBERT III & JENNIFER M NaN
2 WEST COAST RLTY SVCS INC NaN
3 LOVE,JULIE M NaN
4 SAHAR,DAVID NaN
5 SILBERSTERN,BRADLEY E TRUST 91199
6 LEE,SUSAN & JIMMY C 92025
7 FRAZZANO REAL ESTATE I NC NaN
8 RUV INVESTMENTS LLC 91730
9 KAOS KAPITAL LLC NaN
So the final table should have [buyer, zip_code, City, County]. I'm joining with respect to Zip code.
data_2 = data.join(zipcode_database[['City', 'County', 'zip_code']].set_index('zip_code'), on='zip_code')
But the city and county columns are NaN even for the tuples in data where zipcode is actually present.
buyer zip_code City County
10 LANDON AVE TRUST 37736 NaN NaN NaN
11 UMAR,AHMAD NaN NaN NaN
12 3 JPS INC 90717 NaN NaN
13 T & L HOLDINGS INC 95610 NaN NaN
14 CAHP HOLDINGS LLC 90808 NaN NaN
15 REBUILDING TOGETHER LONG BEACH 92344 NaN NaN
16 COLFIN AI-CA 4 LLC NaN NaN NaN
17 GUTIERREZ,HUGO 91381 NaN NaN
18 VALBRIDGE CAP GOLDEN GATE FUND NaN NaN NaN
19 SOLARES,OSCAR 92570 NaN NaN
Why is this the case? The zipcode database has all zipcodes from 90001 - 999950.
My first thought is the datatype of "zip_code" in both are different:
print(zipcode_database['zip_code'].dtype)
print(data['zip_code'].dtype)
Output:
int64
object
Thought of typecasting with astype, but this does not work with NaN values. Any thoughts?
You can cast NaN values to float types, but not int. In your case I would cast the zip_code field in both DataFrames to a float and then join.
zipcode_database.zip_code = zipcode_database.zip_code.astype(float)
data.zip_code = data.zip_code.astype(float)
data_2 = data.join(zipcode_database[['City', 'County', 'zip_code']].set_index('zip_code'), on='zip_code')
I can't reproduce anything meaningful from your example data (no matching zip codes), but that should fix the issue.

Categories

Resources