How to match identical columns' fields from different DataFrames in Python? - python

I need to match the identical fields of two columns from two separate dataframes, and rewrite the original dataframe, considering the another one.
So I have this original df:
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Vienna
3 Toyota Zurich
4 Renault Sydney
5 Ford Toronto
6 BMW Hamburg
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat San Francisco
11 Audi New York City
12 Ferrari Oslo
13 Volkswagen Stockholm
14 Lamborghini Singapore
15 Mercedes Lisbon
16 Jaguar Boston
And this new df:
Car Brand Current City
0 Tesla Amsterdam
1 Renault Paris
2 BMW Munich
3 Fiat Detroit
4 Audi Berlin
5 Ferrari Bruxelles
6 Lamborghini Rome
7 Mercedes Madrid
I need to match the car brands that are identical within the above two dataframes and write the new associate city in the original df, so the result should be this one: (so for example Tesla is now Amsterdam instead of Vienna)
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Amsterdam
3 Toyota Zurich
4 Renault Paris
5 Ford Toronto
6 BMW Munich
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat Detroit
11 Audi Berlin
12 Ferrari Bruxelles
13 Volkswagen Stockholm
14 Lamborghini Rome
15 Mercedes Madrid
16 Jaguar Boston
I tried with this code for mapping the columns and rewrite the field, but it doesn't really work and I cannot figure out how to make it work:
original_df['Original City'] = original_df['Car Brand'].map(dict(corrected_df[['Car Brand', 'Current City']]))
How to make it work ? Thanks a lot!!!!
P.S.: Code for df:
cars = ['Daimler', 'Mitsubishi','Tesla', 'Toyota', 'Renault', 'Ford','BMW', 'Audi Sport','Citroen', 'Chevrolet', 'Fiat', 'Audi', 'Ferrari', 'Volkswagen','Lamborghini', 'Mercedes', 'Jaguar']
cities = ['Chicago', 'LA', 'Vienna', 'Zurich', 'Sydney', 'Toronto', 'Hamburg', 'Helsinki', 'Dublin', 'Brisbane', 'San Francisco', 'New York City', 'Oslo', 'Stockholm', 'Singapore', 'Lisbon', 'Boston']
data = {'Original Car Brand': cars, 'Original City': cities}
original_df = pd.DataFrame (data, columns = ['Original Car Brand', 'Original City'])
---
cars = ['Tesla', 'Renault', 'BMW', 'Fiat', 'Audi', 'Ferrari', 'Lamborghini', 'Mercedes']
cities = ['Amsterdam', 'Paris', 'Munich', 'Detroit', 'Berlin', 'Bruxelles', 'Rome', 'Madrid']
data = {'Car Brand': cars, 'Current City': cities}
corrected_df = pd.DataFrame (data, columns = ['Car Brand', 'Current City'])

Use Series.map with repalce not matched values by original column by Series.fillna:
s = corrected_df.set_index('Car Brand')['Current City']
original_df['Original City'] = (original_df['Original Car Brand'].map(s)
.fillna(original_df['Original City']))
print (original_df)
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Amsterdam
3 Toyota Zurich
4 Renault Paris
5 Ford Toronto
6 BMW Munich
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat Detroit
11 Audi Berlin
12 Ferrari Bruxelles
13 Volkswagen Stockholm
14 Lamborghini Rome
15 Mercedes Madrid
16 Jaguar Boston
Your solution should be changed with convert both columns to numpy array before dict:
d = dict(corrected_df[['Car Brand','Current City']].to_numpy())
original_df['Original City'] = (original_df['Original Car Brand'].map(d)
.fillna(original_df['Original City']))

You can use set_index() and assign() method:
resultdf=original_df.set_index('Original Car Brand').assign(OriginalCity=corrected_df.set_index('Car Brand'))
Finally use fillna() method and reset_index() method:
resultdf=resultdf['OriginalCity'].fillna(resultdf['Original City']).reset_index()

Let us try update
df1 = df1.set_index('Original Car Brand')
df1.update(df2.set_index('Car Brand'))
df1 = df1.reset_index()

Merge can do the work as well
original_df['Original City'] = original_df.merge(corrected_df,left_on='Original Car Brand', right_on='Car Brand',how='left')['Current City'].fillna(original_df['Original City'])

Related

How to insert concatenated columns into a pivot table in pandas

I have this data frame that I am transforming into a pivot table I want to add concatenated columns as the values within the pivot
import pandas as pd
import numpy as np
# creating a dataframe
df = pd.DataFrame({'Student': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'Grade': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'Major': ['Liberal Arts', 'Business', 'Sciences', 'Education', 'Law'],
'Age': [27, 23, 21, 23, 24],
'City': ['Boston', 'Brooklyn', 'Camden', 'Chicago', 'Manhattan'],
'State': ['MA', 'NY', 'NJ', 'IL', 'NY'],
'Years' : [2, 4, 3, 3, 4] })
Displays this table
Student Grade Major Age City State Years
0 John Masters Liberal Arts 27 Boston MA 2
1 Boby Graduate Business 23 Brooklyn NY 4
2 Mina Graduate Sciences 21 Camden NJ 3
3 Peter Masters Education 23 Chicago IL 3
4 Nicky Graduate Law 24 Manhattan NY 4
Concatenated Columns
values = pd.concat([df['Age'],df['Years']], axis=1, ignore_index=True)
Displays this result
0 1
0 27 2
1 23 4
2 21 3
3 23 3
4 24 4
I want to add the concatenated column (values) inside of the pivot table so the table displays the Age and Year in adjacent columns not separate pivot tables
table = pd.pivot_table(df, values =['Age','Years'], index =['Student','City','State'], columns =['Grade', 'Major'], aggfunc = np.sum)
Grade Graduate Masters
Major Business Law Sciences Education Liberal Arts
Student City State
Boby Brooklyn NY 23.0 NaN NaN NaN NaN
John Boston MA NaN NaN NaN NaN 27.0
Mina Camden NJ NaN NaN 21.0 NaN NaN
Nicky Manhattan NY NaN 24.0 NaN NaN NaN
Peter Chicago IL NaN NaN NaN 23.0 NaN

Pandas dataframe Split One column data into 2 using some condition

I have one dataframe which is below-
0
____________________________________
0 Country| India
60 Delhi
62 Mumbai
68 Chennai
75 Country| Italy
78 Rome
80 Venice
85 Milan
88 Country| Australia
100 Sydney
103 Melbourne
107 Perth
I want to Split the data in 2 columns so that in one column there will be country and on other there will be city. I have no idea where to start with. I want like below-
0 1
____________________________________
0 Country| India Delhi
1 Country| India Mumbai
2 Country| India Chennai
3 Country| Italy Rome
4 Country| Italy Venice
5 Country| Italy Milan
6 Country| Australia Sydney
7 Country| Australia Melbourne
8 Country| Australia Perth
Any Idea how to do this?
Look for rows where | is present and pull into another column, and fill down on the newly created column :
(
df.rename(columns={"0": "city"})
# this looks for rows that contain '|' and puts them into a
# new column called Country. rows that do not match will be
# null in the new column.
.assign(Country=lambda x: x.loc[x.city.str.contains("\|"), "city"])
# fill down on the Country column, this also has the benefit
# of linking the Country with the City,
.ffill()
# here we get rid of duplicate Country entries in city and Country
# this ensures that only Country entries are in the Country column
# and cities are in the City column
.query("city != Country")
# here we reverse the column positions to match your expected output
.iloc[:, ::-1]
)
Country city
60 Country| India Delhi
62 Country| India Mumbai
68 Country| India Chennai
78 Country| Italy Rome
80 Country| Italy Venice
85 Country| Italy Milan
100 Country| Australia Sydney
103 Country| Australia Melbourne
107 Country| Australia Perth
Use DataFrame.insert with Series.where and Series.str.startswith for replace not matched values to missing values with ffill for forward filling missing values and then remove rows with same values in both by Series.ne for not equal in boolean indexing:
df.insert(0, 'country', df[0].where(df[0].str.startswith('Country')).ffill())
df = df[df['country'].ne(df[0])].reset_index(drop=True).rename(columns={0:'city'})
print (df)
country city
0 Country|India Delhi
1 Country|India Mumbai
2 Country|India Chennai
3 Country|Italy Rome
4 Country|Italy Venice
5 Country|Italy Milan
6 Country|Australia Sydney
7 Country|Australia Melbourne
8 Country|Australia Perth

how to compare the sum score of men to sum score of women to get the count of countries?

Let's say this is my data frame:
country Edition sports Athletes Medal Gender Score
Germany 1990 Aquatics HAJOS, Alfred gold M 3
Germany 1990 Aquatics HIRSCHMANN, Otto silver M 2
Germany 1990 Aquatics DRIVAS, Dimitrios gold W 3
Germany 1990 Aquatics DRIVAS, Dimitrios silver W 2
US 2008 Athletics MALOKINIS, Ioannis gold M 1
US 2008 Athletics HAJOS, Alfred silver M 2
US 2009 Athletics CHASAPIS, Spiridon gold W 3
France 2010 Athletics CHOROPHAS, Efstathios gold W 3
France 2010 Athletics CHOROPHAS, Efstathios gold M 3
France 2010 golf HAJOS, Alfred Bronze M 1
France 2011 golf ANDREOU, Joannis silver W 2
Spain 2011 golf BURKE, Thomas gold M 3
I am trying to find for how many countries the sum of men scores is equal to the sum of women scores?
I have tried the following:
sum_men = df[ df ['Gender']=='M'].groupby ( 'country' )[Score ].sum()
sum_women = df[ df ['Gender']=='W'].groupby ( 'country' )[Score ].sum()
Now i don't know how to compare this two and filter out no.of countries who have sum of men scores is equal to the sum of women scores.
can anyone please help me in this?
You can do this:
sum_men = df[df['Gender']=='M'].groupby ('Country' )['Score'].sum().reset_index() #watch the reset_index()
sum_women = df[df['Gender']=='W'].groupby ('Country' )['Score'].sum().reset_index()
new_df = sum_men.merge(sum_women, on="Country")
new_df['diff'] = new_df['Score_x'] - new_df['Score_y']
new_df
Country Score_x Score_y diff
0 France 4 5 -1
1 Germany 5 5 0
2 US 3 3 0
print(new_df[new_df['diff']==0])
Country Score_x Score_y diff
1 Germany 5 5 0
2 US 3 3 0
Not sure if you want to leave the ones who are equal or otherwise, but the same logic applies:
group = df.groupby(['country', 'Gender'])['Score'].sum().unstack()
not_equal = group[group.M != group.W]
filtered_df = df[df.country.isin(not_equal.index)]
Output:
country Edition sports Athletes Medal Gender Score score_sum
7 France 2010 Athletics CHOROPHAS, Efstathios gold W 3 5
8 France 2010 Athletics CHOROPHAS, Efstathios gold M 3 4
9 France 2010 golf HAJOS, Alfred Bronze M 1 4
10 France 2011 golf ANDREOU, Joannis silver W 2 5
11 Spain 2011 golf BURKE, Thomas gold M 3 3

how to use groupby() in this case?

let's say: there is a data frame:
country edition sports Athletes Medals
Germany 1990 Aquatics HAJOS, Alfred silver
Germany 1990 Aquatics HIRSCHMANN, Otto silver
Germany 1990 Aquatics DRIVAS, Dimitrios silver
US 2008 Athletics MALOKINIS, Ioannis silver
US 2008 Athletics HAJOS, Alfred silver
US 2009 Athletics CHASAPIS, Spiridon gold
France 2010 Athletics CHOROPHAS, Efstathios gold
France 2010 golf HAJOS, Alfred silver
France 2011 golf ANDREOU, Joannis silver
i want to find out Which edition distributed the most silver medals?
so i'm trying to solve it by groupby function in this way :
df.groupby('Edition')[df['Medal']=='Silver'].count().idxmax()
but its giving me
Key error = 'Columns not found: False, True'
can anyone tell me what is the issue?
So here's your pandas dataframe:
import pandas as pd
data = [
['Germany', 1990, 'Aquatics', 'HAJOS, Alfred', 'silver'],
['Germany', 1990, 'Aquatics', 'IRSCHMANN, Otto', 'silver'],
['Germany', 1990, 'Aquatics', 'DRIVAS, Dimitrios', 'silver'],
['US', 2008, 'Athletics', 'MALOKINIS, Ioannis', 'silver'],
['US', 2008, 'Athletics', 'HAJOS, Alfred', 'silver'],
['US', 2009, 'Athletics', 'CHASAPIS, Spiridon', 'gold'],
['France', 2010, 'Athletics', 'CHOROPHAS, Efstathios', 'gold'],
['France', 2010, 'golf', 'HAJOS, Alfred', 'silver'],
['France', 2011, 'golf', 'ANDREOU, Joannis', 'silver']
]
df = pd.DataFrame(data, columns = ['country', 'edition', 'sports', 'Athletes', 'Medals'])
print(df)
country edition sports Athletes Medals
0 Germany 1990 Aquatics HAJOS, Alfred silver
1 Germany 1990 Aquatics IRSCHMANN, Otto silver
2 Germany 1990 Aquatics DRIVAS, Dimitrios silver
3 US 2008 Athletics MALOKINIS, Ioannis silver
4 US 2008 Athletics HAJOS, Alfred silver
5 US 2009 Athletics CHASAPIS, Spiridon gold
6 France 2010 Athletics CHOROPHAS, Efstathios gold
7 France 2010 golf HAJOS, Alfred silver
8 France 2011 golf ANDREOU, Joannis silver
Now, you can simply filter silver medals then groupby edition (note that 'Edition' will throw a KeyError as opposed to 'edition') and finally get the count:
df[df.Medals == 'silver'].groupby('edition').count()['Medals'].idxmax()
>>> 1990
You can group by both columns to solve:
df[df['Medals'] == 'silver'].groupby(['edition','Medals'],as_index=True)['Athletes'].count().idxmax()
# Outcome:
(1990, 'silver')
df[df['Medal']=='silver'].groupby('edition').size().idxmax()
I tried this and it worked! i just replaced count() by size()
You should count per edition per medal:
>>> df = pd.DataFrame({'edition':[1990,1990,1990,2008,2008,2009,2010,2010,2011],'Medals':['silver','silver','silver','silver','silver','gold','gold','silver','silver']})
>>> df['count'] = ''
>>> df['count'] = df.groupby(['edition','Medals']).transform('count')
Then do the filtering on max():
>>> df = df[df['Medals'].isin(['silver'])]
>>> df
edition Medals count
0 1990 silver 3
1 1990 silver 3
2 1990 silver 3
3 2008 silver 2
4 2008 silver 2
7 2010 silver 1
8 2011 silver 1
>>> df = df[df['count'].isin([df['count'].max()])]
>>> df
edition Medals count
0 1990 silver 3
1 1990 silver 3
2 1990 silver 3
or
>>> df[df['count'].isin([df['count'].max()])]['Medals'].unique()[0]
'silver'

Pandas - Merge and Groupby different dataframes and create new columns

Have n number of dataframes with n number of City columns.
df1:
ID City City1 City2 .... CityN
444x Lima DC
222x Rica Dallas
555x Rio London
333x NYC Tokyo
777x SF Nairobi
df2:
ID City City1 City2 .... CityN
000x Lima Miami
888x Cct Texas
999x Delhi
444x Tokyo Ktm
333x Aus Paris
dfN:
ID City City1 City2 .... CityN
444x Lima DC
333x Rica Dallas
555x Rio London
666x NYC Tokyo
777x SF Nairobi
I have tried merging the dataframes one by one but the City column values get overwritten by the last dataframe values.
dfOutput=df1.merge(df2, how='left', on='ID')
What I would like is retain all these City1, City2, ...CityN column values. I have listed the example output below.
ID City1 City2 City3 City4 City5 City6
444x Tokyo Lima DC Miami Ktm
333x NYC Tokyo Aus Paris Rica Dallas
And so on for the remaining IDs. I also tried using groupbyID provided from another question here in SO.
cities = df.groupby('ID')['City'].apply(lambda x: pd.Series([city for city in x])).unstack()
Thanks for your help.
IIUC you could use pd.merge without left parameter:
In [14]: df1
Out[14]:
ID City City1 City2
0 444x Lima - DC
1 222x Rica Dallas -
2 555x Rio London -
3 333x NYC Tokyo -
4 777x SF - Nairobi
In [15]: df2
Out[15]:
ID City City1 City2
0 000x Lima - Miami
1 888x Cct Texas -
2 999x Delhi - -
3 444x Tokyo Ktm -
4 333x Aus - Paris
In [16]: pd.merge(df1, df2, on='ID')
Out[16]:
ID City_x City1_x City2_x City_y City1_y City2_y
0 444x Lima - DC Tokyo Ktm -
1 333x NYC Tokyo - Aus - Paris
Then you could rename your columns for the resulting dataframe:
cols = ['ID'] + ['City' + str(i) for i in range(1, len(df3.columns))]
In [21]: cols
Out[21]: ['ID', 'City1', 'City2', 'City3', 'City4', 'City5', 'City6']

Categories

Resources