I m trying to merge these dataframes in a way that the final data frame would have matched the country year gdp from first dataframe with its corresponding values from second data frame.
[]
[]
first data frame :
Country
Country code
year
rgdpe
country1
Code1
year1
rgdpe1
country1
Code1
yearn
rgdpen
country2
Code2
year1
rgdpe1'
second dataframe :
countries
value
year
country1
value1
year1
country1
valuen
yearn
country2
Code2
year1
combined dataframe:
| Country | Country code | year |rgdpe |value|
|:--------|:------------:|:----:|:-----:|:---:|
|country1 | Code1 | year1|rgdpe1 |value|
|country1 | Code1 | yearn|rgdpen |Value|
|country2 | Code2 | year1|rgdpe1'|Value|
combined=pd.merge(left=df_biofuel_prod, right=df_GDP[['rgdpe']], left_on='Value', right_on='country', how='right')
combined.to_csv('../../combined_test.csv')
the results of this code gives me just the rgdpe column while the other column are empty.
What would be the most efficient way to merge and match these dataframes ?
First, from the data screen cap, it looks like the "country" column in your first dataset "df_GDP" is set as index. Reset it using "reset_index()". Then merge on multiple columns like left_on=["countries","year"] and right_on=["country","year"]. And since you want to retain all records from your main dataframe "df_biofuel_prod", so it should be "left" join:
combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")
Full example with dummy data:
df_GDP = pd.DataFrame(data=[["USA",2001,400],["USA",2002,450],["CAN",2001,150],["CAN",2002,170]], columns=["country","year","rgdpe"]).set_index("country")
df_biofuel_prod = pd.DataFrame(data=[["USA",400,2001],["USA",450,2003],["CAN",150,2001],["CAN",170,2003]], columns=["countries","Value","year"])
combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")
[Out]:
countries Value year country rgdpe
0 USA 400 2001 USA 400.0
1 USA 450 2003 NaN NaN
2 CAN 150 2001 CAN 150.0
3 CAN 170 2003 NaN NaN
You see "NaN" where matching data is not available in "df_GDP".
Related
Be the following python pandas DataFrame:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
832932
France
30
31728#
I would like to make the following modifications for each row:
If the ID column has any '#' value, the row is left unchanged.
If the ID column has no '#' value, and country is NaN, "Other" is added to the country column, and a 0 is added to other column.
Finally, only if the money column is NaN and the other column has value, we assign the values money and money_add from the following table:
other_ID
money
money_add
19
4532
723823
50
1213
238232
18
1813
273283
30
1313
83293
0
8932
3920
Example of the resulting table:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
Other
8932
0
3920
832932
France
1313
30
83293
31728#
First set values to both columns if match both conditions by list, then filter non # rows and update values by DataFrame.update only matched rows:
m1 = df['ID'].str.contains('#')
m2 = df['country'].isna()
df.loc[~m1 & m2, ['country','other']] = ['Other',0]
df1 = df1.set_index(df1['other_ID'])
df = df.set_index(df['other'].mask(m1))
df.update(df1, overwrite=False)
df = df.reset_index(drop=True)
print (df)
ID country money other money_add
0 832932 France 12131 19.0 82932.0
1 217#8# NaN ; NaN NaN
2 1329T2 Other 8932.0 0.0 3920.0
3 832932 France 1313.0 30.0 83293.0
4 31728# NaN NaN NaN NaN
Im working in a World Happiness Report project that includes datasets from 2015 to 2019. I concatenated them into a final dataframe to get average of parameters (economy, health, etc.) for every country across that time span. But what I forgot to add was the respective region that the country is in (ej: England - Western Europe). How could I add the 'Region' column to my final dataframe and to be sure that that region matches with its respective country?
Not sure if this is what you are looking for.
You may want to do something like this:
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
Or you can also use merge statement. The assumption is that for each country, you have a region that it can map to.
df = pd.merge(df,region_df,how='left',on = ['Country'])
Make sure you have indexed both on Country before you merge to get the optimized response.
data setup
import pandas as pd
c = ['Country','Happiness Score','Other_fields']
d = [['Denmark', 7.5460,1.25],
['Norway',7.5410,1.50],
['Finland',7.5378,1.85]]
region_cols = ['Country','Region']
region_data = [['Denmark','R1'],['Norway','R2'],['Finland','R3']]
df = pd.DataFrame(data = d, columns = c)
region_df = pd.DataFrame(data = region_data, columns = region_cols)
Based on the lookup DataFrame, you can do a map to check for Country and assign Region to df.
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
print (df)
Your result will be as follows:
Base DataFrame:
Country Happiness Score Other_fields
0 Denmark 7.5460 1.25
1 Norway 7.5410 1.50
2 Finland 7.5378 1.85
Lookup DataFrame:
Country Region
0 Denmark R1
1 Norway R2
2 Finland R3
Updated DataFrame:
Country Happiness Score Other_fields Region
0 Denmark 7.5460 1.25 R1
1 Norway 7.5410 1.50 R2
2 Finland 7.5378 1.85 R3
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have a question about merging two tables. Say I have a table A with data consisting of these parameters: Country, City, Zip Code. Also, I have a table B with unique Country names and a column that specifies which continent it is located (NA, Asia, EU, etc..)
How can I merge the two tables into one such that I will have columns: Country, City, Zip Code and a column that corresponds to the continent of table B?
Many thanks!
You can make use of the pd.merge function
Example: You have a "country" df with "country", "city" and "zip" columns and "continent" df with "country" and "continent" columns. Use the pd.merge function on the common column "country"
country = pd.DataFrame([['country1','city1','zip1'],['country1','city1','zip2'],['country1','city2','zip3'],['country1','city2','zip4'],
['country2','city3','zip5'],['country2','city3','zip6'],['country2','city4','zip7'],
['country3','city5','zip8'],['country3','city6','zip9']],
columns=['country','city','zipcode'])
continent = pd.DataFrame([['country1','A'],['country2','B'],['country3','C'],['country4','D'],['country5','E']],
columns=['country','continent'])
country = country.merge(continent, on=['country'])
print(country)
Output:
country city zipcode continent
0 country1 city1 zip1 A
1 country1 city1 zip2 A
2 country1 city2 zip3 A
3 country1 city2 zip4 A
4 country2 city3 zip5 B
5 country2 city3 zip6 B
6 country2 city4 zip7 B
7 country3 city5 zip8 C
8 country3 city6 zip9 C
I've got a dataframe df in Pandas that looks like this:
stores product discount
Westminster 102141 T
Westminster 102142 F
City of London 102141 T
City of London 102142 F
City of London 102143 T
And I'd like to end up with a dataset that looks like this:
stores product_1 discount_1 product_2 discount_2 product_3 discount_3
Westminster 102141 T 102143 F
City of London 102141 T 102143 F 102143 T
How do I do this in pandas?
I think this is some kind of pivot on the stores column, but with multiple . Or perhaps it's an "unmelt" rather than a "pivot"?
I tried:
df.pivot("stores", ["product", "discount"], ["product", "discount"])
But I get TypeError: MultiIndex.name must be a hashable type.
Use DataFrame.unstack for reshape, only necessary create counter by GroupBy.cumcount, last change ordering of second level and flatten MultiIndex in columns by map:
df = (df.set_index(['stores', df.groupby('stores').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1))
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
stores discount_1 product_1 discount_2 product_2 discount_3 \
0 City of London T 102141.0 F 102142.0 T
1 Westminster T 102141.0 F 102142.0 NaN
product_3
0 102143.0
1 NaN
I have three different dataframes of economic measures. The columns are years and the rows are countries. I want to take each country's rows and form a dataframe for each country such that the columns are the three economic measures and the rows are years.
For example: Austria
GDP | CPI | Interest rate
1998 |xxxxxxxxxxx|xxxxxxxxxxx|xxxxxxxxxxxxxx
1999 |xxxxxxxxxxx|xxxxxxxxxxx|xxxxxxxxxxxxxx
I'm having trouble doing this in python because I am not sure how to manipulate rows.
Follow up question:
I now have a dataframe that looks something like this:
by_country: [
GDP | CPI | Interest rate
Country | Austria | Austria | Austria
1998 |xx xx xx xx|xx xx xx|xxxxxxxx
1998 |xx xx xx xx|xx xx xx|xxxxxxxx
......
GDP | CPI | Interest rate
Country | Belgium | Belgium | Belgium
1998 |xx xx xx xxx|xx xx xxx|xxxxxxxx
]
I want to be able to call stuff like this: Austria.GDP, Belgium.CPI, etc. I think the first step would be to define a function that calls the information for a country within the big dataframe such as by_country(Austria).
Essentially, I would like to be able to call country_df(Austria).GDP
Any thoughts on how to do this?
First, you could transpose each data frame so that the rows are the years and the columns are the countries, then take each respective column from the 3 data frames and join them together. Something like this would give you a data frame for each country:
gdp = gdp_df.transpose()
cpi = cpi_df.transpose()
interest = interest_df.transpose()
by_country = {}
# Assumes the same ordering of countries in each data frame
for country in gdp.columns:
country_df = pandas.concat([gdp[country], cpi[country], interest[country]], axis=1)
country_df.columns = ['GDP', 'CPI', 'Interest rate']
by_country[country] = country_df
You can now do something like:
by_country['Austria'].GDP