How to add categorical variable to dataframe? - python

Im working in a World Happiness Report project that includes datasets from 2015 to 2019. I concatenated them into a final dataframe to get average of parameters (economy, health, etc.) for every country across that time span. But what I forgot to add was the respective region that the country is in (ej: England - Western Europe). How could I add the 'Region' column to my final dataframe and to be sure that that region matches with its respective country?

Not sure if this is what you are looking for.
You may want to do something like this:
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
Or you can also use merge statement. The assumption is that for each country, you have a region that it can map to.
df = pd.merge(df,region_df,how='left',on = ['Country'])
Make sure you have indexed both on Country before you merge to get the optimized response.
data setup
import pandas as pd
c = ['Country','Happiness Score','Other_fields']
d = [['Denmark', 7.5460,1.25],
['Norway',7.5410,1.50],
['Finland',7.5378,1.85]]
region_cols = ['Country','Region']
region_data = [['Denmark','R1'],['Norway','R2'],['Finland','R3']]
df = pd.DataFrame(data = d, columns = c)
region_df = pd.DataFrame(data = region_data, columns = region_cols)
Based on the lookup DataFrame, you can do a map to check for Country and assign Region to df.
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
print (df)
Your result will be as follows:
Base DataFrame:
Country Happiness Score Other_fields
0 Denmark 7.5460 1.25
1 Norway 7.5410 1.50
2 Finland 7.5378 1.85
Lookup DataFrame:
Country Region
0 Denmark R1
1 Norway R2
2 Finland R3
Updated DataFrame:
Country Happiness Score Other_fields Region
0 Denmark 7.5460 1.25 R1
1 Norway 7.5410 1.50 R2
2 Finland 7.5378 1.85 R3

Related

LEFT ON Case When in Pandas

i wanted to ask that if in SQL I can do like JOIN ON CASE WHEN, is there a way to do this in Pandas?
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = {"North": ["AT"], "West":["TX","LA"]}
So what i have is 2 dummy dict and i have already converted it to become dataframe, first is the name of the cities with the case,and I'm trying to figure out which region the cities belongs to.
Region|City
North|AT
West|TX
West|LA
None|NY
None|CH
So what i thought in SQL was using left on case when, and if the result is null when join with North region then join with West region.
But if there are 15 or 30 region in some country, it'd be problems i think
Use:
#get City without duplicates
df1 = pd.DataFrame(disease)[['City']].drop_duplicates()
#create DataFrame from region dictionary
region = {"North": ["AT"], "West":["TX","LA"]}
df2 = pd.DataFrame([(k, x) for k, v in region.items() for x in v],
columns=['Region','City'])
#append not matched cities to df2
out = pd.concat([df2, df1[~df1['City'].isin(df2['City'])]])
print (out)
Region City
0 North AT
1 West TX
2 West LA
0 NaN CH
1 NaN NY
If order is not important:
out = df2.merge(df1, how = 'right')
print (out)
Region City
0 NaN CH
1 NaN NY
2 West TX
3 North AT
4 West LA
I'm sorry, I'm not exactly sure what's your expected result, can you express more? if your expected result is just getting the city's region there is no need for conditional joining? for ex: you can transform the city-region table into per city per region per row and direct join with the main df
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = [
{'City':'AT','Region':"North"},
{'City':'TX','Region':"West"},
{'City':'LA','Region':"West"}
]
df = pd.DataFrame(disease)
df_reg = pd.DataFrame(region)
df.merge( df_reg , on = 'City' , how = 'left' )

Add header under two columns of a table

I want to add another header under two columns of a table.
I tried this way but I got InvalidSyntax
table = {
'Color':color,
'Continent':continent,
{'Country':country,
'Nation':season,}
'Name':name,
'Surname':surname,
}
You could use multiindex as described here. Given the structure you show in your dictionary this should do:
df.columns = pd.MultiIndex.from_tuples([('Color'),('Continent'),('Country','Nation'),('Country','Name'),('Country','Surname')])
For example:
df = pd.DataFrame({'Color':['Red','Green'],
'Continent':['Asia','Africa'],
'Nation':['China','Egypt'],
'Name':['X','Y'],
'Surname':['A','B']})
df.columns = pd.MultiIndex.from_tuples([(['Color']),(['Continent']),('Country','Nation'),('Country','Name'),('Country','Surname')])
print(df)
Outputs:
Color Continent Country
NaN NaN Nation Name Surname
0 Red Asia China X A
1 Green Africa Egypt Y B

Multiply columns based on two columns conditions from different dataframes?

I have two dataframes as indicated below:
dfA =
Country City Pop
US Washington 1000
US Texas 5000
CH Geneva 500
CH Zurich 500
dfB =
Country City Density (pop/km2)
US Washington 10
US Texas 50
CH Geneva 5
CH Zurich 5
What I want is to compare the columns Country and City from both dataframes, and when these match such as:
US Washington & US Washington in both dataframes, it takes the Pop value and divides it by Density, as to get a new column area in dfB with the resulting division. Example of first row results dfB['area km2'] = 100
I have tried with np.where() but it is nit working. Any hints on how to achieve this?
Using index matching and div
match_on = ['Country', 'City']
dfA = dfA.set_index(match_on)
dfA.assign(ratio=dfA.Pop.div(df.set_index(['Country', 'City'])['Density (pop/km2)']))
Country City
US Washington 100.0
Texas 100.0
CH Geneva 100.0
Zurich 100.0
dtype: float64
You can also use merge to combine the two dataframes and divide as usual:
dfMerge = dfA.merge(dfB, on=['Country', 'City'])
dfMerge['area'] = dfMerge['Pop'].div(dfMerge['Density (pop/km2)'])
print(dfMerge)
Output:
Country City Pop Density (pop/km2) area
0 US Washington 1000 10 100.0
1 US Texas 5000 50 100.0
2 CH Geneva 500 5 100.0
3 CH Zurich 500 5 100.0
you can also use merge like below
dfB["Area"] = dfB.merge(dfA, on=["Country", "City"], how="left")["Pop"] / dfB["Density (pop/km2)"]
dfB

How do I use a mapping variable to re-index a dataframe?

I have the following data frame:
population GDP
country
United Kingdom 4.5m 10m
Spain 3m 8m
France 2m 6m
I also have the following information in a 2 column dataframe(happy for this to be made into another datastruct if that will be more beneficial as the plan is that it will be sorted in a VARS file.
county code
Spain es
France fr
United Kingdom uk
The 'mapping' datastruct will be sorted in a random order as countries will be added/removed at random times.
What is the best way to re-index the data frame to its country code from its country name?
Is there a smart solution that would also work on other columns so for example if a data frame was indexed on date but one column was df['county'] then you could change df['country'] to its country code? Finally is there a third option that would add an additional column that was either country/code which selected the right code based on a country name in another column?
I think you can use Series.map, but it works only with Series, so need Index.to_series. Last rename_axis (new in pandas 0.18.0):
df1.index = df1.index.to_series().map(df2.set_index('county').code)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
It is same as mapping by dict:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.to_series().map(d)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
EDIT:
Another solution with Index.map, so to_series is omitted:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.map(d.get)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
Here are some brief ways to approach your 3 questions. More details below:
1) How to change index based on mapping in separate df
Use df_with_mapping.todict("split") to create a dictionary, then use a list comprehension to change it into {"old1":"new1",...,"oldn":"newn"} form then use df.index = df.base_column.map(dictionary) to get the changed index.
2) How to change index if the new column is in the same df:
df.index = df["column_you_want"]
3) Creating a new column by mapping on a old column:
df["new_column"] = df["old_column"].map({"old1":"new1",...,"oldn":"newn"})
1) Mapping for the current index exists in separate dataframe but you don't have the mapped column in the dataframe yet
This is essentially the same as question 2 with the additional step of creating a dictionary for the mapping you want.
#creating the mapping dictionary in the form of current index : future index
df2 = pd.DataFrame([["es"],["fr"]],index = ["spain","france"])
interm_dict = df2.to_dict("split") #Creates a dictionary split into column labels, data labels and data
mapping_dict = {country:data[0] for country,data in zip(interm_dict["index"],interm_dict['data'])}
#We only want the first column of the data and the index so we need to make a new dict with a list comprehension and zip
df["country"] = df.index #Create a new column if u want to save the index
df.index = pd.Series(df.index).map(mapping_dict) #change the index
df.index.name = "" #Blanks out index name
df = df.drop("county code",1) #Drops the county code column to avoid duplicate columns
Before:
county code language
spain es spanish
france fr french
After:
language country
es spanish spain
fr french france
2) Changing the current index to one of the columns already in the dataframe
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "french"])
df["country"] = df.index #if you want to save the original index
df.index = df["county code"] #The only step you actually need
df.index.name = "" #if you want a blank index name
df = df.drop("county code",1) #if you dont want the duplicate column
Before:
county code language
spain es spanish
french fr french
After:
language country
es spanish spain
fr french french
3) Creating an additional column based on another column
This is again essentially the same as step 2 except we create an additional column instead of assigning .index to the created series.
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "france"])
df["city"] = df["county code"].map({"es":"barcelona","fr":"paris"})
Before:
county code language
spain es spanish
france fr french
After:
county code language city
spain es spanish barcelona
france fr french paris

How to take rows from python pandas dataframe and make into columns of one new dataframe

I have three different dataframes of economic measures. The columns are years and the rows are countries. I want to take each country's rows and form a dataframe for each country such that the columns are the three economic measures and the rows are years.
For example: Austria
GDP | CPI | Interest rate
1998 |xxxxxxxxxxx|xxxxxxxxxxx|xxxxxxxxxxxxxx
1999 |xxxxxxxxxxx|xxxxxxxxxxx|xxxxxxxxxxxxxx
I'm having trouble doing this in python because I am not sure how to manipulate rows.
Follow up question:
I now have a dataframe that looks something like this:
by_country: [
GDP | CPI | Interest rate
Country | Austria | Austria | Austria
1998 |xx xx xx xx|xx xx xx|xxxxxxxx
1998 |xx xx xx xx|xx xx xx|xxxxxxxx
......
GDP | CPI | Interest rate
Country | Belgium | Belgium | Belgium
1998 |xx xx xx xxx|xx xx xxx|xxxxxxxx
]
I want to be able to call stuff like this: Austria.GDP, Belgium.CPI, etc. I think the first step would be to define a function that calls the information for a country within the big dataframe such as by_country(Austria).
Essentially, I would like to be able to call country_df(Austria).GDP
Any thoughts on how to do this?
First, you could transpose each data frame so that the rows are the years and the columns are the countries, then take each respective column from the 3 data frames and join them together. Something like this would give you a data frame for each country:
gdp = gdp_df.transpose()
cpi = cpi_df.transpose()
interest = interest_df.transpose()
by_country = {}
# Assumes the same ordering of countries in each data frame
for country in gdp.columns:
country_df = pandas.concat([gdp[country], cpi[country], interest[country]], axis=1)
country_df.columns = ['GDP', 'CPI', 'Interest rate']
by_country[country] = country_df
You can now do something like:
by_country['Austria'].GDP

Categories

Resources