Here's my inital dataset
sitename L1 L2 L3
America Continent NaN Others
Mexico NaN Spanish America
This is reference dataset
sitename L1 L2
America Country English
Mexico Country Spanish
This is expected output dataset
sitename L1 L2 L3
America Continent English Others
Mexico Country Spanish America
The reference data is only works on Null vales
For align by column sitemap convert it to index in both DataFrames and then replace missing values:
df = df1.set_index('sitemap').fillna(df2.set_index('sitemap')).reset_index()
Or:
df = df1.set_index('sitemap').combine_first(df2.set_index('sitemap')).reset_index()
Or:
df1 = df1.set_index('sitemap')
df2 = df2.set_index('sitemap')
df1.update(df2)
df = df1.reset_index()
You can use the:
df.fillna(df2)
If indexes are not the same, you can set them to be the same and use the same function:
df = df.set_index('sitname')
df.fillna(df2.set_index('sitename'))
Related
I have the dataset shown below. I am trying to sort it so that the columns are in this order: Week End, Australia, Germany, France, etc...
I have tried using loc and assigning each of the data sets as variables but when I create a new DataFrame it causes an error. Any help would be appreciated.
This is the data before any changes:
Region
Week End
Value
Australia
2014-01-11
1.480510
Germany
2014-01-11
1.481258
France
2014-01-11
0.986507
United Kingdom
2014-01-11
1.973014
Italy
2014-01-11
0.740629
This is my desired output:
Week End
Australia
Germany
France
United Kingdom
Italy
2014-01-11
1.480510
1.481258
0.986507
1.973014
0.740629
What I've tried:
cols = (['Region','Week End','Value'])
df = GS.loc[GS['Brand'].isin(rows)]
df = df[cols]
AUS = df.loc[df['Region'] == 'Australia']
JPN = df.loc[df['Region'] == 'Japan']
US = df.loc[df['Region'] == 'United States of America']
I think that you could actually just do:
df.pivot(index="Week End", columns="Region", values="Value")
User 965311532's answer is much more concise, but an alternative approach using dictionaries would be:
new_df = {'Week End': df['Week End'][0]}
new_df.update({region: value for region, value in zip(df['Region'], df['Value'])})
new_df = pd.DataFrame(new_df, index = [0])
As user 965311532 pointed out, the above code will not work if there are more dates. In this case, we could use pandas groupby:
dates = []
for date, group in df.groupby('Week End'):
date_df = {'Week End': date}
date_df.update({region: value for region, value in zip(df['Region'], df['Value'])})
date_df = pd.DataFrame(date_df, index = [0])
dates.append(date_df)
new_df = pd.concat(dates)
I want to add another header under two columns of a table.
I tried this way but I got InvalidSyntax
table = {
'Color':color,
'Continent':continent,
{'Country':country,
'Nation':season,}
'Name':name,
'Surname':surname,
}
You could use multiindex as described here. Given the structure you show in your dictionary this should do:
df.columns = pd.MultiIndex.from_tuples([('Color'),('Continent'),('Country','Nation'),('Country','Name'),('Country','Surname')])
For example:
df = pd.DataFrame({'Color':['Red','Green'],
'Continent':['Asia','Africa'],
'Nation':['China','Egypt'],
'Name':['X','Y'],
'Surname':['A','B']})
df.columns = pd.MultiIndex.from_tuples([(['Color']),(['Continent']),('Country','Nation'),('Country','Name'),('Country','Surname')])
print(df)
Outputs:
Color Continent Country
NaN NaN Nation Name Surname
0 Red Asia China X A
1 Green Africa Egypt Y B
Im working in a World Happiness Report project that includes datasets from 2015 to 2019. I concatenated them into a final dataframe to get average of parameters (economy, health, etc.) for every country across that time span. But what I forgot to add was the respective region that the country is in (ej: England - Western Europe). How could I add the 'Region' column to my final dataframe and to be sure that that region matches with its respective country?
Not sure if this is what you are looking for.
You may want to do something like this:
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
Or you can also use merge statement. The assumption is that for each country, you have a region that it can map to.
df = pd.merge(df,region_df,how='left',on = ['Country'])
Make sure you have indexed both on Country before you merge to get the optimized response.
data setup
import pandas as pd
c = ['Country','Happiness Score','Other_fields']
d = [['Denmark', 7.5460,1.25],
['Norway',7.5410,1.50],
['Finland',7.5378,1.85]]
region_cols = ['Country','Region']
region_data = [['Denmark','R1'],['Norway','R2'],['Finland','R3']]
df = pd.DataFrame(data = d, columns = c)
region_df = pd.DataFrame(data = region_data, columns = region_cols)
Based on the lookup DataFrame, you can do a map to check for Country and assign Region to df.
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
print (df)
Your result will be as follows:
Base DataFrame:
Country Happiness Score Other_fields
0 Denmark 7.5460 1.25
1 Norway 7.5410 1.50
2 Finland 7.5378 1.85
Lookup DataFrame:
Country Region
0 Denmark R1
1 Norway R2
2 Finland R3
Updated DataFrame:
Country Happiness Score Other_fields Region
0 Denmark 7.5460 1.25 R1
1 Norway 7.5410 1.50 R2
2 Finland 7.5378 1.85 R3
This question already has answers here:
Mapping columns from one dataframe to another to create a new column [duplicate]
(2 answers)
Closed 4 years ago.
I have a DataFrame which contains Alpha 2 country codes (UK, ES, SL etc) and I need these to be the country names. I created a second data frame that has all the Alpha 2 country codes in one column and the corresponding names in another.
I'm trying to compare these two columns then using the index to create the new column. However I am struggling to do this without using a loop. I feel like there is a more efficient way to do this without looping?
I have tried using a for loop, iterating over:
cube_data = pd.DataFrame({'Country Code':['UK','ES','SL']})
alpha2 = pd.DataFrame({'Code':['ES','GH','UK','SL'],
'Name':['Spain','Ghana','United Kingdom','Sierra Leone']})
cube_data
Country Code
0 UK
1 ES
2 SL
alpha2
Code Name
0 ES Spain
1 GH Ghana
2 UK United Kingdom
3 SL Sierra Leone
I have used a for loop to iterate through the columns and when the code from cube_data is found in alpha2['Code'] the index is used to create a new series which has alpha['Name'] at the correct position corresponding to the cube_data.
end result is:
cube_data
Country Code Name
0 UK United Kingdom
1 ES Spain
2 SL Sierra Leone
Surely there is a better way to do this without looping? I have had a look at series.isin() and series.map() but these do not seem to provide the result I need.
Can this be done without a loop?
You can use pandas merge:
df = alpha2.merge(cube_data, left_on='Code', right_on='Country Code', how='inner').drop('Code', axis=1)
merge works like an SQL join: here we merge alpha2 with cube_data. We use the columns 'Code' from alpha2 and 'Country Code' from cube_data to merge the two datframes together and use an 'inner' join logic meaning that only values present in both dataframes will be kept. Finally we drop the column 'Code' from alpha2 which contains the same values as the column 'Country Code'
Use map after converting alpha2 to a mappable object.
First we make our map:
>> country_map = alpha2.set_index('Code')['Name'].to_dict()
>> # country_map = dict(alpha2[['Code', 'Name']].values)
>> # country_map = alpha2.set_index('Code')['Name']
>> print(country_map)
{'ES': 'Spain', 'UK': 'United Kingdom', 'GH': 'Ghana', 'SL': 'Sierra Leone'}
Then we map it on the Country Code column:
>> cube_data['Country'] = cube_data['Country Code'].map(country_map)
>> print(cube_data)
Country Code Country
0 UK United Kingdom
1 ES Spain
2 SL Sierra Leone
Have you looked into the pycountry module?
I've changed your 'UK' alpha_2 to 'GB'.
import pandas as pd
import pycountry
cube_data = pd.DataFrame({'Country Code':['GB','ES','SL']})
for alpha2_code in cube_data['Country Code']:
c = pycountry.countries.get(alpha_2=alpha2_code)
print(c.name)
output:
United Kingdom
Spain
Sierra Leone
Using a lambda to create new column
df = cube_data
df['Name'] = df['Country Code'].apply(lambda x: pycountry.countries.get(alpha_2=x).name)
print(df)
output:
Country Code name
0 GB United Kingdom
1 ES Spain
2 SL Sierra Leone
I have the following data frame:
population GDP
country
United Kingdom 4.5m 10m
Spain 3m 8m
France 2m 6m
I also have the following information in a 2 column dataframe(happy for this to be made into another datastruct if that will be more beneficial as the plan is that it will be sorted in a VARS file.
county code
Spain es
France fr
United Kingdom uk
The 'mapping' datastruct will be sorted in a random order as countries will be added/removed at random times.
What is the best way to re-index the data frame to its country code from its country name?
Is there a smart solution that would also work on other columns so for example if a data frame was indexed on date but one column was df['county'] then you could change df['country'] to its country code? Finally is there a third option that would add an additional column that was either country/code which selected the right code based on a country name in another column?
I think you can use Series.map, but it works only with Series, so need Index.to_series. Last rename_axis (new in pandas 0.18.0):
df1.index = df1.index.to_series().map(df2.set_index('county').code)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
It is same as mapping by dict:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.to_series().map(d)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
EDIT:
Another solution with Index.map, so to_series is omitted:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.map(d.get)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
Here are some brief ways to approach your 3 questions. More details below:
1) How to change index based on mapping in separate df
Use df_with_mapping.todict("split") to create a dictionary, then use a list comprehension to change it into {"old1":"new1",...,"oldn":"newn"} form then use df.index = df.base_column.map(dictionary) to get the changed index.
2) How to change index if the new column is in the same df:
df.index = df["column_you_want"]
3) Creating a new column by mapping on a old column:
df["new_column"] = df["old_column"].map({"old1":"new1",...,"oldn":"newn"})
1) Mapping for the current index exists in separate dataframe but you don't have the mapped column in the dataframe yet
This is essentially the same as question 2 with the additional step of creating a dictionary for the mapping you want.
#creating the mapping dictionary in the form of current index : future index
df2 = pd.DataFrame([["es"],["fr"]],index = ["spain","france"])
interm_dict = df2.to_dict("split") #Creates a dictionary split into column labels, data labels and data
mapping_dict = {country:data[0] for country,data in zip(interm_dict["index"],interm_dict['data'])}
#We only want the first column of the data and the index so we need to make a new dict with a list comprehension and zip
df["country"] = df.index #Create a new column if u want to save the index
df.index = pd.Series(df.index).map(mapping_dict) #change the index
df.index.name = "" #Blanks out index name
df = df.drop("county code",1) #Drops the county code column to avoid duplicate columns
Before:
county code language
spain es spanish
france fr french
After:
language country
es spanish spain
fr french france
2) Changing the current index to one of the columns already in the dataframe
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "french"])
df["country"] = df.index #if you want to save the original index
df.index = df["county code"] #The only step you actually need
df.index.name = "" #if you want a blank index name
df = df.drop("county code",1) #if you dont want the duplicate column
Before:
county code language
spain es spanish
french fr french
After:
language country
es spanish spain
fr french french
3) Creating an additional column based on another column
This is again essentially the same as step 2 except we create an additional column instead of assigning .index to the created series.
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "france"])
df["city"] = df["county code"].map({"es":"barcelona","fr":"paris"})
Before:
county code language
spain es spanish
france fr french
After:
county code language city
spain es spanish barcelona
france fr french paris