Yet another question on pandas partial string merge - python

I know, there have been a number of very close examples, but I can't make them work for me. I want to add a column from another dataframe based on partial string match: The one string is contained in the other, but not necessarily at the beginning. Here is an example:
df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})
df2 = pd.DataFrame({'Country_Name': ['Algeria, Republic of', 'Andorra', 'Kingdom of Bahrain', 'Russia'],
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})
df should get the continent from df2 attached to each 'citizenship' based on the string match / merge. I have been trying to apply the solution mentioned here Pandas: join on partial string match, like Excel VLOOKUP, but cannot get it to work
def get_continent(x):
return df2.loc[df2['Country_Name'].str.contains(x), df2['Continent_Name']].iloc[0]
df['Continent_Name'] = df['citizenship'].apply(get_continent)
But it gives me a key error
KeyError: "None of [Index(['Asia', 'Europe', 'Antarctica', 'Africa', 'Oceania', 'Europe', 'Africa',\n 'North America', 'Europe', 'Asia',\n ...\n 'Asia', 'South America', 'Oceania', 'Oceania', 'Asia', 'Africa',\n 'Oceania', 'Asia', 'Asia', 'Asia'],\n dtype='object', length=262)] are in the [columns]"
Anybody knows what is going on here?

I can see two issues with the code in your question:
In the function return line, you'll want to remove the df2[] bit in the second positional argument to df2.loc, to leave the column name as a string: df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
It then seems like the code from the linked answer only works when there is always a match between "country name" in df2 and "citizenship" in df.
So this works for example:
df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})
df2 = pd.DataFrame({'Country_Name': ['Algeria', 'Andorra', 'Bahrain', 'Spain'],
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})
def get_continent(x):
return df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
df['Continent_Name'] = df['citizenship'].apply(get_continent)
# citizenship Continent_Name
# 0 Algeria Africa
# 1 Andorra Europe
# 2 Bahrain Asia
# 3 Spain Europe
If you want to get the original code to work, you could put in a try/except:
df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})
df2 = pd.DataFrame({'Country_Name': ['Algeria, Republic of', 'Andorra', 'Kingdom of Bahrain', 'Russia'],
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})
def get_continent(x):
try:
return df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
except IndexError:
return None
df['Continent_Name'] = df['citizenship'].apply(get_continent)
# citizenship Continent_Name
# 0 Algeria Africa
# 1 Andorra Europe
# 2 Bahrain Asia
# 3 Spain None

One way you could do this is create a citizenship column in df2 and use that to join the dataframes together. I think the easiest way to make this column would be to use regex.
citizenship_list = df['citizenship'].unique().tolist()
citizenship_regex = r"(" + r"|".join(citizenship_list) + r")"
df2["citizenship"] = df2["Country_Name"].str.extract(citizenship_regex).iloc[:, 0]
joined_df = df.merge(df2, on=["citizenship"], how="left")
print(joined_df)
Then you can reduce this to select just the columns you want.
Also, you probably want to clean both the citizenship and Country_Name columns by running df['citizenship'] = df['citizenship'].str.lower()on them so that you don't missing something due to case.

Related

Deleting groups of countries from dataframe indexes

I have a dataframe, where I have set indexes as countries. However there are also groups of countries such as Sub-Saharan Africa (IDA & IBRD countries) or Middle East & North Africa (IDA & IBRD countries). I want to delete them. I want just countries to stay.
Example imput dataframe where indexes are:
Antigua and Barbuda
Angola
Arab World
Wanted output dataframe:
Antigua and Barbuda
Angola
My idea was using pycountry, however it does nothing.
countr=list(pycountry.countries)
for idx in df.index:
if idx in countr :
continue
else:
df.drop(index=idx)
Check your list of country names:
countries_list = []
for country in pycountry.countries:
countries_list.append(country.name)
Checking output:
>>> print(countries_list[0:5])
['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Åland Islands']
You can do this if you want to get a dataframe of countries that are in your countries list:
import pycountry
import pandas as pd
country = {'Country Name': ['Antigua and Barbuda','USSR', 'United States', 'Germany','The Moon']}
df = pd.DataFrame(data=country)
countries_list = []
for country in pycountry.countries:
countries_list.append(country.name)
new_df = []
for i in df['Country Name']:
if i in countries_list:
new_df.append(i)
Checking output:
>>> print(new_df)
['Antigua and Barbuda', 'United States', 'Germany']
Otherwise, for your specific code, try this:
Assuming you have your data in a dataframe 'df':
import pandas as pd
import pycountry
countries_list = []
for country in pycountry.countries:
countries_list.append(country.name)
for idx in df.index:
if idx in countries_list:
continue
else:
df.drop(index=idx)
Let me know if this works for you.

Using loc to replace values gives error

My code looks like:
import pandas as pd
df = pd.read_excel("Energy Indicators.xls", header=None, footer=None)
c_df = df.copy()
c_df = c_df.iloc[18:245, 2:]
c_df = c_df.rename(columns={2: 'Country', 3: 'Energy Supply', 4:'Energy Supply per Capita', 5:'% Renewable'})
c_df['Energy Supply'] = c_df['Energy Supply'].apply(lambda x: x*1000000)
print(c_df)
c_df = c_df.loc[c_df['Country'] == ('Korea, Rep.')] = 'South Korea'
When I run it, I get the error "'str' has no attribute 'loc'". It seems like it is telling me that I can't use loc on the dataframe. All I want to do is replace the value so if there is an easier way, I am all ears.
Just do
c_df.loc[c_df['Country'] == ('Korea, Rep.')] = 'South Korea'
instead of
c_df = c_df.loc[c_df['Country'] == ('Korea, Rep.')] = 'South Korea'
I would suggest using df.replace:
df = df.replace({'c_df':{'Korea, Rep.':'South Korea'}})
The code above replaces Korea, Rep. with South Korea only in the column c_df. Take a look at the df.replace documentation, which explains the nested dictionary syntax I used above as :
Nested dictionaries, e.g., {‘a’: {‘b’: nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with nan. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
Example:
# Original dataframe:
>>> df
c_df whatever
0 Korea, Rep. abcd
1 x abcd
2 Korea, Rep. abcd
3 y abcd
# After df.replace:
>>> df
c_df whatever
0 South Korea abcd
1 x abcd
2 South Korea abcd
3 y abcd

Pandas isin() output to string and general code optimisation

I am just starting to use python to do some analysis of data at work, so I could really use some help here :)
I have a df with African countries and a bunch of indicators and another df with dimensions representing groupings, and if a country is in that group, the name of the country is in there.
Here's a snapshot:
# indicators df
df_indicators= pd.DataFrame({'Country': ['Algeria', 'Angola', 'Benin'],
'Commitment to CAADP Process': [np.nan, 0.1429, 0.8571]})
# groupings df
df_groupings= pd.DataFrame({'Fragile': ['Somalia', 'Angola', 'Benin'],
'SSA': ['Angola', 'Benin', 'South Africa']})
# what I would like to have without doing it manually
df_indicators['Fragile'] = ['Not Fragile', 'Fragile', 'Fragile']
df_indicators['SSA'] = ['Not SSA', 'SSA', 'Not SSA']
df_indicators
and want to add dimensions to this df that tell me if the country is a fragile state or not (and other groupings). So I have another df with the list of countries belonging to that category.
I use the isin instance to check for equality, but what I would really like is that instead of TRUE and FALSE in the new dimension "Fragile" for example, TRUE values would be replaced by "Fragile" and FALSE values by "NOT FRAGILE".
It goes without saying that if you seen any way to improve this code I am very eager to learn from professionals! Especially if you are in the domain of sustainable development goal statistics.
import pandas as pd
import numpy as np
excel_file = 'Downloads/BR_Data.xlsx'
indicators = pd.read_excel(excel_file, sheetname="Indicators", header=1)
groupings = pd.read_excel(excel_file, sheetname="Groupings", header=0)
# Title countries in the Sub-Saharan Africa dimension
decap = groupings["Sub-Saharan Africa (World Bank List)"].str.title()
groupings["Sub-Saharan Africa (World Bank List)"] = decap
# Create list of legal country names
legal_tags = {"Côte d’Ivoire":"Ivory Coast", "Cote D'Ivoire":"Ivory Coast", "Democratic Republic of the Congo":"DR Congo", "Congo, Dem. Rep.":"DR Congo",
"Congo, Repub. of the":"DR Congo", "Congo, Rep.": "DR Congo", "Dr Congo": "DR Congo", "Central African Rep.":"Central African Republic", "Sao Tome & Principe":
"Sao Tome and Principe", "Gambia, The":"Gambia"}
# I am sure there is a way to insert a list of the column names instead of copy pasting the name of every column label 5 times
groupings.replace({"Least Developing Countries in Africa (UN Classification, used by WB)" : legal_tags}, inplace = True)
groupings.replace({"Oil Exporters (McKinsey Global Institute)" : legal_tags}, inplace = True)
groupings.replace({"Sub-Saharan Africa (World Bank List)" : legal_tags}, inplace = True)
groupings.replace({"African Fragile and Conflict Affected Aread (OECD)" : legal_tags}, inplace = True)
groupings
# If the country is df indicator is found in grouping df then assign true to new column [LDC] => CAN I REPLACE TRUE WITH "LDC" etc...?
indicators["LDC"] = indicators["Country"].isin(groupings["Least Developing Countries in Africa (UN Classification, used by WB)"])
indicators["Fragile"] = indicators["Country"].isin(groupings["African Fragile and Conflict Affected Aread (OECD)"])
indicators["Oil"] = indicators["Country"].isin(groupings["Oil Exporters (McKinsey Global Institute)"])
indicators["SSA"] = indicators["Country"].isin(groupings["Sub-Saharan Africa (World Bank List)"])
indicators["Landlock"] = indicators["Country"].isin(groupings['Landlocked (UNCTAD List)'])
# When I concatenate the data frames of the groupings average I get an index with a bunch of true, false, true, false etc...
df = indicators.merge(groupings, left_on = "Country", right_on= "Country", how ="right")
labels = ['African Fragile and Conflict Affected Aread (OECD)', 'Sub-Saharan Africa (World Bank List)', 'Landlocked (UNCTAD List)', 'North Africa (excl. Middle east)', 'Oil Exporters (McKinsey Global Institute)', 'Least Developing Countries in Africa (UN Classification, used by WB)']
df.drop(labels, axis = 1, inplace = True)
df.loc['mean'] = df.mean()
df_regions = df[:].groupby('Regional Group').mean()
df_LDC = df[:].groupby('LDC').mean()
df_Oil = df[:].groupby('Oil').mean()
df_SSA = df[:].groupby('SSA').mean()
df_landlock = df[:].groupby('Landlock').mean()
df_fragile = df[:].groupby('Fragile').mean()
frames = [df_regions, df_Oil, df_SSA, df_landlock, df_fragile]
result = pd.concat(frames)
result
You can apply a function on the serie instead of isin()
def get_value(x, y, choice):
if x in y:
return choice[0]
else:
return choice[1]
indicators["LDC"] = indicators["Country"].apply(get_value, y=groupings["..."].tolist(), choice= ["Fragile", "Not Fragile"])
I'm not 100% sure that you need tolist() but this code will apply the function for every lines of your dataframe and return either the choice 1 if True or 2 if False.
I hope it helps,

Use dictionary to replace a string within a string in Pandas columns

I am trying to use a dictionary key to replace strings in a pandas column with its values. However, each column contains sentences. Therefore, I must first tokenize the sentences and detect whether a Word in the sentence corresponds with a key in my dictionary, then replace the string with the corresponding value.
However, the result that I continue to get it none. Is there a better pythonic way to approach this problem?
Here is my MVC for the moment. In the comments, I specified where the issue is happening.
import pandas as pd
data = {'Categories': ['animal','plant','object'],
'Type': ['tree','dog','rock'],
'Comment': ['The NYC tree is very big','The cat from the UK is small','The rock was found in LA.']
}
ids = {'Id':['NYC','LA','UK'],
'City':['New York City','Los Angeles','United Kingdom']}
df = pd.DataFrame(data)
ids = pd.DataFrame(ids)
def col2dict(ids):
data = ids[['Id', 'City']]
idDict = data.set_index('Id').to_dict()['City']
return idDict
def replaceIds(data,idDict):
ids = idDict.keys()
types = idDict.values()
data['commentTest'] = data['Comment']
words = data['commentTest'].apply(lambda x: x.split())
for (i,word) in enumerate(words):
#Here we can see that the words appear
print word
print ids
if word in ids:
#Here we can see that they are not being recognized. What happened?
print ids
print word
words[i] = idDict[word]
data['commentTest'] = ' '.apply(lambda x: ''.join(x))
return data
idDict = col2dict(ids)
results = replaceIds(df, idDict)
Results:
None
I am using python2.7 and when I am printing out the dict, there are u' of Unicode.
My expected outcome is:
Categories
Comment
Type
commentTest
Categories Comment Type commentTest
0 animal The NYC tree is very big tree The New York City tree is very big
1 plant The cat from the UK is small dog The cat from the United Kingdom is small
2 object The rock was found in LA. rock The rock was found in Los Angeles.
You can create dictionary and then replace:
ids = {'Id':['NYC','LA','UK'],
'City':['New York City','Los Angeles','United Kingdom']}
ids = dict(zip(ids['Id'], ids['City']))
print (ids)
{'UK': 'United Kingdom', 'LA': 'Los Angeles', 'NYC': 'New York City'}
df['commentTest'] = df['Comment'].replace(ids, regex=True)
print (df)
Categories Comment Type \
0 animal The NYC tree is very big tree
1 plant The cat from the UK is small dog
2 object The rock was found in LA. rock
commentTest
0 The New York City tree is very big
1 The cat from the United Kingdom is small
2 The rock was found in Los Angeles.
It's actually much faster to use str.replace() than replace(), even though str.replace() requires a loop:
ids = {'NYC': 'New York City', 'LA': 'Los Angeles', 'UK': 'United Kingdom'}
for old, new in ids.items():
df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
# Categories Type Comment
# 0 animal tree The New York City tree is very big
# 1 plant dog The cat from the United Kingdom is small
# 2 object rock The rock was found in Los Angeles
The only time replace() outperforms a str.replace() loop is with small dataframes:
The timing functions for reference:
def Series_replace(df):
df['Comment'] = df['Comment'].replace(ids, regex=True)
return df
def Series_str_replace(df):
for old, new in ids.items():
df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
return df
Note that if ids is a dataframe instead of dictionary, you can get the same performance with itertuples():
ids = pd.DataFrame({'Id': ['NYC', 'LA', 'UK'], 'City': ['New York City', 'Los Angeles', 'United Kingdom']})
for row in ids.itertuples():
df['Comment'] = df['Comment'].str.replace(row.Id, row.City, regex=False)

Update a null values in countryside column in a data frame with reference to valid countrycode column from another data frame using python

I have two data frames: Disaster, CountryInfo
Disaster has a column country code which has some null values
for example:
Disaster:
1.**Country** - **Country_code**
2.India - Null
3.Afghanistan - AFD
4.India - IND
5.United States - Null
CountryInfo:
0.**CountryName** - **ISO**
1.India - IND
2.Afganistan - AFD
3.United States - US
I need to fill the country code with reference to the country name.Can anyone suggest a solution for this?
You can simply use map with a Series.
With this approach all values are overwritten not only non NaN.
# Test data
disaster = pd.DataFrame({'Country': ['India', 'Afghanistan', 'India', 'United States'],
'Country_code': [np.nan, 'AFD', 'IND', np.nan]})
country = pd.DataFrame({'Country': ['India', 'Afghanistan', 'United States'],
'Country_code': ['IND', 'AFD', 'US']})
# Transforming country into a Series in order to be able to map it directly
country_se = country.set_index('Country').loc[:, 'Country_code']
# Map
disaster['Country_code'] = disaster['Country'].map(country_se)
print(disaster)
# Country Country_code
# 0 India IND
# 1 Afghanistan AFD
# 2 India IND
# 3 United States US
You can filter on NaN if you do not want to overwrite all values.
disaster.loc[pd.isnull(disaster['Country_code']),
'Country_code'] = disaster['Country'].map(country_se)

Categories

Resources