Deleting groups of countries from dataframe indexes

Deleting groups of countries from dataframe indexes - python

I have a dataframe, where I have set indexes as countries. However there are also groups of countries such as Sub-Saharan Africa (IDA & IBRD countries) or Middle East & North Africa (IDA & IBRD countries). I want to delete them. I want just countries to stay.
Example imput dataframe where indexes are:
Antigua and Barbuda
Angola
Arab World
Wanted output dataframe:
Antigua and Barbuda
Angola
My idea was using pycountry, however it does nothing.
countr=list(pycountry.countries)
for idx in df.index:
if idx in countr :
continue
else:
df.drop(index=idx)

Check your list of country names:
countries_list = []
for country in pycountry.countries:
countries_list.append(country.name)
Checking output:
>>> print(countries_list[0:5])
['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Åland Islands']
You can do this if you want to get a dataframe of countries that are in your countries list:
import pycountry
import pandas as pd
country = {'Country Name': ['Antigua and Barbuda','USSR', 'United States', 'Germany','The Moon']}
df = pd.DataFrame(data=country)
countries_list = []
for country in pycountry.countries:
countries_list.append(country.name)
new_df = []
for i in df['Country Name']:
if i in countries_list:
new_df.append(i)
Checking output:
>>> print(new_df)
['Antigua and Barbuda', 'United States', 'Germany']
Otherwise, for your specific code, try this:
Assuming you have your data in a dataframe 'df':
import pandas as pd
import pycountry
countries_list = []
for country in pycountry.countries:
countries_list.append(country.name)
for idx in df.index:
if idx in countries_list:
continue
else:
df.drop(index=idx)
Let me know if this works for you.

Related

Text to columns in pandas dataframe

I have a pandas dataset like below:
import pandas as pd
data = {'id': ['001', '002', '003'],
'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA, United States",
"1180 Shelard Tower\nMinneapolis, MN 55426\nUSA, United States",
"William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA, United States"]
}
df = pd.DataFrame(data)
print(df)
I need to convert address column to text delimited by \n and create new columns like name, address line 1, City, State, Zipcode, Country like below:
id Name addressline1 City State Zipcode Country
1 William J. Clare 290 Valley Dr. Casper WY 82604 United States
2 null 1180 Shelard Tower Minneapolis MN 55426 United States
3 William N. Barnard 145 S. Durbin Casper WY 82601 United States
I am learning python and from morning I am solving this. Any help will be greatly appreciated.
Thanks,

Right now, Pandas is returning you the table with 2 columns. If you look at the value in the second column, the essential information is separated with the comma. Therefore, if you saved your dataframe to df you can do the following:
df['address_and_city'] = df['address'].apply(lambda x: x.split(',')[0])
df['state_and_postal'] = df['address'].apply(lambda x: x.split(',')[1])
df['country'] = df['address'].apply(lambda x: x.split(',')[2])
Now, you have additional three columns in your dataframe, the last one contains the full information about the country already. Now from the first two columns that you have created you can extract the info you need in a similar way.
df['address_first_line'] = df['address_and_city'].apply(lambda x: ' '.join(x.split('\n')[:-1]))
df['city'] = df['address_and_city'].apply(lambda x: x.split('\n')[-1])
df['state'] = df['state_and_postal'].apply(lambda x: x.split(' ')[1])
df['postal'] = df['state_and_postal'].apply(lambda x: x.split(' ')[2].split('\n')[0])
Now you should have all the columns you need. You can remove the excess columns with:
df.drop(columns=['address','address_and_city','state_and_postal'], inplace=True)
Of course, it all can be done faster and with fewer lines of code, but I think it is the clearest way of doing it, which I hope you will find useful. If you don't understand what I did there, check the documentation for split and join methods, and also for apply method, native to pandas.

Yet another question on pandas partial string merge

I know, there have been a number of very close examples, but I can't make them work for me. I want to add a column from another dataframe based on partial string match: The one string is contained in the other, but not necessarily at the beginning. Here is an example:
df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})
df2 = pd.DataFrame({'Country_Name': ['Algeria, Republic of', 'Andorra', 'Kingdom of Bahrain', 'Russia'],
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})
df should get the continent from df2 attached to each 'citizenship' based on the string match / merge. I have been trying to apply the solution mentioned here Pandas: join on partial string match, like Excel VLOOKUP, but cannot get it to work
def get_continent(x):
return df2.loc[df2['Country_Name'].str.contains(x), df2['Continent_Name']].iloc[0]
df['Continent_Name'] = df['citizenship'].apply(get_continent)
But it gives me a key error
KeyError: "None of [Index(['Asia', 'Europe', 'Antarctica', 'Africa', 'Oceania', 'Europe', 'Africa',\n 'North America', 'Europe', 'Asia',\n ...\n 'Asia', 'South America', 'Oceania', 'Oceania', 'Asia', 'Africa',\n 'Oceania', 'Asia', 'Asia', 'Asia'],\n dtype='object', length=262)] are in the [columns]"
Anybody knows what is going on here?

I can see two issues with the code in your question:
In the function return line, you'll want to remove the df2[] bit in the second positional argument to df2.loc, to leave the column name as a string: df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
It then seems like the code from the linked answer only works when there is always a match between "country name" in df2 and "citizenship" in df.
So this works for example:
df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})
df2 = pd.DataFrame({'Country_Name': ['Algeria', 'Andorra', 'Bahrain', 'Spain'],
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})
def get_continent(x):
return df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
df['Continent_Name'] = df['citizenship'].apply(get_continent)
# citizenship Continent_Name
# 0 Algeria Africa
# 1 Andorra Europe
# 2 Bahrain Asia
# 3 Spain Europe
If you want to get the original code to work, you could put in a try/except:
df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})
df2 = pd.DataFrame({'Country_Name': ['Algeria, Republic of', 'Andorra', 'Kingdom of Bahrain', 'Russia'],
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})
def get_continent(x):
try:
return df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
except IndexError:
return None
df['Continent_Name'] = df['citizenship'].apply(get_continent)
# citizenship Continent_Name
# 0 Algeria Africa
# 1 Andorra Europe
# 2 Bahrain Asia
# 3 Spain None

One way you could do this is create a citizenship column in df2 and use that to join the dataframes together. I think the easiest way to make this column would be to use regex.
citizenship_list = df['citizenship'].unique().tolist()
citizenship_regex = r"(" + r"|".join(citizenship_list) + r")"
df2["citizenship"] = df2["Country_Name"].str.extract(citizenship_regex).iloc[:, 0]
joined_df = df.merge(df2, on=["citizenship"], how="left")
print(joined_df)
Then you can reduce this to select just the columns you want.
Also, you probably want to clean both the citizenship and Country_Name columns by running df['citizenship'] = df['citizenship'].str.lower()on them so that you don't missing something due to case.

Using loc to replace values gives error

My code looks like:
import pandas as pd
df = pd.read_excel("Energy Indicators.xls", header=None, footer=None)
c_df = df.copy()
c_df = c_df.iloc[18:245, 2:]
c_df = c_df.rename(columns={2: 'Country', 3: 'Energy Supply', 4:'Energy Supply per Capita', 5:'% Renewable'})
c_df['Energy Supply'] = c_df['Energy Supply'].apply(lambda x: x*1000000)
print(c_df)
c_df = c_df.loc[c_df['Country'] == ('Korea, Rep.')] = 'South Korea'
When I run it, I get the error "'str' has no attribute 'loc'". It seems like it is telling me that I can't use loc on the dataframe. All I want to do is replace the value so if there is an easier way, I am all ears.

Just do
c_df.loc[c_df['Country'] == ('Korea, Rep.')] = 'South Korea'
instead of
c_df = c_df.loc[c_df['Country'] == ('Korea, Rep.')] = 'South Korea'

I would suggest using df.replace:
df = df.replace({'c_df':{'Korea, Rep.':'South Korea'}})
The code above replaces Korea, Rep. with South Korea only in the column c_df. Take a look at the df.replace documentation, which explains the nested dictionary syntax I used above as :
Nested dictionaries, e.g., {‘a’: {‘b’: nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with nan. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
Example:
# Original dataframe:
>>> df
c_df whatever
0 Korea, Rep. abcd
1 x abcd
2 Korea, Rep. abcd
3 y abcd
# After df.replace:
>>> df
c_df whatever
0 South Korea abcd
1 x abcd
2 South Korea abcd
3 y abcd

Extract specific words from text using pandas

In my dataframe , There are several countries with numbers and/or parenthesis in their name.
I want to remove parentheses and numbers from these countries names.
For example :
'Bolivia (Plurinational State of)' should be 'Bolivia',
'Switzerland17' should be 'Switzerland'.
Here is my code , but it seems not working :
import numpy as np
import pandas as pd
def func():
energy=pd.ExcelFile('Energy Indicators.xls').parse('Energy')
energy=energy.iloc[16:243][['Environmental Indicators: Energy','Unnamed: 3','Unnamed: 4','Unnamed: 5']].copy()
energy.columns=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
o="..."
n=np.NaN
energy = energy.replace('...', np.nan)
energy['Energy Supply']=energy['Energy Supply']*1000000
old=["Republic of Korea","United States of America","United Kingdom of "
+"Great Britain and Northern Ireland","China, Hong "
+"Kong Special Administrative Region"]
new=["South Korea","United States","United Kingdom","Hong Kong"]
for i in range(0,4):
energy = energy.replace(old[i], new[i])
#I'm trying to remove it here =====>
p="("
for j in range(16,243):
if p in energy.iloc[j]['Country']:
country=""
for c in energy.iloc[j]['Country'] :
while(c!=p & !c.isnumeric()):
country=c+country
energy = energy.replace(energy.iloc[j]['Country'], country)
return energy
Here is the .xls file i'm working on : https://drive.google.com/file/d/0B80lepon1RrYeDRNQVFWYVVENHM/view?usp=sharing

Use str.extract:
energy['country'] = energy['country'].str.extract('(^[a-zA-Z]+)', expand=False)
df
country
0 Bolivia (Plurinational State of)
1 Switzerland17
df['country'] = df['country'].str.extract('(^[a-zA-Z]+)', expand=False)
df
country
0 Bolivia
1 Switzerland
To handle countries with spaces in their names (very common), a small improvement to the regex would be enough.
df
country
0 Bolivia (Plurinational State of)
1 Switzerland17
2 West Indies (foo bar)
df['country'] = df['country'].str.extract('(^[a-zA-Z\s]+)', expand=False).str.strip()
df
country
0 Bolivia
1 Switzerland
2 West Indies

Update a null values in countryside column in a data frame with reference to valid countrycode column from another data frame using python

I have two data frames: Disaster, CountryInfo
Disaster has a column country code which has some null values
for example:
Disaster:
1.**Country** - **Country_code**
2.India - Null
3.Afghanistan - AFD
4.India - IND
5.United States - Null
CountryInfo:
0.**CountryName** - **ISO**
1.India - IND
2.Afganistan - AFD
3.United States - US
I need to fill the country code with reference to the country name.Can anyone suggest a solution for this?

You can simply use map with a Series.
With this approach all values are overwritten not only non NaN.
# Test data
disaster = pd.DataFrame({'Country': ['India', 'Afghanistan', 'India', 'United States'],
'Country_code': [np.nan, 'AFD', 'IND', np.nan]})
country = pd.DataFrame({'Country': ['India', 'Afghanistan', 'United States'],
'Country_code': ['IND', 'AFD', 'US']})
# Transforming country into a Series in order to be able to map it directly
country_se = country.set_index('Country').loc[:, 'Country_code']
# Map
disaster['Country_code'] = disaster['Country'].map(country_se)
print(disaster)
# Country Country_code
# 0 India IND
# 1 Afghanistan AFD
# 2 India IND
# 3 United States US
You can filter on NaN if you do not want to overwrite all values.
disaster.loc[pd.isnull(disaster['Country_code']),
'Country_code'] = disaster['Country'].map(country_se)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Deleting groups of countries from dataframe indexes - python

Related

Text to columns in pandas dataframe

Yet another question on pandas partial string merge

Using loc to replace values gives error

Extract specific words from text using pandas

Update a null values in countryside column in a data frame with reference to valid countrycode column from another data frame using python

Categories

Resources