QUESTION: Is there a function to simplify this whole process?
So I was trying to clean data.
Data source: UN Energy Table (will automatically download xls file) And the data in question is the 'Country' column after this xls is turned into a dataframe.
The task was to remove the parentheses and numbers attached on the Country name. What I did was, find all country names containing numbers or parentheses, turning it into a list, finding the clean name and replacing them one by one through a loop.
# Finding all dirty country names
dirtyNames = df1[df1['Country'].str.contains('[A-Za-z ][0-9/(/)]')==True]
# Changing them to list
dirtyNames = dirtyNames['Country'].tolist()
for name in dirtyNames:
clean = re.split('[0-9/(/)]', name)[0]
df1.replace(name,clean, inplace=True)
but is there a function for this? I feel like there must be a function for it if I have to make a loop.
I tried examples from the internet, fixing my dataset into these,
df['first_five_Letter']=df['Country (region)'].str.extract(r'(^w{5})')
and other similar methods, but I keep getting the AttributeError: Can only use .str accessor with string values! error.
There is a very straightforward way of doing it by chaining regexes.
s=[Some list of Countries with numbers and parentheses]
for i,x in enumerate(s):
s[i]=re.sub("[0-9]", "", (re.sub("\)","",(re.sub("\(", "", s[i])))))
or
if 's' is a column in dataframe 'df',
for i,each in df['s'].iteritems():
df.loc[i,'s'] = re.sub("[0-9]", "", (re.sub("\)","",(re.sub("\(", "", df.loc[i,'s'])))))
You can use
import pandas as pd
import numpy as np
df = pd.DataFrame({'Country':['XXX(12)', 'YYYY5000', '(ZZZ)15', np.nan]})
df.loc[pd.isna(df['Country']), 'Country'] = ""
df['Country'] = df['Country'].astype(str).str.replace(r'[0-9()]+', '', regex=True)
df.loc[df['Country'] == '', 'Country'] = np.nan
Here,
df.loc[pd.isna(df['Country']), 'Country'] = "" - converts all NaN values to empty strings
.astype(str) - converts data to string type
.str.replace(r'[0-9()]+', '', regex=True) - removes all digits, ( and ) chars
df.loc[df['Country'] == '', 'Country'] = np.nan - converts empty strings back to NaN.
Since you want to remove both numbers or text in parentheses (or possibly both, but with the numbers always at the end) at the end of the string with nothing, you can do that with one regex replacement:
df = pd.read_excel('Energy Indicators.xls', skiprows=17, usecols='C:F',names=['Country', 'Supply', 'per Capita', 'Renewable'], skipfooter=38)
df['Country'] = df['Country'].str.replace(r'(?:\s*\(.*\))?\d*$', '', regex=True)
Tested on the actual dataset, this gives for df['Country']:
[
'Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia',
'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
'Bermuda', 'Bhutan', 'Bolivia', 'Bonaire, Sint Eustatius and Saba',
'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'British Virgin Islands',
'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cabo Verde',
'Cambodia', 'Cameroon', 'Canada', 'Cayman Islands', 'Central African Republic',
'Chad', 'Chile', 'China', 'China, Hong Kong Special Administrative Region',
'China, Macao Special Administrative Region', 'Colombia', 'Comoros', 'Congo',
'Cook Islands', 'Costa Rica', "Côte d'Ivoire", 'Croatia', 'Cuba', 'Curaçao',
'Cyprus', 'Czech Republic', "Democratic People's Republic of Korea",
'Democratic Republic of the Congo', 'Denmark', 'Djibouti', 'Dominica',
'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea',
'Eritrea', 'Estonia', 'Ethiopia', 'Faeroe Islands', 'Falkland Islands',
'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'Gabon',
'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland',
'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau',
'Guyana', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia',
'Iran', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica',
'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait',
'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon',
'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg',
'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands',
'Martinique', 'Mauritania', 'Mauritius', 'Mexico', 'Micronesia', 'Mongolia',
'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia',
'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua',
'Niger', 'Nigeria', 'Niue', 'Northern Mariana Islands', 'Norway', 'Oman',
'Pakistan', 'Palau', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru',
'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Korea',
'Republic of Moldova', 'Réunion', 'Romania', 'Russian Federation', 'Rwanda',
'Saint Helena', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Pierre and Miquelon',
'Saint Vincent and the Grenadines', 'Samoa', 'Sao Tome and Principe',
'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone',
'Singapore', 'Sint Maarten', 'Slovakia', 'Slovenia', 'Solomon Islands',
'Somalia', 'South Africa', 'South Sudan', 'Spain', 'Sri Lanka', 'State of Palestine',
'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic',
'Tajikistan', 'Thailand', 'The former Yugoslav Republic of Macedonia',
'Timor-Leste', 'Togo', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey',
'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine',
'United Arab Emirates', 'United Kingdom of Great Britain and Northern Ireland',
'United Republic of Tanzania', 'United States of America', 'United States Virgin Islands',
'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Viet Nam',
'Wallis and Futuna Islands', 'Yemen', 'Zambia', 'Zimbabwe'
]
I have a big list of countries which I want the user to choose one or more countries from.
I found this solution which is fitting my exact need:
How do I enable multiple selection of values from a combobox?
The only thing that is still not ideal is that the menu dropdown is as big as the screen plus a dropdown menu.
Is there a possibility to limit the number of items shown? e.g. 10 Items and then use the already existing scroll down / up.
I know that this is possible for Tkinters Combobox but I don´t have the possibility of MultiSelection there.
Here is my example code:
countries = ['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Anguilla', 'Antigua And Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia And Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Virgin Islands', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo', 'Cook Islands', 'Costa Rica', 'Croatia', 'Curacao', 'Cyprus', 'Czech Republic', 'Democratic Republic Of The Congo', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'East Timor', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Estonia', 'Ethiopia', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Isle Of Man', 'Israel', 'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Libyan Arab Jamahiriya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macau', 'Macedonia', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Martinique', 'Mauritania', 'Mauritius', 'Mexico', 'Moldova', 'Monaco', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands', 'Netherlands Antilles', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman', 'Pakistan', 'Palestine', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russia', 'Russian Federation', 'Rwanda', 'Saint Kitts And Nevis', 'Saint Lucia', 'Saint Martin', 'Saint Pierre And Miquelon', 'Saint Vincent And The Grenadines', 'Samoa', 'San Marino', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia', 'Somalia', 'South Africa', 'South Korea', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland', 'Taiwan', 'Tajikistan', 'Tanzania', 'Tanzania, United Republic Of', 'Thailand', 'Togo', 'Tonga', 'Trinidad And Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks And Caicos Islands', 'U.S. Virgin Islands', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Wallis And Futuna', 'Yemen', 'Zambia', 'Zimbabwe']
def country_confirmation():
ChooseVAC_choosen_Country = [k for k,v in ChooseVAC_CountryChoices.items() if v.get() == 1]
if ChooseVAC_choosen_Country != []:
country_bool = True
else:
country_bool = False
print ("country is / countries are:", ChooseVAC_choosen_Country)
from tkinter import *
root = Tk()
ChooseVAC_choosen_Country = []
country_bool = False
ChooseVAC_countries_StringVar = StringVar()
ChooseVAC_countrySelection = Menubutton(root,text='Choose wisely',indicatoron=True, borderwidth=1, relief="raised")
ChooseVAC_menu = Menu(ChooseVAC_countrySelection, tearoff=False)
ChooseVAC_countrySelection.configure(menu=ChooseVAC_menu)
ChooseVAC_CountryChoices = {}
for CountryChoice in countries:
ChooseVAC_CountryChoices[CountryChoice] = IntVar(value=0)
ChooseVAC_menu.add_checkbutton(label=CountryChoice, variable=ChooseVAC_CountryChoices[CountryChoice],
onvalue=1, offvalue=0,
command=country_confirmation)
ChooseVAC_countrySelection.grid(row=0,column=1,sticky='nsew')
root.mainloop()
EDIT:
I used #JohnT answer but added some functionality so the user doesn´t have to confirm each selection:
import tkinter as tk
window = tk.Tk()
#generic window size, showing listbox is smaller than window
window.geometry("600x480")
frame = tk.Frame(window)
frame.pack()
def select(evt):
event = evt.widget
output = []
selection = event.curselection()
#.curselection() returns a list of the indexes selected
#need to loop over the list of indexes with .get()
for i in selection:
o = listBox.get(i)
output.append(o)
print(output)
listBox = tk.Listbox(frame, width=20, height = 5, selectmode='multiple')
#height/width are characters wide and tall,
#height = 20 will show first 20 items in list
#change font size to scale to desired height once # number of items shown
#i recommend setting width to the length of the name of your longest country name +1
listBox.bind('<<ListboxSelect>>',select)
listBox.pack(side="left", fill="y")
scrollbar = tk.Scrollbar(frame, orient="vertical")
scrollbar.config(command=listBox.yview)
scrollbar.pack(side="right", fill="y")
listBox.config(yscrollcommand=scrollbar.set)
for x in range(100):
listBox.insert('end', str(x))
window.mainloop()
Here is a link t the post I got the input from: Getting a callback when a Tkinter Listbox selection is changed?
I believe Listbox will handle everything you want it to, more info on this widget here
http://effbot.org/tkinterbook/listbox.htm
here's sample code using a listbox that is scrollable and allows multiple selection outputting with a print
import tkinter as tk
window = tk.Tk()
#generic window size, showing listbox is smaller than window
window.geometry("600x480")
frame = tk.Frame(window)
frame.pack()
listBox = tk.Listbox(frame, width=20, height=20, selectmode='multiple')
#height/width are characters wide and tall,
#height = 20 will show first 20 items in list
#change font size to scale to desired height once # number of items shown
#i recommend setting width to the length of the name of your longest country name +1
listBox.pack(side="left", fill="y")
scrollbar = tk.Scrollbar(frame, orient="vertical")
scrollbar.config(command=listBox.yview)
scrollbar.pack(side="right", fill="y")
listBox.config(yscrollcommand=scrollbar.set)
for x in range(100):
listBox.insert('end', str(x))
def select():
output = []
selection = listBox.curselection()
#.curselection() returns a list of the indexes selected
#need to loop over the list of indexes with .get()
for i in selection:
o = listBox.get(i)
output.append(o)
print(output)
outBtn = tk.Button(window, text = 'print selection', command = select)
outBtn.pack()
window.mainloop()
Is there an automated way to check if a country name entered is one of the countries of the world in python (i.e., is there an automated way to get a list of all the countries of the world?)
You can use pycountry to get a list of all the countries:
pip install pycountry
Or you can use this dictionary:
Country = [
('US', 'United States'),
('AF', 'Afghanistan'),
('AL', 'Albania'),
('DZ', 'Algeria'),
('AS', 'American Samoa'),
('AD', 'Andorra'),
('AO', 'Angola'),
('AI', 'Anguilla'),
('AQ', 'Antarctica'),
('AG', 'Antigua And Barbuda'),
('AR', 'Argentina'),
('AM', 'Armenia'),
('AW', 'Aruba'),
('AU', 'Australia'),
('AT', 'Austria'),
('AZ', 'Azerbaijan'),
('BS', 'Bahamas'),
('BH', 'Bahrain'),
('BD', 'Bangladesh'),
('BB', 'Barbados'),
('BY', 'Belarus'),
('BE', 'Belgium'),
('BZ', 'Belize'),
('BJ', 'Benin'),
('BM', 'Bermuda'),
('BT', 'Bhutan'),
('BO', 'Bolivia'),
('BA', 'Bosnia And Herzegowina'),
('BW', 'Botswana'),
('BV', 'Bouvet Island'),
('BR', 'Brazil'),
('BN', 'Brunei Darussalam'),
('BG', 'Bulgaria'),
('BF', 'Burkina Faso'),
('BI', 'Burundi'),
('KH', 'Cambodia'),
('CM', 'Cameroon'),
('CA', 'Canada'),
('CV', 'Cape Verde'),
('KY', 'Cayman Islands'),
('CF', 'Central African Rep'),
('TD', 'Chad'),
('CL', 'Chile'),
('CN', 'China'),
('CX', 'Christmas Island'),
('CC', 'Cocos Islands'),
('CO', 'Colombia'),
('KM', 'Comoros'),
('CG', 'Congo'),
('CK', 'Cook Islands'),
('CR', 'Costa Rica'),
('CI', 'Cote D`ivoire'),
('HR', 'Croatia'),
('CU', 'Cuba'),
('CY', 'Cyprus'),
('CZ', 'Czech Republic'),
('DK', 'Denmark'),
('DJ', 'Djibouti'),
('DM', 'Dominica'),
('DO', 'Dominican Republic'),
('TP', 'East Timor'),
('EC', 'Ecuador'),
('EG', 'Egypt'),
('SV', 'El Salvador'),
('GQ', 'Equatorial Guinea'),
('ER', 'Eritrea'),
('EE', 'Estonia'),
('ET', 'Ethiopia'),
('FK', 'Falkland Islands (Malvinas)'),
('FO', 'Faroe Islands'),
('FJ', 'Fiji'),
('FI', 'Finland'),
('FR', 'France'),
('GF', 'French Guiana'),
('PF', 'French Polynesia'),
('TF', 'French S. Territories'),
('GA', 'Gabon'),
('GM', 'Gambia'),
('GE', 'Georgia'),
('DE', 'Germany'),
('GH', 'Ghana'),
('GI', 'Gibraltar'),
('GR', 'Greece'),
('GL', 'Greenland'),
('GD', 'Grenada'),
('GP', 'Guadeloupe'),
('GU', 'Guam'),
('GT', 'Guatemala'),
('GN', 'Guinea'),
('GW', 'Guinea-bissau'),
('GY', 'Guyana'),
('HT', 'Haiti'),
('HN', 'Honduras'),
('HK', 'Hong Kong'),
('HU', 'Hungary'),
('IS', 'Iceland'),
('IN', 'India'),
('ID', 'Indonesia'),
('IR', 'Iran'),
('IQ', 'Iraq'),
('IE', 'Ireland'),
('IL', 'Israel'),
('IT', 'Italy'),
('JM', 'Jamaica'),
('JP', 'Japan'),
('JO', 'Jordan'),
('KZ', 'Kazakhstan'),
('KE', 'Kenya'),
('KI', 'Kiribati'),
('KP', 'Korea (North)'),
('KR', 'Korea (South)'),
('KW', 'Kuwait'),
('KG', 'Kyrgyzstan'),
('LA', 'Laos'),
('LV', 'Latvia'),
('LB', 'Lebanon'),
('LS', 'Lesotho'),
('LR', 'Liberia'),
('LY', 'Libya'),
('LI', 'Liechtenstein'),
('LT', 'Lithuania'),
('LU', 'Luxembourg'),
('MO', 'Macau'),
('MK', 'Macedonia'),
('MG', 'Madagascar'),
('MW', 'Malawi'),
('MY', 'Malaysia'),
('MV', 'Maldives'),
('ML', 'Mali'),
('MT', 'Malta'),
('MH', 'Marshall Islands'),
('MQ', 'Martinique'),
('MR', 'Mauritania'),
('MU', 'Mauritius'),
('YT', 'Mayotte'),
('MX', 'Mexico'),
('FM', 'Micronesia'),
('MD', 'Moldova'),
('MC', 'Monaco'),
('MN', 'Mongolia'),
('MS', 'Montserrat'),
('MA', 'Morocco'),
('MZ', 'Mozambique'),
('MM', 'Myanmar'),
('NA', 'Namibia'),
('NR', 'Nauru'),
('NP', 'Nepal'),
('NL', 'Netherlands'),
('AN', 'Netherlands Antilles'),
('NC', 'New Caledonia'),
('NZ', 'New Zealand'),
('NI', 'Nicaragua'),
('NE', 'Niger'),
('NG', 'Nigeria'),
('NU', 'Niue'),
('NF', 'Norfolk Island'),
('MP', 'Northern Mariana Islands'),
('NO', 'Norway'),
('OM', 'Oman'),
('PK', 'Pakistan'),
('PW', 'Palau'),
('PA', 'Panama'),
('PG', 'Papua New Guinea'),
('PY', 'Paraguay'),
('PE', 'Peru'),
('PH', 'Philippines'),
('PN', 'Pitcairn'),
('PL', 'Poland'),
('PT', 'Portugal'),
('PR', 'Puerto Rico'),
('QA', 'Qatar'),
('RE', 'Reunion'),
('RO', 'Romania'),
('RU', 'Russian Federation'),
('RW', 'Rwanda'),
('KN', 'Saint Kitts And Nevis'),
('LC', 'Saint Lucia'),
('VC', 'St Vincent/Grenadines'),
('WS', 'Samoa'),
('SM', 'San Marino'),
('ST', 'Sao Tome'),
('SA', 'Saudi Arabia'),
('SN', 'Senegal'),
('SC', 'Seychelles'),
('SL', 'Sierra Leone'),
('SG', 'Singapore'),
('SK', 'Slovakia'),
('SI', 'Slovenia'),
('SB', 'Solomon Islands'),
('SO', 'Somalia'),
('ZA', 'South Africa'),
('ES', 'Spain'),
('LK', 'Sri Lanka'),
('SH', 'St. Helena'),
('PM', 'St.Pierre'),
('SD', 'Sudan'),
('SR', 'Suriname'),
('SZ', 'Swaziland'),
('SE', 'Sweden'),
('CH', 'Switzerland'),
('SY', 'Syrian Arab Republic'),
('TW', 'Taiwan'),
('TJ', 'Tajikistan'),
('TZ', 'Tanzania'),
('TH', 'Thailand'),
('TG', 'Togo'),
('TK', 'Tokelau'),
('TO', 'Tonga'),
('TT', 'Trinidad And Tobago'),
('TN', 'Tunisia'),
('TR', 'Turkey'),
('TM', 'Turkmenistan'),
('TV', 'Tuvalu'),
('UG', 'Uganda'),
('UA', 'Ukraine'),
('AE', 'United Arab Emirates'),
('UK', 'United Kingdom'),
('UY', 'Uruguay'),
('UZ', 'Uzbekistan'),
('VU', 'Vanuatu'),
('VA', 'Vatican City State'),
('VE', 'Venezuela'),
('VN', 'Viet Nam'),
('VG', 'Virgin Islands (British)'),
('VI', 'Virgin Islands (U.S.)'),
('EH', 'Western Sahara'),
('YE', 'Yemen'),
('YU', 'Yugoslavia'),
('ZR', 'Zaire'),
('ZM', 'Zambia'),
('ZW', 'Zimbabwe')
]
Update 2021: The module has been updated including shortcomings mentioned by #JurajBezručka
I know this has been asked 8 months ago, but here is a pretty good solution in case you are coming from Google (just like me).
You can use the ISO standard library located here:
https://pypi.python.org/pypi/iso3166/
This piece of code is taken from that link in case you get a 404 Error some time in the future:
Installation:
pip install iso3166
Country Details:
>>> from iso3166 import countries
>>> countries.get('us')
Country(name=u'United States', alpha2='US', alpha3='USA', numeric='840')
>>> countries.get('ala')
Country(name=u'\xc5land Islands', alpha2='AX', alpha3='ALA', numeric='248')
>>> countries.get(8)
Country(name=u'Albania', alpha2='AL', alpha3='ALB', numeric='008')
Countries List:
>>> from iso3166 import countries
>>> for c in countries:
>>> print(c)
Country(name=u'Afghanistan', alpha2='AF', alpha3='AFG', numeric='004')
Country(name=u'\xc5land Islands', alpha2='AX', alpha3='ALA', numeric='248')
Country(name=u'Albania', alpha2='AL', alpha3='ALB', numeric='008')
Country(name=u'Algeria', alpha2='DZ', alpha3='DZA', numeric='012')
...
This package is compliant in case you want to follow the standardization proposed by ISO. According to Wikipedia:
ISO 3166 is a standard published by the International Organization for Standardization (ISO) that defines codes for the names of countries, dependent territories, special areas of geographical interest, and their principal subdivisions (e.g., provinces or states). The official name of the standard is Codes for the representation of names of countries and their subdivisions.
Hence, I strongly recommend using this library in all your apps in case you are working with Countries.
Hope this piece of data is useful for the community!
Chances are you've already got pytz installed in your project, e.g. if you're using Django.
Here's a note from the pytz documentation:
The Olson database comes with a ISO 3166 country code to English country name mapping that pytz exposes as a dictionary:
>>> print(pytz.country_names['nz'])
New Zealand
So, it may be convenient to use the pytz.country_names dictionary.
Not sure how up-to date that ISO 3166 table is, but at least pytz itself is well maintained, and it is currently (i.e. June 2020) in the top 20 "most downloaded past month" from PyPI, according to https://pypistats.org/top, so probably not a bad one to have, as far as external dependencies go.
Although this post is old and has been answered, I would still like to contribute my solution to the question asked:
I have written a function in Python which can be used to find out incorrect country names coming in a data set.
For example:
We have a list of country names which want to check to find out invalid country name:
['UNITED STATES OF AMERICA',
'UNISTED STATES OF AMERICA',
'UNITED KINGDOM',
'UNTED KINGDOM',
'GERMANY',
'MALAYSIA',
....
]
(Note : I have converted list elements into upper case for case insensitive comparison using my function)
This List has incorrect/misspelled entries for country name like : Unisted States of America,Unted Kingdom.
To identify such anomalies I have written a function which can identify such invalid country names.
This function uses ‘pycountry’ library of Python which contains ISO country names.It provides two-alphabet country name,three-alphabet country name,name,common name,official name and numeric country code.
****Function Definition**:**
def country_name_check():
pycntrylst = list(pc.countries)
alpha_2 = []
alpha_3 = []
name = []
common_name = []
official_name = []
invalid_countrynames =[]
tobe_deleted = ['IRAN','SOUTH KOREA','NORTH KOREA','SUDAN','MACAU','REPUBLIC OF IRELAND']
for i in pycntrylst:
alpha_2.append(i.alpha_2)
alpha_3.append(i.alpha_3)
name.append(i.name)
if hasattr(i, "common_name"):
common_name.append(i.common_name)
else:
common_name.append("")
if hasattr(i, "official_name"):
official_name.append(i.official_name)
else:
official_name.append("")
for j in input_country_list:
if j not in map(str.upper,alpha_2) and j not in map(str.upper,alpha_3) and j not in map(str.upper,name) and j not in map(str.upper,common_name) and j not in map(str.upper,official_name):
invalid_countrynames.append(j)
invalid_countrynames = list(set(invalid_countrynames))
invalid_countrynames = [item for item in invalid_countrynames if item not in tobe_deleted]
return print(invalid_countrynames)
)
This function compares country name coming in the input list with each of the following provided by pycountry.countries:
alpha_2 : Two character country code
alpha_3 : Three character country code
name: Country name
common name : Common name for the country
official name : Official name for the country
Also, comparison is being done by converting each of the above attribute content into upper case since we have input country name list also in upper case.
Another thing to be noted here is that,I have created a list called ‘tobe_deleted’ in the function definition.This list contains of those countries for which we have different version of name in pycountry and therefore we do not want these countries to appear as invalid country names when our function is called.
Example:
MACAU is also spelled as MACAO,therefore both the sames are valid.However,pycountry.countries has only one entry with spelling as MACAO.country_name_check() can handle both MACAO and MACAU.
Similarly, pycountry.countries has entry for IRELAND with name=’Ireland’.However,it is also sometimes referred as ‘Republic of Ireland’.country_name_check() can handle both ‘Ireland’ and ‘Republic of Ireland’ in input data set.
I hope this function helps all the people who might have faced issues with handling invalid country names in data sets at any point during data analysis.Thanks for reading my post and any suggestions and feedback are welcome to improve this function.
You can use pycountry to get list of all countries with many other options just follow the steps:
pip install pycountry
import pycountry
def get_countries():
for x in pycountry.countries:
x.alpha_3 +' -- '+x.name
It will print country sort code with country name.
It has other fields you can check it by
help(pycountry.countries)
You can do it like:
import requests
req = requests.get('https://raw.githubusercontent.com/Miguel-Frazao/world-data/master/countries_data.json').json()
countries = (i['name'] for i in req)
print(list(countries))
You can save them to a file so you don't have to do requests all the time, or just copy and paste into your code.
Then, to check if the country exists you can do:
...
country = input('Insert a country')
if country not in countries:
print('nice try, but invalid')
else:
print('yooo, your country is {}'.format(country))
You have more data about each country in case you need it, you can check it in the link that is being requested in the code
This is a crude start that uses the country names gleaned from https://www.iso.org/obp/ui/#search. The country names still contain some tricky cases. For instance, this code recognises 'Samoa' but its not really 'seeing' 'American Samoa'.
class Countries:
def __init__(self):
self.__countries = ['afghanistan', 'aland islands', 'albania', 'algeria', 'american samoa', 'andorra', 'angola', 'anguilla', 'antarctica', 'antigua and barbuda', 'argentina', 'armenia', 'aruba', 'australia', 'austria', 'azerbaijan', 'bahamas (the)', 'bahrain', 'bangladesh', 'barbados', 'belarus', 'belgium', 'belize', 'benin', 'bermuda', 'bhutan', 'bolivia (plurinational state of)', 'bonaire, sint eustatius and saba', 'bosnia and herzegovina', 'botswana', 'bouvet island', 'brazil', 'british indian ocean territory (the)', 'brunei darussalam', 'bulgaria', 'burkina faso', 'burundi', 'cabo verde', 'cambodia', 'cameroon', 'canada', 'cayman islands (the)', 'central african republic (the)', 'chad', 'chile', 'china', 'christmas island', 'cocos (keeling) islands (the)', 'colombia', 'comoros (the)', 'congo (the democratic republic of the)', 'congo (the)', 'cook islands (the)', 'costa rica', "cote d'ivoire", 'croatia', 'cuba', 'curacao', 'cyprus', 'czechia', 'denmark', 'djibouti', 'dominica', 'dominican republic (the)', 'ecuador', 'egypt', 'el salvador', 'equatorial guinea', 'eritrea', 'estonia', 'ethiopia', 'falkland islands (the) [malvinas]', 'faroe islands (the)', 'fiji', 'finland', 'france', 'french guiana', 'french polynesia', 'french southern territories (the)', 'gabon', 'gambia (the)', 'georgia', 'germany', 'ghana', 'gibraltar', 'greece', 'greenland', 'grenada', 'guadeloupe', 'guam', 'guatemala', 'guernsey', 'guinea', 'guinea-bissau', 'guyana', 'haiti', 'heard island and mcdonald islands', 'holy see (the)', 'honduras', 'hong kong', 'hungary', 'iceland', 'india', 'indonesia', 'iran (islamic republic of)', 'iraq', 'ireland', 'isle of man', 'israel', 'italy', 'jamaica', 'japan', 'jersey', 'jordan', 'kazakhstan', 'kenya', 'kiribati', "korea (the democratic people's republic of)", 'korea (the republic of)', 'kuwait', 'kyrgyzstan', "lao people's democratic republic (the)", 'latvia', 'lebanon', 'lesotho', 'liberia', 'libya', 'liechtenstein', 'lithuania', 'luxembourg', 'macao', 'macedonia (the former yugoslav republic of)', 'madagascar', 'malawi', 'malaysia', 'maldives', 'mali', 'malta', 'marshall islands (the)', 'martinique', 'mauritania', 'mauritius', 'mayotte', 'mexico', 'micronesia (federated states of)', 'moldova (the republic of)', 'monaco', 'mongolia', 'montenegro', 'montserrat', 'morocco', 'mozambique', 'myanmar', 'namibia', 'nauru', 'nepal', 'netherlands (the)', 'new caledonia', 'new zealand', 'nicaragua', 'niger (the)', 'nigeria', 'niue', 'norfolk island', 'northern mariana islands (the)', 'norway', 'oman', 'pakistan', 'palau', 'palestine, state of', 'panama', 'papua new guinea', 'paraguay', 'peru', 'philippines (the)', 'pitcairn', 'poland', 'portugal', 'puerto rico', 'qatar', 'reunion', 'romania', 'russian federation (the)', 'rwanda', 'saint barthelemy', 'saint helena, ascension and tristan da cunha', 'saint kitts and nevis', 'saint lucia', 'saint martin (french part)', 'saint pierre and miquelon', 'saint vincent and the grenadines', 'samoa', 'san marino', 'sao tome and principe', 'saudi arabia', 'senegal', 'serbia', 'seychelles', 'sierra leone', 'singapore', 'sint maarten (dutch part)', 'slovakia', 'slovenia', 'solomon islands', 'somalia', 'south africa', 'south georgia and the south sandwich islands', 'south sudan', 'spain', 'sri lanka', 'sudan (the)', 'suriname', 'svalbard and jan mayen', 'swaziland', 'sweden', 'switzerland', 'syrian arab republic', 'taiwan (province of china)', 'tajikistan', 'tanzania, united republic of', 'thailand', 'timor-leste', 'togo', 'tokelau', 'tonga', 'trinidad and tobago', 'tunisia', 'turkey', 'turkmenistan', 'turks and caicos islands (the)', 'tuvalu', 'uganda', 'ukraine', 'united arab emirates (the)', 'united kingdom of great britain and northern ireland (the)', 'united states minor outlying islands (the)', 'united states of america (the)', 'uruguay', 'uzbekistan', 'vanuatu', 'venezuela (bolivarian republic of)', 'viet nam', 'virgin islands (british)', 'virgin islands (u.s.)', 'wallis and futuna', 'western sahara*', 'yemen', 'zambia', 'zimbabwe']
def __call__(self, name, strict=3):
result = False
name = name.lower()
if strict==3:
for country in self.__countries:
if country==name:
return True
else:
return result
elif strict==2:
for country in self.__countries:
if name in country:
return True
else:
return result
elif strict==1:
for country in self.__countries:
if country.startswith(name):
return True
else:
return result
else:
return result
countries = Countries()
print (countries('germany'))
print (countries('russia'))
print (countries('russia', strict=2))
print (countries('russia', strict=1))
print (countries('samoa', strict=2))
print (countries('samoa', strict=1))
Here are the results:
True
False
True
True
True
True
Old question but since it came up during my search and no one provided this alternative I'll add it.
https://github.com/SteinRobert/python-restcountries is a python wrapper for the REST service https://restcountries.eu/.
the wrapper seems to be maintained (was recently updated to python 3) and the REST service is maintained by apilayer so it should be up to date.
pip install python-restcountries
from restcountries import RestCountryApiV2 as rapi
def foo(name):
country_list = rapi.get_countries_by_name('France')