How to reaggregate a MultiIndex pandas.DataFrame? - python

I have a MultiIndex pandas.DataFrame dcba with a large level of spatial desagregation:
region FR \
sector Agriculture Crude coal Crude oil Natural gas
region stressor
FR CO2 4.796711 1.382087e-02 3.149139e-05 2.894532
CH4 15.816831 3.744709e-05 3.567591e-04 0.275431
N2O 9.715682 9.290865e-05 5.603963e-07 0.007834
SF6 0.028011 2.818101e-06 2.607044e-08 0.000477
HFC 1.487352 1.473641e-04 1.475096e-06 0.024675
... ... ... ... ...
RoW Middle East CH4 0.455748 3.566337e-05 7.060048e-04 0.035420
N2O 0.193417 1.176733e-06 7.366779e-07 0.002564
SF6 0.001478 7.465562e-08 2.960808e-08 0.000107
HFC 0.006629 3.190865e-07 1.281020e-07 0.000472
PFC 0.001390 1.491053e-07 3.205249e-08 0.000603
region \
sector Extractive industry Biomass_industry Clothing
region stressor
FR CO2 5.817926e-03 9.866832 0.394570
CH4 3.622520e-04 9.923741 0.075845
N2O 1.267742e-04 6.010542 0.027877
SF6 4.797571e-04 0.036355 0.000561
HFC 2.502868e-02 1.894707 0.028297
... ... ... ...
RoW Middle East CH4 1.844972e-04 0.419346 0.193006
N2O 7.236885e-06 0.062690 0.018240
SF6 9.461463e-07 0.001052 0.000477
HFC 4.114220e-06 0.004652 0.002087
PFC 1.314401e-06 0.002939 0.001726
region ... \
sector Heavy_industry Construction Automobile ...
region stressor ...
FR CO2 13.261457 14.029825 2.479608 ...
CH4 0.632317 2.537475 0.319671 ...
N2O 0.196020 0.968326 0.082451 ...
SF6 0.024654 0.054173 0.003670 ...
HFC 1.641677 2.874809 0.197846 ...
... ... ... ... ...
RoW Middle East CH4 0.677210 0.926126 0.325147 ...
N2O 0.049768 0.034912 0.020158 ...
SF6 0.002112 0.001568 0.000955 ...
HFC 0.009280 0.006824 0.004142 ...
PFC 0.011609 0.006201 0.003916 ...
region RoW Middle East \
sector Heavy_industry Construction Automobile
region stressor
FR CO2 0.580714 0.382980 0.162650
CH4 0.046371 0.114092 0.021962
N2O 0.019406 0.059560 0.007892
SF6 0.001126 0.000872 0.000270
HFC 0.073273 0.049812 0.015326
... ... ... ...
RoW Middle East CH4 2.238149 19.760153 1.079266
N2O 0.222995 2.752258 0.069067
SF6 0.009341 0.162138 0.004313
HFC 0.041137 0.702098 0.018245
PFC 0.057405 0.285898 0.007766
region \
sector Oth transport equipment Machinery Electronics
region stressor
FR CO2 0.116935 0.394273 0.080354
CH4 0.016530 0.048727 0.010756
N2O 0.004032 0.018393 0.004233
SF6 0.000166 0.000665 0.000115
HFC 0.008293 0.036774 0.006075
... ... ... ...
RoW Middle East CH4 0.139413 3.370381 0.650511
N2O 0.009559 0.247341 0.058345
SF6 0.000506 0.013730 0.003265
HFC 0.002176 0.056321 0.013685
PFC 0.001429 0.030418 0.006383
region \
sector Fossil fuels Electricity and heat Transport services
region stressor
FR CO2 0.107540 0.015568 0.058673
CH4 0.018198 0.003783 0.007705
N2O 0.006238 0.001653 0.003543
SF6 0.000204 0.000029 0.000061
HFC 0.010712 0.001534 0.003187
... ... ... ...
RoW Middle East CH4 16.407198 5.020937 2.359744
N2O 0.134513 0.432547 0.510101
SF6 0.009963 0.007036 0.012166
HFC 0.044495 0.031509 0.051611
PFC 0.008458 0.004833 0.006725
region
sector Composite
region stressor
FR CO2 0.801035
CH4 0.311628
N2O 0.150162
SF6 0.001836
HFC 0.094331
... ...
RoW Middle East CH4 119.001176
N2O 8.039872
SF6 0.941479
HFC 3.943134
PFC 0.422255
[294 rows x 833 columns]
the desagregation is defined by the list of the regions.
reg_list = ['FR', 'Austria', 'Belgium', 'Bulgaria', 'Cyprus', 'Czech Republic', 'Germany', 'Denmark', 'Estonia', 'Spain', 'Finland', 'Greece', 'Croatia', 'Hungary', 'Ireland', 'Italy', 'Lithuania', 'Luxembourg', 'Latvia', 'Malta', 'Netherlands', 'Poland', 'Portugal', 'Romania', 'Sweden', 'Slovenia', 'Slovakia', 'United Kingdom', 'United States', 'Japan', 'China', 'Canada', 'South Korea', 'Brazil', 'India', 'Mexico', 'Russia', 'Australia', 'Switzerland', 'Turkey', 'Taiwan', 'Norway', 'Indonesia', 'South Africa', 'RoW Asia and Pacific', 'RoW America', 'RoW Europe', 'RoW Africa', 'RoW Middle East']
sectors_list = ['Agriculture', 'Crude coal', 'Crude oil', 'Natural gas', 'Extractive industry', 'Biomass_industry', 'Clothing', 'Heavy_industry', 'Construction', 'Automobile', 'Oth transport equipment', 'Machinery', 'Electronics', 'Fossil fuels', 'Electricity and heat', 'Transport services', 'Composite']
The Dataframe dcba has the following index and columns :
dcba.index =
MultiIndex([( 'FR', 'CO2'),
( 'FR', 'CH4'),
( 'FR', 'N2O'),
( 'FR', 'SF6'),
( 'FR', 'HFC'),
( 'FR', 'PFC'),
( 'Austria', 'CO2'),
( 'Austria', 'CH4'),
( 'Austria', 'N2O'),
( 'Austria', 'SF6'),
...
( 'RoW Africa', 'N2O'),
( 'RoW Africa', 'SF6'),
( 'RoW Africa', 'HFC'),
( 'RoW Africa', 'PFC'),
('RoW Middle East', 'CO2'),
('RoW Middle East', 'CH4'),
('RoW Middle East', 'N2O'),
('RoW Middle East', 'SF6'),
('RoW Middle East', 'HFC'),
('RoW Middle East', 'PFC')],
names=['region', 'stressor'], length=294)
dcba.columns =
MultiIndex([( 'FR', 'Agriculture'),
( 'FR', 'Crude coal'),
( 'FR', 'Crude oil'),
( 'FR', 'Natural gas'),
( 'FR', 'Extractive industry'),
( 'FR', 'Biomass_industry'),
( 'FR', 'Clothing'),
( 'FR', 'Heavy_industry'),
( 'FR', 'Construction'),
( 'FR', 'Automobile'),
...
('RoW Middle East', 'Heavy_industry'),
('RoW Middle East', 'Construction'),
('RoW Middle East', 'Automobile'),
('RoW Middle East', 'Oth transport equipment'),
('RoW Middle East', 'Machinery'),
('RoW Middle East', 'Electronics'),
('RoW Middle East', 'Fossil fuels'),
('RoW Middle East', 'Electricity and heat'),
('RoW Middle East', 'Transport services'),
('RoW Middle East', 'Composite')],
names=['region', 'sector'], length=833)
And I would like to reaggreagte this DataFrame at a different level by grouping the regions diferently, defined here :
dict_reag =
{'United Kingdom': ['United Kingdom'],
'United States': ['United States'],
'Asia and Row Europe': ['Japan',
'India',
'Russia',
'Indonesia',
'RoW Europe'],
'Chinafrica': ['China', 'RoW Africa'],
'Turkey and RoW America': ['Canada', 'Turkey', 'RoW America'],
'Pacific and RoW Middle East': ['South Korea',
'Australia',
'Taiwan',
'RoW Middle East'],
'Brazil, Mexico and South Africa': ['Brazil', 'Mexico', 'South Africa'],
'Switzerland and Norway': ['Switzerland', 'Norway'],
'RoW Asia and Pacific': ['RoW Asia and Pacific'],
'EU': ['Austria',
'Belgium',
'Bulgaria',
'Cyprus',
'Czech Republic',
'Germany',
'Denmark',
'Estonia',
'Spain',
'Finland',
'Greece',
'Croatia',
'Hungary',
'Ireland',
'Italy',
'Lithuania',
'Luxembourg',
'Latvia',
'Malta',
'Netherlands',
'Poland',
'Portugal',
'Romania',
'Sweden',
'Slovenia',
'Slovakia'],
'FR': ['FR']}
The reaggregation process would transform this 294x833 DataFrame into a 66x187 DataFrame. Note that the new reaggreagation DataFrame corresponds to a sum of the first set of subregions.
I created an empty DataFrame with the correct new level of aggregation :
ghg_list = ['CO2', 'CH4', 'N2O', 'SF6', 'HFC', 'PFC']
multi_reg = []
multi_sec = []
for reg in list(reag_matrix.columns[2:]) :
for sec in sectors_list :
multi_reg.append(reg)
multi_sec.append(sec)
arrays = [multi_reg, multi_sec]
new_col = pd.MultiIndex.from_arrays(arrays, names=('region', 'sector'))
multi_reg2 = []
multi_ghg = []
for reg in list(reag_matrix.columns[2:]) :
for ghg in ghg_list :
multi_reg2.append(reg)
multi_ghg.append(ghg)
arrays2 = [multi_reg2, multi_ghg]
new_index = pd.MultiIndex.from_arrays(arrays2, names=('region', 'stressor'))
new_dcba = pd.DataFrame(np.zeros((len(ghg_list)*len(list(reag_matrix.columns[2:])),len(sectors_list)*len(list(reag_matrix.columns[2:])))),
index =new_index,columns = new_col)
where reag_matrix.columns[2:] corresponds to the new list of regions, as defined in dict_reag :
list(reag_matrix.columns[2:]) = ['FR', 'United Kingdom', 'United States', 'Asia and Row Europe', 'Chinafrica', 'Turkey and RoW America', 'Pacific and RoW Middle East', 'Brazil, Mexico and South Africa', 'Switzerland and Norway', 'RoW Asia and Pacific', 'EU']
I guess I could use the groupby function but I could not make it work without losing the sector desaggregation.
Otherwise, I intended to do it iteratively, but I have errors I do not understand. I first tried to copy the French block which will stay the same :
s1 = dcba.loc['FR','FR'].copy()
new_dcba.loc['FR','FR'] = s1
But this last line raises the error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()" although it does not seems to contain any boolean. What is my problem here ?
Also, to avoid this, I tried to use :
new_dcba.loc['FR','FR'].update(s1, overwrite=True)
But it does not change the values in new_dcba.
Finally, I tried to use .values but then a new error is raised :
new_dcba.loc['FR','FR'] = s1.values
"Must have equal len keys and value when setting with an ndarray"
So, I have two question :
Can you guess of a way to use groupby (and .sum()) for this ?
What is the issue raising the first error ?
Note that I have gone through https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy and could not find any explanation for my specific problem.
I wrote an MCVE here, and it happens to work at a reducted size (what is commented), but not at the actual size of the problem :
list_reg_mcve = reg_list #['FR', 'United States', 'United Kingdom']
list_reg_mcve_new = list_reg_reag_new #['FR','Other']
sectors_list_mcve = sectors_list #['Agriculture','Composite']
dict_mcve = dict_reag #{'FR':['FR'],'Other' : ['United States', 'United Kingdom']}
ghg_list_mcve = ['CO2', 'CH4', 'N2O', 'SF6', 'HFC', 'PFC'] #['CO2','CH4']
multi_reg = []
multi_sec = []
for reg in list_reg_mcve :
for sec in sectors_list_mcve :
multi_reg.append(reg)
multi_sec.append(sec)
arrays = [multi_reg, multi_sec]
new_col = pd.MultiIndex.from_arrays(arrays, names=('region', 'sector'))
multi_reg2 = []
multi_ghg = []
for reg in list_reg_mcve :
for ghg in ghg_list_mcve :
multi_reg2.append(reg)
multi_ghg.append(ghg)
arrays2 = [multi_reg2, multi_ghg]
new_index = pd.MultiIndex.from_arrays(arrays2, names=('region', 'stressor'))
dcba_mcve = pd.DataFrame(np.zeros((len(ghg_list_mcve)*len(list_reg_mcve),
len(sectors_list_mcve)*len(list_reg_mcve))),
index =new_index,columns = new_col)
multi_reg = []
multi_sec = []
for reg in list_reg_mcve_new :
for sec in sectors_list_mcve :
multi_reg.append(reg)
multi_sec.append(sec)
arrays = [multi_reg, multi_sec]
new_col = pd.MultiIndex.from_arrays(arrays, names=('region', 'sector'))
multi_reg2 = []
multi_ghg = []
for reg in list_reg_mcve_new :
for ghg in ghg_list_mcve :
multi_reg2.append(reg)
multi_ghg.append(ghg)
arrays2 = [multi_reg2, multi_ghg]
new_index = pd.MultiIndex.from_arrays(arrays2, names=('region', 'stressor'))
dcba_mcve_new = pd.DataFrame(np.zeros((len(ghg_list_mcve)*len(list_reg_mcve_new),
len(sectors_list_mcve)*len(list_reg_mcve_new))),
index =new_index,columns = new_col)
from random import randint
for col in dcba_mcve.columns:
dcba_mcve[col]=dcba_mcve.apply(lambda x: randint(0,5), axis=1)
print(dcba_mcve)
for reg_export in dict_mcve :
list_reg_agg_1 = dict_mcve[reg_export]
for reg_import in dict_mcve :
list_reg_agg_2 = dict_mcve[reg_import]
s1=pd.DataFrame(np.zeros_like(dcba_mcve_new.loc['FR','FR']),index=dcba_mcve_new.loc['FR','FR'].index, columns = dcba_mcve_new.loc['FR','FR'].columns)
for reg1 in list_reg_agg_1 :
for reg2 in list_reg_agg_2 :
#print(reg1,reg2)
s1 += dcba_mcve.loc[reg1,reg2].copy()
#print(s1)
dcba_mcve_new.loc[reg_export,reg_import].update(s1)
dcba_mcve_new
Thank you in advance

Related

Capital city that starts with "a", and ends with "a". Doesn't matter if letter "a" is uppercase or lowercase

Start with a, and ends with a. I have been trying to output capital cities that start and end with the letter "a". Doesn't matter if they start with capital "A"
capitals = ('Kabul', 'Tirana (Tirane)', 'Algiers', 'Andorra la Vella', 'Luanda', "Saint John's", 'Buenos Aires', 'Yerevan', 'Canberra', 'Vienna', 'Baku', 'Nassau', 'Manama', 'Dhaka', 'Bridgetown', 'Minsk', 'Brussels', 'Belmopan', 'Porto Novo', 'Thimphu', 'Sucre', 'Sarajevo', 'Gaborone', 'Brasilia', 'Bandar Seri Begawan', 'Sofia', 'Ouagadougou', 'Gitega', 'Phnom Penh', 'Yaounde', 'Ottawa', 'Praia', 'Bangui', "N'Djamena", 'Santiago', 'Beijing', 'Bogota', 'Moroni', 'Kinshasa', 'Brazzaville', 'San Jose', 'Yamoussoukro', 'Zagreb', 'Havana', 'Nicosia', 'Prague', 'Copenhagen', 'Djibouti', 'Roseau', 'Santo Domingo', 'Dili', 'Quito', 'Cairo', 'San Salvador', 'London', 'Malabo', 'Asmara', 'Tallinn', 'Mbabana', 'Addis Ababa', 'Palikir', 'Suva', 'Helsinki', 'Paris', 'Libreville', 'Banjul', 'Tbilisi', 'Berlin', 'Accra', 'Athens', "Saint George's", 'Guatemala City', 'Conakry', 'Bissau', 'Georgetown', 'Port au Prince', 'Tegucigalpa', 'Budapest', 'Reykjavik', 'New Delhi', 'Jakarta', 'Tehran', 'Baghdad', 'Dublin', 'Jerusalem', 'Rome', 'Kingston', 'Tokyo', 'Amman', 'Nur-Sultan', 'Nairobi', 'Tarawa Atoll', 'Pristina', 'Kuwait City', 'Bishkek', 'Vientiane', 'Riga', 'Beirut', 'Maseru', 'Monrovia', 'Tripoli', 'Vaduz', 'Vilnius', 'Luxembourg', 'Antananarivo', 'Lilongwe', 'Kuala Lumpur', 'Male', 'Bamako', 'Valletta', 'Majuro', 'Nouakchott', 'Port Louis', 'Mexico City', 'Chisinau', 'Monaco', 'Ulaanbaatar', 'Podgorica', 'Rabat', 'Maputo', 'Nay Pyi Taw', 'Windhoek', 'No official capital', 'Kathmandu', 'Amsterdam', 'Wellington', 'Managua', 'Niamey', 'Abuja', 'Pyongyang', 'Skopje', 'Belfast', 'Oslo', 'Muscat', 'Islamabad', 'Melekeok', 'Panama City', 'Port Moresby', 'Asuncion', 'Lima', 'Manila', 'Warsaw', 'Lisbon', 'Doha', 'Bucharest', 'Moscow', 'Kigali', 'Basseterre', 'Castries', 'Kingstown', 'Apia', 'San Marino', 'Sao Tome', 'Riyadh', 'Edinburgh', 'Dakar', 'Belgrade', 'Victoria', 'Freetown', 'Singapore', 'Bratislava', 'Ljubljana', 'Honiara', 'Mogadishu', 'Pretoria, Bloemfontein, Cape Town', 'Seoul', 'Juba', 'Madrid', 'Colombo', 'Khartoum', 'Paramaribo', 'Stockholm', 'Bern', 'Damascus', 'Taipei', 'Dushanbe', 'Dodoma', 'Bangkok', 'Lome', "Nuku'alofa", 'Port of Spain', 'Tunis', 'Ankara', 'Ashgabat', 'Funafuti', 'Kampala', 'Kiev', 'Abu Dhabi', 'London', 'Washington D.C.', 'Montevideo', 'Tashkent', 'Port Vila', 'Vatican City', 'Caracas', 'Hanoi', 'Cardiff', "Sana'a", 'Lusaka', 'Harare')
This is my code:
for elem in capitals:
elem = elem.lower()
["".join(j for j in i if j not in string.punctuation) for i in capitals]
if (len(elem) >=4 and elem.endswith(elem[0])):
print(elem)
My output is:
andorra la vella
saint john's
asmara
addis ababa
accra
saint george's
nur-sultan
abuja
oslo
warsaw
apia
ankara
tashkent
My expected output is:
andorra la vella
asmara
addis ababa
accra
abuja
apia
ankara
You didn't check if the capital starts with 'a'. I also assumed you want to filter out punctuation based on your code, so this is what I ended up with:
import string
for elem in capitals:
elem = elem.lower()
for punct in string.punctuation:
elem = elem.replace(punct, '')
if elem.startswith('a') and elem.endswith('a'):
print(elem)
for elem in capitals:
elem = elem.lower()
if (elem.startswith('a') and elem.endswith('a')):
print(elem)

Colour scatter plot by column Plotly

I would like to create a scatter plot with 3 variables: Age, Value and City. How can I colour the plot by City?
Current output is a simple scatter plot of Value against Age:
Current Code:
import datetime
import plotly.offline as py
import plotly
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x= data1['Age'], y = data1['Value'], mode='markers', name='lines+markers'))
fig.show()
Update:
Tried:
import plotly.express as px
fig = px.scatter(data1, x=data1['Age'], y=data1['Value'], color=data1['City'])
fig.show()
and caught error:
KeyError: (nan, '', '', '', '')
Update:
Age and Value have been cleaned. Here are some unique values for City(sorry to change the column). There are some messy figures.
['NT', 'WAIKATO', 'VICTORIA', 'South Australia', 'OTHER', 'ON',
'Nsw', 'IL', 'MD - MARYLAND', 'ABU DHABI', 'VIENNA', 'TX',
'VILKAVISKIS', 'NY', 'BALEARES', 'UK', 'GLOUCESTERSHIRE',
'LA MANCHE', 'TEXAS', 'DUBAI', 'ENGLAND', 'ITALY', nan,
'GREATER LONDON', 'BEDFORDSHIRE', 'HEREFORDSHIRE',
'BADEN-WÃ?RTTEMBERG', 'Australian Capital Territory',
'ABERDEENSHIRE', 'OXFORDSHIRE', 'LONDON', 'BC', 'SK',
'NOORD-HOLLAND', 'UNITED KINGDOM', 'New South Wales', 'Brookdale',
'Western Australia', 'GALWAY', 'Queensland', 'TOKYO',
'HAUTE-GARONNE', 'WORCESTERSHIRE', 'CALIFORNIA', 'JAPAN',
'NORTHUMBERLAND', 'NJ - NEW JERSEY', 'GLOS', 'DORSET', 'TENNESSEE',
'BANGKOK', 'CANTERBURY', 'WEXFORD', 'MIDDLESEX', 'SURREY', 'MI',
'NEVADA', 'KENTUCKY', 'NEW YORK', 'ZUID-HOLLAND', 'HONG KONG',
'ESSEX', 'FL', 'LILLEHAMMER', 'DEVON', 'NEW TERRITORIES', 'KENT',
'THAILAND', 'Pyrmont', 'SINGAPORE', 'FRIBOURG', 'CAIRO',
'QUEENSLAND', 'HAMPSHIRE', 'NEW JERSEY', 'WEST MIDLANDS',
'MICHIGAN', 'NONE', 'WI', 'BARNET', 'STAFFS', 'WARWICKSHIRE'...]
Inside the go.Scatter definition you should specify the color parameter as color=data1['Continent']. See the Plotly documentation for more information.

How to check frequency of every unique value from pandas data-frame?

If I have a data-frame of 2000 and in which let say brand have 142 unique values and i want to count frequency of every unique value form 1 to 142.values should change dynamically.
brand=clothes_z.brand_name
brand.describe(include="all")
unique_brand=brand.unique()
brand.describe(include="all"),unique_brand
Output:
(count 2613
unique 142
top Mango
freq 54
Name: brand_name, dtype: object,
array(['Jack & Jones', 'TOM TAILOR DENIM', 'YOURTURN', 'Tommy Jeans',
'Alessandro Zavetti', 'adidas Originals', 'Volcom', 'Pier One',
'Superdry', 'G-Star', 'SIKSILK', 'Tommy Hilfiger', 'Karl Kani',
'Alpha Industries', 'Farah', 'Nike Sportswear',
'Calvin Klein Jeans', 'Champion', 'Hollister Co.', 'PULL&BEAR',
'Nike Performance', 'Even&Odd', 'Stradivarius', 'Mango',
'Champion Reverse Weave', 'Massimo Dutti', 'Selected Femme Petite',
'NAF NAF', 'YAS', 'New Look', 'Missguided', 'Miss Selfridge',
'Topshop', 'Miss Selfridge Petite', 'Guess', 'Esprit Collection',
'Vero Moda', 'ONLY Petite', 'Selected Femme', 'ONLY', 'Dr.Denim',
'Bershka', 'Vero Moda Petite', 'PULL & BEAR', 'New Look Petite',
'JDY', 'Even & Odd', 'Vila', 'Lacoste', 'PS Paul Smith',
'Redefined Rebel', 'Selected Homme', 'BOSS', 'Brave Soul', 'Mind',
'Scotch & Soda', 'Only & Sons', 'The North Face',
'Polo Ralph Lauren', 'Gym King', 'Selected Woman', 'Rich & Royal',
'Rooms', 'Glamorous', 'Club L London', 'Zalando Essentials',
'edc by Esprit', 'OYSHO', 'Oasis', 'Gina Tricot',
'Glamorous Petite', 'Cortefiel', 'Missguided Petite',
'Missguided Tall', 'River Island', 'INDICODE JEANS',
'Kings Will Dream', 'Topman', 'Esprit', 'Diesel', 'Key Largo',
'Mennace', 'Lee', "Levi's®", 'adidas Performance', 'jordan',
'Jack & Jones PREMIUM', 'They', 'Springfield', 'Benetton', 'Fila',
'Replay', 'Original Penguin', 'Kronstadt', 'Vans', 'Jordan',
'Apart', 'New look', 'River island', 'Freequent', 'Mads Nørgaard',
'4th & Reckless', 'Morgan', 'Honey punch', 'Anna Field Petite',
'Noisy may', 'Pepe Jeans', 'Mavi', 'mint & berry', 'KIOMI', 'mbyM',
'Escada Sport', 'Lost Ink', 'More & More', 'Coffee', 'GANT',
'TWINTIP', 'MAMALICIOUS', 'Noisy May', 'Pieces', 'Rest',
'Anna Field', 'Pinko', 'Forever New', 'ICHI', 'Seafolly', 'Object',
'Freya', 'Wrangler', 'Cream', 'LTB', 'G-star', 'Dorothy Perkins',
'Carhartt WIP', 'Betty & Co', 'GAP', 'ONLY Tall', 'Next', 'HUGO',
'Violet by Mango', 'WEEKEND MaxMara', 'French Connection'],
dtype=object))
As it is showing only frequency of Mango "54" because it is top frequency and I want every value frequency like what is the frequency of Jack & Jones, TOM TAILOR DENIM and YOURTURN and so on... and values should change dynamically.
You could simply do,
clothes_z.brand_name.value_counts()
This would list down the unique values and would give you the frequency of every element in that Pandas Series.
from collections import Counter
ll = [...your list of brands...]
c = Counter(ll)
# you can do whatever you want with your counted values
df = pd.DataFrame.from_dict(c, orient='index', columns=['counted'])

Check if a country entered is one of the countries of the world

Is there an automated way to check if a country name entered is one of the countries of the world in python (i.e., is there an automated way to get a list of all the countries of the world?)
You can use pycountry to get a list of all the countries:
pip install pycountry
Or you can use this dictionary:
Country = [
('US', 'United States'),
('AF', 'Afghanistan'),
('AL', 'Albania'),
('DZ', 'Algeria'),
('AS', 'American Samoa'),
('AD', 'Andorra'),
('AO', 'Angola'),
('AI', 'Anguilla'),
('AQ', 'Antarctica'),
('AG', 'Antigua And Barbuda'),
('AR', 'Argentina'),
('AM', 'Armenia'),
('AW', 'Aruba'),
('AU', 'Australia'),
('AT', 'Austria'),
('AZ', 'Azerbaijan'),
('BS', 'Bahamas'),
('BH', 'Bahrain'),
('BD', 'Bangladesh'),
('BB', 'Barbados'),
('BY', 'Belarus'),
('BE', 'Belgium'),
('BZ', 'Belize'),
('BJ', 'Benin'),
('BM', 'Bermuda'),
('BT', 'Bhutan'),
('BO', 'Bolivia'),
('BA', 'Bosnia And Herzegowina'),
('BW', 'Botswana'),
('BV', 'Bouvet Island'),
('BR', 'Brazil'),
('BN', 'Brunei Darussalam'),
('BG', 'Bulgaria'),
('BF', 'Burkina Faso'),
('BI', 'Burundi'),
('KH', 'Cambodia'),
('CM', 'Cameroon'),
('CA', 'Canada'),
('CV', 'Cape Verde'),
('KY', 'Cayman Islands'),
('CF', 'Central African Rep'),
('TD', 'Chad'),
('CL', 'Chile'),
('CN', 'China'),
('CX', 'Christmas Island'),
('CC', 'Cocos Islands'),
('CO', 'Colombia'),
('KM', 'Comoros'),
('CG', 'Congo'),
('CK', 'Cook Islands'),
('CR', 'Costa Rica'),
('CI', 'Cote D`ivoire'),
('HR', 'Croatia'),
('CU', 'Cuba'),
('CY', 'Cyprus'),
('CZ', 'Czech Republic'),
('DK', 'Denmark'),
('DJ', 'Djibouti'),
('DM', 'Dominica'),
('DO', 'Dominican Republic'),
('TP', 'East Timor'),
('EC', 'Ecuador'),
('EG', 'Egypt'),
('SV', 'El Salvador'),
('GQ', 'Equatorial Guinea'),
('ER', 'Eritrea'),
('EE', 'Estonia'),
('ET', 'Ethiopia'),
('FK', 'Falkland Islands (Malvinas)'),
('FO', 'Faroe Islands'),
('FJ', 'Fiji'),
('FI', 'Finland'),
('FR', 'France'),
('GF', 'French Guiana'),
('PF', 'French Polynesia'),
('TF', 'French S. Territories'),
('GA', 'Gabon'),
('GM', 'Gambia'),
('GE', 'Georgia'),
('DE', 'Germany'),
('GH', 'Ghana'),
('GI', 'Gibraltar'),
('GR', 'Greece'),
('GL', 'Greenland'),
('GD', 'Grenada'),
('GP', 'Guadeloupe'),
('GU', 'Guam'),
('GT', 'Guatemala'),
('GN', 'Guinea'),
('GW', 'Guinea-bissau'),
('GY', 'Guyana'),
('HT', 'Haiti'),
('HN', 'Honduras'),
('HK', 'Hong Kong'),
('HU', 'Hungary'),
('IS', 'Iceland'),
('IN', 'India'),
('ID', 'Indonesia'),
('IR', 'Iran'),
('IQ', 'Iraq'),
('IE', 'Ireland'),
('IL', 'Israel'),
('IT', 'Italy'),
('JM', 'Jamaica'),
('JP', 'Japan'),
('JO', 'Jordan'),
('KZ', 'Kazakhstan'),
('KE', 'Kenya'),
('KI', 'Kiribati'),
('KP', 'Korea (North)'),
('KR', 'Korea (South)'),
('KW', 'Kuwait'),
('KG', 'Kyrgyzstan'),
('LA', 'Laos'),
('LV', 'Latvia'),
('LB', 'Lebanon'),
('LS', 'Lesotho'),
('LR', 'Liberia'),
('LY', 'Libya'),
('LI', 'Liechtenstein'),
('LT', 'Lithuania'),
('LU', 'Luxembourg'),
('MO', 'Macau'),
('MK', 'Macedonia'),
('MG', 'Madagascar'),
('MW', 'Malawi'),
('MY', 'Malaysia'),
('MV', 'Maldives'),
('ML', 'Mali'),
('MT', 'Malta'),
('MH', 'Marshall Islands'),
('MQ', 'Martinique'),
('MR', 'Mauritania'),
('MU', 'Mauritius'),
('YT', 'Mayotte'),
('MX', 'Mexico'),
('FM', 'Micronesia'),
('MD', 'Moldova'),
('MC', 'Monaco'),
('MN', 'Mongolia'),
('MS', 'Montserrat'),
('MA', 'Morocco'),
('MZ', 'Mozambique'),
('MM', 'Myanmar'),
('NA', 'Namibia'),
('NR', 'Nauru'),
('NP', 'Nepal'),
('NL', 'Netherlands'),
('AN', 'Netherlands Antilles'),
('NC', 'New Caledonia'),
('NZ', 'New Zealand'),
('NI', 'Nicaragua'),
('NE', 'Niger'),
('NG', 'Nigeria'),
('NU', 'Niue'),
('NF', 'Norfolk Island'),
('MP', 'Northern Mariana Islands'),
('NO', 'Norway'),
('OM', 'Oman'),
('PK', 'Pakistan'),
('PW', 'Palau'),
('PA', 'Panama'),
('PG', 'Papua New Guinea'),
('PY', 'Paraguay'),
('PE', 'Peru'),
('PH', 'Philippines'),
('PN', 'Pitcairn'),
('PL', 'Poland'),
('PT', 'Portugal'),
('PR', 'Puerto Rico'),
('QA', 'Qatar'),
('RE', 'Reunion'),
('RO', 'Romania'),
('RU', 'Russian Federation'),
('RW', 'Rwanda'),
('KN', 'Saint Kitts And Nevis'),
('LC', 'Saint Lucia'),
('VC', 'St Vincent/Grenadines'),
('WS', 'Samoa'),
('SM', 'San Marino'),
('ST', 'Sao Tome'),
('SA', 'Saudi Arabia'),
('SN', 'Senegal'),
('SC', 'Seychelles'),
('SL', 'Sierra Leone'),
('SG', 'Singapore'),
('SK', 'Slovakia'),
('SI', 'Slovenia'),
('SB', 'Solomon Islands'),
('SO', 'Somalia'),
('ZA', 'South Africa'),
('ES', 'Spain'),
('LK', 'Sri Lanka'),
('SH', 'St. Helena'),
('PM', 'St.Pierre'),
('SD', 'Sudan'),
('SR', 'Suriname'),
('SZ', 'Swaziland'),
('SE', 'Sweden'),
('CH', 'Switzerland'),
('SY', 'Syrian Arab Republic'),
('TW', 'Taiwan'),
('TJ', 'Tajikistan'),
('TZ', 'Tanzania'),
('TH', 'Thailand'),
('TG', 'Togo'),
('TK', 'Tokelau'),
('TO', 'Tonga'),
('TT', 'Trinidad And Tobago'),
('TN', 'Tunisia'),
('TR', 'Turkey'),
('TM', 'Turkmenistan'),
('TV', 'Tuvalu'),
('UG', 'Uganda'),
('UA', 'Ukraine'),
('AE', 'United Arab Emirates'),
('UK', 'United Kingdom'),
('UY', 'Uruguay'),
('UZ', 'Uzbekistan'),
('VU', 'Vanuatu'),
('VA', 'Vatican City State'),
('VE', 'Venezuela'),
('VN', 'Viet Nam'),
('VG', 'Virgin Islands (British)'),
('VI', 'Virgin Islands (U.S.)'),
('EH', 'Western Sahara'),
('YE', 'Yemen'),
('YU', 'Yugoslavia'),
('ZR', 'Zaire'),
('ZM', 'Zambia'),
('ZW', 'Zimbabwe')
]
Update 2021: The module has been updated including shortcomings mentioned by #JurajBezručka
I know this has been asked 8 months ago, but here is a pretty good solution in case you are coming from Google (just like me).
You can use the ISO standard library located here:
https://pypi.python.org/pypi/iso3166/
This piece of code is taken from that link in case you get a 404 Error some time in the future:
Installation:
pip install iso3166
Country Details:
>>> from iso3166 import countries
>>> countries.get('us')
Country(name=u'United States', alpha2='US', alpha3='USA', numeric='840')
>>> countries.get('ala')
Country(name=u'\xc5land Islands', alpha2='AX', alpha3='ALA', numeric='248')
>>> countries.get(8)
Country(name=u'Albania', alpha2='AL', alpha3='ALB', numeric='008')
Countries List:
>>> from iso3166 import countries
>>> for c in countries:
>>> print(c)
Country(name=u'Afghanistan', alpha2='AF', alpha3='AFG', numeric='004')
Country(name=u'\xc5land Islands', alpha2='AX', alpha3='ALA', numeric='248')
Country(name=u'Albania', alpha2='AL', alpha3='ALB', numeric='008')
Country(name=u'Algeria', alpha2='DZ', alpha3='DZA', numeric='012')
...
This package is compliant in case you want to follow the standardization proposed by ISO. According to Wikipedia:
ISO 3166 is a standard published by the International Organization for Standardization (ISO) that defines codes for the names of countries, dependent territories, special areas of geographical interest, and their principal subdivisions (e.g., provinces or states). The official name of the standard is Codes for the representation of names of countries and their subdivisions.
Hence, I strongly recommend using this library in all your apps in case you are working with Countries.
Hope this piece of data is useful for the community!
Chances are you've already got pytz installed in your project, e.g. if you're using Django.
Here's a note from the pytz documentation:
The Olson database comes with a ISO 3166 country code to English country name mapping that pytz exposes as a dictionary:
>>> print(pytz.country_names['nz'])
New Zealand
So, it may be convenient to use the pytz.country_names dictionary.
Not sure how up-to date that ISO 3166 table is, but at least pytz itself is well maintained, and it is currently (i.e. June 2020) in the top 20 "most downloaded past month" from PyPI, according to https://pypistats.org/top, so probably not a bad one to have, as far as external dependencies go.
Although this post is old and has been answered, I would still like to contribute my solution to the question asked:
I have written a function in Python which can be used to find out incorrect country names coming in a data set.
For example:
We have a list of country names which want to check to find out invalid country name:
['UNITED STATES OF AMERICA',
'UNISTED STATES OF AMERICA',
'UNITED KINGDOM',
'UNTED KINGDOM',
'GERMANY',
'MALAYSIA',
....
]
(Note : I have converted list elements into upper case for case insensitive comparison using my function)
This List has incorrect/misspelled entries for country name like : Unisted States of America,Unted Kingdom.
To identify such anomalies I have written a function which can identify such invalid country names.
This function uses ‘pycountry’ library of Python which contains ISO country names.It provides two-alphabet country name,three-alphabet country name,name,common name,official name and numeric country code.
****Function Definition**:**
def country_name_check():
pycntrylst = list(pc.countries)
alpha_2 = []
alpha_3 = []
name = []
common_name = []
official_name = []
invalid_countrynames =[]
tobe_deleted = ['IRAN','SOUTH KOREA','NORTH KOREA','SUDAN','MACAU','REPUBLIC OF IRELAND']
for i in pycntrylst:
alpha_2.append(i.alpha_2)
alpha_3.append(i.alpha_3)
name.append(i.name)
if hasattr(i, "common_name"):
common_name.append(i.common_name)
else:
common_name.append("")
if hasattr(i, "official_name"):
official_name.append(i.official_name)
else:
official_name.append("")
for j in input_country_list:
if j not in map(str.upper,alpha_2) and j not in map(str.upper,alpha_3) and j not in map(str.upper,name) and j not in map(str.upper,common_name) and j not in map(str.upper,official_name):
invalid_countrynames.append(j)
invalid_countrynames = list(set(invalid_countrynames))
invalid_countrynames = [item for item in invalid_countrynames if item not in tobe_deleted]
return print(invalid_countrynames)
)
This function compares country name coming in the input list with each of the following provided by pycountry.countries:
alpha_2 : Two character country code
alpha_3 : Three character country code
name: Country name
common name : Common name for the country
official name : Official name for the country
Also, comparison is being done by converting each of the above attribute content into upper case since we have input country name list also in upper case.
Another thing to be noted here is that,I have created a list called ‘tobe_deleted’ in the function definition.This list contains of those countries for which we have different version of name in pycountry and therefore we do not want these countries to appear as invalid country names when our function is called.
Example:
MACAU is also spelled as MACAO,therefore both the sames are valid.However,pycountry.countries has only one entry with spelling as MACAO.country_name_check() can handle both MACAO and MACAU.
Similarly, pycountry.countries has entry for IRELAND with name=’Ireland’.However,it is also sometimes referred as ‘Republic of Ireland’.country_name_check() can handle both ‘Ireland’ and ‘Republic of Ireland’ in input data set.
I hope this function helps all the people who might have faced issues with handling invalid country names in data sets at any point during data analysis.Thanks for reading my post and any suggestions and feedback are welcome to improve this function.
You can use pycountry to get list of all countries with many other options just follow the steps:
pip install pycountry
import pycountry
def get_countries():
for x in pycountry.countries:
x.alpha_3 +' -- '+x.name
It will print country sort code with country name.
It has other fields you can check it by
help(pycountry.countries)
You can do it like:
import requests
req = requests.get('https://raw.githubusercontent.com/Miguel-Frazao/world-data/master/countries_data.json').json()
countries = (i['name'] for i in req)
print(list(countries))
You can save them to a file so you don't have to do requests all the time, or just copy and paste into your code.
Then, to check if the country exists you can do:
...
country = input('Insert a country')
if country not in countries:
print('nice try, but invalid')
else:
print('yooo, your country is {}'.format(country))
You have more data about each country in case you need it, you can check it in the link that is being requested in the code
This is a crude start that uses the country names gleaned from https://www.iso.org/obp/ui/#search. The country names still contain some tricky cases. For instance, this code recognises 'Samoa' but its not really 'seeing' 'American Samoa'.
class Countries:
def __init__(self):
self.__countries = ['afghanistan', 'aland islands', 'albania', 'algeria', 'american samoa', 'andorra', 'angola', 'anguilla', 'antarctica', 'antigua and barbuda', 'argentina', 'armenia', 'aruba', 'australia', 'austria', 'azerbaijan', 'bahamas (the)', 'bahrain', 'bangladesh', 'barbados', 'belarus', 'belgium', 'belize', 'benin', 'bermuda', 'bhutan', 'bolivia (plurinational state of)', 'bonaire, sint eustatius and saba', 'bosnia and herzegovina', 'botswana', 'bouvet island', 'brazil', 'british indian ocean territory (the)', 'brunei darussalam', 'bulgaria', 'burkina faso', 'burundi', 'cabo verde', 'cambodia', 'cameroon', 'canada', 'cayman islands (the)', 'central african republic (the)', 'chad', 'chile', 'china', 'christmas island', 'cocos (keeling) islands (the)', 'colombia', 'comoros (the)', 'congo (the democratic republic of the)', 'congo (the)', 'cook islands (the)', 'costa rica', "cote d'ivoire", 'croatia', 'cuba', 'curacao', 'cyprus', 'czechia', 'denmark', 'djibouti', 'dominica', 'dominican republic (the)', 'ecuador', 'egypt', 'el salvador', 'equatorial guinea', 'eritrea', 'estonia', 'ethiopia', 'falkland islands (the) [malvinas]', 'faroe islands (the)', 'fiji', 'finland', 'france', 'french guiana', 'french polynesia', 'french southern territories (the)', 'gabon', 'gambia (the)', 'georgia', 'germany', 'ghana', 'gibraltar', 'greece', 'greenland', 'grenada', 'guadeloupe', 'guam', 'guatemala', 'guernsey', 'guinea', 'guinea-bissau', 'guyana', 'haiti', 'heard island and mcdonald islands', 'holy see (the)', 'honduras', 'hong kong', 'hungary', 'iceland', 'india', 'indonesia', 'iran (islamic republic of)', 'iraq', 'ireland', 'isle of man', 'israel', 'italy', 'jamaica', 'japan', 'jersey', 'jordan', 'kazakhstan', 'kenya', 'kiribati', "korea (the democratic people's republic of)", 'korea (the republic of)', 'kuwait', 'kyrgyzstan', "lao people's democratic republic (the)", 'latvia', 'lebanon', 'lesotho', 'liberia', 'libya', 'liechtenstein', 'lithuania', 'luxembourg', 'macao', 'macedonia (the former yugoslav republic of)', 'madagascar', 'malawi', 'malaysia', 'maldives', 'mali', 'malta', 'marshall islands (the)', 'martinique', 'mauritania', 'mauritius', 'mayotte', 'mexico', 'micronesia (federated states of)', 'moldova (the republic of)', 'monaco', 'mongolia', 'montenegro', 'montserrat', 'morocco', 'mozambique', 'myanmar', 'namibia', 'nauru', 'nepal', 'netherlands (the)', 'new caledonia', 'new zealand', 'nicaragua', 'niger (the)', 'nigeria', 'niue', 'norfolk island', 'northern mariana islands (the)', 'norway', 'oman', 'pakistan', 'palau', 'palestine, state of', 'panama', 'papua new guinea', 'paraguay', 'peru', 'philippines (the)', 'pitcairn', 'poland', 'portugal', 'puerto rico', 'qatar', 'reunion', 'romania', 'russian federation (the)', 'rwanda', 'saint barthelemy', 'saint helena, ascension and tristan da cunha', 'saint kitts and nevis', 'saint lucia', 'saint martin (french part)', 'saint pierre and miquelon', 'saint vincent and the grenadines', 'samoa', 'san marino', 'sao tome and principe', 'saudi arabia', 'senegal', 'serbia', 'seychelles', 'sierra leone', 'singapore', 'sint maarten (dutch part)', 'slovakia', 'slovenia', 'solomon islands', 'somalia', 'south africa', 'south georgia and the south sandwich islands', 'south sudan', 'spain', 'sri lanka', 'sudan (the)', 'suriname', 'svalbard and jan mayen', 'swaziland', 'sweden', 'switzerland', 'syrian arab republic', 'taiwan (province of china)', 'tajikistan', 'tanzania, united republic of', 'thailand', 'timor-leste', 'togo', 'tokelau', 'tonga', 'trinidad and tobago', 'tunisia', 'turkey', 'turkmenistan', 'turks and caicos islands (the)', 'tuvalu', 'uganda', 'ukraine', 'united arab emirates (the)', 'united kingdom of great britain and northern ireland (the)', 'united states minor outlying islands (the)', 'united states of america (the)', 'uruguay', 'uzbekistan', 'vanuatu', 'venezuela (bolivarian republic of)', 'viet nam', 'virgin islands (british)', 'virgin islands (u.s.)', 'wallis and futuna', 'western sahara*', 'yemen', 'zambia', 'zimbabwe']
def __call__(self, name, strict=3):
result = False
name = name.lower()
if strict==3:
for country in self.__countries:
if country==name:
return True
else:
return result
elif strict==2:
for country in self.__countries:
if name in country:
return True
else:
return result
elif strict==1:
for country in self.__countries:
if country.startswith(name):
return True
else:
return result
else:
return result
countries = Countries()
print (countries('germany'))
print (countries('russia'))
print (countries('russia', strict=2))
print (countries('russia', strict=1))
print (countries('samoa', strict=2))
print (countries('samoa', strict=1))
Here are the results:
True
False
True
True
True
True
Old question but since it came up during my search and no one provided this alternative I'll add it.
https://github.com/SteinRobert/python-restcountries is a python wrapper for the REST service https://restcountries.eu/.
the wrapper seems to be maintained (was recently updated to python 3) and the REST service is maintained by apilayer so it should be up to date.
pip install python-restcountries
from restcountries import RestCountryApiV2 as rapi
def foo(name):
country_list = rapi.get_countries_by_name('France')

Change True/False value to discrete value in pandas dataframe with np.where()

I am trying to assign a state name to a list of university names:
df = pd.DataFrame({'College': pd.Series(['University of Michigan', 'University of Florida', 'Iowa State'])})
State = ['Michigan', 'Iowa']
df['State'] = np.where(df['College'].str.contains('|'.join(State)),
'state','--')
I would like to replace the "state" value that appears when there is a match with the actual name of the state. Example: University of Michigan -> Michigan (rather than "state"). Ultimately, "State" will have all 50 states so I can't write 50 "np.where" statements for each state name.
Thank you for your help.
You could use str.extract here, instead of np.where:
In [290]: df['State'] = df['College'].str.extract('({})'.format('|'.join(State)), expand=True)
In [291]: df
Out[291]:
College State
0 University of Michigan Michigan
1 University of Florida NaN
2 Iowa State Iowa
States = [
'Washington' 'Wisconsin' 'West Virginia' 'Florida' 'Wyoming'
'New Hampshire' 'New Jersey' 'New Mexico' 'National' 'North Carolina'
'North Dakota' 'Nebraska' 'New York' 'Rhode Island' 'Nevada' 'Guam'
'Colorado' 'California' 'Georgia' 'Connecticut' 'Oklahoma' 'Ohio' 'Kansas'
'South Carolina' 'Kentucky' 'Oregon' 'South Dakota' 'Delaware'
'District of Columbia' 'Hawaii' 'Puerto Rico' 'Texas' 'Louisiana'
'Tennessee' 'Pennsylvania' 'Virginia' 'Virgin Islands' 'Alaska' 'Alabama'
'American Samoa' 'Arkansas' 'Vermont' 'Illinois' 'Indiana' 'Iowa'
'Arizona' 'Idaho' 'Maine' 'Maryland' 'Massachusetts' 'Utah' 'Missouri'
'Minnesota' 'Michigan' 'Montana' 'Northern Mariana Islands' 'Mississippi'
]
state_str = '|'.join(States)
df.update(df.College.str.extract(r'(?P<State>{})'.format(state_str), expand=True))
df

Categories

Resources