My task is to remove any content in a parenthesis and remove any numbers followed by Country name. Change the names of a couple of countries.
e.g.
Bolivia (Plurinational State of)' should be 'Bolivia'
Switzerland17' should be 'Switzerland'`.
My original code was in the order:
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
energy['Country'] = energy['Country'].replace(dict1)
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy.loc[energy['Country'] == 'United States']
The str.replace part works fine. The tasks were completed.
When I use the last line to check if I successfully changed the Country name. This original code doesn't work. However if I change the order of the code into:
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy['Country'] = energy['Country'].replace(dict1)
Then it successfully changes the Country Name.
So there must be something wrong with my Regex syntax, how to solve this conflict? Why is this happening?
The problem is that you need regex=True replace for replace substrings:
energy = pd.DataFrame({'Country':['United States of America4',
'United States of America (aaa)','Slovakia']})
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
#no replace beacuse no match (numbers and ())
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Empty DataFrame
Columns: [Country]
Index: []
energy['Country'] = energy['Country'].replace(dict1, regex=True)
print (energy)
Country
0 United States4
1 United States (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States
#first data cleaning
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
#replace works nice
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States
Related
I want to create a new column called Playercategory,
Name Country
Kobe United States
John Italy
Charly Japan
Braven Japan / United States
Rocky Germany / United States
Bran Lithuania
Nick United States / Ukraine
Jonas Nigeria
if the player's nationality is 'United States' or United States with any other country except European country, then Playercategory=="American"
if the player's nationality is European country or European country with any other country, then Playercategory=="Europe" (ex: 'Italy', 'Italy / United States', 'Germany / United States', 'Lithuania / Australia','Belgium')
For all the other players, then Playercategory=="Non"
Expected Output:
Name Country Playercategory
Kobe United States American
John Italy Europe
Charles Japan Non
Braven Japan / United States American
Rocky Germany / United States Europe
Bran Lithuania Europe
Nick United States / Ukraine American
Jonas Nigeria Non
What I tried:
First I created a list with Europe countries:
euroCountries=['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czechia', 'Denmark',
'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland',
'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands',
'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden']
i know how to check one condition,like below way,
df["PlayerCatagory"] = np.where(df["Country"].isin(euroCountries), "Europe", "Non")
But don't know how to concat the above three conditions and create PlayerCategory correctly.
Really appreciate your support!!!!!!!
Use numpy.select with test first if match euroCountries in Series.str.contains and then test if match United States:
m1 = df['Country'].str.contains('|'.join(euroCountries))
m2 = df['Country'].str.contains('United States')
Or you can test splitted values with Series.str.split, DataFrame.eq or DataFrame.isin and then if at least one match per rows by DataFrame.any:
df1 = df['Country'].str.split(' / ', expand=True)
m1 = df1.eq('United States').any(axis=1)
m2 = df1.isin(euroCountries).any(axis=1)
df["PlayerCatagory"] = np.select([m1, m2], ['Europe','American'], default='Non')
print (df)
Name Country PlayerCatagory
0 Kobe United States American
1 John Italy Europe
2 Charly Japan Non
3 Braven Japan / United States American
4 Rocky Germany / United States Europe
5 Bran Lithuania Europe
6 Nick United States / Ukraine American
7 Jonas Nigeria Non
I am trying to replace a string value from a dataframe and its not changing. Below is my code:
import pandas as pd
import numpy as np
def answer_one():
energy = pd.read_excel('Energy Indicators.xls', skip_footer=38,skiprows=17).drop(['Unnamed: 0',
'Unnamed: 1'], axis='columns')
energy = energy.rename(columns={'Unnamed: 2':'Country',
'Petajoules':'Energy Supply',
'Gigajoules':'Energy Supply per Capita',
'%':'% Renewable'})
energy['Energy Supply'] *=1000000
energy['Country'] = energy['Country'].replace({'China, Hong Kong Special Administrative Region':
'Hong Kong', 'United Kingdom of Great Britain and
Northern Ireland': 'United Kingdom', 'Republic Of
Korea': 'South Korea', 'United States of America':
'United States', 'Iran (Islamic Republic of)':
'Iran'})
return energy
answer_one()
I am getting all the output except that last one, where I am trying to replace the country names in 'Country' column.
I don't see any issues with the .replace(). Maybe there is something else missing which we are not aware from the hidden dataset. I just created a sample df with the values you got there and replaced them as you have done.
The df now has the original values for the country column.
df = pd.DataFrame(['China, Hong Kong Special Administrative Region',
'United Kingdom of Great Britain and Northern Ireland',
'Republic Of Korea',
'United States of America',
'Iran (Islamic Republic of)'],
columns=['Country'])
print(df)
Before replace:
Country
0 China, Hong Kong Special Administrative Region
1 United Kingdom of Great Britain and Northern I...
2 Republic Of Korea
3 United States of America
4 Iran (Islamic Republic of)
Used the same .replace() and assigned the pairs (see proper indentation)
df['Country'] = df['Country'].replace({'China, Hong Kong Special Administrative Region': 'Hong Kong',
'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
'Republic Of Korea': 'South Korea',
'United States of America': 'United States',
'Iran (Islamic Republic of)': 'Iran'})
print(df)
After replace:
Country
0 Hong Kong
1 United Kingdom
2 South Korea
3 United States
4 Iran
My task is to remove any content in a parenthesis and remove any numbers followed by Country name. Change the names of a couple of countries.
e.g.
Bolivia (Plurinational State of)' should be 'Bolivia'
Switzerland17' should be 'Switzerland'`.
My original code was in the order:
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
energy['Country'] = energy['Country'].replace(dict1)
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy.loc[energy['Country'] == 'United States']
The str.replace part works fine. The tasks were completed.
When I use the last line to check if I successfully changed the Country name. This original code doesn't work. However if I change the order of the code into:
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy['Country'] = energy['Country'].replace(dict1)
Then it successfully changes the Country Name.
So there must be something wrong with my Regex syntax, how to solve this conflict? Why is this happening?
The problem is that you need regex=True replace for replace substrings:
energy = pd.DataFrame({'Country':['United States of America4',
'United States of America (aaa)','Slovakia']})
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
#no replace beacuse no match (numbers and ())
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Empty DataFrame
Columns: [Country]
Index: []
energy['Country'] = energy['Country'].replace(dict1, regex=True)
print (energy)
Country
0 United States4
1 United States (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States
#first data cleaning
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
#replace works nice
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States
d = {'country': ['US', 'US', 'United Kingdom', 'United Kingdom'],
'province/state': ['New York', np.nan, 'Gibraltar', np.nan]}
df = pd.DataFrame(data=d)
I guess there are three steps:
Step 1: fill the NA of the province with the related country
df['province/state'].fillna(df['country'], inplace=True]
Step 2: create a new col by concatenating the country and province with '-':
df['new_geo'] = df['country'] + '-' + df['province/state']
Step 3: remove the country if it is repeated:
for example, remove United Kingdom-United Kingdom. Only keep those which are not overlapped, such as United Kingdom-Gibraltar. But I am not sure what regex should be used.
Is there any convenient way to do this?
Try:
df['new_geo'] = np.where(df['province/state'].notna(), df['country'] + '-' + df['province/state'], df['country'])
df['province/state']=df['province/state'].fillna(df['country'])
Outputs:
country province/state new_geo
0 US New York US-New York
1 US US US
2 United Kingdom Gibraltar United Kingdom-Gibraltar
3 United Kingdom United Kingdom United Kingdom
combine strings usings pandas str cat, then fill the empty cells sideways using ffill with axis=1.
res = (df
.assign(new_geo = lambda x: x.country.str.cat(x['province/state'],sep='-'))
.ffill(axis=1)
)
res
country province/state new_geo
0 US New York US-New York
1 US US US
2 United Kingdom Gibraltar United Kingdom-Gibraltar
3 United Kingdom United Kingdom United Kingdom
I am a beginner in pandas and I met this problem recently.
with the data below(few lines ahead)
Country Energy Supply Energy Supply per Capita % Renewable
0 Afghanistan 3.210000e+08 10.0 78.669280
1 Albania 1.020000e+08 35.0 100.000000
2 Algeria 1.959000e+09 51.0 0.551010
3 American Samoa NaN NaN 0.641026
4 Andorra 9.000000e+06 121.0 88.695650
I wish to read in the data from excel, set the names of the columns, then replace the names of certain countries in a function then return the Dataframe for further use.
def answer_one():
energy = pd.read_excel('Energy Indicators.xls',skiprows=18,header=None,skipfooter=38,parse_cols='C:F')
energy.columns = ['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
energy.replace('...',np.nan,inplace=True)
energy['Energy Supply'] = energy['Energy Supply'] * 1000000
energy.replace({"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"},inplace=True)
energy.replace(regex={r'[0-9]':'',r'\(.*\)':''},inplace=True)
return energy
answer_one()
for the regex match, it works perfectly, but
energy['Country'].replace({"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"},inplace=True)
It does not seem to work. Then in the result of the function returned Dataframe, I notice that:
216 United States of America 9.083800e+10 286.0 11.570980
So the value is not correctly replaced
Why is this happening? I'll be much appreciated if you could help understand this
Because I working with similar data I think I know problem.
There are sometimes superscripts numbers after country names, so need swap code for first remove numbers and then replace strings:
energy.replace(regex={r'[0-9]':'',r'\(.*\)':''},inplace=True)
energy.replace({"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"},inplace=True)
I think inplace is not good practice, check this and this.
So better is use:
d = {"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
energy = energy.replace(regex={r'[0-9]':'',r'\(.*\)':''})
energy['Country']= energy['Country'].replace(d)
Try
energy['Country']= energy['Country'].replace({"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"},inplace=True)