remove numbers after the string and () along with what is inside - python

I have a problem when removing numbers and parenthesis along with what is inside in Python. It is suggested to use str.replace. However, the challenge here is that the numbers are not certain numbers. I only know I need to remove whatever the number is, but I am not sure what it may be. For parenthesis, I only know I need remove () along with what is inside. However, the content inside also varies. For instance, If I have the following data set:
import pandas as pd
a = pd.Series({'Country':'China 1', 'Capital': 'Bei Jing'})
b = pd.Series({'Country': 'United States (of American)', 'Capital': 'Washington'})
c = pd.Series({'Country': 'United Kingdom (of Great Britain and Northern Ireland)', 'Capital': 'London'})
d = pd.Series({'Country': 'France 2', 'Capital': 'Paris'})
e = pd.DataFrame([a,b,c,d])
Now in Column 'Country', the values are 'China 1', 'United States (of American)', 'United Kingdom (of...)' and 'France 2'. After replacement/remove, I want to get rid of all numbers and parenthesis along with content inside, and want the values in Column Country to be 'China', 'United States', 'United Kingdom', 'France'.

You can use str.replace here with regex.
series1.str.replace("^([a-zA-Z]+(?:\s+[a-zA-Z]+)*).*", r"\1")
See demo.You can replace with your own series and other modifications.
https://regex101.com/r/lIScpi/2
You can also directly modify frame.
a = pd.Series({'Country': 'China 1', 'Capital': 'Bei Jing'})
b = pd.Series({'Country': 'United States (of American)', 'Capital': 'Washington'})
c = pd.Series({'Country': 'United Kingdom (of Great Britain and Northern Ireland)', 'Capital': 'London'})
d = pd.Series({'Country': 'France 2', 'Capital': 'Paris'})
e = pd.DataFrame([a, b, c, d])
print e
e['Country'] = e['Country'].str.replace("^([a-zA-Z]+(?:\s+[a-zA-Z]+)*).*", r"\1")
print e
Output before replace.
Capital Country
0 Bei Jing China 1
1 Washington United States (of American)
2 London United Kingdom (of Great Britain and Northern ...
3 Paris France 2
Output after replace
Capital Country
0 Bei Jing China
1 Washington United States
2 London United Kingdom
3 Paris France

Related

create a column value if the column value is in the list- python pandas

I want to create a new column called Playercategory,
Name Country
Kobe United States
John Italy
Charly Japan
Braven Japan / United States
Rocky Germany / United States
Bran Lithuania
Nick United States / Ukraine
Jonas Nigeria
if the player's nationality is 'United States' or United States with any other country except European country, then Playercategory=="American"
if the player's nationality is European country or European country with any other country, then Playercategory=="Europe" (ex: 'Italy', 'Italy / United States', 'Germany / United States', 'Lithuania / Australia','Belgium')
For all the other players, then Playercategory=="Non"
Expected Output:
Name Country Playercategory
Kobe United States American
John Italy Europe
Charles Japan Non
Braven Japan / United States American
Rocky Germany / United States Europe
Bran Lithuania Europe
Nick United States / Ukraine American
Jonas Nigeria Non
What I tried:
First I created a list with Europe countries:
euroCountries=['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czechia', 'Denmark',
'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland',
'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands',
'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden']
i know how to check one condition,like below way,
df["PlayerCatagory"] = np.where(df["Country"].isin(euroCountries), "Europe", "Non")
But don't know how to concat the above three conditions and create PlayerCategory correctly.
Really appreciate your support!!!!!!!
Use numpy.select with test first if match euroCountries in Series.str.contains and then test if match United States:
m1 = df['Country'].str.contains('|'.join(euroCountries))
m2 = df['Country'].str.contains('United States')
Or you can test splitted values with Series.str.split, DataFrame.eq or DataFrame.isin and then if at least one match per rows by DataFrame.any:
df1 = df['Country'].str.split(' / ', expand=True)
m1 = df1.eq('United States').any(axis=1)
m2 = df1.isin(euroCountries).any(axis=1)
df["PlayerCatagory"] = np.select([m1, m2], ['Europe','American'], default='Non')
print (df)
Name Country PlayerCatagory
0 Kobe United States American
1 John Italy Europe
2 Charly Japan Non
3 Braven Japan / United States American
4 Rocky Germany / United States Europe
5 Bran Lithuania Europe
6 Nick United States / Ukraine American
7 Jonas Nigeria Non

I am trying to get an output of a replaced string value in a Data frame, but its not changing

I am trying to replace a string value from a dataframe and its not changing. Below is my code:
import pandas as pd
import numpy as np
def answer_one():
energy = pd.read_excel('Energy Indicators.xls', skip_footer=38,skiprows=17).drop(['Unnamed: 0',
'Unnamed: 1'], axis='columns')
energy = energy.rename(columns={'Unnamed: 2':'Country',
'Petajoules':'Energy Supply',
'Gigajoules':'Energy Supply per Capita',
'%':'% Renewable'})
energy['Energy Supply'] *=1000000
energy['Country'] = energy['Country'].replace({'China, Hong Kong Special Administrative Region':
'Hong Kong', 'United Kingdom of Great Britain and
Northern Ireland': 'United Kingdom', 'Republic Of
Korea': 'South Korea', 'United States of America':
'United States', 'Iran (Islamic Republic of)':
'Iran'})
return energy
answer_one()
I am getting all the output except that last one, where I am trying to replace the country names in 'Country' column.
I don't see any issues with the .replace(). Maybe there is something else missing which we are not aware from the hidden dataset. I just created a sample df with the values you got there and replaced them as you have done.
The df now has the original values for the country column.
df = pd.DataFrame(['China, Hong Kong Special Administrative Region',
'United Kingdom of Great Britain and Northern Ireland',
'Republic Of Korea',
'United States of America',
'Iran (Islamic Republic of)'],
columns=['Country'])
print(df)
Before replace:
Country
0 China, Hong Kong Special Administrative Region
1 United Kingdom of Great Britain and Northern I...
2 Republic Of Korea
3 United States of America
4 Iran (Islamic Republic of)
Used the same .replace() and assigned the pairs (see proper indentation)
df['Country'] = df['Country'].replace({'China, Hong Kong Special Administrative Region': 'Hong Kong',
'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
'Republic Of Korea': 'South Korea',
'United States of America': 'United States',
'Iran (Islamic Republic of)': 'Iran'})
print(df)
After replace:
Country
0 Hong Kong
1 United Kingdom
2 South Korea
3 United States
4 Iran

string manupilation based on the pattern of two columns, any convenient way?

d = {'country': ['US', 'US', 'United Kingdom', 'United Kingdom'],
'province/state': ['New York', np.nan, 'Gibraltar', np.nan]}
df = pd.DataFrame(data=d)
I guess there are three steps:
Step 1: fill the NA of the province with the related country
df['province/state'].fillna(df['country'], inplace=True]
Step 2: create a new col by concatenating the country and province with '-':
df['new_geo'] = df['country'] + '-' + df['province/state']
Step 3: remove the country if it is repeated:
for example, remove United Kingdom-United Kingdom. Only keep those which are not overlapped, such as United Kingdom-Gibraltar. But I am not sure what regex should be used.
Is there any convenient way to do this?
Try:
df['new_geo'] = np.where(df['province/state'].notna(), df['country'] + '-' + df['province/state'], df['country'])
df['province/state']=df['province/state'].fillna(df['country'])
Outputs:
country province/state new_geo
0 US New York US-New York
1 US US US
2 United Kingdom Gibraltar United Kingdom-Gibraltar
3 United Kingdom United Kingdom United Kingdom
combine strings usings pandas str cat, then fill the empty cells sideways using ffill with axis=1.
res = (df
.assign(new_geo = lambda x: x.country.str.cat(x['province/state'],sep='-'))
.ffill(axis=1)
)
res
country province/state new_geo
0 US New York US-New York
1 US US US
2 United Kingdom Gibraltar United Kingdom-Gibraltar
3 United Kingdom United Kingdom United Kingdom

Lookup table with 'wildcards' in Pandas

I've been looking for an answer to this problem for a few days, but can't find anything similar in other threads.
I have a lookup table to define classification for some input data. The classification depends on continent, country and city. However, some classes may depend on a subset of these variables, e.g. only continent and country (no city). An example of such lookup table is below. In my example, I'm using one and two stars as wildcards:
- One Star: I want all cities in France to be classified as France, and
- Two Stars: All cities in US, excepting New York and San Francisco as USA - Other.
lookup_df = pd.DataFrame({'Continent': ['Europe', 'Europe', 'Asia', 'America', 'America', 'America', 'America', 'Africa'],
'Country': ['France', 'Italy', 'Japan', 'USA', 'USA', 'USA', 'Argentina', '*'],
'City': ['*', '*', '*', 'New York', 'San Francisco', '**', '*', '*'],
'Classification': ['France', 'Italy', 'Japan', 'USA - NY', 'USA - SF', 'USA - Other', 'Argentina', 'Africa']})
If my dataframe is
df = pd.DataFrame({'Continent': ['Europe', 'Europe', 'Asia', 'America ', 'America', 'America', 'Africa'],
'Country': ['France', 'Italy', 'Japan', 'USA', 'USA', 'USA', 'Egypt'],
'City': ['Paris', 'Rome', 'Tokyo', 'San Francisco', 'Houston', 'DC', 'Cairo']})
I am trying to get the following result:
Continent Country City Classification
0 Europe France Paris France
1 Europe Italy Rome Italy
2 Asia Japan Tokyo Japan
3 America USA San Francisco USA - SF
4 America USA Houston USA - Other
5 America USA DC USA - Other
6 Africa Egypt Cairo Africa
I need to start from a lookup table or similar because it's easier to maintain, easier to explain and it's also used by other processes. I can't create a full table, because I would have to consider all possible cities in the world.
Is there any pythonic way of doing this? I thought I could use pd.merge, but I haven't found any examples of this online.
One easy-to-maintain way to handle your task is to use maps:
df2 = df.copy()
# below will yield a field df2.Classification and save the value when all "Continent", "Country" and "City" match, otherwise np.nan
df2 = df2.merge(lookup_df, how='left', on = ["Continent", "Country", "City"])
# create map1 from lookup_df when City is '*' but Country is not '*'
map1 = lookup_df.loc[lookup_df.City.str.match('^\*+$') & ~lookup_df.Country.str.match('^\*+$')].set_index(['Continent','Country']).Classification.to_dict()
map1
#{('Europe', 'France'): 'France',
# ('Europe', 'Italy'): 'Italy',
# ('Asia', 'Japan'): 'Japan',
# ('America', 'USA'): 'USA - Other',
# ('America', 'Argentina'): 'Argentina'}
# create map2 from lookup_df when both City and Country are '*'
map2 = lookup_df.loc[lookup_df.City.str.match('^\*+$') & lookup_df.Country.str.match('^\*+$')].set_index('Continent').Classification.to_dict()
map2
#{'Africa': 'Africa'}
# create a function to define your logic:
def set_classification(x):
return x.Classification if x.Classification is not np.nan else \
map1[(x.Continent, x.Country)] if (x.Continent, x.Country) in map1 else \
map2[x.Continent] if x.Continent in map2 else \
np.nan
# apply the above function to each row of the df2
df2["Classification"] = df2.apply(set_classification, axis = 1)
Note: your original df.Continent on the 4th row contains an extra trailing space 'America ' which will fail the above df2 = df2.merge(...) line. you will need to fix this data issue though.

Filling out empty cells with lists of values

I have a data frame that looks like below:
City State Country
Chicago IL United States
Boston
San Diego CA United States
Los Angeles CA United States
San Francisco
Sacramento
Vancouver BC Canada
Toronto
And I have 3 lists of values that are ready to fill in the None cells:
city = ['Boston', 'San Francisco', 'Sacramento', 'Toronto']
state = ['MA', 'CA', 'CA', 'ON']
country = ['United States', 'United States', 'United States', 'Canada']
The order of the elements in these list are correspondent to each other. Thus, the first items across all 3 lists match each other, and so forth. How can I fill out the empty cells and produce a result like below?
City State Country
Chicago IL United States
Boston MA United States
San Diego CA United States
Los Angeles CA United States
San Francisco CA United States
Sacramento CA United States
Vancouver BC Canada
Toronto ON Canada
My code gives me an error and I'm stuck.
if df.loc[df['City'] == 'Boston']:
'State' = 'MA'
Any solution is welcome. Thank you.
Create two mappings, one for <city : state>, and another for <city : country>.
city_map = dict(zip(city, state))
country_map = dict(zip(city, country))
Next, set City as the index -
df = df.set_index('City')
And, finally use map/replace to transform keys to values as appropriate -
df['State'] = df['City'].map(city_map)
df['Country'] = df['City'].map(country_map)
As an extra final step, you may call df.reset_index() at the end.

Categories

Resources