string manupilation based on the pattern of two columns, any convenient way? - python

d = {'country': ['US', 'US', 'United Kingdom', 'United Kingdom'],
'province/state': ['New York', np.nan, 'Gibraltar', np.nan]}
df = pd.DataFrame(data=d)
I guess there are three steps:
Step 1: fill the NA of the province with the related country
df['province/state'].fillna(df['country'], inplace=True]
Step 2: create a new col by concatenating the country and province with '-':
df['new_geo'] = df['country'] + '-' + df['province/state']
Step 3: remove the country if it is repeated:
for example, remove United Kingdom-United Kingdom. Only keep those which are not overlapped, such as United Kingdom-Gibraltar. But I am not sure what regex should be used.
Is there any convenient way to do this?

Try:
df['new_geo'] = np.where(df['province/state'].notna(), df['country'] + '-' + df['province/state'], df['country'])
df['province/state']=df['province/state'].fillna(df['country'])
Outputs:
country province/state new_geo
0 US New York US-New York
1 US US US
2 United Kingdom Gibraltar United Kingdom-Gibraltar
3 United Kingdom United Kingdom United Kingdom

combine strings usings pandas str cat, then fill the empty cells sideways using ffill with axis=1.
res = (df
.assign(new_geo = lambda x: x.country.str.cat(x['province/state'],sep='-'))
.ffill(axis=1)
)
res
country province/state new_geo
0 US New York US-New York
1 US US US
2 United Kingdom Gibraltar United Kingdom-Gibraltar
3 United Kingdom United Kingdom United Kingdom

Related

create a column value if the column value is in the list- python pandas

I want to create a new column called Playercategory,
Name Country
Kobe United States
John Italy
Charly Japan
Braven Japan / United States
Rocky Germany / United States
Bran Lithuania
Nick United States / Ukraine
Jonas Nigeria
if the player's nationality is 'United States' or United States with any other country except European country, then Playercategory=="American"
if the player's nationality is European country or European country with any other country, then Playercategory=="Europe" (ex: 'Italy', 'Italy / United States', 'Germany / United States', 'Lithuania / Australia','Belgium')
For all the other players, then Playercategory=="Non"
Expected Output:
Name Country Playercategory
Kobe United States American
John Italy Europe
Charles Japan Non
Braven Japan / United States American
Rocky Germany / United States Europe
Bran Lithuania Europe
Nick United States / Ukraine American
Jonas Nigeria Non
What I tried:
First I created a list with Europe countries:
euroCountries=['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czechia', 'Denmark',
'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland',
'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands',
'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden']
i know how to check one condition,like below way,
df["PlayerCatagory"] = np.where(df["Country"].isin(euroCountries), "Europe", "Non")
But don't know how to concat the above three conditions and create PlayerCategory correctly.
Really appreciate your support!!!!!!!
Use numpy.select with test first if match euroCountries in Series.str.contains and then test if match United States:
m1 = df['Country'].str.contains('|'.join(euroCountries))
m2 = df['Country'].str.contains('United States')
Or you can test splitted values with Series.str.split, DataFrame.eq or DataFrame.isin and then if at least one match per rows by DataFrame.any:
df1 = df['Country'].str.split(' / ', expand=True)
m1 = df1.eq('United States').any(axis=1)
m2 = df1.isin(euroCountries).any(axis=1)
df["PlayerCatagory"] = np.select([m1, m2], ['Europe','American'], default='Non')
print (df)
Name Country PlayerCatagory
0 Kobe United States American
1 John Italy Europe
2 Charly Japan Non
3 Braven Japan / United States American
4 Rocky Germany / United States Europe
5 Bran Lithuania Europe
6 Nick United States / Ukraine American
7 Jonas Nigeria Non

How to replace specific entries in a column using replace() in a dataframe when they are not getting updated in the output by using replace method.? [duplicate]

My task is to remove any content in a parenthesis and remove any numbers followed by Country name. Change the names of a couple of countries.
e.g.
Bolivia (Plurinational State of)' should be 'Bolivia'
Switzerland17' should be 'Switzerland'`.
My original code was in the order:
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
energy['Country'] = energy['Country'].replace(dict1)
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy.loc[energy['Country'] == 'United States']
The str.replace part works fine. The tasks were completed.
When I use the last line to check if I successfully changed the Country name. This original code doesn't work. However if I change the order of the code into:
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy['Country'] = energy['Country'].replace(dict1)
Then it successfully changes the Country Name.
So there must be something wrong with my Regex syntax, how to solve this conflict? Why is this happening?
The problem is that you need regex=True replace for replace substrings:
energy = pd.DataFrame({'Country':['United States of America4',
'United States of America (aaa)','Slovakia']})
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
#no replace beacuse no match (numbers and ())
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Empty DataFrame
Columns: [Country]
Index: []
energy['Country'] = energy['Country'].replace(dict1, regex=True)
print (energy)
Country
0 United States4
1 United States (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States
#first data cleaning
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
#replace works nice
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States

Create stacked pandas series from series with list elements

I have a pandas series with elements as list:
import pandas as pd
s = pd.Series([ ['United States of America'],['China', 'Hong Kong'], []])
print(s)
0 [United States of America]
1 [China, Hong Kong]
2 []
How to get a series like the following:
0 United States of America
1 China
1 Hong Kong
I am not sure about what happens to 2.
The following options all return Series. Create a new frame and listify.
pd.DataFrame(s.tolist()).stack()
0 0 United States of America
1 0 China
1 Hong Kong
dtype: object
To reset the index, use
pd.DataFrame(s.tolist()).stack().reset_index(drop=True)
0 United States of America
1 China
2 Hong Kong
dtype: object
To convert to DataFrame, call to_frame()
pd.DataFrame(s.tolist()).stack().reset_index(drop=True).to_frame('countries')
countries
0 United States of America
1 China
2 Hong Kong
If you're trying to code golf, use
sum(s, [])
# ['United States of America', 'China', 'Hong Kong']
pd.Series(sum(s, []))
0 United States of America
1 China
2 Hong Kong
dtype: object
Or even,
pd.Series(np.sum(s))
0 United States of America
1 China
2 Hong Kong
dtype: object
However, like most other operations involving sums of lists operations, this is bad in terms of performance (list concatenation operations are inefficient).
Faster operations are possible using chaining with itertools.chain:
from itertools import chain
pd.Series(list(chain.from_iterable(s)))
0 United States of America
1 China
2 Hong Kong
dtype: object
pd.DataFrame(list(chain.from_iterable(s)), columns=['countries'])
countries
0 United States of America
1 China
2 Hong Kong
Or use:
df = pd.DataFrame(s.tolist())
print(df[0].fillna(df[1].dropna().item()))
Output:
0 United States of America
1 China
2 Hong Kong
Name: 0, dtype: object
Assuming that is list
pd.Series(s.sum())
Out[103]:
0 United States of America
1 China
2 Hong Kong
dtype: object
There is a simpler and probably way less computationally expensive to do that through pandas function explode. See at here. In your case, the answer would be:
s.explode()
Simple as it is! In a case with more columns you can specify which one you would like to "explode" by adding the name of it in literals, for example s.explode('country').

Slicing key words to become a new category column in python

data = pd.Series(['ABC Company, UK', 'CDE Company, US', 'CN DEF Company'])
data
[out]
0 ABC Company, UK
1 CDE Company, US
2 CN DEF Company
dtype: object
How to add another column to become a dataframe that is named 'Region' to convert from UK to United Kindom, US to United States, and CN to China in this column?
I guess to use a dictionary function to do that?
If you split the code out of your column first, you can map using a dictionary:
>>> df = pd.DataFrame({'country_code':['UK','US','CN']})
>>> countries = {'UK':'United Kingdom',
'US':'United States',
'CN':'China'}
>>> df['country_name'] = df['country_code'].map(countries)
>>> df
country_code country_name
0 UK United Kingdom
1 US United States
2 CN China
With your data, you can do:
data = data.str.split(pat=',', expand=True)
countries = {'UK':'United Kingdom',
'US':'United States',
'CN':'China'}
data[1] = data[1].map(coyntries)
data = data.str.cat([0, 1], sep=',')

pandas.replace conflict with str.replace regex. Code Order

My task is to remove any content in a parenthesis and remove any numbers followed by Country name. Change the names of a couple of countries.
e.g.
Bolivia (Plurinational State of)' should be 'Bolivia'
Switzerland17' should be 'Switzerland'`.
My original code was in the order:
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
energy['Country'] = energy['Country'].replace(dict1)
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy.loc[energy['Country'] == 'United States']
The str.replace part works fine. The tasks were completed.
When I use the last line to check if I successfully changed the Country name. This original code doesn't work. However if I change the order of the code into:
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy['Country'] = energy['Country'].replace(dict1)
Then it successfully changes the Country Name.
So there must be something wrong with my Regex syntax, how to solve this conflict? Why is this happening?
The problem is that you need regex=True replace for replace substrings:
energy = pd.DataFrame({'Country':['United States of America4',
'United States of America (aaa)','Slovakia']})
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
#no replace beacuse no match (numbers and ())
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Empty DataFrame
Columns: [Country]
Index: []
energy['Country'] = energy['Country'].replace(dict1, regex=True)
print (energy)
Country
0 United States4
1 United States (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States
#first data cleaning
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
#replace works nice
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States

Categories

Resources