Pandas regex replace with multiple values and spaces in the values - python

I have the following Pandas code where I am trying to replace the names of countries with the string <country>.
df['title_type2'] = df['title_type']
countries = open(r'countries.txt').read().splitlines() # Reads all lines into a list and removes \n.
countries = [country.replace(' ', r'\s') for country in countries]
pattern = r'\b' + '|'.join(countries) + r'\b'
df['title_type2'].str.replace(pattern, '<country>')
However I can't get countries with spaces (like South Korea) to work correctly, since they do not get replaced. The problem seems to be that my \s is turning into \\s. How can I avoid this or how can I fix the issue?

There is no need to replace any space with \s.
Your pattern should rather include:
\b - "starting" word boundary,
(?:...|...|...) a non-capturing group with country names (alternatives),
\b - "ending" word boundary,
something like:
pattern = r'\b(?:China|South Korea|Taiwan)\b'
Then you can do the replacement:
df['title_type2'].str.replace(pattern, '<country>')
I created test data as follows:
df = pd.DataFrame(['Abc Taiwan', 'Xyz China', 'Zxx South Korea', 'No country name'],
columns=['title_type'])
df['title_type2'] = df['title_type']
and got:
0 Abc <country>
1 Xyz <country>
2 Zxx <country>
3 No country name
Name: title_type2, dtype: object

Related

Extract a substring from a column and replace column data frame

I need some help extracting a substring from a column in my data frame and then replacing that column with a substring. I was wondering if python would be better performance for stripping the string or using regular expression to substitute/replace the string with the substring.
The string looks something like this in the column:
Person
------
<Person 1234567 Tom Brady>
<Person 456789012 Mary Ann Thomas>
<Person 92145 John Smith>
What I would like is this:
Person
------
Tom Brady
Mary Ann Thomas
John Smith
What I have so far as far as regular expressions go is this:
/^([^.]+[.]+[^.]+)[.]/g
And that just gets this part '<Person 1234567 ', not sure how to get the '>' from the end.
Multiple ways, but you can use str.replace():
import pandas as pd
df = pd.DataFrame({'Person': ['<Person 1234567 Tom Brady>',
'<Person 456789012 Mary Ann Thomas>',
'<Person 92145 John Smith>']})
df['Person'] = df['Person'].str.replace(r'(?:<Person[\d\s]+|>)', '', regex=True)
print(df)
Prints:
Person
0 Tom Brady
1 Mary Ann Thomas
2 John Smith
Pattern used: (?:<Person[\d\s]+|>), see an online demo:
(?: - Open non-capture group for alternation;
<Person[\d\s]+ - Match literal '<Person' followed by 1+ whitespace characters or digits;
| - Or;
> - A literal '>'
) - Close group.
You can first identify all the alphabets in keeping things simple with this code
res = re.findall(r"[^()0-9-]+", string)
res[1]
This should return you a list of strings ['Person', 'Tom Brady'], you can then access the name of the Person with res[1]
**Remark: I have yet to try the code, in the case that it also returns spaces, you should be able to easily remove them with strip() or it should be the the third string of the list res[3] instead.
You can read more about re.findall() online or through the documentation.
Python regex has a function called search that finds the matching pattern in a string. With the examples given, you can use regex to extract the names with:
import re
s = "<Person 1234567 John Smith>"
re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", s).group(0)
>>> 'John Smith'
The regular expression [A-Z][a-z]+(\s[A-Z][a-z]+)+ is just matching the names (Tom Brady, Mary Ann Thomas, etc.)
I like to use Panda's apply function to apply an operation on each row, so the final result would look like this:
import re
import pandas as pd
def extract_name(row):
row["Person"] = re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", row["Person"]).group(0)
return row
df = YOUR DATAFRAME
df2 = df.apply(extract_name, axis=1)
and df2 has the Person column with the extracted names.

Repeat pattern using python regex

Well, I'm cleaning a dataset, using Pandas.
I have a column called "Country", where different rows could have numbers or other information into parenthesis and I have to remove them, for example:
Australia1,
PerĂº (country),
3Costa Rica, etc. To do this, I'm getting the column and I make a mapping over it.
pattern = "([a-zA-Z]+[\s]*[a-aZ-Z]+)(?:[(]*.*[)]*)"
df['Country'] = df['Country'].str.extract(pattern)
But I have a problem with this regex, I cannot match names as "United States of America", because it only takes "United ". How can I repeat unlimited the pattern of the fisrt group to match the whole name?
Thanks!
In this situation, I will clean the data step by step.
df_str = '''
Country
Australia1
PerĂº (country)
3Costa Rica
United States of America
'''
df = pd.read_csv(io.StringIO(df_str.strip()), sep='\n')
# handle the data
(df['Country']
.str.replace('\d+', '', regex=True) # remove number
.str.split('\(').str[0] # get items before `(`
.str.strip() # strip spaces
)
Thanks for you answer, it worked!
I found other solution, and it was doing a match of the things that I don't want on the df.
pattern = "([\s]*[(][\w ]*[)][\s]*)|([\d]*)" #I'm selecting info that I don't want
df['Country'] = df['Country'].replace(pattern, "", regex = True) #I replace that information to an empty string

split column regex dataframe python

I have a column in a dataframe where in some rows I have the state, and sometimes just the city. For example in some rows I just have: 'Los Angeles', but in other rows I may have 'CA Los Angeles'.
I want to split that column into two new ones: states and cities, and if the state is not specified, then it can be blank. Something like this:
COLUMN
STATE
CITY
FL Miami
FL
Miami
Houston
null
Houston
I was thinking by maybe splitting using regex like '[A-Z][A-Z]\s' or something like that but I cannot make it work. Any ideas?
You can use
^(?:([A-Z]{2})\s+)?(.*)
See the regex demo. Details:
^ - start of string
(?:([A-Z]{2})\s+)? - an optional occurrence of
([A-Z]{2}) - Group 1: two uppercase ASCII letters
\s+ - one or more whitespaces
(.*) - Group 2: any zero or more chars other than line break chars as many as possible.
If you are using Pandas use
df[['STATE','CITY']] = df['COLUMN'].str.extract(r'^(?:([A-Z]{2})\s+)?(.*)', expand=False)

Regex match match "words" that contain two continuous streaks of digits and letters or vice-versa and split them

I am having following line of text as given below:
text= 'Cms12345678 Gleandaleacademy Fee Collection 00001234Abcd Renewal 123Acgf456789'
I am trying to split numbers followed by characters or characters followed by numbers only to get the output as:
output_text = 'Cms 12345678 Gleandaleacademy Fee Collection 00001234 Abcd Renewal 123Acgf456789
I have tried the following approcah:
import re
text = 'Cms12345678 Gleandaleacademy Fee Collection 00001234Abcd Renewal 123Acgf456789'
text = text.lower().strip()
text = text.split(' ')
output_text =[]
for i in text:
if bool(re.match(r'[a-z]+\d+|\d+\w+',i, re.IGNORECASE))==True:
out_split = re.split('(\d+)',i)
for j in out_split:
output_text.append(j)
else:
output_text.append(i)
output_text = ' '.join(output_text)
Which is giving output as:
output_text = 'cms 12345678 gleandaleacademy fee collection 00001234 abcd renewal 123 acgf 456789 '
This code is also splliting the last element of text 123acgf456789 due to incorrect regex in re.match.
Please help me out to get correct output.
You can use
re.sub(r'\b(?:([a-zA-Z]+)(\d+)|(\d+)([a-zA-Z]+))\b', r'\1\3 \2\4', text)
See the regex demo
Details
\b - word boundary
(?: - start of a non-capturing group (necessary for the word boundaries to be applied to all the alternatives):
([a-zA-Z]+)(\d+) - Group 1: one or more letters and Group 2: one or more digits
| - or
(\d+)([a-zA-Z]+) - Group 3: one or more digits and Group 4: one or more letters
) - end of the group
\b - word boundary
During the replacement, either \1 and \2 or \3 and \4 replacement backreferences are initialized, so concatenating them as \1\3 and \2\4 yields the right results.
See a Python demo:
import re
text = "Cms1291682971 Gleandaleacademy Fee Collecti 0000548Andb Renewal 402Ecfev845410001"
print( re.sub(r'\b(?:([a-zA-Z]+)(\d+)|(\d+)([a-zA-Z]+))\b', r'\1\3 \2\4', text) )
# => Cms 1291682971 Gleandaleacademy Fee Collecti 0000548 Andb Renewal 402Ecfev845410001

Removing substring of from a list of strings

There are several countries with numbers and/or parenthesis in my list. How I remove these?
e.g.
'Bolivia (Plurinational State of)' should be 'Bolivia',
'Switzerland17' should be 'Switzerland'.
Run just:
df.Country.replace(r'\d+|\s*\([^)]*\)', '', regex=True, inplace=True)
Assuming that the initial content of your DataFrame is:
Country
0 Bolivia (Plurinational State of)
1 Switzerland17
2 United Kingdom
after the above replace you will have:
Country
0 Bolivia
1 Switzerland
2 United Kingdom
The above pattern contains:
the first alternative - a non-empty sequence of digits,
the second alternative:
an optional sequence of "white" chars,
an opening parenthesis (quoted),
a sequence of chars other than ) (between brackets no quotation is
needed),
a closing parenthesis (also quoted).
Use Series.str.replace with regex for replacement, \s* is for possible spaces before (, then \(.*\) is for values () and values between | is for regex or and \d+ is for numbers with 1 or more digits:
df = pd.DataFrame({'a':['Bolivia (Plurinational State of)','Switzerland17']})
df['a'] = df['a'].str.replace('(\s*\(.*\)|\d+)','')
print (df)
a
0 Bolivia
1 Switzerland
You can remove string by this way:-
Remove numbers:-
import re
a = 'Switzerland17'
pattern = '[0-9]'
res = re.sub(pattern, '', a)
print(res)
Output:-
'Switzerland'
Remove parenthesis:-
b = 'Bolivia (Plurinational State of)'
pattern2 = '(\s*\(.*\))'
res2 = re.sub(pattern2, '', b)
print(res2)
Output:-
'Bolivia'
Using Regex and simple List Operation
Go through the list items, find the regex matching in each item, and replace the values in place. This regex "[a-zA-Z]{2,}" works for only string matching with the minimum size of two or more. It gives your freedom based on parenthesis. The better approach for Regex is to use Matching string based on your input domain (i.e country in your case) and a Country name cannot have a number in its name or Parenthesis. SO you should use the following.
import re
list_of_country_strings = ["Switzerland17", "America290","Korea(S)"]
for index in range(len(list_of_country_strings)):
x = re.match("[a-zA-Z]{2,}",string = list_of_country_strings[index])
if x:
list_of_country_strings[index] = list_of_country_strings[index][x.start():x.end()]
print(list_of_country_strings)
Output
['Switzerland', 'America', 'Korea']

Categories

Resources