Removing substring of from a list of strings - python

There are several countries with numbers and/or parenthesis in my list. How I remove these?
e.g.
'Bolivia (Plurinational State of)' should be 'Bolivia',
'Switzerland17' should be 'Switzerland'.

Run just:
df.Country.replace(r'\d+|\s*\([^)]*\)', '', regex=True, inplace=True)
Assuming that the initial content of your DataFrame is:
Country
0 Bolivia (Plurinational State of)
1 Switzerland17
2 United Kingdom
after the above replace you will have:
Country
0 Bolivia
1 Switzerland
2 United Kingdom
The above pattern contains:
the first alternative - a non-empty sequence of digits,
the second alternative:
an optional sequence of "white" chars,
an opening parenthesis (quoted),
a sequence of chars other than ) (between brackets no quotation is
needed),
a closing parenthesis (also quoted).

Use Series.str.replace with regex for replacement, \s* is for possible spaces before (, then \(.*\) is for values () and values between | is for regex or and \d+ is for numbers with 1 or more digits:
df = pd.DataFrame({'a':['Bolivia (Plurinational State of)','Switzerland17']})
df['a'] = df['a'].str.replace('(\s*\(.*\)|\d+)','')
print (df)
a
0 Bolivia
1 Switzerland

You can remove string by this way:-
Remove numbers:-
import re
a = 'Switzerland17'
pattern = '[0-9]'
res = re.sub(pattern, '', a)
print(res)
Output:-
'Switzerland'
Remove parenthesis:-
b = 'Bolivia (Plurinational State of)'
pattern2 = '(\s*\(.*\))'
res2 = re.sub(pattern2, '', b)
print(res2)
Output:-
'Bolivia'

Using Regex and simple List Operation
Go through the list items, find the regex matching in each item, and replace the values in place. This regex "[a-zA-Z]{2,}" works for only string matching with the minimum size of two or more. It gives your freedom based on parenthesis. The better approach for Regex is to use Matching string based on your input domain (i.e country in your case) and a Country name cannot have a number in its name or Parenthesis. SO you should use the following.
import re
list_of_country_strings = ["Switzerland17", "America290","Korea(S)"]
for index in range(len(list_of_country_strings)):
x = re.match("[a-zA-Z]{2,}",string = list_of_country_strings[index])
if x:
list_of_country_strings[index] = list_of_country_strings[index][x.start():x.end()]
print(list_of_country_strings)
Output
['Switzerland', 'America', 'Korea']

Related

How to remove numbers from a string column that starts with 4 zeros?

I have a column of names and informations of products, i need to remove the codes from the names and every code starts with four or more zeros, some names have four zeros or more in the weight and some are joined with the name as the example below:
data = {
'Name' : ['ANOA 250g 00004689', 'ANOA 10000g 00000059884', '80%c asjw 150000001568 ', 'Shivangi000000478761'],
}
testdf = pd.DataFrame(data)
The correct output would be:
results = {
'Name' : ['ANOA 250g', 'ANOA 10000g', '80%c asjw 150000001568 ', 'Shivangi'],
}
results = pd.DataFrame(results)
you can split the strings by the start of the code pattern, which is expressed by the regex (?<!\d)0{4,}. this pattern consumes four 0s that are not preceded by any digit. after splitting the string, take the first fragment, and the str.strip gets rid of possible trailing space
testdf.Name.str.split('(?<!\d)0{4,}', regex=True, expand=True)[0].str.strip()[0].str.strip()
# outputs:
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
note that this works for the case where the codes are always at the end of your string.
Use a regex with str.replace:
testdf['Name'] = testdf['Name'].str.replace(r'(?:(?<=\D)|\s*\b)0{4}\d*',
'', regex=True)
Or, similar to #HaleemurAli, with a negative match
testdf['Name'] = testdf['Name'].str.replace(r'(?<!\d)0{4,}0{4}\d*',
'', regex=True)
Output:
Name
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
regex1 demo
regex2 demo
try splitting it at each space and checking if the each item has 0000 in it like:
answer=[]
for i in results["Name"]:
answer.append("".join([j for j in i.split() if "0000" not in j]))

Extract a substring from a column and replace column data frame

I need some help extracting a substring from a column in my data frame and then replacing that column with a substring. I was wondering if python would be better performance for stripping the string or using regular expression to substitute/replace the string with the substring.
The string looks something like this in the column:
Person
------
<Person 1234567 Tom Brady>
<Person 456789012 Mary Ann Thomas>
<Person 92145 John Smith>
What I would like is this:
Person
------
Tom Brady
Mary Ann Thomas
John Smith
What I have so far as far as regular expressions go is this:
/^([^.]+[.]+[^.]+)[.]/g
And that just gets this part '<Person 1234567 ', not sure how to get the '>' from the end.
Multiple ways, but you can use str.replace():
import pandas as pd
df = pd.DataFrame({'Person': ['<Person 1234567 Tom Brady>',
'<Person 456789012 Mary Ann Thomas>',
'<Person 92145 John Smith>']})
df['Person'] = df['Person'].str.replace(r'(?:<Person[\d\s]+|>)', '', regex=True)
print(df)
Prints:
Person
0 Tom Brady
1 Mary Ann Thomas
2 John Smith
Pattern used: (?:<Person[\d\s]+|>), see an online demo:
(?: - Open non-capture group for alternation;
<Person[\d\s]+ - Match literal '<Person' followed by 1+ whitespace characters or digits;
| - Or;
> - A literal '>'
) - Close group.
You can first identify all the alphabets in keeping things simple with this code
res = re.findall(r"[^()0-9-]+", string)
res[1]
This should return you a list of strings ['Person', 'Tom Brady'], you can then access the name of the Person with res[1]
**Remark: I have yet to try the code, in the case that it also returns spaces, you should be able to easily remove them with strip() or it should be the the third string of the list res[3] instead.
You can read more about re.findall() online or through the documentation.
Python regex has a function called search that finds the matching pattern in a string. With the examples given, you can use regex to extract the names with:
import re
s = "<Person 1234567 John Smith>"
re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", s).group(0)
>>> 'John Smith'
The regular expression [A-Z][a-z]+(\s[A-Z][a-z]+)+ is just matching the names (Tom Brady, Mary Ann Thomas, etc.)
I like to use Panda's apply function to apply an operation on each row, so the final result would look like this:
import re
import pandas as pd
def extract_name(row):
row["Person"] = re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", row["Person"]).group(0)
return row
df = YOUR DATAFRAME
df2 = df.apply(extract_name, axis=1)
and df2 has the Person column with the extracted names.

Remove leading words pandas

I have this data df where Names is a column name and below it are its data:
Names
------
23James
0Sania
4124Thomas
101Craig
8Rick
How can I return it to this:
Names
------
James
Sania
Thomas
Craig
Rick
I tried with df.strip but there are certain numbers that are still in the DataFrame.
You can also extract all characters after digits using a capture group:
df['Names'] = df['Names'].str.extract('^\d+(.*)')
print(df)
# Output
Names
0 James
1 Sania
2 Thomas
3 Craig
4 Rick
Details on Regex101
We can use str.replace here with the regex pattern ^\d+, which targets leading digits.
df["Names"] = df["Names"].str.replace(r'^\d+', '')
The answer by Tim certainly solves this but I usually feel uncomfortable using regex as I'm not proficient with it so I would approach it like this -
def removeStartingNums(s):
count = 0
for i in s:
if i.isnumeric():
count += 1
else:
break
return s[count:]
df["Names"] = df["Names"].apply(removeStartingNums)
What the function essentially does is count the number of leading characters which are numeric and then returns a string which has those starting characters sliced off

Pandas regex replace with multiple values and spaces in the values

I have the following Pandas code where I am trying to replace the names of countries with the string <country>.
df['title_type2'] = df['title_type']
countries = open(r'countries.txt').read().splitlines() # Reads all lines into a list and removes \n.
countries = [country.replace(' ', r'\s') for country in countries]
pattern = r'\b' + '|'.join(countries) + r'\b'
df['title_type2'].str.replace(pattern, '<country>')
However I can't get countries with spaces (like South Korea) to work correctly, since they do not get replaced. The problem seems to be that my \s is turning into \\s. How can I avoid this or how can I fix the issue?
There is no need to replace any space with \s.
Your pattern should rather include:
\b - "starting" word boundary,
(?:...|...|...) a non-capturing group with country names (alternatives),
\b - "ending" word boundary,
something like:
pattern = r'\b(?:China|South Korea|Taiwan)\b'
Then you can do the replacement:
df['title_type2'].str.replace(pattern, '<country>')
I created test data as follows:
df = pd.DataFrame(['Abc Taiwan', 'Xyz China', 'Zxx South Korea', 'No country name'],
columns=['title_type'])
df['title_type2'] = df['title_type']
and got:
0 Abc <country>
1 Xyz <country>
2 Zxx <country>
3 No country name
Name: title_type2, dtype: object

Match words only if preceded by specific pattern

I have a string from a NWS bulletin:
LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley
My aim is to extract a couple fields with regular expressions. In the first string I want "AAD" and from the second string I want "RECHNX". I have tried:
( )\w{3} #for the first string
and
\w{6} #for the 2nd string
But these find all 3 and 6 character strings leading up to the string I want.
Assuming the fields you want to extract are always in capital letters and preceded by 6 digits and a space, this regular expression would do the trick:
(?<=\d{6}\s)[A-Z]+
Demo: https://regex101.com/r/dsDHTs/1
Edit: if you want to match up to two alpha-numeric uppercase words preceded by 6 digits, you can use:
(?<=\d{6}\s)([A-Z0-9]+\b)\s(?:([A-Z0-9]+\b))*
Demo: https://regex101.com/r/dsDHTs/5
If you have a specific list of valid fields, you could also simply use:
(AAD|TMLB|RECHNX|RR4HNX)
https://regex101.com/r/dsDHTs/3
Since the substring you want to extract is a word that follows a number, separated by a space, you can use re.search with the following regex (given your input stored in s):
re.search(r'\b\d+ (\w+)', s).group(1)
To read first groups of word chars from each line, you can use a pattern like
(\w+) (\w+) (\w+) (\w+).
Then, from the first line read group No 4 and from the second line read group No 3.
Look at the following program. It prints four groups from each source line:
import re
txt = """LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley"""
n = 0
pat = re.compile(r'(\w+) (\w+) (\w+) (\w+)')
for line in txt.splitlines():
n += 1
print(f'{n:2}: {line}')
mtch = pat.search(line)
if mtch:
gr = [ mtch.group(i) for i in range(1, 5) ]
print(f' {gr}')
The result is:
1: LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
['LTUS41', 'KCAR', '141558', 'AAD']
2: KHNX 141001 RECHNX Weather Service San Joaquin Valley
['KHNX', '141001', 'RECHNX', 'Weather']

Categories

Resources