Removing whitespaces using regex python - python

I am trying to amend each line of a file to remove any parts beginning with the character '(' or containing a number/character in square brackets i.e.'[2]':
f = open('/Users/name/Desktop/university_towns.txt',"r")
listed = []
import re
for i in f.readlines():
if i.find(r'\(.*?\)\n'):
here = re.sub(r'\(.*?\)\[.*?\]\n', "", i)
listed.append(here)
elif i.find(r' \(.*?\)\n'):
here = re.sub(r' \(.*?\)\[.*?\]\n', "", i)
listed.append(here)
elif i.find(r' \[.*?\]\n'):
here = re.sub(r' \[.*?\]\n', "", i)
listed.append(here)
else:
here = re.sub(r'\[.*?\]\n', "", i)
listed.append(here)
A sample of my input data:
Platteville (University of Wisconsin–Platteville)[2]
River Falls (University of Wisconsin–River Falls)[2]
Stevens Point (University of Wisconsin–Stevens Point)[2]
Waukesha (Carroll University)
Whitewater (University of Wisconsin–Whitewater)[2]
Wyoming[edit]
Laramie (University of Wyoming)[5]
A sample of my output data:
Platteville
River Falls
Stevens Point
Waukesha (Carroll University)
Whitewater
Wyoming[edit]
Laramie
However, I do not want the parts such as '(Carroll University)' or '[edit]'.
How can I amend my formula?
I would be so grateful if anyone could give me any advice!

You can do:
import re
with open(ur_file) as f_in:
for line in f_in:
if m:=re.search(r'^([^([]+)', line): # Python 3.8+
print(m.group(1))
If your Python is prior to 3.8 without the Walrus:
with open(ur_file) as f_in:
for line in f_in:
m=re.search(r'^([^([]+)', line)
if m:
print(m.group(1))
Prints:
Platteville
River Falls
Stevens Point
Waukesha
Whitewater
Wyoming
Laramie
The regex explained:
^([^([]+)
^ start of the line
^ ^ capture group
^ ^ character class
^ class of characters OTHER THAN ( and [
^ + means one or more
Here is the regex on Regex101

Use this RegEx instead:
\(.*\)|\[.*\]
Like so:
re.sub(r'\(.*\)|\[.*\]', '', i)
This will substitute anything in parenthesis (\(.*\)) or (|) anything in square brackets (\[.*\])

If after a vectorised solution which is much faster and more readable than a loop. Then try;
Data
df=pd.DataFrame({'text':['Platteville (University of Wisconsin–Platteville)[2]','River Falls (University of Wisconsin–River Falls)[2]','Stevens Point (University of Wisconsin–Stevens Point)[2]','Waukesha (Carroll University)','Whitewater (University of Wisconsin–Whitewater)[2]','Wyoming[edit]','Wyoming[edit]']})
Regex extract
df['name']=df.text.str.extract('([A-Za-z\s+]+(?=\(|\[))')
Regex Breakdown
Capture any [A-Za-z\s+] UpperCase, Lowercase letters that are followed by space
(?=\(|\[)) and that are immediately followed by special character(` or special character [

Related

How to extract all comma delimited numbers inside () bracked and ignore any text

I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.
Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]
Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}

remove any occurence of a "(...)" from a long string, where "..." could be anything

I am currently using the following code to try and remove the characters from the string but nothing is being removed. I think it has to do with the fact that the sequence of characters I am trying to remove are between parantheses. So for example, in the following string, "-McQuay International, 13600 Industrial Park Blvd, (p) ", I would want to remove "(p)"
import re
regexp = " \(*\) "
text = re.sub(regexp, "", "-McQuay International, 13600 Industrial Park Blvd, (p)")
I would use the following regex replacement:
inp = "-McQuay International, 13600 Industrial Park Blvd, (p)"
output = re.sub(r'\s*\(.*?\)\s*', ' ', inp).strip()
print(output) # -McQuay International, 13600 Industrial Park Blvd,
First, you should be using lazy dog when matching parentheses. This is to avoid matching across multiple sets of parentheses, should your text have that. Second, I use a replacement which also removes unwanted whitespace. The call to strip() will remove any leading/trailing whitespace which might be left over.
regexp = "\(|\)"
Try this

Wildcard matching with looping re

I'm trying to improve the matching expression of this code so that it matches spaces before or after the string and also ignores the case. The goal is to output the shortened state abbreviation.
import re
s = "new South Wales "
for r in (("New South Wales", "NSW"), ("Victoria", "VIC"), ("Queensland", "QLD"), ("South Australia", "SA"), ("Western Australia", "WA"), ("Northern Territory", "NT"), ("Tasmania", "TAS"), ("Australian Capital Territory", "ACT")):
s = s.replace(*r)
output = {'state': s}
print (output)
I've figured out the regex to do this (see here):
(?i)(?<!\S)New South Wales(?!\S)
which will match with or without spaces on either side of string and also ignores case. Can anyone help me update my original code to include the new regex?
If I were you I would just strip() the string before passing it in and use something like re.sub() where we can tell it to ignore the case using 'flags=re.IGNORECASE' like below.
import re
s = " new South Wales ".strip()
for r in (("New South Wales", "NSW"), ("Victoria", "VIC"), ("Queensland", "QLD"), ("South Australia", "SA"), ("Western Australia", "WA"), ("Northern Territory", "NT"), ("Tasmania", "TAS"), ("Australian Capital Territory", "ACT")):
_regex = '{0}|{1}'.format(r[0], r[1])
if re.match(_regex, s, flags=re.IGNORECASE):
subbed_string = re.sub(r[0], r[1], s, flags=re.IGNORECASE)
print({'state': subbed_string.upper()})
Additionally I have added in a check for a match before trying to substitute in the value. Otherwise you could output the wrong result. For example:
(('Tasmania', 'TAS'){'state': 'new South Wales'})

Regex to extract name from list

I am working with a text file (620KB) that has a list of ID#s followed by full names separated by a comma.
The working regex I've used for this is
^([A-Z]{3}\d+)\s+([^,\s]+)
I want to also capture the first name and middle initial (space delimiter between first and MI).
I tried this by doing:
^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)
Which works, but I want to remove the new line break that is generated on the output file (I will be importing the two output files into a database (possibly Access) and I don't want to capture the new line breaks, also if there is a better way of writing the regex?
Full code:
import re
source = open('source.txt')
ticket_list = open('ticket_list.txt', 'w')
id_list = open('id_list.txt', 'w')
for lines in source:
m = re.search('^([A-Z]{3}\d+)\s+([^\s]+([\D+])+)', lines)
if m:
x = m.group()
print('Ticket: ' + x)
ticket_list.write(x + "\n")
ticket_list = open('First.txt', 'r')
for lines in ticket_list:
y = re.search('^(\d+)\s+([^\s]+([\D+])+)', lines)
if y:
z = y.group()
print ('ID: ' + z)
id_list.write(z + "\n")
source.close()
ticket_list.close()
id_list.close()
Sample Data:
Source:
ABC1000033830 SMITH, Z
100000012 Davis, Franl R
200000655 Gest, Baalio
DEF4528942681 PACO, BETH
300000233 Theo, David Alex
400000012 Torres, Francisco B.
ABC1200045682 Mo, AHMED
DEF1000006753 LUGO, G TO
ABC1200123123 de la Rosa, Maria E.
Depending on what kind of linebreak you're dealing with, a simple positive lookahead may remedy your pattern capturing the linebreak in the result. This was generated by RegexBuddy 4.2.0, and worked with all your test data.
if re.search(r"^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)(?=$)", subject, re.IGNORECASE | re.MULTILINE):
# Successful match
else:
# Match attempt failed
Basically, the positive lookahead makes sure that there is a linebreak (in this case, end of line) character directly after the pattern ends. It will match, but not capture the actual end of line.

Python regex findall

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags.
Here is my attempt:
regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)
Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates'].
import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)
yields
['Barack Obama', 'Bill Gates']
The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same
unicode as u'[[1P].+?[/P]]+?' except harder to read.
The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,
Remove the outer enclosing square brackets. (Also remove the
stray 1 in front of P.)
To protect the literal brackets in [P], escape the brackets with a
backslash: \[P\].
To return only the words inside the tags, place grouping parentheses
around .+?.
Try this :
for match in re.finditer(r"\[P[^\]]*\](.*?)\[/P\]", subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
Your question is not 100% clear, but I'm assuming you want to find every piece of text inside [P][/P] tags:
>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('\[P\]\s?(.+?)\s?\[\/P\]', line)
['Barack Obama', 'Bill Gates']
you can replace your pattern with
regex = ur"\[P\]([\w\s]+)\[\/P\]"
Use this pattern,
pattern = '\[P\].+?\[\/P\]'
Check here

Categories

Resources