Regex similar to a Hearst Pattern in Python - python

I'm trying to come up with a regex similiar to the ones listed here for Hearst Patterns in order to get the following results:
NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
NP_The_Eleventh_Air_Force (NP_11_AF) is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
Doing re.search(regex, sentence) for each of this sentences I want to match this 2 groupsNP_The_Eleventh_Air_Force NP_a_Numbered_Air_Force
This is my attempt but it doesn't get any matches:
(NP_\\w+ (, )?is (NP_\\w+ ?))

In both sentences I think (, )? is not present, but the part before between parenthesis is so you could make that part optional instead.
Also move the last parenthesis from )) to (NP_\w+) to create the first group.
The pattern including the optional comma and space could be:
(NP_\w+)(?: \([^()]+\))? (?:, )?is (NP_\w+ ?)
Regex demo
If you don't need the space at the end and the comma space is not present, you pattern could be:
(NP_\w+)(?: \([^()]+\))? is (NP_\w+)
(NP_\w+) Capture group 1 Match NP_ and 1+ word chars
(?: \([^()]+\))? Optionally match a space and a part with parenthesis
is Match literally
(NP_\w+) Capture group 2 Match NP_ and 1+ word chars
See a regex demo | Python demo
For example
import re
regex = r"(NP_\w+)(?: \([^()]+\))? is (NP_\w+)"
test_str = "NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF)."
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
print(matches.group(2))
Output
NP_The_Eleventh_Air_Force
NP_a_Numbered_Air_Force

I got one, quite simple:
regex = r"NP.\w+ ?Forces?\b
You can see how it works out, it's a online tool to write and test regex for multiple languages:
https://regex101.com/r/KKH3D3/1/

Related

How to split a string with parentheses and spaces into a list

I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))
Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']
For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.

Where is such a regex wrong?

I am using python.
The pattern is:
re.compile(r'^(.+?)-?.*?\(.+?\)')
The text like:
text1 = 'TVTP-S2(xxxx123123)'
text2 = 'TVTP(xxxx123123)'
I expect to get TVTP
Another option to match those formats is:
^([^-()]+)(?:-[^()]*)?\([^()]*\)
Explanation
^ Start of string
([^-()]+) Capture group 1, match 1+ times any character other than - ( and )
(?:-[^()]*)? As the - is excluded from the first part, optionally match - followed by any char other than ( and )
\([^()]*\) Match from ( till ) without matching any parenthesis between them
Regex demo | Python demo
Example
import re
regex = r"^([^-()]+)(?:-[^()]*)?\([^()]*\)"
s = ("TVTP-S2(xxxx123123)\n"
"TVTP(xxxx123123)\n")
print(re.findall(regex, s, re.MULTILINE))
Output
['TVTP', 'TVTP']
This regex works:
pattern = r'^([^-]+).*\(.+?\)'
>>> re.findall(pattern, 'TVTP-S2(xxxx123123)')
['TVTP']
>>> re.findall(pattern, 'TVTP(xxxx123123)')
['TVTP']
a quick answer will be
^(\w+)(-.*?)?\((.*?)\)$
https://regex101.com/r/wL4jKe/2/
It is because the first plus is lazy, and the subsequent dash is optional, followed by a pattern that allows any character.
This allows the regex engine to choose the single letter T for the first group (because it is lazy), choose to interpret the dash as just not being there, which is allowed because it is followed by a question mark, and then have the next .* match "VTP-S2".
You can just grab non-dashes to capture, followed by nonparentheses up to the parentheses.
p=re.compile(r'^([^-]*?)[^(]*\(.+?\)')
p.search('TVTP-S2(xxxx123123) blah()').group(1)
The nonparentheses part prevents the second portion from matching 'S2(xxxx123123) blah(' in my modified example above.

Regex to exclude optional words and return as list

I am trying to extract the name and profession as a list of tuples from the below string using regex.
Input string
text = "Mr John,Carpenter,Mrs Liza,amazing painter"
As you can see the first word is the name followed by the profession which repeats in a comma seperated fashion. The problem is that, I want to get rid of the adjectives that comes along with the profession. For e.g "amazing" in the below example.
Expected output
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
I stripped out the adjective from the text using "replace" and used the below code using "regex" to get the output. But I am looking for a single regex function to avoid running the string replace. I figured that this has something to do with look ahead in regex but couldn't make it work. Any help would be appreciated.
text.replace("amazing ", "")
txt_new = re.findall("([\w\s]+),([\w\s]+)",text)
If you only want to use word and whitespace characters, this could be another option:
(\w+(?:\s+\w+)*)\s*,\s*(?:\w+\s+)*(\w+)
Explanation
( Capture group 1
\w+(?:\s+\w+)* Match 1+ word chars and optionally repeat 1+ whitespace chars and 1+ word chars
) Close group 1
\s*,\s* Match a comma between optional whitespace chars
(?:\w+\s+)* Optionally repeat 1+ word and 1+ whitespace chars
(\w+) Capture group 2, match 1+ word chars
Regex demo | Python demo
import re
regex = r"(\w+(?:\s+\w+)*)\s*,\s*(?:\w+\s+)*(\w+)"
s = ("Mr John,Carpenter,Mrs Liza,amazing painter")
print(re.findall(regex, s))
Output
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
Here is one regex approach using re.findall:
text = "Mr John,Carpenter,Mrs Liza,amazing painter"
matches = re.findall(r'\s*([^,]+?)\s*,\s*.*?(\S+)\s*(?![^,])', text)
print(matches)
This prints:
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
Here is an explanation of the regex pattern:
\s* match optional whitespace
([^,]+?) match the name
\s* optional whitespace
, first comma
\s* optional whitespace
.*? consume all content up until
(\S+) the last profession word
\s* optional whitespace
(?![^,]) assert that what follows is either comma or the end of the input

Python Regex if there is a brace in string

the co[njuring](media_title)
I want a regex to detect if a pattern like above exist.
Currently I have a regex that turns
line = Can I please eat at[ warunk upnormal](restaurant_name)
line = re.sub('\[\s*(.*?)\s*\]', r'[\1]', line)
line = re.sub(r'(\w)\[', r'\1 [', line)
Can I please eat at [warunk upnormal](restaurant_name)
Notice how there aren't any spaces which is good, and it creates a space char and brace ex. x[ to x [
What I want, is to change the above to regexes to not perform the change if there is a sentences like this
the co[njuring](media_title)
the co[njuring](media_title) and che[ese dog]s(food)
Notice how there is a brace in there. Basically, I want to know how can I improve these regexes to take this into account.
line = re.sub('\[\s*(.*?)\s*\]', r'[\1]', line)
line = re.sub(r'(\w)\[', r'\1 [', line)
For the 2 patterns that you use, you could also use a single pattern with 2 capturing groups.
(\w)\[\s*(.*?)\s*\]
Regex demo and a Python demo
In the replacement use the 2 capturing groups \1 [\2]
Example code
line = re.sub('(\w)\[\s*(.*?)\s*\]', r'\1 [\2]', line)
The different in the given format that I see is that there is an underscore present (instead of a brace) between the parenthesis (restaurant_name) and (media_title) vs (food)
If that is the case, you can use a third capturing group, matching the value in parenthesis with at least a single underscore present, not at the start and not at the end.
(\w)\[\s*(.*?)\s*\](\([^_\s()]+(?:_[^_\s()]+)+\))
Explanation
(\w) Capture group 1, match a word char
\[\s* Match [ and 0+ whitespace chars
(.*?) Capture group 2, match any char except a newline non greedy
\s*\] Match 0+ whitespace chars and ]
( Capture group 3
\( Match (
[^_\s()]+ Match 1+ times any char except an underscore, whitespace char or parenthesis
(?:_[^_\s()]+)+ Repeat 1+ times the previous pattern with an underscore prepended
\) Match )
) Close group
In the replacement use the 3 capturing groups \1 [\2]\3
Regex demo and a Python demo
Example code
import re
regex = r"(\w)\[\s*(.*?)\s*\](\([^_\s()]+(?:_[^_\s()]+)+\))"
test_str = ("Can I please eat at[ warunk upnormal](restaurant_name)\n"
"Can I please eat at[ warunk upnormal ](restaurant_name)\n"
"the co[njuring](media_title)\n"
"the co[njuring](media_title) and che[ese dog]s(food)")
result = re.sub(regex, r"\1 [\2]\3", test_str)
if result:
print (result)
Output
Can I please eat at [warunk upnormal](restaurant_name)
Can I please eat at [warunk upnormal](restaurant_name)
the co [njuring](media_title)
the co [njuring](media_title) and che[ese dog]s(food)

Searching for a pattern in a sentence with regex in python

I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?
Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.
Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']

Categories

Resources