Regex search multiple suffixes - python

I have a big list of target words I am searching
words = ['Word1', 'Word2', 'Word3']
I've been told that a regular expression of this sort:
suffix = re.compile('(?:{words}) (\\w+)'.format(words='|'.join(words)))
Is pretty efficient, since it fails the regex evaluation immediately when a character that doesn't match the expression is met.
However, the other way around is not efficient:
prefix = re.compile('(\\w+) (?:{words})'.format(words='|'.join(words)))
Is there an elegant way to instruct python's regex to do the search in reverse ?
Edit
I've been asked to add example usages:
# prefix search
title = re.compile('(?:Mr.|Mrs.|Ms.|Dr. |Lt.) (\\w+)')
# suffix search
company = re.compile('(\\w+) (?:Inc.| LLP.|ltd.|GMBH)')
# invoking the regex
all_people_names = title.findall(document)
all_company_names = company.findall(document)
Edit 2
A lot of people had been skeptical regarding the significance of the timing differences.
I've implemented the 2 methods: endswith() and endswith_rev() that reverses the string and the results as kabanus suggested.
These are the results:
As you can see, it makes a huge difference, even with a small amount of suffixes.

Well, the way you did it you have to test all the possible prefixes up to the suffix. One way to beat this, only if the string is long enough, is to reverse everything, so you get back to your first example:
prefix = re.compile('(?:{words}) (\\w+)'.format(words='|'.join([word[::-1] for word in words])))
re.match(prefix,mystring[::-1])
so you are searching from the end, and get back the same pattern - remember to reverse the matches though. I wonder how long does the list of words and string need to be to make this worth it. Apparently this is a major optimization booster see OP for some timing.

Using Regular Expression is OK in some cases, or required in others, e.g. when you configure a system that allows you to match patterns and the input type is a RegEx pattern, but for this simple use case RegEx just wastes CPU cycles.
This use case is simple because you know the position at which you want to match the subbstrings - they are always at the end of the input, so each suffix either matches the given inputString or not:
inputString[ len(inputString) - len(suffix) : ] == suffix
But of course, you already have the Python method endswith(suffix), so you can test with:
inputString.endswith( suffix )
The suffix argument can be a tuple though, so you can do the following:
suffixes = ( "Inc.", "inc.", "Gmbh", "ltd.", "LTD", "LLP" )
inputString.endswith( suffixes )
Or for a case insensitive search:
suffixes = ( "inc.", "gmbh", "ltd.", "llp" )
inputString.lower().endswith( suffixes )
Anyway, if performance is really important then perhaps Python is not the best language.

try
.*\.(?:jpg|gif|png)
will match
1.jpg
b.png
c.gif
test it in https://regex101.com/
Non-capturing group (?:jpg|gif|png)
1st Alternative jpg
jpg matches the characters jpg literally (case sensitive)
2nd Alternative gif
gif matches the characters gif literally (case sensitive)
3rd Alternative png
png matches the characters png literally (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)

Related

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.
Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

How do I build a tokenizing regex based iterator in python

I'm basing this question on an answer I gave to this other SO question, which was my specific attempt at a tokenizing regex based iterator using more_itertools's pairwise iterator recipe.
Following is my code taken from that answer:
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer(r"^|[ ]+|$", string)):
print(string[prev.end(): curr.start()]) # originally I yield here
I then noticed that if the string starts or ends with delimiters (i.e. string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d ") then the tokenizer will print empty strings (these are actually extra matches to string start and string end) in the beginning and end of its list of token outputs so to remedy this I tried the following (quite ugly) attempts at other regexes:
"(?:^|[ ]|$)+" - this seems quite simple and like it should work but it doesn't (and also seems to behave wildly different on other regex engines) for some reason it wouldn't build a single match from the string's start and the delimiters following it, the string start somehow also consumes the character following it! (this is also where I see divergence from other engines, is this a BUG? or does it have something to do with special non corporeal characters and the or (|) operator in python that I'm not aware of?), this solution also did nothing for the double match containing the string's end, once it matched the delimiters and then gave another match for the string end ($) character itself.
"(?:[ ]|$|^)+" - Putting the delimiters first actually solves one of the problems, the split at the beginning doesn't contain string start (but I don't care too much about that anyway since I'm interested in the tokens themselves), it also matches string start when there are no delimiters at the beginning of the string but the string ending is still a problem.
"(^[ ]*)|([ ]*$)|([ ]+)" - This final attempt got the string start to be part of the first match (which wasn't really that much of a problem in the first place) but try as I might I couldn't get rid of the delimiter + end and then delimiter match problem (which yields an additional empty string), still, I'm showing you this example (with grouping) since it shows that the ending special character $ is matched twice, once with the preceding delimiters and once by itself (2 group 2 matches).
My questions are:
Why do I get such a strange behavior in attempt #1
How do I solve the end of string issue?
Am I being a tank, i.e. is there a simple way to solve this that I'm blindly missing?
remember that the solution can't change the string and must
produce an iterable generator which iterates on the spaces between the tokens and not the tokens themselves (This last part might seem to complicate the answer unnecessarily since otherwise I have a simple answer but if you must know (and if you don't read no further) it's part of a bigger framework I'm building where this yielding method is inherited by a pipeline which then constructs yielded sentences out of it in various patterns which are used to extract fields from semi structured classifier driven messages)
The problems you're having are due to the trickiness and undocumented edge cases of zero-width matches. You can resolve them by using negative lookarounds to explicitly tell Python not to produce a match for ^ or $ if the string has delimiters at the start or end:
delimiter_re = r'[\n\- ]' # newline, hyphen, or space
search_regex = r'''^(?!{0}) # string start with no delimiter
| # or
{0}+ # sequence of delimiters (at least one)
| # or
(?<!{0})$ # string end with no delimiter
'''.format(delimiter_re)
search_pattern = re.compile(search_regex, re.VERBOSE)
Note that this will produce one match in an empty string, not zero, and not separate beginning and ending matches.
It may be simpler to iterate over non-delimiter sequences and use the resulting matches to locate the string components you want:
token = re.compile(r'[^\n\- ]+')
previous_end = 0
for match in token.finditer(string):
do_something_with(string[previous_end:match.start()])
previous_end = match.end()
do_something_with(string[previous_end:])
The extra matches you were getting at the end of the string were because after matching the sequence of delimiters at the end, the regex engine looks for matches at the end again, and finds a zero-width match for $.
The behavior you were getting at the beginning of the string for the ^|... pattern is trickier: the regex engine sees a zero-width match for ^ at the start of the string and emits it, without trying the other | alternatives. After the zero-width match, the engine needs to avoid producing that match again to avoid an infinite loop; this particular engine appears to do that by skipping a character, but the details are undocumented and the source is hard to navigate. (Here's part of the source, if you want to read it.)
The behavior you were getting at the start of the string for the (?:^|...)+ pattern is even trickier. Executing this straightforwardly, the engine would look for a match for (?:^|...) at the start of the string, find ^, then look for another match, find ^ again, then look for another match ad infinitum. There's some undocumented handling that stops it from going on forever, and this handling appears to produce a zero-width match, but I don't know what that handling is.
It sounds like you're just trying to return a list of all the "words" separated by any number of deliminating chars. You could instead just use regex groups and the negation regex ^ to achieve this:
# match any number of consecutive non-delim chars
string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d "
delimiters = '\n\- '
regex = r'([^{0}]+)'.format(delimiters)
for match in re.finditer(regex, string):
print(match.group(0))
output:
dasdha
hasud
hasuid
hsuia
dhsuai
dhasiu
dhaui
d

efficient way to get words before and after substring in text (python)

I'm using regex to find occurrences of string patterns in a body of text. Once I find that the string pattern occurs, I want to get x words before and after the string as well (x could be as small as 4, but preferably ~10 if still as efficient).
I am currently using regex to find all instances, but occasionally it will hang. Is there a more efficient way to solve this problem?
This is the solution I currently have:
sub = r'(\w*)\W*(\w*)\W*(\w*)\W*(\w*)\W*(%s)\W*(\w*)\W*(\w*)\W*(\w*)\W*(\w*)' % result_string #refind string and get surrounding += 4 words
surrounding_text = re.findall(sub, text)
for found_text in surrounding_text:
result_found.append(" ".join(map(str,found_text)))
I'm not sure if this is what you're looking for:
>>> text = "Hello, world. Regular expressions are not always the answer."
>>> words = text.partition("Regular expressions")
>>> words
('Hello, world. ', 'Regular expressions', ' are not always the answer.')
>>> words_before = words[0]
>>> words_before
'Hello, world. '
>>> separator = words[1]
>>> separator
'Regular expressions'
>>> words_after = words[2]
>>> words_after
' are not always the answer.'
Basically, str.partition() splits the string into a 3-element tuple. In this example, the first element is all of the words before the specific "separator", the second element is the separator, and the third element is all of the words after the separator.
The main problem with your pattern is that it begins with optional things that causes a lot of tries for each positions in the string until a match is found. The number of tries increases with the text size and with the value of n (the number of words before and after). This is why only few lines of text suffice to crash your code.
A way consists to begin the pattern with the target word and to use lookarounds to capture the text (or the words) before and after:
keyword (?= words after ) (?<= words before - keyword)
Starting a pattern with the searched word (a literal string) makes it very fast, and words around are then quickly found from this position in the string. Unfortunately the re module has some limitations and doesn't allow variable length lookbehinds (as many other regex flavors).
The new regex module supports variable length lookbehinds and other useful features like the ability to store the matches of a repeated capture group (handy to get the separated words in one shot).
import regex
text = '''In strange contrast to the hardly tolerable constraint and nameless
invisible domineerings of the captain's table, was the entire care-free
license and ease, the almost frantic democracy of those inferior fellows
the harpooneers. While their masters, the mates, seemed afraid of the
sound of the hinges of their own jaws, the harpooneers chewed their food
with such a relish that there was a report to it.'''
word = 'harpooneers'
n = 4
pattern = r'''
\m (?<target> %s ) \M # target word
(?<= # content before
(?<before> (?: (?<wdb>\w+) \W+ ){0,%d} )
%s
)
(?= # content after
(?<after> (?: \W+ (?<wda>\w+) ){0,%d} )
)
''' % (word, n, word, n)
rgx = regex.compile(pattern, regex.VERBOSE | regex.IGNORECASE)
class Result(object):
def __init__(self, m):
self.target_span = m.span()
self.excerpt_span = (m.starts('before')[0], m.ends('after')[0])
self.excerpt = m.expandf('{before}{target}{after}')
self.words_before = m.captures('wdb')[::-1]
self.words_after = m.captures('wda')
results = [Result(m) for m in rgx.finditer(text)]
print(results[0].excerpt)
print(results[0].excerpt_span)
print(results[0].words_before)
print(results[0].words_after)
print(results[1].excerpt)
Making a regex (well, anything, for that matter) with "as much repetitions as you will ever possibly need" is an extremely bad idea. That's because you
do an excessive amount of needless work every time
cannot really know for sure how much you will ever possibly need, thus introducing an arbitrary limitation
The bottom line for the below solutions: the 1st solution is the most effective one for large data; the 2nd one is the closest to your current, but scales much worse.
strip your entities to exactly what you are interested in at each moment:
find the substring (e.g. str.index. For whole words only, re.find with e.g. r'\b%s\b'%re.escape(word) is more suitable)
go N words back.
Since you mentioned a "text", your strings are likely to be very large, so you want to avoid copying potentially unlimited chunks of them.
E.g. re.finditer over a substring-reverse-iterator-in-place according to slices to immutable strings by reference and not copy and Best way to loop over a python string backwards. This will only become better than slicing when the latter is expensive in terms of CPU and/or memory - test on some realistic examples to find out. Doesn't work. re works directly with the memory buffer. Thus it's impossible to reverse a string for it without copying the data.
There's no function to find a character from a class in Python, nor an "xsplit". So the fastest way appears to be (i for i,c in enumerate(reversed(buffer(text,0,substring_index)) if c.isspace()) (timeit gives ~100ms on P3 933MHz for a full pass through a 100k string).
Alternatively:
Fix your regex to not be subject to catastrophic backtracking and eliminate code duplication (DRY principle).
The 2nd measure will eliminate the 2nd issue: we'll make the number of repetitions explicit (Python Zen, koan 2) and thus highly visible and manageable.
As for the 1st issue, if you really only need "up to known, same N" items in each case, you won't actually be doing "excessive work" by finding them together with your string.
The "fix" part here is \w*\W* -> \w+\W+. This eliminates major ambiguity (see the above link) from the fact that each x* can be a blank match.
Matching up to N words before the string effectively is harder:
with (\w+\W+){,10} or equivalent, the matcher will be finding every 10 words before discovering that your string doesn't follow them, then trying 9,8, etc. To ease it up on the matcher somewhat, \b before the pattern will make it only perform all this work at the beginning of each word
lookbehind is not allowed here: as the linked article explains, the regex engine must know how many characters to step back before trying the contained regex. And even if it was - a lookbehind is tried before every character - i.e. it's even more of a CPU hog
As you can see, regexes aren't quite cut to match things backwards
To eliminate code duplication, either
use the aforementioned {,10}. This will not save individual words but should be noticeably faster for large text (see the above on how the matching works here). We can always parse the retrieved chunk of text in more details (with the regex in the next item) once we have it. Or
autogenerate the repetitive part
note that (\w+\W+)? repeated mindlessly is subject to the same ambiguity as above. To be unambiguous, the expression must be like this (w=(\w+\W+) here for brevity): (w(w...(ww?)?...)?)? (and all the groups need to be non-capturing).
I personally think that using text.partition() is the best option, as it eliminates the messy regular expressions, and automatically leaves output in an easy-to-access tuple.

How to add tags to negated words in strings that follow "not", "no" and "never"

How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.
Input:It was never going to work, he thought. He did not play so well, so he had to practice some more.
Desired output:It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.
Any idea how to solve this?
To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
Will print (demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Explanation
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.
Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with
(\s+)(\w+)
And replace them with what you want
\1NEG_\2
I would not do this with regexp. Rather I would;
Split the input on punctuation characters.
For each fragment do
Set negation counter to 0
Split input into words
For each word
Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
If original word is in {No,Never,Not} increase negation counter by one.
You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.
Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.
Join the string together again and insert the result in your original string in the place of the first regex's match.

re.match() multiple times in the same string with Python

I have a regular expression to find :ABC:`hello` pattern. This is the code.
format =r".*\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
m = patt.match(l.rstrip())
if m:
...
It works well when the pattern happens once in a line, but with an example ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`". It finds only the last one.
How can I find all the three patterns?
EDIT
Based on Paul Z's answer, I could get it working with this code
format = r"\:([^:]*)\:\`([^`]*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
print tag, ":::", value
Result
tagbox ::: Verilog
tagbox ::: Multiply
tagbox ::: VHDL
Yeah, dcrosta suggested looking at the re module docs, which is probably a good idea, but I'm betting you actually wanted the finditer function. Try this:
format = r"\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
....
Your current solution always finds the last one because the initial .* eats as much as it can while still leaving a valid match (the last one). Incidentally this is also probably making your program incredibly slower than it needs to be, because .* first tries to eat the entire string, then backs up character by character as the remaining expression tells it "that was too much, go back". Using finditer should be much more performant.
A good place to start is there module docs. In addition to re.match (which searches starting explicitly at the beginning of the string), there is re.findall (finds all non-overlapping occurrences of the pattern), and the methods match and search of compiled RegexObjects, both of which accept start and end positions to limit the portion of the string being considered. See also split, which returns a list of substrings, split by the pattern. Depending on how you want your output, one of these may help.
re.findall or even better regex.findall can do that for you in a single line:
import regex as re #or just import re
s = ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`"
format = r"\:([^:]*)\:\`([^`]*)\`"
re.findall(format,s)
result is:
[('tagbox', 'Verilog'), ('tagbox', 'Multiply'), ('tagbox', 'VHDL')]

Categories

Resources