How do I build a tokenizing regex based iterator in python

How do I build a tokenizing regex based iterator in python - python

I'm basing this question on an answer I gave to this other SO question, which was my specific attempt at a tokenizing regex based iterator using more_itertools's pairwise iterator recipe.
Following is my code taken from that answer:
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer(r"^|[ ]+|$", string)):
print(string[prev.end(): curr.start()]) # originally I yield here
I then noticed that if the string starts or ends with delimiters (i.e. string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d ") then the tokenizer will print empty strings (these are actually extra matches to string start and string end) in the beginning and end of its list of token outputs so to remedy this I tried the following (quite ugly) attempts at other regexes:
"(?:^|[ ]|$)+" - this seems quite simple and like it should work but it doesn't (and also seems to behave wildly different on other regex engines) for some reason it wouldn't build a single match from the string's start and the delimiters following it, the string start somehow also consumes the character following it! (this is also where I see divergence from other engines, is this a BUG? or does it have something to do with special non corporeal characters and the or (|) operator in python that I'm not aware of?), this solution also did nothing for the double match containing the string's end, once it matched the delimiters and then gave another match for the string end ($) character itself.
"(?:[ ]|$|^)+" - Putting the delimiters first actually solves one of the problems, the split at the beginning doesn't contain string start (but I don't care too much about that anyway since I'm interested in the tokens themselves), it also matches string start when there are no delimiters at the beginning of the string but the string ending is still a problem.
"(^[ ]*)|([ ]*$)|([ ]+)" - This final attempt got the string start to be part of the first match (which wasn't really that much of a problem in the first place) but try as I might I couldn't get rid of the delimiter + end and then delimiter match problem (which yields an additional empty string), still, I'm showing you this example (with grouping) since it shows that the ending special character $ is matched twice, once with the preceding delimiters and once by itself (2 group 2 matches).
My questions are:
Why do I get such a strange behavior in attempt #1
How do I solve the end of string issue?
Am I being a tank, i.e. is there a simple way to solve this that I'm blindly missing?
remember that the solution can't change the string and must
produce an iterable generator which iterates on the spaces between the tokens and not the tokens themselves (This last part might seem to complicate the answer unnecessarily since otherwise I have a simple answer but if you must know (and if you don't read no further) it's part of a bigger framework I'm building where this yielding method is inherited by a pipeline which then constructs yielded sentences out of it in various patterns which are used to extract fields from semi structured classifier driven messages)

The problems you're having are due to the trickiness and undocumented edge cases of zero-width matches. You can resolve them by using negative lookarounds to explicitly tell Python not to produce a match for ^ or $ if the string has delimiters at the start or end:
delimiter_re = r'[\n\- ]' # newline, hyphen, or space
search_regex = r'''^(?!{0}) # string start with no delimiter
| # or
{0}+ # sequence of delimiters (at least one)
| # or
(?<!{0})$ # string end with no delimiter
'''.format(delimiter_re)
search_pattern = re.compile(search_regex, re.VERBOSE)
Note that this will produce one match in an empty string, not zero, and not separate beginning and ending matches.
It may be simpler to iterate over non-delimiter sequences and use the resulting matches to locate the string components you want:
token = re.compile(r'[^\n\- ]+')
previous_end = 0
for match in token.finditer(string):
do_something_with(string[previous_end:match.start()])
previous_end = match.end()
do_something_with(string[previous_end:])
The extra matches you were getting at the end of the string were because after matching the sequence of delimiters at the end, the regex engine looks for matches at the end again, and finds a zero-width match for $.
The behavior you were getting at the beginning of the string for the ^|... pattern is trickier: the regex engine sees a zero-width match for ^ at the start of the string and emits it, without trying the other | alternatives. After the zero-width match, the engine needs to avoid producing that match again to avoid an infinite loop; this particular engine appears to do that by skipping a character, but the details are undocumented and the source is hard to navigate. (Here's part of the source, if you want to read it.)
The behavior you were getting at the start of the string for the (?:^|...)+ pattern is even trickier. Executing this straightforwardly, the engine would look for a match for (?:^|...) at the start of the string, find ^, then look for another match, find ^ again, then look for another match ad infinitum. There's some undocumented handling that stops it from going on forever, and this handling appears to produce a zero-width match, but I don't know what that handling is.

It sounds like you're just trying to return a list of all the "words" separated by any number of deliminating chars. You could instead just use regex groups and the negation regex ^ to achieve this:
# match any number of consecutive non-delim chars
string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d "
delimiters = '\n\- '
regex = r'([^{0}]+)'.format(delimiters)
for match in re.finditer(regex, string):
print(match.group(0))
output:
dasdha
hasud
hasuid
hsuia
dhsuai
dhasiu
dhaui
d

Related

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.

Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

Filter words in Regular expression

So, quite recently I have been introduced to regular expressions in Python and I've come across with some code online to filter words from a string list that are contained on other substrings.
def Filter(string, substr):
return [str for str in string
if re.match(r'[^\d]+|^', str).group(0) in substr]
It seems pretty straightforward and it works pretty well for my specific problem I'm meeting, but I really can't wrap my head around the meaning of it and how it is working. It just seems very confusing. Can anyone explain to me as if I was a baby or something? My coding skills are not that great, and I'm still a rookie.
Just to be clear, the code works, and I'm happy to move on, I just don't understand this bit.

[^\d] matches any character that isn't a numeric digit; this can also be written as \D.
+ after a pattern means to match any sequence of characters that match the pattern, so [^\d]+ matches a sequence of non-digits.
| separates alternative patterns to match.
The second alternative ^ matches the beginning of the string. Every string will match this. I think they use this just to avoid the match failing, so that you can always call .group(0) on the result. They could accomplish the same thing by changing + to * in the first alternative, since this means that the matched sequence can be 0 repetitions.
re.match() looks for a match of the regexp at the beginning of the argument string. And .group(0) returns what was matched by the entire regexp. So this whole thing returns the initial sequence of non-digits in str.
Finally, the list comprehension returns any of the items in strings whose initial sequence of non-digits is in substr.
With the simplifications I mentioned above, this can be rewritten:
def Filter(string, substr):
return [item for item in string
if re.match(r'\D*', item).group(0) in substr]
Note that if any of the items begin with a digit, the result of the regexp will be an empty string, and an empty string is a substring of every string. So these items will be included in the filter result. I suspect this is not the intended result.

I will try to to explain this for you.
So basically we are creating a method named "filter" and passing two arguments i.e "string (to be searched in)" and "substring (to be searched for)". Then we are using re.match inside a python return function along with an if condition within a for loop (the for loop helps us traverse through the main string one by one).
As for: (r'[^\d]+|^': this is a regular expression pattern where, \d is regex pattern for digit and + means at least one or more and finally they are closed within () that means the group that you want to capture.
re.match:
re.match is a function that searches only from the beginning of the string and returns the matched object (if found). However, if the substring is found somewhere in the middle then it will simply return none.

regex: replace hyphens with en-dashes with re.sub

I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).
The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)
def mychanger(fileName):
with open(fileName,'r') as file:
str = file.read()
str = str.decode("utf-8")
str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
with open(fileName,'wb') as file:
file.write(str)
I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.
To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.
In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.
A typical line in my table is
variable & -2.061\sym{***}& 4.032\sym{**} & 1.236 \\
& (-2.32) & (-2.02) & (-0.14)
However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.
What is the problem here?
Thanks!

I can offer the following two step replacement:
str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)
The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.
Demo

re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):
re.sub(r"([^{]{1,4})-",r"\1–", str)

Is this regex syntax working?

I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")

As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"

Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*

Regular expression syntax for "match nothing"?

I have a python template engine that heavily uses regexp. It uses concatenation like:
re.compile( regexp1 + "|" + regexp2 + "*|" + regexp3 + "+" )
I can modify the individual substrings (regexp1, regexp2 etc).
Is there any small and light expression that matches nothing, which I can use inside a template where I don't want any matches? Unfortunately, sometimes '+' or '*' is appended to the regexp atom so I can't use an empty string - that will raise a "nothing to repeat" error.

This shouldn't match anything:
re.compile('$^')
So if you replace regexp1, regexp2 and regexp3 with '$^' it will be impossible to find a match. Unless you are using the multi line mode.
After some tests I found a better solution
re.compile('a^')
It is impossible to match and will fail earlier than the previous solution. You can replace a with any other character and it will always be impossible to match

(?!) should always fail to match. It is the zero-width negative look-ahead. If what is in the parentheses matches then the whole match fails. Given that it has nothing in it, it will fail the match for anything (including nothing).

To match an empty string - even in multiline mode - you can use \A\Z, so:
re.compile('\A\Z|\A\Z*|\A\Z+')
The difference is that \A and \Z are start and end of string, whilst ^ and $ these can match start/end of lines, so $^|$^*|$^+ could potentially match a string containing newlines (if the flag is enabled).
And to fail to match anything (even an empty string), simply attempt to find content before the start of the string, e.g:
re.compile('.\A|.\A*|.\A+')
Since no characters can come before \A (by definition), this will always fail to match.

Maybe '.{0}'?

You could use
\z..
This is the absolute end of string, followed by two of anything
If + or * is tacked on the end this still works refusing to match anything

Or, use some list comprehension to remove the useless regexp entries and join to put them all together. Something like:
re.compile('|'.join([x for x in [regexp1, regexp2, ...] if x != None]))
Be sure to add some comments next to that line of code though :-)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I build a tokenizing regex based iterator in python - python

Related

Regex Statement to only match parts of a string for comparison - Python

Filter words in Regular expression

regex: replace hyphens with en-dashes with re.sub

Is this regex syntax working?

Regular expression syntax for "match nothing"?

Categories

Resources