Does Python regular expression library re have a maximum match [duplicate] - python

This question already has answers here:
Python extract pattern matches
(10 answers)
How do I return a string from a regex match in python? [duplicate]
(4 answers)
Closed 2 years ago.
I have scoured the web (and perhaps I am searching the wrong thing), but I have a very long regex pattern that I would like to match:
Ex:
import re
re_pattern_str = r"I want to match this \(this is an example\) regular expression to a giant string"
sample_paragraph = "I want to match this (this is an example) regular expression to a giant string. This is a huge paragraph with a bunch of stuff in it"
print(re.match(re_pattern_str, sample_paragraph))
The output of the above program is as follows:
run
<re.Match object; span=(0, 78), match='I want to match this (this is an example) regular>
As you can see, it gets cut off and doesn't capture the whole string.
Also, I noticed that using verbose mode with a lot of comments ((?x) in Python) captures less. Does this mean there is a limit to how much can be captured? I also noticed using different Python versions and different machines caused different amounts of a long regex string to be captured. I still can't pinpoint if this is an issue in the re library in Python, a Python 3 specific thing (I haven't compared this to Python 2), a machine issue, memory issue, or something else.
I have used Python 3.8.1 for the above example, and have used Python 3.7.2 for another example using verbose regexes and other examples (I can't share these examples since those are proprietary).
Any help on the mechanics of Python regex engine and why this happens (and if there is a maximum length that can be captured via regex, why?), this would be very helpful.

You think the repr of the match is the matched text. It isn't. The repr tries not to dump pages of text for large matches. If you want to see the complete matched text, index in to get it as a string:
print(re.match(re_pattern_str, sample_paragraph)[0])
#^^^ gets the matched text itself
You can see from the repr it's a much longer match (it spans index 0 to 78).

Related

How can I re-write my Regex Expression to begin the search at the occurrence of a separate pattern? [duplicate]

This question already has answers here:
Python extract pattern matches
(10 answers)
Closed 2 years ago.
Apologies if this is a duplicate - I wasn't exactly sure what to search for and everything I found came up short.
I'm using Python and if anybodies interested I drafted up a quick example on here:
Regex101 Example I created
I'm trying to use regex to grab the first part of a string that might be formatted like so:
**This is a Location** 8:20
or it could be formatted like...
Irrelevant information - **Relevant Information** 6:90
I wrote the following expression which does the job almost perfectly, pulling the relevant part of the string (words) out but it also pulls in the second part of the string (numbers). This is annoying as I then need to do a second regex/python expression to split that out.
r'(\w* ){1,5}\d+:\d+'
I'm using Python so I know I can quite easily separate the info manually with a slice etc but I feel like there must be a more elegant solution to my Regex that would negate the need for this step. Essentially I think the solution would be to match '\d+:\d+' and look back from there.
Ok - perhaps this isn't the most elegant solution but I've just realised I think I can use capturing groups like so:
# Pattern with groups
pattern = '((\w* ){1,5})(\d+:\d+)'
string = "useless something else - useful 2:2"
r = re.search(pattern, string)
if r:
useful info= r.group(1)
boundary = r.group(3)
Theoretically, I'm always going to have the same number of groups with group 1 containing the relevant string I'm trying to grab and group 3 the time/number value. I'll test this now and update/close this thread.

python regex escaping meta characters among delimiters [duplicate]

This question already has answers here:
Why can't Python parse this JSON data? [closed]
(3 answers)
Closed 4 years ago.
Python 2.4.4 (yeah, long story)
I want to parse this fragment (with re)
"comment":"#2 Surely, (this) can't be any [more] complicated a reg-ex?",
i.e., it (the comment) can contain characters (upper or lower), numbers, hash, parentheses, square brackets, single quotes, and commas, and it (this fragment) specifically ends with a dquote and a comma
i've gotten this far with the expression,
r'\"comment\":\"(?P<COMMENT>[a-zA-Z0-9\s]+)\",'
but, of course, it only matches when none of the meta characters are in the comment. the final \", works as the the termination criterion. I've tried all kinds of escape, double escape ...
could a kind 're geek' please enlighten ?
i want to access the "entire" comment as match.group["COMMENT"]
corrected the pattern to what I was actually using when asked. my bad cut-n-paste.
until marked with all the "DUPLICATES", I couldn't spell JSON. But, I DID specify I had to do this with re.
even with all the JSON responses and code frags, it wasn't introduced until 2.6, and I did specify I'm still using 2.4.4.
Thanks to those responding with the regex-based solutions. Now working for me :)
Use a non-greedy .*? to match anything before ",, assuming this as the end of comment:
import re
s = '''"comment":"#2 Surely, (this) can't be any [more] complicated a reg-ex?",'''
match = re.search(r'"comment":"(?P<comment>.*?)",', s)
print(match.group('comment'))
# #2 Surely, (this) can't be any [more] complicated a reg-ex?
You can name your matched string using (?P<group_name>…).

Python re not matching when there is a dot in the string? [duplicate]

This question already has an answer here:
Python regular expression re.match, why this code does not work? [duplicate]
(1 answer)
Closed 5 years ago.
It appears that python regex is not matching when the target string has a dot. Is this a feature? Am I missing something? Thanks!
>>> import re
>>> re.compile('txt').match('txt')
<_sre.SRE_Match object at 0x7f2c424cd648>
>>> re.compile('txt').match('.txt')
>>>
The docs are quite explicit about the difference between search and match:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
You should therefore replace match with search:
In [9]: re.compile('txt').search('.txt')
Out[9]: <_sre.SRE_Match at 0x7f888afc9cc8>
I don't know the history behind this quirk (which, let's face it, must trip a lot of people up), but I'd be interested to see (in the comments) why re offers a specific function for matching at the start of a string.
In general, I'm broadly against anything whose name is broad enough that you have to go to the docs to work out what it does.

Square brackets breaking regex? [duplicate]

This question already has answers here:
Reference - What does this regex mean?
(1 answer)
Decyphering a simple regex
(3 answers)
Closed 5 years ago.
I'm new to learning regex, and I came across a problem that I solved, although I'm not sure why it was a problem and would just like to learn a bit more!
I'm using Python for my regex statement. The relevant portion of text to be captured is (I've changed the exact numbers, but this is what it looks like)
Evaluation Type: InterimContract Percent Complete: 30%Period of Performance Being Assessed: 05/27/2013 -
I'm looking to capture Interim and 05/27/2013. The regex that I was using that did NOT work was
match = re.search(
"Evaluation Type:[\s\n]*(.*?)[\s\n]*Contract Percent[.]*"
"Period of Performance Being Assessed:[\s\n]*(.*?)[\s\n]*-"
, page_content)
The code that does work is
match = re.search(
"Evaluation Type:[\s\n]*(.*?)[\s\n]*Contract Percent.*"
"Period of Performance Being Assessed:[\s\n]*(.*?)[\s\n]*-"
, page_content)
(as you may notice, the difference is that I removed the square brackets around the . at the end of line 2.
I understand that the brackets weren't actually needed (just helped me visualize it as I'm creating the regex) but I'm not sure why they broke it. I was getting no match with the first set of code, while a perfect match with the second. I'm sure it's some simple little thing, but I couldn't find what would be breaking from my searches online (although it could be that I don't understand enough in depth to know what I'm looking for)
[.]* means 0 or more dot
.* means 0 or more any character but newline.
A dot inside a character class loses its special meaning.

what does this python regex mean "([\w\/%]*)" [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I am reading the Shinken source code in shinken/misc/perfdata.py and i finally find a regex that i can not understand. like this:
metric_pattern = re.compile('^([^=]+)=([\d\.\-\+eE]+)([\w\/%]*);?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE]+)?;?([\d\.\-\+eE]+)?;?\s*')
what confused me is that what does \/ mean in ([\w\/%]*)?
You're rightfully confused, because that regex must have been written by someone who doesn't know Python regexes well.
In some languages (e.g. JavaScript), regexes are delimited by slashes. That means that if you need an actual slash in your regex, you have to escape it. Since Python doesn't use slashes, there's no need to escape the slash (but it doesn't cause an error, either).
Much more worrisome is that the author failed to use a raw string. In many cases, that won't matter (because Python will treat "\d" as "\\d" which then correctly translates to the regex \d, but in other cases, it will cause problems. One example is "\b" which means "a backspace character" and not "a word boundary anchor" like the regex \b would.
Also, the author has escaped a lot of characters that didn't need escaping at all. The entire regex could be rewritten as
metric_pattern = re.compile(r'^([^=]+)=([\d.+eE-]+)([\w/%]*);?([\d.+eE:~#-]+)?;?([\d.+eE:~#-]+)?;?([\d.+eE-]+)?;?([\d.+eE-]+)?;?\s*')
and even then, I'm surprised that it works at all. Looks very chaotic to me and is definitely not foolproof. For example, there appears to be a big potential for catastrophic backtracking meaning that users could freeze your server with malicious input.

Categories

Resources