This question already has answers here:
Reference - What does this regex mean?
(1 answer)
Decyphering a simple regex
(3 answers)
Closed 5 years ago.
I'm new to learning regex, and I came across a problem that I solved, although I'm not sure why it was a problem and would just like to learn a bit more!
I'm using Python for my regex statement. The relevant portion of text to be captured is (I've changed the exact numbers, but this is what it looks like)
Evaluation Type: InterimContract Percent Complete: 30%Period of Performance Being Assessed: 05/27/2013 -
I'm looking to capture Interim and 05/27/2013. The regex that I was using that did NOT work was
match = re.search(
"Evaluation Type:[\s\n]*(.*?)[\s\n]*Contract Percent[.]*"
"Period of Performance Being Assessed:[\s\n]*(.*?)[\s\n]*-"
, page_content)
The code that does work is
match = re.search(
"Evaluation Type:[\s\n]*(.*?)[\s\n]*Contract Percent.*"
"Period of Performance Being Assessed:[\s\n]*(.*?)[\s\n]*-"
, page_content)
(as you may notice, the difference is that I removed the square brackets around the . at the end of line 2.
I understand that the brackets weren't actually needed (just helped me visualize it as I'm creating the regex) but I'm not sure why they broke it. I was getting no match with the first set of code, while a perfect match with the second. I'm sure it's some simple little thing, but I couldn't find what would be breaking from my searches online (although it could be that I don't understand enough in depth to know what I'm looking for)
[.]* means 0 or more dot
.* means 0 or more any character but newline.
A dot inside a character class loses its special meaning.
Related
This question already has answers here:
Python extract pattern matches
(10 answers)
Closed 2 years ago.
Apologies if this is a duplicate - I wasn't exactly sure what to search for and everything I found came up short.
I'm using Python and if anybodies interested I drafted up a quick example on here:
Regex101 Example I created
I'm trying to use regex to grab the first part of a string that might be formatted like so:
**This is a Location** 8:20
or it could be formatted like...
Irrelevant information - **Relevant Information** 6:90
I wrote the following expression which does the job almost perfectly, pulling the relevant part of the string (words) out but it also pulls in the second part of the string (numbers). This is annoying as I then need to do a second regex/python expression to split that out.
r'(\w* ){1,5}\d+:\d+'
I'm using Python so I know I can quite easily separate the info manually with a slice etc but I feel like there must be a more elegant solution to my Regex that would negate the need for this step. Essentially I think the solution would be to match '\d+:\d+' and look back from there.
Ok - perhaps this isn't the most elegant solution but I've just realised I think I can use capturing groups like so:
# Pattern with groups
pattern = '((\w* ){1,5})(\d+:\d+)'
string = "useless something else - useful 2:2"
r = re.search(pattern, string)
if r:
useful info= r.group(1)
boundary = r.group(3)
Theoretically, I'm always going to have the same number of groups with group 1 containing the relevant string I'm trying to grab and group 3 the time/number value. I'll test this now and update/close this thread.
This question already has answers here:
Python extract pattern matches
(10 answers)
How do I return a string from a regex match in python? [duplicate]
(4 answers)
Closed 2 years ago.
I have scoured the web (and perhaps I am searching the wrong thing), but I have a very long regex pattern that I would like to match:
Ex:
import re
re_pattern_str = r"I want to match this \(this is an example\) regular expression to a giant string"
sample_paragraph = "I want to match this (this is an example) regular expression to a giant string. This is a huge paragraph with a bunch of stuff in it"
print(re.match(re_pattern_str, sample_paragraph))
The output of the above program is as follows:
run
<re.Match object; span=(0, 78), match='I want to match this (this is an example) regular>
As you can see, it gets cut off and doesn't capture the whole string.
Also, I noticed that using verbose mode with a lot of comments ((?x) in Python) captures less. Does this mean there is a limit to how much can be captured? I also noticed using different Python versions and different machines caused different amounts of a long regex string to be captured. I still can't pinpoint if this is an issue in the re library in Python, a Python 3 specific thing (I haven't compared this to Python 2), a machine issue, memory issue, or something else.
I have used Python 3.8.1 for the above example, and have used Python 3.7.2 for another example using verbose regexes and other examples (I can't share these examples since those are proprietary).
Any help on the mechanics of Python regex engine and why this happens (and if there is a maximum length that can be captured via regex, why?), this would be very helpful.
You think the repr of the match is the matched text. It isn't. The repr tries not to dump pages of text for large matches. If you want to see the complete matched text, index in to get it as a string:
print(re.match(re_pattern_str, sample_paragraph)[0])
#^^^ gets the matched text itself
You can see from the repr it's a much longer match (it spans index 0 to 78).
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I am currently going through pythonchallenge.com, and now trying to make a code that searches for a lowercase letter with exactly three uppercase letters on both side of it. Then I got stuck on trying to make a regular expression for it. This is what I have tried:
import re
#text is in https://pastebin.com/pAFrenWN since it is too long
p = re.compile("[^A-Z]+[A-Z]{3}[a-z][A-Z]{3}[^A-Z]+")
print("".join(p.findall(text)))
This is what I got with it:
dqIQNlQSLidbzeOEKiVEYjxwaZADnMCZqewaebZUTkLYNgouCNDeHSBjgsgnkOIXdKBFhdXJVlGZVme
gZAGiLQZxjvCJAsACFlgfe
qKWGtIDCjn
I later searched for the solution, which had this regular expression:
p = re.compile("[^A-Z]+[A-Z]{3}([a-z])[A-Z]{3}[^A-Z]+")
So there is a bracket around [a-z], and I couldn't figure out what difference it makes. I would like some explanation on this.
Use Parentheses for Grouping and Capturing By placing part of a
regular expression inside round brackets or parentheses, you can group
that part of the regular expression together. This allows you to apply
a quantifier to the entire group or to restrict alternation to part of
the regex.
https://www.regular-expressions.info/brackets.html
Basicly the regex engine can find a list of strings matching the whole search pattern, and return you the parts inside the ().
This question already has answers here:
Why can't Python parse this JSON data? [closed]
(3 answers)
Closed 4 years ago.
Python 2.4.4 (yeah, long story)
I want to parse this fragment (with re)
"comment":"#2 Surely, (this) can't be any [more] complicated a reg-ex?",
i.e., it (the comment) can contain characters (upper or lower), numbers, hash, parentheses, square brackets, single quotes, and commas, and it (this fragment) specifically ends with a dquote and a comma
i've gotten this far with the expression,
r'\"comment\":\"(?P<COMMENT>[a-zA-Z0-9\s]+)\",'
but, of course, it only matches when none of the meta characters are in the comment. the final \", works as the the termination criterion. I've tried all kinds of escape, double escape ...
could a kind 're geek' please enlighten ?
i want to access the "entire" comment as match.group["COMMENT"]
corrected the pattern to what I was actually using when asked. my bad cut-n-paste.
until marked with all the "DUPLICATES", I couldn't spell JSON. But, I DID specify I had to do this with re.
even with all the JSON responses and code frags, it wasn't introduced until 2.6, and I did specify I'm still using 2.4.4.
Thanks to those responding with the regex-based solutions. Now working for me :)
Use a non-greedy .*? to match anything before ",, assuming this as the end of comment:
import re
s = '''"comment":"#2 Surely, (this) can't be any [more] complicated a reg-ex?",'''
match = re.search(r'"comment":"(?P<comment>.*?)",', s)
print(match.group('comment'))
# #2 Surely, (this) can't be any [more] complicated a reg-ex?
You can name your matched string using (?P<group_name>…).
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I am reading the Shinken source code in shinken/misc/perfdata.py and i finally find a regex that i can not understand. like this:
metric_pattern = re.compile('^([^=]+)=([\d\.\-\+eE]+)([\w\/%]*);?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE]+)?;?([\d\.\-\+eE]+)?;?\s*')
what confused me is that what does \/ mean in ([\w\/%]*)?
You're rightfully confused, because that regex must have been written by someone who doesn't know Python regexes well.
In some languages (e.g. JavaScript), regexes are delimited by slashes. That means that if you need an actual slash in your regex, you have to escape it. Since Python doesn't use slashes, there's no need to escape the slash (but it doesn't cause an error, either).
Much more worrisome is that the author failed to use a raw string. In many cases, that won't matter (because Python will treat "\d" as "\\d" which then correctly translates to the regex \d, but in other cases, it will cause problems. One example is "\b" which means "a backspace character" and not "a word boundary anchor" like the regex \b would.
Also, the author has escaped a lot of characters that didn't need escaping at all. The entire regex could be rewritten as
metric_pattern = re.compile(r'^([^=]+)=([\d.+eE-]+)([\w/%]*);?([\d.+eE:~#-]+)?;?([\d.+eE:~#-]+)?;?([\d.+eE-]+)?;?([\d.+eE-]+)?;?\s*')
and even then, I'm surprised that it works at all. Looks very chaotic to me and is definitely not foolproof. For example, there appears to be a big potential for catastrophic backtracking meaning that users could freeze your server with malicious input.