What does the whitespace in Python RegEx '^(.+?(\d*)) *$' mean? [duplicate] - python

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
What does the whitespace in Python RegEx ^(.+?(\d*)) *$ mean?
pat = re.compile('^(.+?(\d*)) *$',re.M)
Does * mean \s*?
Can the whitespace be ignored? i.e. is ^(.+?(\d*)) *$ same as ^(.+?(\d*))*$?
I ran some examples, and it seems that the answers to the above two questions are no.
Thanks!

* means 0 or more occurances, $ anchors the match to the end of line, so it's allowing (probably) trailing spaces, but not tabs, unless it's actually a tab.
No if you remove that white space, lines with invisible spaces after them won't match.
As it stands it's matching a line sequence of one or more non-digits, followed by optional digits and optional spaces.
Actually if debugging I'd have to look up what happens on a line like "12345 " with the non-greedy matching as I'd tend to write myself something like "^(\D+(\d+))\s*$" or "^(\D*.(\d+))\s*$" depending on intention. In old days you had to code against the greedy matching yourself, which means I generally avoid stuff like .+(\d*) through habit. Capturing 0 digits generally is a bug, as is having first digit consumed by .+

You can test this out for yourself on an online regex tool such as http://www.regex101.com
It's just a space character.
For your info, \s is actually 'whitespace', so it matches tabs, form feeds and other characters as well as spaces Whitespace link

Related

Matching a space between occurrences in Regex

I need assistance with matching spaces and subsequent matches in regex.
the example is as follows:
I want to match all of the following scenarios:
60 ml ( 1)
60ML (2 )
60ml(2) (a)
the regex I have used is:
(60\s?(?:ml)\s?(?:\w|\(.{0,3}\)){0,5})
link to the example: link to regex
the regex matches the first 2 examples, but not the instances where there is a space between (2) and (a).
any guidance would be appreciated.
Your regex doesn't allow for spaces between the parenthesised groups (2) and (a) in your last example. You can add <space>* to it to allow it to do so. Note you cannot use \s* unless you are only matching a single value at a time, otherwise the fact that \s will match newline can cause the first match to go too far.
(60\s?ml\s?(?:\w|\(.{0,3}\) *){0,5})
Note that without anchors counting repetitions doesn't really make sense. For example, this regex will match both 60ML (2 )(a)(a)(a)(a) and 60ML (2 )(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a), returning 60ML (2 )(a)(a)(a)(a) in both cases. If that is not what you want, you will need to add an anchor to the end of the regex ($ perhaps) to prevent it matching the longer string.
Demo on regex101

Python Regex symbolize unlimited amount of characters [duplicate]

This question already has answers here:
Using regex to match any character except =
(4 answers)
Closed 4 years ago.
I'm trying to figure out how to represent the following regex in python:
Find the first occurence of
{any character that isn't a letter}'{unlimited amount of any character including '}'{any character that isn't a letter}
For example:
She said 'Hello There!'.
`he Looked. 'I've been sick' and then...`
My question is how do I implement the middle part? How do I represent an unlimited amount of characters until the pattern in the end is found (`_)?
There are a few different ways you can represent an indefinite number of characters:
*: zero or more of the preceding character (greedy)
+: one or more of the preceding character (greedy)
*?: zero or more of the preceding character (non-greedy)
+?: one or more of the preceding character (non-greedy)
"Greedy" means that as many characters as possible will be matched. "Non-greedy" means that as few characters as possible will be matched. (For more explanation on greedy and non-greedy, see this answer.)
In your case, it sounds like you want to match one or more characters, and for the match to be non-greedy, so you need +?.
In Python code:
import re
my_regex = re.compile(r"\W'[^']+?'\W")
my_regex.search("She said 'Hello There!'.")
This regex won't match your second example, 'I've been sick' and then..., as there is no non-word character before the first '.

Why do you have to mention the start and end characters in regex? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
The question was a regex that does not contain the letters PART. Here was an answer:
/^((?!PART).)*$/
Link to question: How to match a line not containing a word
My question is, why does it not work unless you indicate start(^) and end($)?
Why doesn't only ((?!PART).)* work?
The pattern
/^((?!PART).)*$/
matches the general pattern ^.*$, namely anything occurring in between the start and end anchors. However, it also contains a negative assertion (?!PART). As written, your pattern asserts that, at each position in the string, the string PART does not appear when looking forward for four characters.
If you just use this pattern:
/((?!PART).)*/
then it only means that PART does not occur at any certain point in the string. But as this demo shows, a string such as PART hello world would match this alternative pattern, and may not be what you want.
Because the concept of "line" means nothing without the "multiline" flag re.M. Also if you only match the part of the line containing PART then all of your matches will also be the string "PART" or in the example above "PART" + the entire remaining line (depending on flags).
^ = start of line
$ = end of line
re.S = DOTALL, a dot includes whitespace/end of line markers
re.M = MULTILINE do not treat lineendings as match end
Without making that explicit you are only doing a substring matching and ignoring line endings entirely.

Regex - replace word having plus or brackets [duplicate]

This question already has answers here:
Escaping regex string
(4 answers)
Closed 6 years ago.
In Python, I am trying to do
text = re.sub(r'\b%s\b' % word, "replace_text", text)
to replace a word with some text. Using re rather than just doing text.replace to replace only if the whole word matches using \b. Problem comes when there are characters like +, (, [ etc in word. For example +91xxxxxxxx.
Regex treats this + as wildcard for one or more and breaks with error. sre_constants.error: nothing to repeat. Same is in the case of ( too.
Could find a fix for this after searching around a bit. Is there a way?
Just use re.escape(string):
word = re.escape(word)
text = re.sub(r'\b{}\b'.format(word), "replace_text", text)
It replaces all critical characters with a special meaning in regex patterns with their escape forms (e.g. \+ instead of +).
Just a sidenote: formatting with the percent (%) character is deprecated and was replaced by the .format() method of strings.

Python regular expression to match a pattern when preceded by either start of line or whitespace [duplicate]

This question already has answers here:
Python Regex Engine - "look-behind requires fixed-width pattern" Error
(3 answers)
Closed 4 years ago.
I would like to write a regex that matches the word hello but only when it either starts a line or is preceded by whitespace. I don't want to match the whitespace if its there...I just need to know it (or the start of line) is there.
So I've tried:
r = re.compile('hello(?<=\s|^)')
but this throws:
error: look-behind requires fixed-width pattern
For the sake of an example, if my string to be searched is:
s = 'hello world hello thello'
then I would like my regex to match two times...at the locations in uppercase below:
'HELLO world HELLO thello'
where the first would match because it is preceded by the start of the line, while the second match would be because it is preceded by a space. The last 5 characters would not match because they are preceded by a t.
(?:(?<=\s)|^)hello would be that which you want. The lookbehind needs to be in the beginning of regular expression; and it must indeed be of fixed width - \s is 1 character wide, whereas ^ is 0 characters, so you cannot combine them with |. In this case we do not need to, we just alternate (?<=\s) and ^.
Notice that both of these would still match hellooo; if this is not acceptable, you have to add \b at the end.

Categories

Resources