Python Regex symbolize unlimited amount of characters [duplicate] - python

This question already has answers here:
Using regex to match any character except =
(4 answers)
Closed 4 years ago.
I'm trying to figure out how to represent the following regex in python:
Find the first occurence of
{any character that isn't a letter}'{unlimited amount of any character including '}'{any character that isn't a letter}
For example:
She said 'Hello There!'.
`he Looked. 'I've been sick' and then...`
My question is how do I implement the middle part? How do I represent an unlimited amount of characters until the pattern in the end is found (`_)?

There are a few different ways you can represent an indefinite number of characters:
*: zero or more of the preceding character (greedy)
+: one or more of the preceding character (greedy)
*?: zero or more of the preceding character (non-greedy)
+?: one or more of the preceding character (non-greedy)
"Greedy" means that as many characters as possible will be matched. "Non-greedy" means that as few characters as possible will be matched. (For more explanation on greedy and non-greedy, see this answer.)
In your case, it sounds like you want to match one or more characters, and for the match to be non-greedy, so you need +?.
In Python code:
import re
my_regex = re.compile(r"\W'[^']+?'\W")
my_regex.search("She said 'Hello There!'.")
This regex won't match your second example, 'I've been sick' and then..., as there is no non-word character before the first '.

Related

Question about ".*" in match regex in Python [duplicate]

This question already has answers here:
Regular Expressions- Match Anything
(17 answers)
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 2 years ago.
Following is a simple piece of code about regex match:
import re
pattern = ".*"
s = "ab"
print(re.search(pattern, s))
output:
<_sre.SRE_Match object; span=(0, 2), match='ab'>
My confusion is "." matches any single character, so here it's able to match "a" or "b" , then with a "*" behind it, this combo should be able to match "" "a" or "aa" or "aaa..." or "b" or "bb" or "bbb..." or other single characters that repeat for several times.
But how comes it(".*") matches "ab" the same time?
The comments more or less covered it, but to provide an answer: the pattern .* means to match any character . zero or more times *. And by default, a regex is greedy so when presented with 'abc', even though '' would satisfy that rule, or 'a' would, etc., it will match the entire string, since matching all of it still meets the requirement.
It does not mean to match the same character zero or more times. Every character it matches can be a different character or the same as a previously matched one.
If instead you want to match any character, but match as many of that same character as possible, zero or more times, you can use:
(.)?\1*
See here https://regex101.com/r/FgvuX2/1 and here https://regex101.com/r/FgvuX2/2
What this effectively does, is match a single character optionally, creating a back reference which can be used in the second part of the expression. Thus it matches any single character (if there is one) to group 1 and matches that group 1 zero or more times, being greedy.

Python/Regex: Get all strings between any two characters [duplicate]

This question already has answers here:
Match text between two strings with regular expression
(3 answers)
Closed 5 years ago.
I have a use case that requires the identification of many different pieces of text between any two characters.
For example,
String between a single space and (: def test() would return
test
String between a word and space (paste), and a special character (/): #paste "game_01/01" would return "game_01
String between a single space and ( with multiple target strings: } def test2() { Hello(x, 1) would return test2 and Hello
To do this, I'm attempting to write something generic that will identify the shortest string between any two characters.
My current approach is (from chrisz):
pattern = '{0}(.*?){1}'.format(re.escape(separator_1), re.escape(separator_2))
And for the first use case, separator_1 = \s and separator_2 = (. This isn't working so evidently I am missing something but am not sure what.
tl;dr How can I write a generic regex to parse the shortest string between any two characters?
Note: I know there are many examples of this but they seem quite specific and I'm looking for a general solution if possible.
Let me know if this is what you are looking for:
import re
def smallest_between_two(a, b, text):
return min(re.findall(re.escape(a)+"(.*?)"+re.escape(b),text), key=len)
print(smallest_between_two(' ', '(', 'def test()'))
print(smallest_between_two('[', ']', '[this one][not this one]'))
print(smallest_between_two('paste ', '/', '#paste "game_01/01"'))
Output:
test
this one
"game_01
To add an explanation to what this does:
re.findall():
Return all non-overlapping matches of pattern in string, as a list of strings
re.escape()
Escape all the characters in pattern except ASCII letters and numbers. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it
(.*?)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
So our regular expression matches any character (not including line terminators) between two arbitrary escaped strings, and then returns the shortest length string from the list that re.findall() returns.

regex lookahead assertion [duplicate]

This question already has answers here:
Regex plus vs star difference? [duplicate]
(9 answers)
Closed 5 years ago.
I'm new to python regex and am learning the lookahead assertion.
I found the following strange. Could someone tell me how it works?
import regex as re
re.search('(\d*)(?<=a)(\.)','1a.')
<regex.Match object; span=(2, 3), match='.'>
re.search('(\d+)(?<=a)(\.)','1a.')
out put nothing
Why doesn't the second one match anything?
The first pattern:
re.search('(\d*)(?<=a)(\.)', '1a.')
says to find zero or more digits, followed by a dot. Right before the dot, it has a positive lookbehind, which asserts the previous character was an a. In this case, Python will match zero digits, followed by a single dot. The lookbehind fires true, because the preceding character was in fact an a.
However, the second pattern:
re.search('(\d+)(?<=a)(\.)','1a.')
matches one or more digits, followed the lookbehind and matching dot. In this case, Python is compelled to match the number 1. But then it the lookbehind must fail. Obviously, if the last character matched were a number, it cannot be the letter a. So, there is no match possible in the second case. Even if we were to remove (?<=a) from the second pattern, it would still fail because we are not accounting for the letter a.

Python regular expression to match a pattern when preceded by either start of line or whitespace [duplicate]

This question already has answers here:
Python Regex Engine - "look-behind requires fixed-width pattern" Error
(3 answers)
Closed 4 years ago.
I would like to write a regex that matches the word hello but only when it either starts a line or is preceded by whitespace. I don't want to match the whitespace if its there...I just need to know it (or the start of line) is there.
So I've tried:
r = re.compile('hello(?<=\s|^)')
but this throws:
error: look-behind requires fixed-width pattern
For the sake of an example, if my string to be searched is:
s = 'hello world hello thello'
then I would like my regex to match two times...at the locations in uppercase below:
'HELLO world HELLO thello'
where the first would match because it is preceded by the start of the line, while the second match would be because it is preceded by a space. The last 5 characters would not match because they are preceded by a t.
(?:(?<=\s)|^)hello would be that which you want. The lookbehind needs to be in the beginning of regular expression; and it must indeed be of fixed width - \s is 1 character wide, whereas ^ is 0 characters, so you cannot combine them with |. In this case we do not need to, we just alternate (?<=\s) and ^.
Notice that both of these would still match hellooo; if this is not acceptable, you have to add \b at the end.

What does the whitespace in Python RegEx '^(.+?(\d*)) *$' mean? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
What does the whitespace in Python RegEx ^(.+?(\d*)) *$ mean?
pat = re.compile('^(.+?(\d*)) *$',re.M)
Does * mean \s*?
Can the whitespace be ignored? i.e. is ^(.+?(\d*)) *$ same as ^(.+?(\d*))*$?
I ran some examples, and it seems that the answers to the above two questions are no.
Thanks!
* means 0 or more occurances, $ anchors the match to the end of line, so it's allowing (probably) trailing spaces, but not tabs, unless it's actually a tab.
No if you remove that white space, lines with invisible spaces after them won't match.
As it stands it's matching a line sequence of one or more non-digits, followed by optional digits and optional spaces.
Actually if debugging I'd have to look up what happens on a line like "12345 " with the non-greedy matching as I'd tend to write myself something like "^(\D+(\d+))\s*$" or "^(\D*.(\d+))\s*$" depending on intention. In old days you had to code against the greedy matching yourself, which means I generally avoid stuff like .+(\d*) through habit. Capturing 0 digits generally is a bug, as is having first digit consumed by .+
You can test this out for yourself on an online regex tool such as http://www.regex101.com
It's just a space character.
For your info, \s is actually 'whitespace', so it matches tabs, form feeds and other characters as well as spaces Whitespace link

Categories

Resources