myreg = r"\babcb\"
mystr = "sdf ddabc"
mystr1 = "sdf abc"
print(re.findall(myreg,mystr))=[]
print(re.findall(myreg,mystr1))=[abc]
Until now everything works as expected but if i change my reg and my str to.
myreg = r"\b\+abcb\"
mystr = "sdf +abc"
print(re.findall(myreg,mystr)) = [] but i would like to get [+abc]
I have noticed that using the following works as expected.
myreg = "^\\+abc$"
mystr = "+abc"
mystr1 = "-+abc"
My question: Is it possible to achieve the same results as above without splitting the string?
Best regards,
Gabriel
Your problem is the following:
\b is defined as the boundary between a \w and a \W character
(or vice versa).
\w contains the character set [a-zA-Z0-9_]
\W contains the character set [^a-zA-Z0-9_], which means all characters except [a-zA-Z0-9_]
'+' is not contained in \w so you won't match the boundary between the whitespace and the '+'.
To get what you want, you should remove the first \b from your pattern:
import re
string = "sdf +abc"
pattern = r"\+abc\b"
matches = re.findall(pattern, string)
print matches
['+abc']
There are two problems
Before your + in +abc, there is no word boundary, so \b cannot match.
Your regex \b\+abcb\ tries to match a literal b character after abc (typo).
Word Boundaries
The word boundary \b matches at a position between a word character (letters, digits and underscore) and a non-word character (or a line beginning or ending). For instance, there is a word boundary between the + and the a
Solution: Make your Own boundary
If you want to match +abc but only when it is not preceded by a word character (for instance, you don't want it inside def+abc), then you can make your own boundary with a lookbehind:
(?<!\w)\+abc
This says "match +abc if it is not preceded by a word character (letter, digit, underscore)".
Related
I would like to match a word when it is after a char m or b
So for example, when the word is men, I would like to return en (only the word that is following m), if the word is beetles then return eetles
Initially I tried (m|b)\w+ but it matches the entire men not en
How do I write regex expression in this case?
Thank you!
You could get the match only using a positive lookbehind asserting what is on the left is either m or b using character class [mb] preceded by a word boundary \b
(?<=\b[mb])\w+
(?<= Positive lookbehind, assert what is directly to the left is
\b[mb] Word boundary, match either m or b
) Close lookbehind
\w+ Match 1 + word chars
Regex demo
If there can not be anything after the the word characters, you can assert a whitespace boundary at the right using (?!\S)
(?<=\b[mb])\w+(?!\S)
Regex demo | Python demo
Example code
import re
test_str = ("beetles men")
regex = r"(?<=\b[mb])\w+"
print(re.findall(regex, test_str))
Output
['eetles', 'en']
You may use
\b[mb](\w+)
See the regex demo.
NOTE: When your known prefixes include multicharacter sequences, say, you want to find words starting with m or be, you will have to use a non-capturing group rather than a character class: \b(?:m|be)(\w+). The current solution can thus be written as \b(?:m|b)(\w+) (however, a character class here looks more natural, unless you have to build the regex dynamically).
Details
\b - a word boundary
[mb] - m or b
(\w+) - Capturing group 1: any one or more word chars, letters, digits or underscores. To match only letters, use ([^\W\d_]+) instead.
Python demo:
import re
rx = re.compile(r'\b[mb](\w+)')
text = "The words are men and beetles."
# First occurrence:
m = rx.search(text)
if m:
print(m.group(1)) # => en
# All occurrences
print( rx.findall(text) ) # => ['en', 'eetles']
(?<=[mb])\w+/
You can use this above regex. The regex means "Any word starts with m or b".
(?<=[mb]): positive lookbehind
\w+: matches any word character (equal to [a-zA-Z0-9]+)
I have a decent familiarity with regex but this is tricky. I need to find instances like this from a SQL case statement:
when col_name = 'this can be a word or sentence'
I can match the above when it's just one word, but when it's more than one word it's not working.
s = """when col_name = 'a sentence of words'"""
x = re.search("when\s(\w+)\s*=\s*\'(\w+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a"
I want group(2) to return "a sentence of words" but I'm just getting the first word. That part could either be one word or several. How to do it?
When I add in the second \', then I get no match:
x = re.search("when\s(\w+)\s*=\s*\'(\w+)\'", s)
You may match all characters other than single quotation mark rather than matching letters, digits and connector punctuation ("word" chars) with the Group 2 pattern:
import re
s = """when col_name = 'a sentence of words'"""
x = re.search(r"when\s+(\w+)\s*=\s*'([^']+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a sentence of words"
See the Python demo
The [^'] is a negated character class that matches any char but a single quotation mark, see the regex demo.
In case the string can contain escaped single quotes, you may consider replacing [^'] with
If the escape char is ': ([^']*(?:''[^']*)*)
If the escape char is \: ([^\\']*(?:\\.[^'\\]*)*).
Note the use of the raw string literal to define the regex pattern (all backslashes are treated as literal backslashes inside it).
I am new with regex and I would like some help. So I have a string below and I want to make my regex match the first character of the acronym literally + any character[a-z] unlimited times but only for the first character. For the rest of the characters, i would like to just match them as they are. Any help on what to change on my regex line to achieve this, would be highly appreciated.
import re
s = 'nUSA stands for northern USA'
x = (f'({"nUSA"}).+?({" ".join( t[0] + "[a-z]" + t[1:] for t in "nUSA")})(?: )')
print(x)
out: (nUSA).+?(n[a-z]+ U[a-z]+ S[a-z]+ A[a-z]+)(?: )
What i want to achieve with my regex line is something like the pattern below so that it can match for the northern USA.
(nUSA).+?(n[a-z]+ U + S + A)(?: )
instead of the one i get
(nUSA).+?(n[a-z]+ U[a-z]+ S[a-z]+ A[a-z]+)(?: )
I would like it to work for any arbitrary text, not only for the specific one. I am not sure if i have expressed my problem properly.
You may use
import re
s = 'nUSA stands for northern USA'
key='nUSA'
x = rf'\b({key})\b.+?\b({key[0]}[a-z]*\s*{key[1:]})(?!\S)'
# => print(x) => \b(nUSA)\b.+?\b(n[a-z]*\s*USA)(?!\S)
# Or, if the key can contain special chars at the end:
# x = rf'\b({re.escape(key)})(?!\w).+?(?<!\w)({re.escape(key[0])}[a-z]*\s*{re.escape(key[1:])})(?!\S)'
print(re.findall(x, s))
# => [('nUSA', 'northern USA')]
See the Python demo. The resulting regex will look like \b(nUSA)\b.+?\b(n[a-z]*\s*USA)(?!\S), see its demo. Details:
\b - word boundary
(nUSA) - Group 1 capturing the key word
\b / (?!\w) - word boundary (right-hand word boundary)
.+? - any 1+ chars other than linebreak chars as few as possible
\b - word boundary
(n[a-z]*\s*USA) - Group 2: n (first char), then any 0+ lowercase ASCII letters, 0+ whitespaces and the rest of the key string.
(?!\S) - a right-hand whitespace boundary (you may consider using (?!\w) again here).
I want to compile a Regular Expression in Python to search for these values values_list = ['4.7"', '5.1"', '5.5"'].
This code can't find 5.1" at all:
regex = re.compile(r'\b(' + '5.1"' + r')\b')
regex.findall('5.1"')
The last word boundary \b is what failing the regex match. Modify the pattern to:
regex = re.compile(r'\b(' + '5.1"' + r')')
and it'll work.
From the docs:
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match
is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.
Pay close attention to the second bullet point.
I want to accept words that include these characters: '.:?!'
I tried this as pattern: r'*(.|:|?|!)*' but does not work.
How I implement this by using python re.
the code I'd like to run is like this:
import re
pattern = r'*(.|:|?|!)*'
word = '.Flask'
match = re.match(pattern, word)
if match:
print('yes')
for example, I want to accept these words:'.flask', 'flask.','!flask','flask!'....
and even non ascii character. So I want to include those words as well:.日本語, 日本語.
that's why I wanted to use * symbol.
If you want to match words starting or ending with one of these characters, this regex should fit your needs:
pattern = r'[.:?!][^ ]*|[^ ]*[.:?!]'
Or even better with word boundary:
pattern = r'\b[.:?!][^ ]*\b|\b[^ ]*[.:?!]\b'
Explenation:
[.:?!][^ ]* matches words beginning with [.:?!] followed by all characters except whitespace.
[^ ]*[.:?!] matches all words beginning Wirth any character except whitespace ending with a [.:?!] character.
\b matches the word boundaries.