RegEx match word in string containing + and - using re.findall() Python - python

myreg = r"\babcb\"
mystr = "sdf ddabc"
mystr1 = "sdf abc"
print(re.findall(myreg,mystr))=[]
print(re.findall(myreg,mystr1))=[abc]
Until now everything works as expected but if i change my reg and my str to.
myreg = r"\b\+abcb\"
mystr = "sdf +abc"
print(re.findall(myreg,mystr)) = [] but i would like to get [+abc]
I have noticed that using the following works as expected.
myreg = "^\\+abc$"
mystr = "+abc"
mystr1 = "-+abc"
My question: Is it possible to achieve the same results as above without splitting the string?
Best regards,
Gabriel

Your problem is the following:
\b is defined as the boundary between a \w and a \W character
(or vice versa).
\w contains the character set [a-zA-Z0-9_]
\W contains the character set [^a-zA-Z0-9_], which means all characters except [a-zA-Z0-9_]
'+' is not contained in \w so you won't match the boundary between the whitespace and the '+'.
To get what you want, you should remove the first \b from your pattern:
import re
string = "sdf +abc"
pattern = r"\+abc\b"
matches = re.findall(pattern, string)
print matches
['+abc']

There are two problems
Before your + in +abc, there is no word boundary, so \b cannot match.
Your regex \b\+abcb\ tries to match a literal b character after abc (typo).
Word Boundaries
The word boundary \b matches at a position between a word character (letters, digits and underscore) and a non-word character (or a line beginning or ending). For instance, there is a word boundary between the + and the a
Solution: Make your Own boundary
If you want to match +abc but only when it is not preceded by a word character (for instance, you don't want it inside def+abc), then you can make your own boundary with a lookbehind:
(?<!\w)\+abc
This says "match +abc if it is not preceded by a word character (letter, digit, underscore)".

Related

regex match a word after a certain character

I would like to match a word when it is after a char m or b
So for example, when the word is men, I would like to return en (only the word that is following m), if the word is beetles then return eetles
Initially I tried (m|b)\w+ but it matches the entire men not en
How do I write regex expression in this case?
Thank you!
You could get the match only using a positive lookbehind asserting what is on the left is either m or b using character class [mb] preceded by a word boundary \b
(?<=\b[mb])\w+
(?<= Positive lookbehind, assert what is directly to the left is
\b[mb] Word boundary, match either m or b
) Close lookbehind
\w+ Match 1 + word chars
Regex demo
If there can not be anything after the the word characters, you can assert a whitespace boundary at the right using (?!\S)
(?<=\b[mb])\w+(?!\S)
Regex demo | Python demo
Example code
import re
test_str = ("beetles men")
regex = r"(?<=\b[mb])\w+"
print(re.findall(regex, test_str))
Output
['eetles', 'en']
You may use
\b[mb](\w+)
See the regex demo.
NOTE: When your known prefixes include multicharacter sequences, say, you want to find words starting with m or be, you will have to use a non-capturing group rather than a character class: \b(?:m|be)(\w+). The current solution can thus be written as \b(?:m|b)(\w+) (however, a character class here looks more natural, unless you have to build the regex dynamically).
Details
\b - a word boundary
[mb] - m or b
(\w+) - Capturing group 1: any one or more word chars, letters, digits or underscores. To match only letters, use ([^\W\d_]+) instead.
Python demo:
import re
rx = re.compile(r'\b[mb](\w+)')
text = "The words are men and beetles."
# First occurrence:
m = rx.search(text)
if m:
print(m.group(1)) # => en
# All occurrences
print( rx.findall(text) ) # => ['en', 'eetles']
(?<=[mb])\w+/
You can use this above regex. The regex means "Any word starts with m or b".
(?<=[mb]): positive lookbehind
\w+: matches any word character (equal to [a-zA-Z0-9]+)

how to match either word or sentence in this Python regex?

I have a decent familiarity with regex but this is tricky. I need to find instances like this from a SQL case statement:
when col_name = 'this can be a word or sentence'
I can match the above when it's just one word, but when it's more than one word it's not working.
s = """when col_name = 'a sentence of words'"""
x = re.search("when\s(\w+)\s*=\s*\'(\w+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a"
I want group(2) to return "a sentence of words" but I'm just getting the first word. That part could either be one word or several. How to do it?
When I add in the second \', then I get no match:
x = re.search("when\s(\w+)\s*=\s*\'(\w+)\'", s)
You may match all characters other than single quotation mark rather than matching letters, digits and connector punctuation ("word" chars) with the Group 2 pattern:
import re
s = """when col_name = 'a sentence of words'"""
x = re.search(r"when\s+(\w+)\s*=\s*'([^']+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a sentence of words"
See the Python demo
The [^'] is a negated character class that matches any char but a single quotation mark, see the regex demo.
In case the string can contain escaped single quotes, you may consider replacing [^'] with
If the escape char is ': ([^']*(?:''[^']*)*)
If the escape char is \: ([^\\']*(?:\\.[^'\\]*)*).
Note the use of the raw string literal to define the regex pattern (all backslashes are treated as literal backslashes inside it).

regex to match literally the characters of an abbreviation

I am new with regex and I would like some help. So I have a string below and I want to make my regex match the first character of the acronym literally + any character[a-z] unlimited times but only for the first character. For the rest of the characters, i would like to just match them as they are. Any help on what to change on my regex line to achieve this, would be highly appreciated.
import re
s = 'nUSA stands for northern USA'
x = (f'({"nUSA"}).+?({" ".join( t[0] + "[a-z]" + t[1:] for t in "nUSA")})(?: )')
print(x)
out: (nUSA).+?(n[a-z]+ U[a-z]+ S[a-z]+ A[a-z]+)(?: )
What i want to achieve with my regex line is something like the pattern below so that it can match for the northern USA.
(nUSA).+?(n[a-z]+ U + S + A)(?: )
instead of the one i get
(nUSA).+?(n[a-z]+ U[a-z]+ S[a-z]+ A[a-z]+)(?: )
I would like it to work for any arbitrary text, not only for the specific one. I am not sure if i have expressed my problem properly.
You may use
import re
s = 'nUSA stands for northern USA'
key='nUSA'
x = rf'\b({key})\b.+?\b({key[0]}[a-z]*\s*{key[1:]})(?!\S)'
# => print(x) => \b(nUSA)\b.+?\b(n[a-z]*\s*USA)(?!\S)
# Or, if the key can contain special chars at the end:
# x = rf'\b({re.escape(key)})(?!\w).+?(?<!\w)({re.escape(key[0])}[a-z]*\s*{re.escape(key[1:])})(?!\S)'
print(re.findall(x, s))
# => [('nUSA', 'northern USA')]
See the Python demo. The resulting regex will look like \b(nUSA)\b.+?\b(n[a-z]*\s*USA)(?!\S), see its demo. Details:
\b - word boundary
(nUSA) - Group 1 capturing the key word
\b / (?!\w) - word boundary (right-hand word boundary)
.+? - any 1+ chars other than linebreak chars as few as possible
\b - word boundary
(n[a-z]*\s*USA) - Group 2: n (first char), then any 0+ lowercase ASCII letters, 0+ whitespaces and the rest of the key string.
(?!\S) - a right-hand whitespace boundary (you may consider using (?!\w) again here).

Use regular expressions to search for a number + " in Python

I want to compile a Regular Expression in Python to search for these values values_list = ['4.7"', '5.1"', '5.5"'].
This code can't find 5.1" at all:
regex = re.compile(r'\b(' + '5.1"' + r')\b')
regex.findall('5.1"')
The last word boundary \b is what failing the regex match. Modify the pattern to:
regex = re.compile(r'\b(' + '5.1"' + r')')
and it'll work.
From the docs:
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match
is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.
Pay close attention to the second bullet point.

regular expression python `r'*(.|:|?|!)*'`

I want to accept words that include these characters: '.:?!'
I tried this as pattern: r'*(.|:|?|!)*' but does not work.
How I implement this by using python re.
the code I'd like to run is like this:
import re
pattern = r'*(.|:|?|!)*'
word = '.Flask'
match = re.match(pattern, word)
if match:
print('yes')
for example, I want to accept these words:'.flask', 'flask.','!flask','flask!'....
and even non ascii character. So I want to include those words as well:.日本語, 日本語.
that's why I wanted to use * symbol.
If you want to match words starting or ending with one of these characters, this regex should fit your needs:
pattern = r'[.:?!][^ ]*|[^ ]*[.:?!]'
Or even better with word boundary:
pattern = r'\b[.:?!][^ ]*\b|\b[^ ]*[.:?!]\b'
Explenation:
[.:?!][^ ]* matches words beginning with [.:?!] followed by all characters except whitespace.
[^ ]*[.:?!] matches all words beginning Wirth any character except whitespace ending with a [.:?!] character.
\b matches the word boundaries.

Categories

Resources