Use regular expressions to search for a number + " in Python - python

I want to compile a Regular Expression in Python to search for these values values_list = ['4.7"', '5.1"', '5.5"'].
This code can't find 5.1" at all:
regex = re.compile(r'\b(' + '5.1"' + r')\b')
regex.findall('5.1"')

The last word boundary \b is what failing the regex match. Modify the pattern to:
regex = re.compile(r'\b(' + '5.1"' + r')')
and it'll work.
From the docs:
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match
is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.
Pay close attention to the second bullet point.

Related

Python - remove punctuation marks at the end and at the beginning of one or more words

I wanted to know how to remove punctuation marks at the end and at the beginning of one or more words.
If there are punctuation marks between the word, we don't remove.
for example
input:
word = "!.test-one,-"
output:
word = "test-one"
use strip
>>> import string
>>> word = "!.test-one,-"
>>> word.strip(string.punctuation)
'test-one'
The best solution is to use Python .strip(chars) method of the built-in class str.
Another approach will be to use a regular expression and the regular expressions module.
In order to understand what strip() and the regular expression does you can take a look at two functions which duplicate the behavior of strip(). The first one using recursion, the second one using while loops:
chars = '''!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~'''
def cstm_strip_1(word, chars):
# Approach using recursion:
w = word[1 if word[0] in chars else 0: -1 if word[-1] in chars else None]
if w == word:
return w
else:
return cstm_strip_1(w, chars)
def cstm_strip_2(word, chars):
# Approach using a while loop:
i , j = 0, -1
while word[i] in chars:
i += 1
while word[j] in chars:
j -= 1
return word[i:j+1]
import re, string
chars = string.punctuation
word = "~!.test-one^&test-one--two???"
wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
assert wsc == re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
word = "__~!.test-one^&test-one--two??__"
wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
# assert wsc == re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
assert re.sub(r"(^[^\w]+)|([^\w]+$)", "", word) == word
print(re.sub(r"(^[^\w]+)|([^\w]+$)", "", word), '!=', wsc )
print('"',re.sub(r"(^[^\w]+)|([^\w]+$)", "", "\tword\t"), '" != "', "\tword\t".strip(chars), '"', sep='' )
Notice that the result when using the given regular expression pattern can differ from the result when using .strip(string.punctuation) because the set of characters covered by regular expression [^\w] pattern differs from the set of characters in string.punctuation.
SUPPLEMENT
What does the regular expression pattern:
(^[^\w]+)|([^\w]+$)
mean?
Below a detailed explanation:
The '|' character means 'or' providing two alternatives for the
sub-string (called match) which is to find in the provided string.
'(^[^\w]+)' is the first of the two alternatives for a match
'(' ')' enclose what is called a "capturing group" (^[^\w]+)
The first of the two '^' asserts position at start of a line
'\w' : with \ escaped 'w' means: "word character"
(i.e. letters a-z, A-Z, digits 0-9 and the underscore '_').
The second of the two '^' means: logical "not"
(here not a "word character")
i.e. all characters except a-zA-z0-9 and '_'
(for example '~' or 'ö')
Notice that the meaning of '^' depends on context:
'^' outside of [ ] it means start of line/string
'^' inside of [ ] as first char means logical not
and not as first means itself
'[', ']' enclose specification of a set of characters
and mean the occurrence of exactly one of them
'+' means occurrence between one and unlimited times
of what was defined in preceding token
'([^\w]+$)' is the second alternative for a match
differing from the first by stating that the match
should be found at the end of the string
'$' means: "end of the line" (or "end of string")
The regular expression pattern tells the regular expression engine to work as follows:
The engine looks at the start of the string for an occurrence of a non-word
character. If one if found it will be remembered as a match and next
character will be checked and added to the already found ones if it is also
a non-word character. This way the start of the string is checked for
occurrences of non-word characters which then will be removed from the
string if the pattern is used in re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
which replaces any found characters with an empty string (in other words
it deletes found character from the string).
After the engine hits first word character in the string the search at
the start of the string will the jump to the end of the string because
of the second alternative given for the pattern to find as the first
alternative is limited to the start of the line.
This way any non-word characters in the intermediate part of the string
will be not searched for.
The engine looks then at the end of a string for a non-word character
and proceeds like at the start but going backwards to assure that the
found non-word characters are at the end of the string.
Using re.sub
import re
word = "!.test-one,-"
out = re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
print(out)
Gives #
test-one
Check this example using slice
import string
sentence = "_blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers."
if sentence[0] in string.punctuation:
sentence = sentence[1:]
if sentence[-1] in string.punctuation:
sentence = sentence[:-1]
print(sentence)
Output:
blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers

how to match either word or sentence in this Python regex?

I have a decent familiarity with regex but this is tricky. I need to find instances like this from a SQL case statement:
when col_name = 'this can be a word or sentence'
I can match the above when it's just one word, but when it's more than one word it's not working.
s = """when col_name = 'a sentence of words'"""
x = re.search("when\s(\w+)\s*=\s*\'(\w+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a"
I want group(2) to return "a sentence of words" but I'm just getting the first word. That part could either be one word or several. How to do it?
When I add in the second \', then I get no match:
x = re.search("when\s(\w+)\s*=\s*\'(\w+)\'", s)
You may match all characters other than single quotation mark rather than matching letters, digits and connector punctuation ("word" chars) with the Group 2 pattern:
import re
s = """when col_name = 'a sentence of words'"""
x = re.search(r"when\s+(\w+)\s*=\s*'([^']+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a sentence of words"
See the Python demo
The [^'] is a negated character class that matches any char but a single quotation mark, see the regex demo.
In case the string can contain escaped single quotes, you may consider replacing [^'] with
If the escape char is ': ([^']*(?:''[^']*)*)
If the escape char is \: ([^\\']*(?:\\.[^'\\]*)*).
Note the use of the raw string literal to define the regex pattern (all backslashes are treated as literal backslashes inside it).

RegEx match word in string containing + and - using re.findall() Python

myreg = r"\babcb\"
mystr = "sdf ddabc"
mystr1 = "sdf abc"
print(re.findall(myreg,mystr))=[]
print(re.findall(myreg,mystr1))=[abc]
Until now everything works as expected but if i change my reg and my str to.
myreg = r"\b\+abcb\"
mystr = "sdf +abc"
print(re.findall(myreg,mystr)) = [] but i would like to get [+abc]
I have noticed that using the following works as expected.
myreg = "^\\+abc$"
mystr = "+abc"
mystr1 = "-+abc"
My question: Is it possible to achieve the same results as above without splitting the string?
Best regards,
Gabriel
Your problem is the following:
\b is defined as the boundary between a \w and a \W character
(or vice versa).
\w contains the character set [a-zA-Z0-9_]
\W contains the character set [^a-zA-Z0-9_], which means all characters except [a-zA-Z0-9_]
'+' is not contained in \w so you won't match the boundary between the whitespace and the '+'.
To get what you want, you should remove the first \b from your pattern:
import re
string = "sdf +abc"
pattern = r"\+abc\b"
matches = re.findall(pattern, string)
print matches
['+abc']
There are two problems
Before your + in +abc, there is no word boundary, so \b cannot match.
Your regex \b\+abcb\ tries to match a literal b character after abc (typo).
Word Boundaries
The word boundary \b matches at a position between a word character (letters, digits and underscore) and a non-word character (or a line beginning or ending). For instance, there is a word boundary between the + and the a
Solution: Make your Own boundary
If you want to match +abc but only when it is not preceded by a word character (for instance, you don't want it inside def+abc), then you can make your own boundary with a lookbehind:
(?<!\w)\+abc
This says "match +abc if it is not preceded by a word character (letter, digit, underscore)".

regular expression python `r'*(.|:|?|!)*'`

I want to accept words that include these characters: '.:?!'
I tried this as pattern: r'*(.|:|?|!)*' but does not work.
How I implement this by using python re.
the code I'd like to run is like this:
import re
pattern = r'*(.|:|?|!)*'
word = '.Flask'
match = re.match(pattern, word)
if match:
print('yes')
for example, I want to accept these words:'.flask', 'flask.','!flask','flask!'....
and even non ascii character. So I want to include those words as well:.日本語, 日本語.
that's why I wanted to use * symbol.
If you want to match words starting or ending with one of these characters, this regex should fit your needs:
pattern = r'[.:?!][^ ]*|[^ ]*[.:?!]'
Or even better with word boundary:
pattern = r'\b[.:?!][^ ]*\b|\b[^ ]*[.:?!]\b'
Explenation:
[.:?!][^ ]* matches words beginning with [.:?!] followed by all characters except whitespace.
[^ ]*[.:?!] matches all words beginning Wirth any character except whitespace ending with a [.:?!] character.
\b matches the word boundaries.

Python regex match literal asterisk

Given the following string:
s = 'abcdefg*'
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk? I thought the following would work, but it does not:
re.match(r"^[a-z]\*+$", s)
It gives None and not a match object.
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk?
The following will do it:
re.match(r"^[a-z]+[*]?$", s)
The ^ matches the start of the string.
The [a-z]+ matches one or more lowercase letters.
The [*]? matches zero or one asterisks.
The $ matches the end of the string.
Your original regex matches exactly one lowercase character followed by one or more asterisks.
\*? means 0-or-1 asterisk:
re.match(r"^[a-z]+\*?$", s)
re.match(r"^[a-z]+\*?$", s)
The [a-z]+ matches the sequence of lowercase letters, and \*? matches an optional literal * chatacter.
Try
re.match(r"^[a-z]*\*?$", s)
this means "a string consisting zero or more lowercase characters (hence the first asterisk), followed by zero or one asterisk (the question mark after the escaped asterisk).
Your regex means "exactly one lowercase character followed by one or more asterisks".
You forgot the + after the [a-z] match to indicate you want 1 or more of them as well (right now it's matching just one).
re.match(r"^[a-z]+\*+$", s)

Categories

Resources