Why does my regex pattern allow digits?

Why does my regex pattern allow digits? - python

I have tried creating a regex pattern that matches only letters and allows a whitespace:
import re
user_input = raw_input('Input: ')
if re.match('[A-Za-z ]', user_input):
print user_input
However,
When inputting o888, or something similar, a match seems to still occur

That happens because your regex allows partial matches.
Use
if re.match('[A-Za-z ]*$', user_input):
^^
to anchor the pattern at the end and match 0+ chars. As re.match anchors the pattern at the start of the string, the ^ anchor is not necessary, but $ - end of string - is required to enforce the full string match.
If you do not want to allow an empty string, use + quantifier - one or more occurrences - rather than * (zero or more occurrences).

Related

Restrict negative lookahead to be between substrings regex

In my regex pattern, I would like to make sure a certain substring only occurs once in between two other substrings.
So, let's take for example these strings:
string_a = “this and that”
string_b = "this and and that"
I want to return a match for string_a but not for string_b, because 'and' occurs twice there between this/that.
I would do that with a negative lookahead-tempered dot:
my_pattern = "this(?:(?!and.*and).)*that"
This matches string_a and not string_b, so so far so good.
However, with the following sentence is also not matched (like string_b):
string_c = "this and that and"
Evidently, the negative lookahead occurs for the whole string, rather than between "this" and "that" as I had anticipated and hoped.
How can I do this instead?

You can use another tempered greedy token to temper the .* inside the lookahead:
this(?:(?!this|that|and(?:(?!that).)*?and).)*?that
See the regex demo.
Details:
this - a fixed string
(?:(?!this|that|and(?:(?!that).)*?and).)*? - any char other than line break chars, zero or more but as few as possible occurrernces, that does not start a this, that char sequences or a pattern that matches and, then any char other than line break chars, zero or more but as few as possible occurrernces, that does not start a that char sequence and then and string
that - a fixed string.

Regex to pull the first and last letter of a string

I am using this \d{2}-\d{2}-\d{4} to validate my string. It works to pull the sequence of numbers out of said string resulting in 02-01-1716 however, i also need to pull the letter the string begins with and ends with; i.e. Q:\Region01s\FY 02\02-01-1716A.pdf i need the Q as well as the A so in the end i would have Q: 02-01-1716A

You can use
import re
regex = r"^([a-zA-Z]:)\\(?:.*\\)?(\d{2}-\d{2}-\d{4}[a-zA-Z]?)"
text = r"Q:\Region01s\FY 02\02-01-1716A.pdf"
match = re.search(regex, text)
if match:
print(f"{match.group(1)} {match.group(2)}")
# => Q: 02-01-1716A
See the Python demo. Also, see the regex demo. Details:
^ - start of string
([a-zA-Z]:) - Group 1: a letter and :
\\ - a backslash
(?:.*\\)? - an optional sequence of any chars other than line break chars as many as possible, followed with a backslash
(\d{2}-\d{2}-\d{4}[a-zA-Z]?) - Group 2: two digits, -, two digits, -, four digits, an optional letter.
The output - if there is a match - is a concatenation of Group 1, space and Group 2 values.

You can try:
(.).*(.)\.[^\.]+$
Or with the validation:
(.).*\d{2}-\d{2}-\d{4}(.)\.[^\.]+$

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']

First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []

Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).

Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?

I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the
hyphen in 12-34 should be kept while the equal mark after 123 should be removed.
Here is my python script.
import re
s = "中国，中，。》％国foo中¥国bar#中123=国％中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)
the expected output should be
中国中国foo中国bar中123国中国12-34中国
but the result is
中国中国foo中国bar中123=国中国12-34中国
I can't figure out why there is an extra equal sign in the output?

Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.
You can try the following regex:
u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'
You can use it as such:
import re
s = "中国，中，。》％国foo中¥国bar#中123=国％中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))

I suggest matching and capturing these characters in between digits (to restore them later in the output), and just match them in other contexts.
In Python 2, it will look like
import re
s = u"中国，中，。》％国foo中¥国bar#中123=国％中国12-34中国"
pat_block = u'[^\u4e00-\u9fff0-9a-zA-Z]+';
pattern = u'([0-9]+{0}[0-9]+)|{0}'.format(pat_block)
res = re.sub(pattern, lambda x: x.group(1) if x.group(1) else u"" ,s)
print(res.encode("utf8")) # => 中国中国foo中国bar中123国中国12-34中国
See the Python demo
If you need to preserve those symbols inside any Unicode digits, you need to replace [0-9] with \d and pass the re.UNICODE flag to the regex.
The regex will look like
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)|[^\u4e00-\u9fff0-9a-zA-Z]+
It will works like this:
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+) - Group 1 capturing
[0-9]+ - 1+ digits
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
[0-9]+ - 1+ digits
| - or
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
In Python 2.x, when a group is not matched in re.sub, the backreference to it is None, that is why a lambda expression is required to check if Group 1 matched first.

Regex for a third-person verb

I'm trying to create a regex that matches a third person form of a verb created using the following rule:
If the verb ends in e not preceded by i,o,s,x,z,ch,sh, add s.
So I'm looking for a regex matching a word consisting of some letters, then not i,o,s,x,z,ch,sh, and then "es". I tried this:
\b\w*[^iosxz(sh)(ch)]es\b
According to regex101 it matches "likes", "hates" etc. However, it does not match "bathes", why doesn't it?

You may use
\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*
See the regex demo
Since Python re does not support variable length alternatives in a lookbehind, you need to split the conditions into two lookbehinds here.
Pattern details:
\b - a leading word boundary
(?=\w*(?<![iosxz])(?<![cs]h)es\b) - a positive lookahead requiring a sequence of:
\w* - 0+ word chars
(?<![iosxz]) - there must not be i, o, s, x, z chars right before the current location and...
(?<![cs]h) - no ch or sh right before the current location...
es - followed with es...
\b - at the end of the word
\w* - zero or more (maybe + is better here to match 1 or more) word chars.
See Python demo:
import re
r = re.compile(r'\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*')
s = 'it matches "likes", "hates" etc. However, it does not match "bathes", why doesn\'t it?'
print(re.findall(r, s))

If you want to match strings that end with e and are not preceded by i,o,s,x,z,ch,sh, you should use:
(?<!i|o|s|x|z|ch|sh)e
Your regex [^iosxz(sh)(ch)] consists of character group, the ^ simply negates, and the rest will be exactly matched, so it's equivalent to:
[^io)sxz(c]
which actually means: "match anything that's not one of "io)sxz(c".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does my regex pattern allow digits? - python

I have tried creating a regex pattern that matches only letters and allows a whitespace: import re user_input = raw_input('Input: ') if re.match('[A-Za-z ]', user_input): print user_input However, When inputting o888, or something similar, a match seems to still occur

Related

Restrict negative lookahead to be between substrings regex

Regex to pull the first and last letter of a string

strange output regular expression r'[-.\:alnum:](.*)'

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?

Regex for a third-person verb

Categories

Resources