I am writing a python regex that matches only string that consists of letters, digits and one or more question marks.
For example, regex1: ^[A-Za-z0-9?]+$ returns strings with or without ?
I want a regex2 that matches expressions such as ABC123?A, 1AB?CA?, ?2ABCD, ???, 123? but not ABC123, ABC.?1D1, ABC(a)?1d
on mysql, I did that and it works:
select *
from (
select * from norm_prod.skill_patterns
where pattern REGEXP '^[A-Za-z0-9?]+$') AS XXX
where XXX.pattern not REGEXP '^[A-Za-z0-9]+$'
How about something like this:
^(?=.*\?)[a-zA-Z0-9\?]+$
As you can see here at regex101.com
Explanation
The (?=.*\?) is a positive lookahead that tells the regex that the start of the match should be followed by 0 or more characters and then a ? - i.e., there should be a ? somewhere in the match.
The [a-zA-Z0-9\?]+ matches one-or-more occurrences of the characters given in the character class i.e. a-z, A-Z and digits from 0-9, and the question mark ?.
Altogether, the regex first checks if there is a question mark somewhere in the string to be matched. If yes, then it matches the characters mentioned above. If either the ? is not present, or there is some foreign character, then the string is not matched.
You can validate an alphanumeric string with one or more question marks using
where pattern REGEXP '^[A-Za-z0-9]*([?][A-Za-z0-9]*)+$'
In Python:
re.search(r'^[A-Za-z0-9]*(?:\?[A-Za-z0-9]*)+$', text)
See the regex demo.
Details:
^ - start of string
[A-Za-z0-9]* - zero or more letters or digits
([?][A-Za-z0-9]*)+ - one or more repetitions of a ? char and then zero or more letters or digits
$ - end of string.
If you plan to apply this to any Unicode string, consider using POSIX character classes:
where pattern REGEXP '^[[:alnum:]]*([?][[:alnum:]]*)+$'
where [[:alnum:]] matches any letters and digits. In Python:
re.search(r'^[^\W_]*(?:\?[^\W_]*)+$', text)
In Python, all shorthand character classes are Unicode aware by default, and the [^\W_] pattern is a \w (that matches letters, digits, connector punctuation) with _ subtracted from it.
If there should be at least a single question mark present using MySQL or Python:
^[A-Za-z0-9]*\?[A-Za-z0-9?]*$
Explanation
^ Start of string
[A-Za-z0-9]* Match optional chars A-Z a-z 0-9
\? Match a question mark
[A-Za-z0-9]* Match optional chars A-Z a-z 0-9 or ?
$ End of string
See a regex demo.
In MySQL double escape the backslash like:
REGEXP '^[A-Za-z0-9]*\\?[A-Za-z0-9?]*$'
I don't understand the logic in the functioning of the scape operator \ in python regex together with r' of raw strings.
Some help is appreciated.
code:
import re
text=' esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation'
print('text0=',text)
text1 = re.sub(r'(\s+)([;:\.\-])', r'\2', text)
text2 = re.sub(r'\s+\.', '\.', text)
text3 = re.sub(r'\s+\.', r'\.', text)
print('text1=',text1)
print('text2=',text2)
print('text3=',text3)
The theory says:
backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.
And as far as the link provided at the end of this question explains, r' represents a raw string, i.e. there is no special meaning for symbols, it is as it stays.
so in the above regex I would expect text2 and text3 to be different, since the substitution text is '.' in text 2, i.e. a period, whereas (in principle) the substitution text in text 3 is r'.' which is a raw string, i.e. the string as it is should appear, backslash and period. But they result in the same:
The result is:
text0= esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation
text1= esto.es 10. er- 12.23 with [ and.Other ] here is more; puntuation
text2= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
text3= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
#text2=text3 but substitutions are not the same r'\.' vs '\.'
It looks to me that the r' does not work the same way in substitution part, nor the backslash. On the other hand my intuition tells me I am missing something here.
EDIT 1:
Following #Wiktor Stribiżew comment.
He pointed out that (following his link):
import re
print(re.sub(r'(.)(.)(.)(.)(.)(.)', 'a\6b', '123456'))
print(re.sub(r'(.)(.)(.)(.)(.)(.)', r'a\6b', '123456'))
# in my example the substitutions were not the same and the result were equal
# here indeed r' changes the results
which gives:
ab
a6b
that puzzles me even more.
Note:
I read this stack overflow question about raw strings which is super complete. Nevertheless it does not speak about substitutions
First and foremost,
replacement patterns ≠ regular expression patterns
We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.
NOTE: The only special character in a substitution pattern is a backslash, \. Only the backslash must be doubled.
Replacement pattern syntax in Python
The re.sub docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n, \r) and regex escape sequences (\6) and those that can be used as both regex and string escape sequences (\&).
I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X' or r'\X', and a string escape sequence to denote a sequence of \ and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape " (and that is the reason why you can't end a raw string literal with \", but the backlash is still part of the string then).
So, in a replacement pattern, you may use backreferences:
re.sub(r'\D(\d)\D', r'\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b') # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1
You may see that r'\1' and '\\1' is the same replacement pattern, \1. If you use '\1', it will get parse as a string escape sequence, a character with octal value 001. If you forget to use r prefix with the unambiguous backreference, there is no problem because \g is not a valid string escape sequence, and there, \ escape character remains in the string. Read on the docs I linked to:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.
So, when you pass '\.' as a replacement string, you actually send \. two-char combination as the replacement string, and that is why you get \. in the result.
\ is a special character in Python replacement pattern
If you use re.sub(r'\s+\.', r'\\.', text), you will get the same result as in text2 and text3 cases, see this demo.
That happens because \\, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2' in the replacement to actually replace with \ and 2 char combination, you would get an error.
Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:
re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)
A simple way to work around all these string escaping issues is to use a function/lambda as the repl argument, instead of a string. For example:
output = re.sub(
pattern=find_pattern,
repl=lambda _: replacement,
string=input,
)
The replacement string won't be parsed at all, just substituted in place of the match.
From the doc (my emphasis):
re.sub(pattern, repl, string, count=0, flags=0)
Return the string
obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes of ASCII letters are
reserved for future use and treated as errors. Other unknown escapes
such as \& are left alone. Backreferences, such as \6, are replaced
with the substring matched by group 6 in the pattern.
The repl argument is not just plain text. It can also be the name of a function or refer to a position in a group (e.g. \g<quote>, \g<1>, \1).
Also, from here:
Unlike Standard C, all unrecognized escape sequences are left in the
string unchanged, i.e., the backslash is left in the result.
Since . is not a special escape character, '\.' is the same as r'\.\.
I need to match strings which have a-z, \? or \*, for example:
abcd
abc\?d # mush have a \ in front of a ?
abc\*d
ab\?c\*d
and exclude strings which don't have \ in front of other punctuations, such as
abc?d
abc*d
ab?c*d
I tried [a-z(?:\\\?)(?:\\\*)]+ (https://regex101.com/r/5yYBDl/1), but it doesn't work, because [] only supports characters i guess.
Any help would be appreciated.
You may use this regex with an alternation and anchors:
^(?:[a-z]|\\[*?])+$
Updated RegEx Demo
RegEx Details:
^: Start
(?:[a-z]|\\[*?])+: Non capturing group to match either [a-z] or \? or \*. Match 1 or more of this non capturing group.
$: End
will match Unicode character work? depending on your application it may have Unicode support
^(?:[a-z]|\u005c[\u003f\u002a])+$
https://www.regular-expressions.info/unicode.html
snippet from this site
"Perl, PCRE, Boost, and std::regex do not support the \uFFFF syntax. They use \x{FFFF} instead. You can omit leading zeros in the hexadecimal number between the curly braces. Since \x by itself is not a valid regex token, \x{1234} can never be confused to match \x 1234 times. It always matches the Unicode code point U+1234. \x{1234}{5678} will try to match code point U+1234 exactly 5678 times"
I have been trying to create a regex that will match on a string of strings with the following format: "static.string static.mod static.bin". I basically want to enforce the string.string format. My current implementation only gets the first string static.string. this is my RE ^(\s*)([A-Za-z]+)(\.+)([A-Za-z]+). This only matches the first string, so how do I make it iterate and match any string that fits that format in a string of strings?
You may use
re.findall(r'(?<!\S)[A-Za-z]+\.[A-Za-z]+(?!\S)', text)
See the regex demo.
The regex matches:
(?<!\S) - a location immediately preceded with a whitespace or start of string
[A-Za-z]+ - 1+ ASCII letters
\. - a dot
[A-Za-z]+ - 1+ ASCII letters
(?!\S) - a location immediately followed with a whitespace or end of string.
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
Why does one need to add the DOTALL flag for the python regular expression to match characters including the new line character in a raw string. I ask because a raw string is supposed to ignore the escape of special characters such as the new line character. From the docs:
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline.
This is my situation:
string = '\nSubject sentence is: Appropriate support for families of children diagnosed with hearing impairment\nCausal Verb is : may have\npredicate sentence is: a direct impact on the success of early hearing detection and intervention programs in reducing the negative effects of permanent hearing loss'
re.search(r"Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string ,re.DOTALL)
results in a match , However , when I remove the DOTALL flag, I get no match.
In regex . means any character except \n
So if you have newlines in your string, then .* will not pass that newline(\n).
But in Python, if you use the re.DOTALL flag(also known as re.S) then it includes the \n(newline) with that dot .
Your source string is not raw, only your pattern string.
maybe try
string = r'\n...\n'
re.search("Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string)