Question about matching RE in a complicated form - python

How can I match a word using RE in the following format:
Letter number Alphanumeric dot(.) Alphanumeric{0-4}
Examples:
A24.L
A2F.L9
A2F.LG4
This is what I've come up with so far:
answer=re.findall(r'[A-Za-z]\d\w\.\w{0-4})

As you are using re.findall, I assume you are looking for partial matches inside longer text. Bearing that in mind, you need to fix the following:
\w matches not only alphanumeric, but also a _ char
{0-4} is not a valid limiting ("range", or "interval") quantifier, it has a {min,max} syntax (note that the min value should not be omitted, although some regex engines allow that with 0 value used as default, but there are regex engines that either do not support or that do not work correctly with this omitting)
In Python 3, \d matches any Unicode digit (like ٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙0123456789), so you probably want to use (?a) inline modifier (to only match ASCII digits) or an explicit [0-9].
So, you can use
answer=re.findall(r'\b[A-Za-z][0-9][A-Za-z0-9]\.[A-Za-z0-9]{1,4}\b', text)
if the alphanumeric after . is obligatory, and the following if the match can end in a dot:
answer=re.findall(r'\b[A-Za-z][0-9][A-Za-z0-9]\.[A-Za-z0-9]{0,4}(?<!\w\B)', text)
Details:
\b - word boundary
[A-Za-z] - a letter
[0-9] - an ASCII digit
[A-Za-z0-9] - an ASCII alphanumeric
\. - a . char
[A-Za-z0-9]{1,4}\b - one to four alphanumeric chars at the word boundary.
The second regex does not contain a word boundary at the end since the match is supposed to be able to end in a . (that is not a word char). The (?<!\w\B) is a right-hand dynamic word boundary that only requires a non-word char or end position if the preceding char is a word char.
See the regex demo.

Related

Getting a correct regex for word starting and ending with different letters

I am quite new to regex and I right now Have a problem formulating a regex to match a string where the first and last letter are different. I looked up on the internet and found a regex that just does it's opposite. i.e. matches words that have same starting and ending letter. Can anyone please help me to understand if I can negeate this regex in some way or can create a new regex to match my requirements. The regex that needs to be modiifed or changed is:
^\s|^[a-z]$|^([a-z]).*\1$
This matches these Strings :
aba,
a,
b,
c,
d,
" ",
cccbbbbbbac,
aaaaba
But I want it to match strings like:
aaabbcz,
zba,
ccb,
cbbbba
Can anyone please help me in this regard? Thank you.
Note: I will be using this with Python Regex, so the regex should be compataible to be used with Python.
You don't need a regex for this, just use
s[0] != s[-1]
where s is your string. If you must use a regex, you can use this:
^(.).*(?!\1).$
This looks for
^ : beginning of string
(.) : a character (captured in group 1)
.* : some number of characters
(?!\1). : a character which is not the character captured in group 1
$ : end of string
Regex demo on regex101
This part of your pattern ^([a-z]).*\1$ only accounts for chars a-z, but you also want to exclude " "
You can rewrite that pattern by putting the part after the capture group inside a negative lookahead.
^(.)(?!.*\1$).+
^ Start of string
(.) Capture a single char (including spaces) in group 1
(?!.*\1$) Negative lookahead, assert that the string does not end with the same character
.+ Match 1+ characters so that the string has a minimum of 2 characters
See a regex demo.
If the string should start and end with a non whitespace character to prevent / trailing trailing spaces, you can start the match with a non whitespace character \S and also end the match with a non whitespace character.
^(\S)(?!.*\1$).*\S$
See another regex demo.

Detect strings containing only digits, letters and one or more question marks

I am writing a python regex that matches only string that consists of letters, digits and one or more question marks.
For example, regex1: ^[A-Za-z0-9?]+$ returns strings with or without ?
I want a regex2 that matches expressions such as ABC123?A, 1AB?CA?, ?2ABCD, ???, 123? but not ABC123, ABC.?1D1, ABC(a)?1d
on mysql, I did that and it works:
select *
from (
select * from norm_prod.skill_patterns
where pattern REGEXP '^[A-Za-z0-9?]+$') AS XXX
where XXX.pattern not REGEXP '^[A-Za-z0-9]+$'
How about something like this:
^(?=.*\?)[a-zA-Z0-9\?]+$
As you can see here at regex101.com
Explanation
The (?=.*\?) is a positive lookahead that tells the regex that the start of the match should be followed by 0 or more characters and then a ? - i.e., there should be a ? somewhere in the match.
The [a-zA-Z0-9\?]+ matches one-or-more occurrences of the characters given in the character class i.e. a-z, A-Z and digits from 0-9, and the question mark ?.
Altogether, the regex first checks if there is a question mark somewhere in the string to be matched. If yes, then it matches the characters mentioned above. If either the ? is not present, or there is some foreign character, then the string is not matched.
You can validate an alphanumeric string with one or more question marks using
where pattern REGEXP '^[A-Za-z0-9]*([?][A-Za-z0-9]*)+$'
In Python:
re.search(r'^[A-Za-z0-9]*(?:\?[A-Za-z0-9]*)+$', text)
See the regex demo.
Details:
^ - start of string
[A-Za-z0-9]* - zero or more letters or digits
([?][A-Za-z0-9]*)+ - one or more repetitions of a ? char and then zero or more letters or digits
$ - end of string.
If you plan to apply this to any Unicode string, consider using POSIX character classes:
where pattern REGEXP '^[[:alnum:]]*([?][[:alnum:]]*)+$'
where [[:alnum:]] matches any letters and digits. In Python:
re.search(r'^[^\W_]*(?:\?[^\W_]*)+$', text)
In Python, all shorthand character classes are Unicode aware by default, and the [^\W_] pattern is a \w (that matches letters, digits, connector punctuation) with _ subtracted from it.
If there should be at least a single question mark present using MySQL or Python:
^[A-Za-z0-9]*\?[A-Za-z0-9?]*$
Explanation
^ Start of string
[A-Za-z0-9]* Match optional chars A-Z a-z 0-9
\? Match a question mark
[A-Za-z0-9]* Match optional chars A-Z a-z 0-9 or ?
$ End of string
See a regex demo.
In MySQL double escape the backslash like:
REGEXP '^[A-Za-z0-9]*\\?[A-Za-z0-9?]*$'

Regex for input validation enforcement

I have been trying to create a regex that will match on a string of strings with the following format: "static.string static.mod static.bin". I basically want to enforce the string.string format. My current implementation only gets the first string static.string. this is my RE ^(\s*)([A-Za-z]+)(\.+)([A-Za-z]+). This only matches the first string, so how do I make it iterate and match any string that fits that format in a string of strings?
You may use
re.findall(r'(?<!\S)[A-Za-z]+\.[A-Za-z]+(?!\S)', text)
See the regex demo.
The regex matches:
(?<!\S) - a location immediately preceded with a whitespace or start of string
[A-Za-z]+ - 1+ ASCII letters
\. - a dot
[A-Za-z]+ - 1+ ASCII letters
(?!\S) - a location immediately followed with a whitespace or end of string.

Python regex to match word boundary when part do the word contains special character [duplicate]

I've spent some time, but still have to solution. I need regular expression that is able to match a words with signs in it (like c++) in string.
I've used /\bword\b/, for "usual" words, it works OK. But as soon as I try /\bC\+\+\b/ it just does not work. It some how works wrong with a plus signs in it.
I need a regex to detect if input string contains c++ word in it. Input like,
"c++ developer"
"using c++ language"
etc.
ps. Using C#, .Net Regex.Match function.
Thanks for help!
+ is a special character so you need to escape it
\bC\+\+(?!\w)
Note that we can't use \b because + is not a word-character.
The problem isn't with the plus character, that you've escaped correctly, but the \b sequence. It indicates a word boundary, which is a point between a word character (alphanumeric) and something else. Plus isn't a word character, so for \b to match, there would need to be a word character directly after the last plus sign.
\bC\+\+\b matches "Test C++Test" but not "Test C++ Test" for example. Try something like \bC\+\+\s if you expect there to be a whitespace after the last plus sign.
Plus sign have special meaning so you will have to escape it with \. The same rule applies to these characters: \, *, +, ?, |, {, [, (,), ^, $,., #, and white space
UPDATE: the problem was with \b sequence
If you want to match a c++ between non-word chars (chars other than letters, digits and underscores) you may use
\bc\+\+\B
See the regex demo where \b is a word boundary and \B matches all positions that are not word boundary positions.
C# syntax:
var pattern = #"\bc\+\+\B";
You must remember that \b / \B are context dependent: \b matches between the start/end of string and the adjoining word char or between a word and a non-word chars, while \B matches between the start/end of string and the adjoining non-word char or between two word or two non-word chars.
If you build the pattern dynamically, it is hard to rely on word boundary \b pattern.
Use adaptive dynamic wod boundaries, (?!\B\w) and (?<!\w\B) lookarounds instead, they will always match a word not immediately preceded/followed with a word char if the word starts/ends with a word char:
var pattern = $#"(?!\B\w){Regex.Escape(word)}(?<!\w\B)";
If the word boundaries you want to match are whitespace boundaries (i.e. the match is expected only between whitespaces), use
var pattern = $#"(?<!\S){Regex.Escape(word)}(?!\S)";
As the others said, your problem isn't the + sign you've escaped correctly but the \b that is a zero-lenght char that match word boundary that takes place between word \w and non-word \W char.
There is also another mistake in your regex, you want to match char C (uppercase) with c++ (lowercase).To do so you have to change your regex to /\bc\+\+/ or use the i modifier to match case insensitive : /\bc\+\+/i

Python regex not matching at word boundary as required

I want to match a set of patterns at "word boundary", but the patterns may have a prefix [##] which should get matched if present.
I'm using following regex pattern in python.
r"\b[##]?(abc|ef|ghij)\b"
Sample text is : #abc is a pattern which should match. also abc should match. And finally #ef
In this text only abc, abc and ef are matched without and not #abc and #ef as I want.
You need to put the word boundary next to [##] which you made as optional. Because in this #abc part there is a non-word boundary \B exists before # (not a word character) and after the start of the line (not a word character) not a word boundary \b. Note that \b matches between a word character and a non-word character, vice-versa. \B matches between two word characters or two non-word characters.
r"[##]?\b(abc|ef|ghij)\b"
If you put \b before [##], it would match strings like foo#abc or bar#abc because here there is actually a word boundary exists before # and #.
DEMO
Example:
>>> s = "#abc is a pattern which should match. also abc should match. And finally #ef"
>>> re.findall(r'[##]?\b(?:abc|ef|ghij)\b', s)
['#abc', 'abc', '#ef']
#abc
^ ^
\B \b
The group (##)? is saying that the word may begin with "##". What you are looking for is [##]? which is saying the first character is # or #, but it is not required. If you need the match to be part of a group you could use (#|#)?.
I will also throw in my version of the fixed regex without capturing group (since you do not seem to be using them):
r'[##]?\b(?:abc|ef|ghij)\b'
See my demo.
EXPLANATION: [##] are non-word characters and are optional due to ?. \b is not optional, and regex engine consumes it first, i.e. it consumes right # or #, but they are not part of the match since \b is always zero-width.
Here are more details on \b from Regular-Expressions.info:
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match
is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.

Categories

Resources