Regular expression, X out of Y critereon to match [duplicate]

Regular expression, X out of Y critereon to match [duplicate] - python

My client has requested that passwords on their system must following a specific set of validation rules, and I'm having great difficulty coming up with a "nice" regular expression.
The rules I have been given are...
Minimum of 8 character
Allow any character
Must have at least one instance from three of the four following character types...
Upper case character
Lower case character
Numeric digit
"Special Character"
When I pressed more, "Special Characters" are literally everything else (including spaces).
I can easily check for at least one instance for all four, using the following...
^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?\d)(?=.*?[^a-zA-Z0-9]).{8,}$
The following works, but it's horrible and messy...
^((?=.*?[A-Z])(?=.*?[a-z])(?=.*?\d)|(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[^a-zA-Z0-9])|(?=.*?[A-Z])(?=.*?\d)(?=.*?[^a-zA-Z0-9])|(?=.*?[a-z])(?=.*?\d)(?=.*?[^a-zA-Z0-9])).{8,}$
So you don't have to work it out yourself, the above is checking for (1,2,3|1,2,4|1,3,4|2,3,4) which are the 4 possible combinations of the 4 groups (where the number relates to the "types" in the set of rules).
Is there a "nicer", cleaner or easier way of doing this?
(Please note, this is going to be used in an <asp:RegularExpressionValidator> control in an ASP.NET website, so therefore needs to be a valid regex for both .NET and javascript.)

It's not much of a better solution, but you can reduce [^a-zA-Z0-9] to [\W_], since a word character is all letters, digits and the underscore character. I don't think you can avoid the alternation when trying to do this in a single regex. I think you have pretty much have the best solution.
One slight optimization is that \d*[a-z]\w_*|\d*[A-Z]\w_* ~> \d*[a-zA-Z]\w_*, so I could remove one of the alternation sets. If you only allowed 3 out of 4 this wouldn't work, but since \d*[A-Z][a-z]\w_* was implicitly allowed it works.
(?=.{8,})((?=.*\d)(?=.*[a-z])(?=.*[A-Z])|(?=.*\d)(?=.*[a-zA-Z])(?=.*[\W_])|(?=.*[a-z])(?=.*[A-Z])(?=.*[\W_])).*
Extended version:
(?=.{8,})(
(?=.*\d)(?=.*[a-z])(?=.*[A-Z])|
(?=.*\d)(?=.*[a-zA-Z])(?=.*[\W_])|
(?=.*[a-z])(?=.*[A-Z])(?=.*[\W_])
).*
Because of the fourth condition specified by the OP, this regular expression will match even unprintable characters such as new lines. If this is unacceptable then modify the set that contains \W to allow for more specific set of special characters.

I'd like to improve the accepted solution with this one
^(?=.{8,})(
(?=.*[^a-zA-Z\s])(?=.*[a-z])(?=.*[A-Z])|
(?=.*[^a-zA-Z0-9\s])(?=.*\d)(?=.*[a-zA-Z])
).*$

The above Regex worked well for most scenarios except for strings such as "AAAAAA1$", "$$$$$$1a"
This could be an issue only in iOS ( Objective C and Swift) that the regex "\d" has issues
The following fix worked in iOS, i.e changing to [0-9] for digits
^((?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])|(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[^a-zA-Z0-9])|(?=.*?[A-Z])(?=.*?[0-9])(?=.*?[^a-zA-Z0-9])|(?=.*?[a-z])(?=.*?[0-9])(?=.*?[^a-zA-Z0-9])).{8,}$

Password must meet at least 3 out of the following 4 complexity rules,
[at least 1 uppercase character (A-Z) at least 1 lowercase character (a-z) at least 1 digit (0-9) at least 1 special character — do not forget to treat space as special characters too]
at least 10 characters
at most 128 characters
not more than 2 identical characters in a row (e.g., 111 not allowed)
'^(?!.(.)\1{2}) ((?=.[a-z])(?=.[A-Z])(?=.[0-9])|(?=.[a-z])(?=.[A-Z])(?=.[^a-zA-Z0-9])|(?=.[A-Z])(?=.[0-9])(?=.[^a-zA-Z0-9])|(?=.[a-z])(?=.[0-9])(?=.*[^a-zA-Z0-9])).{10,127}$'
(?!.*(.)\1{2})
(?=.[a-z])(?=.[A-Z])(?=.*[0-9])
(?=.[a-z])(?=.[A-Z])(?=.*[^a-zA-Z0-9])
(?=.[A-Z])(?=.[0-9])(?=.*[^a-zA-Z0-9])
(?=.[a-z])(?=.[0-9])(?=.*[^a-zA-Z0-9])
.{10,127}

Related

Regular expression that accepts tokens of three or more alphabetical characters

I'm trying to build a TFIDVectorizer that only accepts tokens of 3 or more alphabetical characters using TFIdfVectorizer(token_pattern="(?u)\\b\\D\\D\\D+\\b")
But it doesn't behave correctly, I know token_pattern="(?u)\\b\\w\\w\\w+\\b" accepts tokens of 3 or more alphanumerical characters, so I just don't understand why the former is not working.
What am I missing?

The problem lies in using the \D metacharacter, as it's actually for matching any non-digit character, rather than any alphabetical character. From Python docs:
You can go instead with:
token_pattern="(?i)[a-z]{3,}"
Explanation:
(?i) — inline flag to make matching case-insensitive,
[a-z] — matches any Latin letter,
{3,} — makes the previous token match three or more times (greedily, i.e., as many times as possible).
I hope this answers your question. :)

Do character classes count as groups in regular expressions?

A small project I got assigned is supposed to extract website URLs from given text. Here's how the most relevant portion of it looks like :
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+-\\/_]+
)''',re.VERBOSE)
This does do its job properly, but I noticed that it also includes the ','s and '.' in URL strings it prints. So my first question is, how do I make it exclude any punctuation symbols in the end of the string it detects ?
My second question is referring to the title itself ( finally ), but doesn't really seem to affect this particular program I'm working on : Do character classes ( in this case [a-zA-Z0-9.%+-\/_]+ ) count as groups ( group[3] in this case ) ?
Thanks in advance.

To exclude some symbols at the end of string you can use negative lookbehind. For example, to disallow . ,:
.*(?<![.,])$

answering in reverse:
No, character classes are just shorthand for bracketed text. They don't provide groups in the same way that surrounding with parenthesis would. They only allow the regular expression engine to select the specified characters -- nothing more, nothing less.
With regards to finding comma and dot: Actually, I see the problem here, though the below may still be valuable, so I'll leave it. Essentially, you have this: [a-zA-Z0-9.%+-\\/_]+ the - character has special meaning: everything between these two characters -- by ascii code. so [A-a] is a valid range. It include A-Z, but also a bunch of other characters that aren't A-Z. If you want to include - in the range, then it needs to be the last character: [a-zA-Z0-9.%+\\/_-]+ should work
For comma, I actually don't see it represented in your regex, so I can't comment specifically on that. It shouldn't be allowed anywhere in the url. In general though, you'll just want to add more groups/more conditions.
First, break apart the url into the specifc groups you'll want:
(scheme)://(domain)(endpoint)
Each section gets a different set of requirements: e.g. maybe domain needs to end with a slash:
[a-zA-Z0-9]+\.com/ should match any domain that uses an alphanumeric character, and ends -- specifically -- with .com (note the \., otherwise it'll capture any single character followed by com/
For the endpoint section, you'll probably still want to allow special characters, but if you're confident you don't want the url to end with, say, a dot, then you could do something [A-Za-z0-9] -- note the lack of a dot here, plus, it's length -- only a single character. This will change the rest of your regex, so you need to think about that.
A couple of random thoughts:
If you're confident you want to match the whole line, add a $ to the end of the regex, to signify the end of the line. One possibility here is that your regex does match some portion of the text, but ignores the junk at the end, since you didn't say to read the whole line.
Regexes get complicated really fast -- they're kind of write-only code. Add some comments to help. E.g.
web_url_regex = re.compile(
r'(http://|https://)' # Capture the scheme name
r'([a-zA-Z0-9.%+-\\/_])' # Everything else, apparently
)
Do not try to be exhaustive in your validation -- as noted, urls are hard to validate because you can't know for sure that one is valid. But the form is pretty consistent, as laid out above: scheme, domain, endpoint (and query string)

To answer the second question first, no a character class is not a group (unless you explicitly make it into one by putting it in parentheses).
Regarding the first question of how to make it exclude the punctuation symbols at the end, the code below should answer that.
Firstly though, your regex had an issue separate from the fact that it was matching the final punctuation, namely that the last - does not appear to be intended as defining a range of characters (see footnote below re why I believe this to be the case), but was doing so. I've moved it to the end of the character class to avoid this problem.
Now a character class to match the final character is added at the end of the regexp, which is the same as the previous character class except that it does not include . (other punctuation is now already not included). So the matched pattern cannot end in .. The + (one or more) on the previous character class is now reduced to * (zero or more).
If for any reason the exact set of characters matched needs tweaking, then the same principle can still be employed: match a single character at the end from a reduced set of possibilities, preceded by any number of characters from a wider set which includes characters that are permitted to be included but not at the end.
import re
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+\\/_-]*
[a-zA-Z0-9%+\\/_-]
)''',re.VERBOSE)
str = "... at http://www.google.com/. It says"
m = re.search(webURLregex, str)
if m:
print(m.group())
Outputs:
http://www.google.com/
[*] The observation that the second - does not appear to be intended to define a character range is based on the fact that, if it was, such a range would be from 056-134 (octal) which would include also the alphabetical characters, making the a-zA-Z redundant.

Regex backreference to match opposite case

Before I begin — it may be worth stating, that: this technically does not have to be solved using a Regex, it's just that I immediately thought of a Regex when I started solving this problem, and I'm interested in knowing whether it's possible to solve using a Regex.
I've spent the last couple hours trying to create a Regex that does the following.
The regex must match a string that is ten characters long, iff the first five characters and last five characters are identical but each individual character is opposite in case.
In other words, if you take the first five characters, invert the case of each individual character, that should match the last five characters of the string.
For example, the regex should match abCDeABcdE, since the first five characters and the last five characters are the same, but each matching character is opposite in case. In other words, flip_case("abCDe") == "ABcdE"
Here are a few more strings that should match:
abcdeABCDE, abcdEABCDe, zYxWvZyXwV.
And here are a few that shouldn't match:
abcdeABCDZ, although the case is opposite, the strings themselves do not match.
abcdeABCDe, is a very close match, but should not match since the e's are not opposite in case.
Here is the first regex I tried, which is obviously wrong since it doesn't account for the case-swap process.
/([a-zA-Z]{5})\1/g
My next though was whether the following is possible in a regex, but I've been reading several Regex tutorials and I can't seem to find it anywhere.
/([A-Z])[\1+32]/g
This new regex (that obviously doesn't work) is supposed to match a single uppercase letter, immediately followed by itself-plus-32-ascii, so, in other words, it should match an uppercase letter followed immediately by its' lowercase counterpart. But, as far as I'm concerned, you cannot "add an ascii value" to backreference in a regex.
And, bonus points to whoever can answer this — in this specific case, the string in question is known to be 10 characters long. Would it be possible to create a regex that matches strings of an arbitrary length?

You want to use the following pattern with the Python regex module:
^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)
See the regex demo
Details
^ - start of string
(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L})) - a positive lookahead with a sequence of five capturing groups that capture the first five letters individually
(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$) - a ppositive lookahead that make sure that, at the end of the string, there are 5 letters that are the same as the ones captured at the start but are of different case.
In brief, the first (\p{L}) in the first lookahead captures the first a in abcdeABCDE and then, inside the second lookahead, (?!\1)(?i:\1) makes sure the fifth char from the end is the same (with the case insensitive mode on), and (?!\1) negative lookahead make sure this letter is not identical to the one captured.
The re module does not support inline modifier groups, so this expression won't work with that moduue.
Python regex based module demo:
import regex
strs = ['abcdeABCDE', 'abcdEABCDe', 'zYxWvZyXwV', 'abcdeABCDZ', 'abcdeABCDe']
rx = r'^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)'
for s in strs:
print("Testing {}...".format(s))
if regex.search(rx, s):
print("Matched")
Output:
Testing abcdeABCDE...
Matched
Testing abcdEABCDe...
Matched
Testing zYxWvZyXwV...
Matched
Testing abcdeABCDZ...
Testing abcdeABCDe...

Regex matching Unicode variable names

In Python 2, a Python variable name contains only ASCII letters, numbers and underscores, and it must not start with a number. Thus,
re.search(r'[_a-zA-Z][_a-zA-Z0-9]*', s)
will find a matching Python name in the str s.
In Python 3, the letters are no longer restricted to ASCII. I am in search for a new regex which will match any and all legal Python 3 variable names.
According to the docs, \w in a regex will match any Unicode word literal, including numbers and the underscore. I am however unsure whether this character set contains exactly those characters which might be used in variable names.
Even if the character set \w contains exactly the characters from which Python 3 variable names may legally be constructed, how do I use it to create my regex? Using just \w+ will also match "words" which start with a number, which is no good. I have the following solution in mind,
re.search(r'(\w&[^0-9])\w*', s)
where & is the "and" operator (just like | is the "or" operator). The parentheses will thus match any word literal which at the same time is not a number. The problem with this is that the & operator does not exist, and so I'm stuck with no solution.
Edit
Though the "double negative" trick (as explained in the answer by Patrick Artner below) can also be found in this question, note that this only partly answers my question. Using [^\W0-9]\w* only works if I am guaranteed that \w exactly matches the legal Unicode characters, plus the numbers 0-9. I would like a source of this knowledge, or some other regex which gets the job done.

You can use a double negative - \W is anything that \w is not - just disallow it to allow any \w:
[^\W0-9]\w*
essentially using any not - non-wordcharacter except 0-9 followed by any word character any number of times.
Doku: regular-expression-syntax

You could try using
^(?![0-9])\w+$
Which will not partial match invalid variable names
Alternatively, if you don't need to use regex. str.isidentifier() will probably do what you want.

Extracting ICCID from a string using regex

I'm trying to return and print the ICCID of a SIM card in a device; the SIM cards are from various suppliers and therefore of differing lengths (either 19 or 20 digits). As a result, I'm looking for a regular expression that will extract the ICCID (in a way that's agnostic to non-word characters immediately surrounding it).
Given that an ICCID is specified as a 19-20 digit string starting with "89", I've simply gone for:
(89\d{17,18})
This was the most successful pattern that I'd tested (along with some patterns rejected for reasons below).
In the string that I'm extracting it from, the ICCID is immediately followed by a carriage return and then a line feed, but some testing against terminating it with \r, \n, or even \b failed to work (the program that I'm using is an in-house one built on python, so I suspect that's what it's using for regex). Also, simply using (\d{19,20}) ended up extracting the last 19 digits of a 20-digit ICCID (as the third and last valid match). Along the same lines, I ruled out (\d{19,20})? in principle, as I expect that to finish when it finds the first 19 digits.
So my question is: Should I use the pattern I've chosen, or is there a better expression (not using non-word characters to frame the string) that will return the longest substring of a variable-length string of digits?

If the engine behind the scenes is really Python, and there can be any non-digits chars around the value you need to extract, use lookarounds to restrict the context around the values:
(?<!\d)89\d{17,18}(?!\d)
^^^^^^^ ^^^^^^
The (?<!\d) loobehind will require the absense of a digit before the match and (?!\d) negative lookahead will require the absence of a digit after that value.
See this regex demo

I'd go for
89\d{17,18}[^\d]
This should prefer 18 digits, but 17 would also suffice. After that, no more other numeric characters would be allowed.
Only limitation: there must be at least one more character after the ICCID (which should be okay from what you described).
Be aware that any longer number sequence carrying "89" followed by 17 or 18 numerical characters would also match.

(\d+)\D+
seems like it would do the trick readily. (\d+ ) would capture 20 numbers. \D+ would match anything else afterwards.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.