Extracting ICCID from a string using regex

Extracting ICCID from a string using regex - python

I'm trying to return and print the ICCID of a SIM card in a device; the SIM cards are from various suppliers and therefore of differing lengths (either 19 or 20 digits). As a result, I'm looking for a regular expression that will extract the ICCID (in a way that's agnostic to non-word characters immediately surrounding it).
Given that an ICCID is specified as a 19-20 digit string starting with "89", I've simply gone for:
(89\d{17,18})
This was the most successful pattern that I'd tested (along with some patterns rejected for reasons below).
In the string that I'm extracting it from, the ICCID is immediately followed by a carriage return and then a line feed, but some testing against terminating it with \r, \n, or even \b failed to work (the program that I'm using is an in-house one built on python, so I suspect that's what it's using for regex). Also, simply using (\d{19,20}) ended up extracting the last 19 digits of a 20-digit ICCID (as the third and last valid match). Along the same lines, I ruled out (\d{19,20})? in principle, as I expect that to finish when it finds the first 19 digits.
So my question is: Should I use the pattern I've chosen, or is there a better expression (not using non-word characters to frame the string) that will return the longest substring of a variable-length string of digits?

If the engine behind the scenes is really Python, and there can be any non-digits chars around the value you need to extract, use lookarounds to restrict the context around the values:
(?<!\d)89\d{17,18}(?!\d)
^^^^^^^ ^^^^^^
The (?<!\d) loobehind will require the absense of a digit before the match and (?!\d) negative lookahead will require the absence of a digit after that value.
See this regex demo

I'd go for
89\d{17,18}[^\d]
This should prefer 18 digits, but 17 would also suffice. After that, no more other numeric characters would be allowed.
Only limitation: there must be at least one more character after the ICCID (which should be okay from what you described).
Be aware that any longer number sequence carrying "89" followed by 17 or 18 numerical characters would also match.

(\d+)\D+
seems like it would do the trick readily. (\d+ ) would capture 20 numbers. \D+ would match anything else afterwards.

Related

Regex shorthand for matching special characters

I'm building a password validator for the back-end of a web app and I'm using the pretty standard uppercase, lowercase, digit, min length and special character requirement, and I'm looking to refactor a bit the regex so it's more compact. Is there a way to search for a match of any special character without having a pretty long regex with every special character written?
r'^(?=.*[\d])(?=.*[A-Z])(?=.*[a-z])(?=.*[special chars regex])[\w\d special chars regex]{8,255}$'
So far my attempts have been around the idea of having a negation set like \S, which works to match them but since they also match digits, then it's still allowing for passwords with no special characters.
EDIT:
What I mean with special characters can be summarized as, I want to catch characters with ASCII index between 33 and 126, excluding letters and digits (indexes 48 ~ 57 for digits, 65 ~ 90 for upper case letters and 97 ~ 122 for lower case letters) as those are indeed part of pre-existing regex short hands such as \w and \d.
Here's an ASCII chart for reference.

[^\w\s\d] will match a single character that's not a word character (lowercase or uppercase), a number, or a whitespace character.
That would mean you would in theory use [\w\d[^\w\d\s]] to indicate all the characters that a password can be composed of, but since that union doesn't seem to want to parse correctly, you can specify the Unicode range explicitly (though in hex, not decimal):
(?=.*[\d])(?=.*[A-Z])(?=.*[a-z])(?=.*[^\w\d\s])[\u0021-\u007E]{8,255}

Regular expression, X out of Y critereon to match [duplicate]

My client has requested that passwords on their system must following a specific set of validation rules, and I'm having great difficulty coming up with a "nice" regular expression.
The rules I have been given are...
Minimum of 8 character
Allow any character
Must have at least one instance from three of the four following character types...
Upper case character
Lower case character
Numeric digit
"Special Character"
When I pressed more, "Special Characters" are literally everything else (including spaces).
I can easily check for at least one instance for all four, using the following...
^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?\d)(?=.*?[^a-zA-Z0-9]).{8,}$
The following works, but it's horrible and messy...
^((?=.*?[A-Z])(?=.*?[a-z])(?=.*?\d)|(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[^a-zA-Z0-9])|(?=.*?[A-Z])(?=.*?\d)(?=.*?[^a-zA-Z0-9])|(?=.*?[a-z])(?=.*?\d)(?=.*?[^a-zA-Z0-9])).{8,}$
So you don't have to work it out yourself, the above is checking for (1,2,3|1,2,4|1,3,4|2,3,4) which are the 4 possible combinations of the 4 groups (where the number relates to the "types" in the set of rules).
Is there a "nicer", cleaner or easier way of doing this?
(Please note, this is going to be used in an <asp:RegularExpressionValidator> control in an ASP.NET website, so therefore needs to be a valid regex for both .NET and javascript.)

It's not much of a better solution, but you can reduce [^a-zA-Z0-9] to [\W_], since a word character is all letters, digits and the underscore character. I don't think you can avoid the alternation when trying to do this in a single regex. I think you have pretty much have the best solution.
One slight optimization is that \d*[a-z]\w_*|\d*[A-Z]\w_* ~> \d*[a-zA-Z]\w_*, so I could remove one of the alternation sets. If you only allowed 3 out of 4 this wouldn't work, but since \d*[A-Z][a-z]\w_* was implicitly allowed it works.
(?=.{8,})((?=.*\d)(?=.*[a-z])(?=.*[A-Z])|(?=.*\d)(?=.*[a-zA-Z])(?=.*[\W_])|(?=.*[a-z])(?=.*[A-Z])(?=.*[\W_])).*
Extended version:
(?=.{8,})(
(?=.*\d)(?=.*[a-z])(?=.*[A-Z])|
(?=.*\d)(?=.*[a-zA-Z])(?=.*[\W_])|
(?=.*[a-z])(?=.*[A-Z])(?=.*[\W_])
).*
Because of the fourth condition specified by the OP, this regular expression will match even unprintable characters such as new lines. If this is unacceptable then modify the set that contains \W to allow for more specific set of special characters.

I'd like to improve the accepted solution with this one
^(?=.{8,})(
(?=.*[^a-zA-Z\s])(?=.*[a-z])(?=.*[A-Z])|
(?=.*[^a-zA-Z0-9\s])(?=.*\d)(?=.*[a-zA-Z])
).*$

The above Regex worked well for most scenarios except for strings such as "AAAAAA1$", "$$$$$$1a"
This could be an issue only in iOS ( Objective C and Swift) that the regex "\d" has issues
The following fix worked in iOS, i.e changing to [0-9] for digits
^((?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])|(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[^a-zA-Z0-9])|(?=.*?[A-Z])(?=.*?[0-9])(?=.*?[^a-zA-Z0-9])|(?=.*?[a-z])(?=.*?[0-9])(?=.*?[^a-zA-Z0-9])).{8,}$

Password must meet at least 3 out of the following 4 complexity rules,
[at least 1 uppercase character (A-Z) at least 1 lowercase character (a-z) at least 1 digit (0-9) at least 1 special character — do not forget to treat space as special characters too]
at least 10 characters
at most 128 characters
not more than 2 identical characters in a row (e.g., 111 not allowed)
'^(?!.(.)\1{2}) ((?=.[a-z])(?=.[A-Z])(?=.[0-9])|(?=.[a-z])(?=.[A-Z])(?=.[^a-zA-Z0-9])|(?=.[A-Z])(?=.[0-9])(?=.[^a-zA-Z0-9])|(?=.[a-z])(?=.[0-9])(?=.*[^a-zA-Z0-9])).{10,127}$'
(?!.*(.)\1{2})
(?=.[a-z])(?=.[A-Z])(?=.*[0-9])
(?=.[a-z])(?=.[A-Z])(?=.*[^a-zA-Z0-9])
(?=.[A-Z])(?=.[0-9])(?=.*[^a-zA-Z0-9])
(?=.[a-z])(?=.[0-9])(?=.*[^a-zA-Z0-9])
.{10,127}

Regex backreference to match opposite case

Before I begin — it may be worth stating, that: this technically does not have to be solved using a Regex, it's just that I immediately thought of a Regex when I started solving this problem, and I'm interested in knowing whether it's possible to solve using a Regex.
I've spent the last couple hours trying to create a Regex that does the following.
The regex must match a string that is ten characters long, iff the first five characters and last five characters are identical but each individual character is opposite in case.
In other words, if you take the first five characters, invert the case of each individual character, that should match the last five characters of the string.
For example, the regex should match abCDeABcdE, since the first five characters and the last five characters are the same, but each matching character is opposite in case. In other words, flip_case("abCDe") == "ABcdE"
Here are a few more strings that should match:
abcdeABCDE, abcdEABCDe, zYxWvZyXwV.
And here are a few that shouldn't match:
abcdeABCDZ, although the case is opposite, the strings themselves do not match.
abcdeABCDe, is a very close match, but should not match since the e's are not opposite in case.
Here is the first regex I tried, which is obviously wrong since it doesn't account for the case-swap process.
/([a-zA-Z]{5})\1/g
My next though was whether the following is possible in a regex, but I've been reading several Regex tutorials and I can't seem to find it anywhere.
/([A-Z])[\1+32]/g
This new regex (that obviously doesn't work) is supposed to match a single uppercase letter, immediately followed by itself-plus-32-ascii, so, in other words, it should match an uppercase letter followed immediately by its' lowercase counterpart. But, as far as I'm concerned, you cannot "add an ascii value" to backreference in a regex.
And, bonus points to whoever can answer this — in this specific case, the string in question is known to be 10 characters long. Would it be possible to create a regex that matches strings of an arbitrary length?

You want to use the following pattern with the Python regex module:
^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)
See the regex demo
Details
^ - start of string
(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L})) - a positive lookahead with a sequence of five capturing groups that capture the first five letters individually
(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$) - a ppositive lookahead that make sure that, at the end of the string, there are 5 letters that are the same as the ones captured at the start but are of different case.
In brief, the first (\p{L}) in the first lookahead captures the first a in abcdeABCDE and then, inside the second lookahead, (?!\1)(?i:\1) makes sure the fifth char from the end is the same (with the case insensitive mode on), and (?!\1) negative lookahead make sure this letter is not identical to the one captured.
The re module does not support inline modifier groups, so this expression won't work with that moduue.
Python regex based module demo:
import regex
strs = ['abcdeABCDE', 'abcdEABCDe', 'zYxWvZyXwV', 'abcdeABCDZ', 'abcdeABCDe']
rx = r'^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)'
for s in strs:
print("Testing {}...".format(s))
if regex.search(rx, s):
print("Matched")
Output:
Testing abcdeABCDE...
Matched
Testing abcdEABCDe...
Matched
Testing zYxWvZyXwV...
Matched
Testing abcdeABCDZ...
Testing abcdeABCDe...

Regex can not find all the required expressions

i`m learning python and also english. I'm probably taking a very amateur approach to my problem.
I'm trying to find a sequence of 17 numbers in .txt files. I have thousands of files and I've been creating regular expressions for the most common types of occurrences I've noticed, for example:
01582.2005.012.02.00-\r\n 3\r\n
nº 01387.2009.466.02.001\r\n
nº 01462. 2008. 030. 02. 000\r\n
nº0033620084610200-0\r\n
n. 02414.2008.023.02.001 (201...
nº 00030.2007.084.02.00-3 (2
nº 00627.2009.006.02.004\r\n
nº 0001491-6020125020
numero: 00028.2009.031.02.00-0\r\n
n 00012.2010.391.02.00-0 - 7ª tu
nº 0000695720135020402
nº 00037.2007.048.02.00-1\r\n
01113.2009.074.02.00.4.\r
proc: - 00396-25.2011.5.02-0020
n.º 0163100-53-2010.5.02.0341
nº 01230.2007.065.02.0.0-5 - 7ª tu
nº 64587.2009.\r\n 549.02.001\r\n
The regular expressions that I created were able to find the sequences in about 70% of the files, but I got to a point that for each new regex I do, the number of sequences found is so insignificant in relation to what is missing, that I feel counting sand in the desert. Some of the regex I used were these:
search = re.search(r'((\d{5})\.?\s*(\d{4})\.?\s*(\d{3})\.?\s*(\d{2})\.?\s*(\d{2})\-?\s*(\d))', content.read())
search = re.search(r'((\d{5})\.(\d{4})\.(\d{3})\.(\d{2})\.(\d)\-(\d{2}))', content.read())
search = re.search(r'((\d{5})\.(\d{4})\.(\d{3})\.(\d{2})\.(\d{2})\.(\d))', content.read())
they could find some of these examples I gave, but most of them did not. what I would like to know is how can I take a broader approach to my regex than I am doing. thankss.
edit: What I have more problems to find, are those that have line breaks or spaces between the "-" or "\"

Not being sure if there's some underlying pattern to all those sequences I'm not seeing, I tested this maybe-too-generic one in regex101 with your input and got one full-match per each of those lines, matching the first sequence of 17 in each of them (where the numbers can have some special characters like punctuation between them).
(\d([\s,.\-\xAD_]|(\\r)|(\\n))*){17}
(or: (?:\d(?:[\s,.\-\xAD_]|(?:\\r)|(?:\\n))*){17} if you want to avoid capturing those groups. You can also surround the whole thing between parenthesis to capture the full match if you want)
Also included "\r" and "\n" literally written out (\s takes care of spaces and actual line breaks, if there were any, between numbers), since you've written them out explicitly in that input so I had to do that to consider them part of the sequence.
Basically it says: Match, 17 times in a row: A digit, followed by any amount of any of these characters or literally '\r' or '\n'. And you can add anything you'd like to consider as part of the number sequence (like, maybe slashes should also be ignored) to this part of the regex ([\s,.\-\xAD_]|(\\r)|(\\n)). Anything that isn't in that part of the regex is gonna break or separate potential sequences (so letters or punctuation I might've missed)
But not being super regex knowledgeable, I'm not all that happy with that and I really think Jay's suggestion is the best place to start: Figure out what characters are only an annoyance to you, remove those from the whole text, and hopefully looking for the sequences is easier after that.
Btw, here:
0163100-53-2010.5.02.0341
There are 20 numbers, so the 341 is not part of the full match since it comes after the first 17 (same with some other lines with more that 17 numbers in a row), and I'm not sure if that's what you want.

Understanding regex pattern used to find string between strings in html

I have the following html file:
<!-- <div class="_5ay5"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m-"><div class="_u3y"><div class="_5asl"><a class="_47hq _5asm" href="/Dev/videos/1610110089242029/" aria-label="Who said it?" ajaxify="/Dev/videos/1610110089242029/" rel="theater">
In order to pull the string of numbers between videos/ and /", I'm using the following method that I found:
import re
Source_file = open('source.html').read()
result = re.compile('videos/(.*?)/"').search(Source_file)
print result
I've tried Googling an explanation for exactly how the (.*?) works in this particular implementation, but I'm still unclear. Could someone explain this to me? Is this what's known as a "non-greedy" match? If yes, what does that mean?

The ? in this context is a special operator on the repetition operators (+, *, and ?). In engines where it is available this causes the repetition to be lazy or non-greedy or reluctant or other such terms. Typically repetition is greedy which means that it should match as much as possible. So you have three types of repetition in most modern perl-compatible engines:
.* # Match any character zero or more times
.*? # Match any character zero or more times until the next match (reluctant)
.*+ # Match any character zero or more times and don't stop matching! (possessive)
More information can be found here: http://www.regular-expressions.info/repeat.html#lazy for reluctant/lazy and here: http://www.regular-expressions.info/possessive.html for possessive (which I'll skip discussing in this answer).
Suppose we have the string aaaa. We can match all of the a's with /(a+)a/. Literally this is
match one or more a's followed by an a.
This will match aaaa. The regex is greedy and will match as many a's as possible. The first submatch is aaa.
If we use the regex /(a+?)a this is
reluctantly match one or more as followed by an a
or
match one or more as until we reach another a
That is, only match what we need. So in this case the match is aa and the first submatch is a. We only need to match one a to satisfy the repetition and then it is followed by an a.
This comes up a lot when using regex to match within html tags, quotes and the suchlike -- usually reserved for quick and dirty operations. That is to say using regex to extract from very large and complex html strings or quoted strings with escape sequence can cause a lot of problems but it's perfectly fine for specific use cases. So in your case we have:
/Dev/videos/1610110089242029/
The expression needs to match videos/ followed by zero or more characters followed by /". If there is only one videos URL there that's just fine without being reluctant.
However we have
/videos/1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029/"
Without reluctance, the regex will match:
1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029
It tries to match as much as possible and / and " satisfy . just fine. With reluctance, the matching stops at the first /" (actually it backtracks but you can read about that separately). Thus you only get the part of the url you need.

It can be explained in a simple way:
.: match anything (any character),
*: any number of times (at least zero times),
?: as few times as possible (hence non-greedy).
videos/(.*?)/"
as a regular expression matches (for example)
videos/1610110089242029/"
and the first capturing group returns 1610110089242029, because any of the digits is part of “any character” and there are at least zero characters in it.
The ? causes something like this:
videos/1610110089242029/" something else … "videos/2387423470237509/"
to properly match as 1610110089242029 and 2387423470237509 instead of as 1610110089242029/" something else … "videos/2387423470237509, hence “as few times as possible”, hence “non-greedy”.

The . means any character. The * means any number of times, including zero. The ? does indeed mean non-greedy; that means that it will try to capture as few characters as possible, i.e., if the regex encounters a /, it could match it with the ., but it would rather not because the . is non-greedy, and since the next character in the regex is happy to match /, the . doesn't have to. If you didn't have the ?, that . would eat up the whole rest of the file because it would be chomping at the bit to match as many things as possible, and since it matches everything, it would go on forever.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.