Regex - how to exclude 4 digit number from wider numeric pattern - python

Really tried to browse and search if this specific question has been posted previously, so I hope I'm not asking an obvious one here.
My problem: I have a regex expression, with few different possible criteria for pattern matching separated by pipes. I'm OK with all of them except one, where I basically want to:
Find any expression which would be number between 4 and 6 digits (regardless of position in the string)
Exclude from this pattern expressions which would relate to years in this century (so starting with 20 and followed by two digits)
So for example, I would like to match: 4149, 20259, 202046, but would like to exclude 2019 as it will refer to a year and not the code I'm searching for.
Currently, I tried applying this one (only last part of the expression): |\d{4,6}?!20\d{2}) , but it's not working properly. I know that the expressions preceeding pipe are fine and was able to notice that \d{4,6} stops to work once I add the "exclusion" in this case, so I assume I'm not using the ?! properly. Could I ask you for an advice on this one?
Edit: Solved! Thank you very much for immediate answers (I was really positively surprised how fast there were few alternative solutions). Sorry I had to pick just one, all of others would be adjustable and viable for my needs, I just found this one most appealing and tailored for my needs.

Where I'm not sure if word-boundaries are your best bet to indicate boundaries (maybe \D is better?), you could try:
\b(?!20\d\d\b)\d{4,6}\b
See the Online Demo
\b - Word boundary.
(?!20\d\d\b) - Negative lookahead: No literal 20 followed by two digits and a word boundary.
\d{4,6} - Four to six digits.
\b - Word boundary.

You could use the following regular expression.
r'\b(?:20\d{3,4}|2[1-9]\d{2,4}|[1,3-9]\d{3,5})\b'
Demo

This should work:
[013-9][1-9]\d{2}|\d{5,6}
Match all 4 digit sequences, except the ones starting with 20, and all 5 or 6 digit sequences

Related

Regex - Group everything until last occurence

When working on this string:
see.Ya23.v2.0023.jpg
I already found out I could get the last occurence of a number by using:
(?P<Frame>\d+(?!.*\d))
It gives me the group containing "0023".
But how do I group everything until that happens?
If I do this:
(?P<Sequence>.*)(?P<Frame>\d+(?!.*\d))
My two groups contain "see.Ya23.v2.002" and "3", when I would like to have to have them contain "see.Ya23.v2." and "0023".
Hope you can help me. Thanks in advance.
You almost got it completely.
just in the first group you can add the lazy indicator ? after any match. that causes to drop the selection at the first possible possition.
(?P<Sequence>.*?)(?P<Frame>\d+(?!.*\d))
this will give you
see.Ya23.v2. and 0023
and if you also want to avoid selecting the dot
(?P<Sequence>.*?)\.(?P<Frame>\d+(?!.*\d))
the result is see.Ya23.v2 and 0023
The simplest and quickest way is to put a negative assertion for a digit
before your digit expression at the start of the Frame group.
This will make sure the Frame is the last complete set of digits and
still allow a greedy Sequence match which give a performance boost.
(?P<Sequence>.*)(?P<Frame>(?<!\d)\d+(?!.*\d))
https://regex101.com/r/LCUoCR/1
The problem is explained in my Youtube video related to how backtracking works in regex.
In short: the .* part matches the whole string first, and then the regex engine starts stepping back through the string to accommodate a part for the subsequent patterns, i.e. for \d+(?!.*\d). Once the 3 is found in see.Ya23.v2.0023.jpg, this pattern matches, and the regex engine returns a match.
All you need is to make sure the char before the \d+ is a non-digit char and you need to use
(?P<Sequence>(?:.*\D)?)(?P<Frame>\d+)(?!.*\d)
See the regex demo.

Re module and positive look behind of variable width

I am new to programming and Python, so I apologize if this is an obvious question. I tried looking at similar questions on this website, but the solutions seem to be outside of my reach.
Problem: Consider the following text:
12/19 Paul 1/20
1/20 Jacob 10/2
Using the module re, extract the names from the above. In other words, your output should be:
['Paul', 'Jacob']
First, I tried using positive look arounds. I tried:
import re
name_regex=re.compile(r'''(
(?<=\d{1,2}/\d{1,2}\s) #looks for one or two digits followed by a forward slash followed by one or two digits, followed by a space
.*? #looks for anything besides the newline in a non-greedy manner (is the non-greedy part necessary? I am not sure...)
(?=\s\d{1,2}/\d{1,2}) #looks for a space followed by one or two digits followed by a forward slash followed by one or two digits
)''', re.VERBOSE)
text=str("12/19 Paul 1/20\n1/20 Jacob 10/2")
print(name_regex.findall(text))
However, the above yields the error:
re.error: look-behind requires fixed-width pattern
From reading similar questions, I believe that this means that look arounds cannot have variable length (i.e., they cannot look for "1 or 2 digits").
However, how can I fix this?
Any help would be greatly appreciated. Especially the help suited for nearly a complete beginner like me!
PS. Ultimately, the list of names surrounded by dates can be very long. The dates can have one or two digits that are separated by a slash. I just wanted to give a minimal working example.
Thank you!
If you want to match at least a single non whitespace char between the digit patterns, you might use
(?<=\d{1,2}/\d{1,2}\s)\S.*?(?=\s\d{1,2}/\d{1,2})
This part \S.*? will match a non whitespace char followed by any char except a newline non greedy so it will match until asserting the first occurrence of (?=\s\d{1,2}/\d{1,2})
Python demo
Note that if you would use .*? then match would also return an empty entry ['Paul', '', 'Jacob'] , see this example.
You could also use a capturing group instead of lookarounds:
\d{1,2}/\d{1,2}\s(\S.*?)\s\d{1,2}/\d{1,2}
Regex demo

functional difference between lookarounds and non-capture group?

I'm trying to come up with an example where positive look-around works but
non-capture groups won't work, to further understand their usages. The examples I"m coming up with all work with non-capture groups as well, so I feel like I"m not fully grasping the usage of positive look around.
Here is a string, (taken from a SO example) that uses positive look ahead in the answer. The user wanted to grab the second column value, only if the value of the
first column started with ABC, and the last column had the value 'active'.
string ='''ABC1 1.1.1.1 20151118 active
ABC2 2.2.2.2 20151118 inactive
xxx x.x.x.x xxxxxxxx active'''
The solution given used 'positive look ahead' but I noticed that I could use non-caputure groups to arrive at the same answer.
So, I'm having trouble coming up with an example where positive look-around works, non-capturing group doesn't work.
pattern =re.compile('ABC\w\s+(\S+)\s+(?=\S+\s+active)') #solution
pattern =re.compile('ABC\w\s+(\S+)\s+(?:\S+\s+active)') #solution w/out lookaround
If anyone would be kind enough to provide an example, I would be grateful.
Thanks.
The fundamental difference is the fact, that non-capturing groups still consume the part of the string they match, thus moving the cursor forward.
One example where this makes a fundamental difference is when you try to match certain strings, that are surrounded by certain boundaries and these boundaries can overlap. Sample task:
Match all as from a given string, that are surrounded by bs - the given string is bababaca. There should be two matches, at positions 2 and 4.
Using lookarounds this is rather easy, you can use b(a)(?=b) or (?<=b)a(?=b) and match them. But (?:b)a(?:b) won't work - the first match will also consume the b at position 3, that is needed as boundary for the second match. (note: the non-capturing group isn't actually needed here)
Another rather prominent sample are password validations - check that the password contains uppercase, lowercase letters, numbers, whatever - you can use a bunch of alternations to match these - but lookaheads come in way easier:
(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[!?.])
vs
(?:.*[a-z].*[A-Z].*[0-9].*[!?.])|(?:.*[A-Z][a-z].*[0-9].*[!?.])|(?:.*[0-9].*[a-z].*[A-Z].*[!?.])|(?:.*[!?.].*[a-z].*[A-Z].*[0-9])|(?:.*[A-Z][a-z].*[!?.].*[0-9])|...

Regular expression pattern questions?

I am having a hard time understanding regular expression pattern. Could someone help me regular expression pattern to match all words ending in s. And start with a and end with a (like ana).
How do I write ending?
Word boundaries are given by \b so the following regex matches words ending with ing or s: "\b(\w+?(?:ing|s))\b" where as \b is a word boundary, \w+ is one or more "word character" and (?:ing|s) is an uncaptured group of either ing or s.
As you asked "how to develop a regex":
First: Don't use regex for complex tasks. They are hard to read, write and maintain. For example there is a regex that validates email addresses - but its computer generated and nothing you should use in practice.
Start simple and add edge cases. At the beginning plan what characters you need to use: You said you need words ending with s or ing. So you probably need something to represent a word, endings of words and the literal characters s and ing. What is a word? This might change from case to case, but at least every alphabetical character. Looking up in the python documentation on regexes you can find \w which is [a-zA-Z0-9_], which fits my impression of a word character. There you can also find \b which is a word boundary.
So the "first pseudo code try" is something like \b\w...\w\b which matches a word. We still need to "formalize" ... which we want to have the meaning of "one ore more characters", which directly translates to \b\w+\b. We can now match a word! We still need the s or ing. | translates to or, so how is the following: \b\w+ing|s\b? If you test this, you'll see that it will match confusing things like ingest which should not match our regex. What is happening? As you probably already saw the | can't know "which part it should or", so we need to introduce parenthesis: \b\w+(ing|s)\b. Congratulations, you have now arrived at a working regex!
Why (and how) does this differ from the example I gave first? First I wrote \w+? instead of \w+, the ? turns the + into a non-greedy version. If you know what the difference between greedy and non greedy is, skip this paragraph. Consider the following: AaAAbA and we want to match the things enclosed with big letter A. A naive try: A\w+A, so one or more word characters enclosed with A. This matches AaA, but also AaAAbA, A is still something that can be matched by \w. Without further config the *+? quantifier all try to match as much as possible. Sometimes, like in the A example, you don't want that, you can then use a ? after the quantifier to signal you want a non-greedy version, a version that matches as little as possible.
But in our case this isn't needed, the words are well seperated by whitespaces, which are not part of \w. So in fact you can just let + be greedy and everything will be alright. If you use . (any character) you often need to be careful not to match to much.
The other difference is using (?:s|ing) instead of (s|ing). What does the ?: do here? It changes a capturing group to a non capturing group. Generally you don't want to get "everything" from the regex. Consider the following regex: I want to go to \w+. You are not interested in the whole sentence, but only in the \w+, so you can capture it in a group: I want to go to (\w+). This means that you are interested in this specific piece of information and want to retrieve it later. Sometimes (like when using |) you need to group expressions together, but are not interested in their content, you can then declare it as non capturing. Otherwise you will get the group (s or ing) but not the actual word!
So to summarize:
* start small
* add one case after another
* always test with examples
In fact I just tried re.findall(\b\w+(?:ing|s)\b, "fishing words") and it didn't work. \w+(?:ing|s) works. I've no idea why, maybe someone else can explain that. Regex are an arcane thing, only use them for easy and easy to test tasks.
Generally speaking I'd use \b to match "word boundaries" with \w which matches word components (short cut for [A-Za-z0-9_]). Then you can do an or grouping to match "s" or "ing". Result is:
/\b\w+(s|ing)\b/

Is this regex correct for xsd:anyURI

I am implementing a function (in Python) that checks for conformance of the string to xsd:anyURI.
According to Schema Central it only makes sense to check for repeated, consecutive and non-consecutive # characters and % followed by something other than hex characters 0-Ff.
So far, I have something like and it seems to be working:
if uri.search('(%[^0-9A-Fa-f]+)|(#.*#+)')
The second expression for multiple '#' signs may be faulty.
If you are aiming for an exclusion regex according to the Schema Central parser requirement, you are almost there. The first half, excluding percent signs not followed by two hexadecimal digits is best solved using a negative look-ahead assertion; the second half is fine, though you can ditch the last repeat indicator without affecting your results:
(%(?![0-9A-F]{2})|#.*#)
Compile your regex with case independence (i flag) and you are good to go.
Recommended reading: the Python Standard Library’s chapter on Regular Expression Operation Syntax.
I recently had to do this without a negative lookahead, and the following seems to work:
(%.?[^0-9A-Fa-f]|#.*#)

Categories

Resources