Re module and positive look behind of variable width - python

I am new to programming and Python, so I apologize if this is an obvious question. I tried looking at similar questions on this website, but the solutions seem to be outside of my reach.
Problem: Consider the following text:
12/19 Paul 1/20
1/20 Jacob 10/2
Using the module re, extract the names from the above. In other words, your output should be:
['Paul', 'Jacob']
First, I tried using positive look arounds. I tried:
import re
name_regex=re.compile(r'''(
(?<=\d{1,2}/\d{1,2}\s) #looks for one or two digits followed by a forward slash followed by one or two digits, followed by a space
.*? #looks for anything besides the newline in a non-greedy manner (is the non-greedy part necessary? I am not sure...)
(?=\s\d{1,2}/\d{1,2}) #looks for a space followed by one or two digits followed by a forward slash followed by one or two digits
)''', re.VERBOSE)
text=str("12/19 Paul 1/20\n1/20 Jacob 10/2")
print(name_regex.findall(text))
However, the above yields the error:
re.error: look-behind requires fixed-width pattern
From reading similar questions, I believe that this means that look arounds cannot have variable length (i.e., they cannot look for "1 or 2 digits").
However, how can I fix this?
Any help would be greatly appreciated. Especially the help suited for nearly a complete beginner like me!
PS. Ultimately, the list of names surrounded by dates can be very long. The dates can have one or two digits that are separated by a slash. I just wanted to give a minimal working example.
Thank you!

If you want to match at least a single non whitespace char between the digit patterns, you might use
(?<=\d{1,2}/\d{1,2}\s)\S.*?(?=\s\d{1,2}/\d{1,2})
This part \S.*? will match a non whitespace char followed by any char except a newline non greedy so it will match until asserting the first occurrence of (?=\s\d{1,2}/\d{1,2})
Python demo
Note that if you would use .*? then match would also return an empty entry ['Paul', '', 'Jacob'] , see this example.
You could also use a capturing group instead of lookarounds:
\d{1,2}/\d{1,2}\s(\S.*?)\s\d{1,2}/\d{1,2}
Regex demo

Related

Regex - Group everything until last occurence

When working on this string:
see.Ya23.v2.0023.jpg
I already found out I could get the last occurence of a number by using:
(?P<Frame>\d+(?!.*\d))
It gives me the group containing "0023".
But how do I group everything until that happens?
If I do this:
(?P<Sequence>.*)(?P<Frame>\d+(?!.*\d))
My two groups contain "see.Ya23.v2.002" and "3", when I would like to have to have them contain "see.Ya23.v2." and "0023".
Hope you can help me. Thanks in advance.
You almost got it completely.
just in the first group you can add the lazy indicator ? after any match. that causes to drop the selection at the first possible possition.
(?P<Sequence>.*?)(?P<Frame>\d+(?!.*\d))
this will give you
see.Ya23.v2. and 0023
and if you also want to avoid selecting the dot
(?P<Sequence>.*?)\.(?P<Frame>\d+(?!.*\d))
the result is see.Ya23.v2 and 0023
The simplest and quickest way is to put a negative assertion for a digit
before your digit expression at the start of the Frame group.
This will make sure the Frame is the last complete set of digits and
still allow a greedy Sequence match which give a performance boost.
(?P<Sequence>.*)(?P<Frame>(?<!\d)\d+(?!.*\d))
https://regex101.com/r/LCUoCR/1
The problem is explained in my Youtube video related to how backtracking works in regex.
In short: the .* part matches the whole string first, and then the regex engine starts stepping back through the string to accommodate a part for the subsequent patterns, i.e. for \d+(?!.*\d). Once the 3 is found in see.Ya23.v2.0023.jpg, this pattern matches, and the regex engine returns a match.
All you need is to make sure the char before the \d+ is a non-digit char and you need to use
(?P<Sequence>(?:.*\D)?)(?P<Frame>\d+)(?!.*\d)
See the regex demo.

Regex - how to exclude 4 digit number from wider numeric pattern

Really tried to browse and search if this specific question has been posted previously, so I hope I'm not asking an obvious one here.
My problem: I have a regex expression, with few different possible criteria for pattern matching separated by pipes. I'm OK with all of them except one, where I basically want to:
Find any expression which would be number between 4 and 6 digits (regardless of position in the string)
Exclude from this pattern expressions which would relate to years in this century (so starting with 20 and followed by two digits)
So for example, I would like to match: 4149, 20259, 202046, but would like to exclude 2019 as it will refer to a year and not the code I'm searching for.
Currently, I tried applying this one (only last part of the expression): |\d{4,6}?!20\d{2}) , but it's not working properly. I know that the expressions preceeding pipe are fine and was able to notice that \d{4,6} stops to work once I add the "exclusion" in this case, so I assume I'm not using the ?! properly. Could I ask you for an advice on this one?
Edit: Solved! Thank you very much for immediate answers (I was really positively surprised how fast there were few alternative solutions). Sorry I had to pick just one, all of others would be adjustable and viable for my needs, I just found this one most appealing and tailored for my needs.
Where I'm not sure if word-boundaries are your best bet to indicate boundaries (maybe \D is better?), you could try:
\b(?!20\d\d\b)\d{4,6}\b
See the Online Demo
\b - Word boundary.
(?!20\d\d\b) - Negative lookahead: No literal 20 followed by two digits and a word boundary.
\d{4,6} - Four to six digits.
\b - Word boundary.
You could use the following regular expression.
r'\b(?:20\d{3,4}|2[1-9]\d{2,4}|[1,3-9]\d{3,5})\b'
Demo
This should work:
[013-9][1-9]\d{2}|\d{5,6}
Match all 4 digit sequences, except the ones starting with 20, and all 5 or 6 digit sequences

regular expression ending

I have ONE string as plain text and want to extract phone numbers of any format from it.
Here is my regex:
r = re.compile(r"(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)[-\s*]\d{3}[-\.\s]??\d{4})")
It extracts the following matches correctly:
617.933.6444
(880)-567-4565
(880) 567-4565
222-333-8888
555 666 4444
9999999999
But how can I avoid getting 7986815059 when I have 798681505951 in the text?
How to make an ending for my regex? (it should not contain letters and digits after and before, exact number count must be 10)
!!!!
Decision
If somebody needs to find US phone numbers in string, use link from the last Wiktor Stribiżew comment.
You need to use word boundaries, but placing them into your pattern is not obvious. It is due to the fact that the second alternative starts with a non-word char, \(. Thus, the first \b must be added at the beginning of the first alternative, and the trailing one at the very end of the pattern:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^ ^^
See the regex demo
You may also require a non-word char or start of string before (. Then add \B at the second alternative start:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\B\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^
See another demo
Also, note that there is no need escaping a . inside a character class, it is already parsed as a literal dot in [.]. And no need using a lazy ?? quantifier, it does not make sense here and a greedy version, ?, will work equally well and will look "cleaner".

Understanding regex pattern used to find string between strings in html

I have the following html file:
<!-- <div class="_5ay5"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m-"><div class="_u3y"><div class="_5asl"><a class="_47hq _5asm" href="/Dev/videos/1610110089242029/" aria-label="Who said it?" ajaxify="/Dev/videos/1610110089242029/" rel="theater">
In order to pull the string of numbers between videos/ and /", I'm using the following method that I found:
import re
Source_file = open('source.html').read()
result = re.compile('videos/(.*?)/"').search(Source_file)
print result
I've tried Googling an explanation for exactly how the (.*?) works in this particular implementation, but I'm still unclear. Could someone explain this to me? Is this what's known as a "non-greedy" match? If yes, what does that mean?
The ? in this context is a special operator on the repetition operators (+, *, and ?). In engines where it is available this causes the repetition to be lazy or non-greedy or reluctant or other such terms. Typically repetition is greedy which means that it should match as much as possible. So you have three types of repetition in most modern perl-compatible engines:
.* # Match any character zero or more times
.*? # Match any character zero or more times until the next match (reluctant)
.*+ # Match any character zero or more times and don't stop matching! (possessive)
More information can be found here: http://www.regular-expressions.info/repeat.html#lazy for reluctant/lazy and here: http://www.regular-expressions.info/possessive.html for possessive (which I'll skip discussing in this answer).
Suppose we have the string aaaa. We can match all of the a's with /(a+)a/. Literally this is
match one or more a's followed by an a.
This will match aaaa. The regex is greedy and will match as many a's as possible. The first submatch is aaa.
If we use the regex /(a+?)a this is
reluctantly match one or more as followed by an a
or
match one or more as until we reach another a
That is, only match what we need. So in this case the match is aa and the first submatch is a. We only need to match one a to satisfy the repetition and then it is followed by an a.
This comes up a lot when using regex to match within html tags, quotes and the suchlike -- usually reserved for quick and dirty operations. That is to say using regex to extract from very large and complex html strings or quoted strings with escape sequence can cause a lot of problems but it's perfectly fine for specific use cases. So in your case we have:
/Dev/videos/1610110089242029/
The expression needs to match videos/ followed by zero or more characters followed by /". If there is only one videos URL there that's just fine without being reluctant.
However we have
/videos/1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029/"
Without reluctance, the regex will match:
1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029
It tries to match as much as possible and / and " satisfy . just fine. With reluctance, the matching stops at the first /" (actually it backtracks but you can read about that separately). Thus you only get the part of the url you need.
It can be explained in a simple way:
.: match anything (any character),
*: any number of times (at least zero times),
?: as few times as possible (hence non-greedy).
videos/(.*?)/"
as a regular expression matches (for example)
videos/1610110089242029/"
and the first capturing group returns 1610110089242029, because any of the digits is part of “any character” and there are at least zero characters in it.
The ? causes something like this:
videos/1610110089242029/" something else … "videos/2387423470237509/"
to properly match as 1610110089242029 and 2387423470237509 instead of as 1610110089242029/" something else … "videos/2387423470237509, hence “as few times as possible”, hence “non-greedy”.
The . means any character. The * means any number of times, including zero. The ? does indeed mean non-greedy; that means that it will try to capture as few characters as possible, i.e., if the regex encounters a /, it could match it with the ., but it would rather not because the . is non-greedy, and since the next character in the regex is happy to match /, the . doesn't have to. If you didn't have the ?, that . would eat up the whole rest of the file because it would be chomping at the bit to match as many things as possible, and since it matches everything, it would go on forever.

Regular expression capturing entire match consisting of repeated groups

I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.
First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'
Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')

Categories

Resources