python regex pattern to extract value between two characters

python regex pattern to extract value between two characters - python

I am trying to extract an id number from urls in the form of
http://www.domain.com/some-slug-here/person/237570
http://www.domain.com/person/237570
either one of these urls could also have params on them
http://www.domain.com/some-slug-here/person/237570?q=some+search+string
http://www.domain.com/person/237570?q=some+search+string
I have tried the following expressions to capture the id value of '237570' from the above urls but each one kinda works but does work across all four url scenarios.
(?<=person\/)(.*)(?=\?)
(?<=person\/)(.*)(?=\?|\z)
(?<=person\/)(.*)(?=\??*)
what I am seeing happening is it is getting the 237570 but including the ? and characters that come after it in the url. how can I say stop capturing either when you hit a ?, /, or the end of the string?

String:
http://www.domain.com/some-slug-here/person/1234?q=some+search+string
http://www.domain.com/person/3456?q=some+search+string
http://www.domain.com/some-slug-here/person/5678
http://www.domain.com/person/7890
Regexp:
person\/(\d{1,})
Output:
>>> regex.findall(string)
[u'1234', u'3456', u'5678', u'7890']

Don't use .* to match the ID. . will match any character (except for line breaks, unless you use the DOTALL option). Just match a bunch of digits: (.*) --> (\d+)

Related

How to handle " in Regex Python

I am trying to grab fary_trigger_post in the code below using Regex. However, I don't understand why it always includes " in the end of the matched pattern, which I don't expect.
Any idea or suggestion?
re.match(
r'-instance[ "\']*(.+)[ "\']*$',
'-instance "fary_trigger_post" '.strip(),
flags=re.S).group(1)
'fary_trigger_post"'
Thank you.

The (.+) is greedy and grabs ANY character until the end of the input. If you modified your input to include characters after the final double quote (e.g. '-instance "fary_trigger_post" asdf') you would find the double quote and the remaining characters in the output (e.g. fary_trigger_post" asdf). Instead of .+ you should try [^"\']+ to capture all characters except the quotes. This should return what you expect.
re.match(r'-instance[ "\']*([^"\']+)[ "\'].*$', '-instance "fary_trigger_post" '.strip(), flags=re.S).group(1)
Also, note that I modified the end of the expression to use .* which will match any characters following the last quote.

Here's what I'd use in your matching string, but it's hard to provide a better answer without knowing all your cases:
r'-instance\s+"(.+)"\s*$'

When you try to get group 1 (i.e. (.+)) regex will follow this match to the end of string, as it can match . (any character) 1 or more times (but it will take maximum amount of times). I would suggest use the following pattern:
'-instance[ "\']*(.+)["\']+ *$'
This will require regex to match all spaces in the end and all quoutes separatelly, so that it won't be included into group 1

Modifying existing regex, excluding some parts

I have an expression in regex, as follow:
r"(?:AND|OR|SUB|ADD)\([^()]*\)(?:\]\[|\[|)(|\s\[|\s)COIL\(Seq\[\d+]\.Bool\[\d+]\.\d+"
Usually I am using it to capturing from a sentence like that:
AND(Seq[1].Mat)AND(Type_G012.WithData)[COIL(Seq[1].Bool[93].11),XIC(Seq[1].exp)RES(Seq[1].ita)]
So I want to extract the last "AND" or "OR" or "SUB"... also with Seq[1].Bool[93].11.
After that I am doing an additional extraction. It was working with no problems with almost everything. The problem is that I have some patterns like that.
AND(Seq[1].Mat)AND(Type_G014.WithData)AND(Type_G015.WithData)[SET(Seq[1].WaitStep)COIL(Seq[1].Seq[93].10),AND(Seq[1].exp)RES(Seq[1].ita)]
Then I am not capturing the last AND, OR, SUB, etc. Because now I have the SET instruction in the middle of the AND and the COIL. So I want to exclude anything diferent of AND|OR|SUB|ADD Because I would like to extract from the last sentence as follow:
AND(Type_G015.WithData)[SET(Seq[1].Wait)COIL(Seq[1].Seq[93].10
Then is the last AND before the COIL. If something could help me I am testing several things and I am messing it.
Thanks.

To match both parts, you might use
(?:AND|OR|SUB|ADD)\([^()]+\)(?:\[SET\([^()]+\))?\[?COIL\(Seq\[\d+\]\.(?:Seq|Bool)\[\d+\]\.\d+
In parts
(?:AND|OR|SUB|ADD) Match 1 of the alternatives
\([^()]+\) Match from an opening till closing parenthesis
(?:\[SET\([^()]+\))? Optionally match [SET and from opening till closing parenthesis
\[?COIL\(Seq\[\d+\]\. Match Optional [ and COIL(Seq[ 1+ digits and ].
(?:Seq|Bool) Match either Seq or Bool
\[\d+\]\.\d+ Match [ 1+ digits and ]. followed by 1+ digits
Regex demo

Thanks.
Now I have something like that:
AND(Type_G015.WithData)[SET(Seq[1].Wait)COIL(Seq[1].Seq[93].10
How Can I extract just AND(Type_G015.WithData). Because my idea was as follow:
(?:AND|OR|SUB|ADD)\((.*)\) But now I am extracting until the last parenthesis but I would like to extract everythin but only until the first closing parenthesis, just:
AND(Type_G015.WithData)
Between parenthesis we could be whatever less more parenthesis.

Why is this regex code not working

Working on a Django app
Here's my urls.py
urlpattterns = [url(r'^(?P<leagueId>[0-9]+)/(?P<year>[0-9]+)/(?P<team>[\S]+)/$', views.team_detail, name="team_detail"),]
An example url would be along the lines of:
http://localhost:8000/123456/2017/Johnny%20Rocket/
I tried playing around with Pythex, but I couldn't get the urls to match
Note: The variables passed are /{number}/{year}/{name}
The name can consist of alphanumeric characters and whitespaces.

The character class \S matches anything except whitespace, and %20 is decoded to a space before it is matched against the regex.
To match alphanumerical characters and whitespace, you can use [\w\s].

I would use the following regex:
r"(?P<leagueId>\d+)/(?P<year>\d+)/(?P<team>[^/]+)/$"
A few small changes:
took out the ^ which means beginning of line
changed the [0-9] to \d
But the big one being the section. I like to use negative character classes. You basically want to match everything up to the next
/
And that is what
[^/]+
will do for you.
You may want to limit the year to a range like:
\d{2,4}

Understanding regex pattern used to find string between strings in html

I have the following html file:
<!-- <div class="_5ay5"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m-"><div class="_u3y"><div class="_5asl"><a class="_47hq _5asm" href="/Dev/videos/1610110089242029/" aria-label="Who said it?" ajaxify="/Dev/videos/1610110089242029/" rel="theater">
In order to pull the string of numbers between videos/ and /", I'm using the following method that I found:
import re
Source_file = open('source.html').read()
result = re.compile('videos/(.*?)/"').search(Source_file)
print result
I've tried Googling an explanation for exactly how the (.*?) works in this particular implementation, but I'm still unclear. Could someone explain this to me? Is this what's known as a "non-greedy" match? If yes, what does that mean?

The ? in this context is a special operator on the repetition operators (+, *, and ?). In engines where it is available this causes the repetition to be lazy or non-greedy or reluctant or other such terms. Typically repetition is greedy which means that it should match as much as possible. So you have three types of repetition in most modern perl-compatible engines:
.* # Match any character zero or more times
.*? # Match any character zero or more times until the next match (reluctant)
.*+ # Match any character zero or more times and don't stop matching! (possessive)
More information can be found here: http://www.regular-expressions.info/repeat.html#lazy for reluctant/lazy and here: http://www.regular-expressions.info/possessive.html for possessive (which I'll skip discussing in this answer).
Suppose we have the string aaaa. We can match all of the a's with /(a+)a/. Literally this is
match one or more a's followed by an a.
This will match aaaa. The regex is greedy and will match as many a's as possible. The first submatch is aaa.
If we use the regex /(a+?)a this is
reluctantly match one or more as followed by an a
or
match one or more as until we reach another a
That is, only match what we need. So in this case the match is aa and the first submatch is a. We only need to match one a to satisfy the repetition and then it is followed by an a.
This comes up a lot when using regex to match within html tags, quotes and the suchlike -- usually reserved for quick and dirty operations. That is to say using regex to extract from very large and complex html strings or quoted strings with escape sequence can cause a lot of problems but it's perfectly fine for specific use cases. So in your case we have:
/Dev/videos/1610110089242029/
The expression needs to match videos/ followed by zero or more characters followed by /". If there is only one videos URL there that's just fine without being reluctant.
However we have
/videos/1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029/"
Without reluctance, the regex will match:
1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029
It tries to match as much as possible and / and " satisfy . just fine. With reluctance, the matching stops at the first /" (actually it backtracks but you can read about that separately). Thus you only get the part of the url you need.

It can be explained in a simple way:
.: match anything (any character),
*: any number of times (at least zero times),
?: as few times as possible (hence non-greedy).
videos/(.*?)/"
as a regular expression matches (for example)
videos/1610110089242029/"
and the first capturing group returns 1610110089242029, because any of the digits is part of “any character” and there are at least zero characters in it.
The ? causes something like this:
videos/1610110089242029/" something else … "videos/2387423470237509/"
to properly match as 1610110089242029 and 2387423470237509 instead of as 1610110089242029/" something else … "videos/2387423470237509, hence “as few times as possible”, hence “non-greedy”.

The . means any character. The * means any number of times, including zero. The ? does indeed mean non-greedy; that means that it will try to capture as few characters as possible, i.e., if the regex encounters a /, it could match it with the ., but it would rather not because the . is non-greedy, and since the next character in the regex is happy to match /, the . doesn't have to. If you didn't have the ?, that . would eat up the whole rest of the file because it would be chomping at the bit to match as many things as possible, and since it matches everything, it would go on forever.

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.

If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.

You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.

It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.