How should I make these regex capture groups more succinct? - python

I'm using python's re library to do this, but it's a basic regex question.
I am receiving a string of coordinate information in degrees-minutes-seconds format without spaces, and I'm parsing it out to discrete coordinate pairs for conversion.
The string is fed to me looking like this (fake coords for example):
102030N0102030E203040N0203040E304050N0304050E405060N0405060E
I am catching it like this:
coordstr = '102030N0102030E203040N0203040E304050N0304050E405060N0405060E'
coords = re.match(
re.compile(r"^(\d+[NS]{1}\d+[EW]{1})(\d+[NS]{1}\d+[EW]{1})(\d+[NS]{1}\d+[EW]{1})(\d+[NS]{1}\d+[EW]{1})"),
coordstr)
for x in coords.groups():
print(x)
which gives me
102030N0102030E
203040N0203040E
304050N0304050E
405060N0405060E
And allows me to address each coordinate pair as coords.group(1), coords.group(2) and so on.
So it works, but it feels like I'm being too verbose in the pattern. Is there a more succinct way to crawl the line with one of the capture groups, and add each matched group to .groups() as it's encountered? I know I could do it with brute force string slicing but that seems like more trouble than it's worth.
I've read this but it doesn't seem to address what I'm going after in this question.
Because this is for an enterprise and these strings describe raster bounds, I will be validating the string before introducing the regex search and falling back to a gdal object if the string is not found (or corrupted).

Since you will pre-validate the strings you will process with regex, you need not use re.search / re.match with several groups with identical pattern, you can use re.findall to get all \d+[NS]\d+[EW] pattern matches from your strings:
import re
coordstr = '102030N0102030E203040N0203040E304050N0304050E405060N0405060E'
coords = re.findall(r'\d+[NS]\d+[EW]', coordstr)
for x in coords:
print(x)
Output:
102030N0102030E
203040N0203040E
304050N0304050E
405060N0405060E
See the Python demo.
NOTE: the list of matches returned by re.findall will always be in the same order as they are in the source text, see this SO post.

Related

Find string in possibly multiple parentheses?

I am looking for a regular expression that discriminates between a string that contains a numerical value enclosed between parentheses, and a string that contains outside of them. The problem is, parentheses may be embedded into each other:
So, for example the expression should match the following strings:
hey(example1)
also(this(onetoo2(hard)))
but(here(is(a(harder)one)maybe23)Hehe)
But it should not match any of the following:
this(one)is22misleading
how(to(go)on)with(multiple)3parent(heses(around))
So far I've tried
\d[A-Za-z] \)
and easy things like this one. The problem with this one is it does not match the example 2, because it has a ( string after it.
How could I solve this one?
The problem is not one of pattern matching. That means regular expressions are not the right tool for this.
Instead, you need lexical analysis and parsing. There are many libraries available for that job.
You might try the parsing or pyparsing libraries.
These type of regexes are not always easy, but sometimes it's possible to come up with a way provided the input remains somewhat consistent. A pattern generally like this should work:
(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)
Code:
import re
p = re.compile(ur'(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)', re.MULTILINE)
result = re.findall(p, searchtext)
print(result)
Result:
https://regex101.com/r/aL8bB8/1

How to regex split, but keep the split string?

I have the following URL pattern:
http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en
I would like to get everything up until and inclusive of /watch/\d+/.
So far I have:
>>> re.split(r'watch/\d+/', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'supernatural-dub-hollywood-babylon/en']
But this does not include the split string (the string which appears between the domain and the path). The end answer I want to achieve is:
http://www.hulu.jp/watch/589851
You need to use capture group :
>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']
As mentioned in the other answer, you need to use groups to capture the "glue" between the split strings.
I wonder though, is what you want here a split() or a search()? It looks (from the sample) that you're trying to extract from a URL everything from the first occurrence of /watch/XXX/ where XXX is 1 or more digits, to the end of the string. If that's the case, then a match/search might be more suitable, as with a split if the search regex can match multiple times you'll split into multiple groups. Ex:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/', 'watch/2342/', 'fdsaafsdf']
Which doesn't look like what you want. Instead perhaps:
result = re.search(r'(watch/\d+/)(.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groups() if result else []
which gives:
('watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
You could also use this approach combined with named groups to get extra fancy:
result = re.search(r'(?P<watchId>watch/\d+/)(?P<path>.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groupdict() if result else {}
giving:
{'path': 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', 'watchId': 'watch/589851/'}
If you're set on the split() approach, you can also set the maxsplit parameter to ensure it's only split once:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', maxsplit=1)
giving:
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf']
Personally though, I find that when parsing URL's into constituent parts the search() with named groups approach works extremely well as it allows you to name the various parts in the regex itself, and via groupdict() get a nice dictionary you can use for working with those parts.
You've surely seen the Stack Overflow don't-parse-HTML-with-regex post, yes?
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Well, regex can parse URLs, but trying to do so when there's a plethora of better tools is foolish.
This is what a regex for URLs looks like:
^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$ (+ caseless flag)
It's just a mess of characters, right? Exactly!
Don't parse URLs with regex... almost.
There is one simple thing:
A path-relative URL must be zero or more path segments separated from each other by a "/".
Splitting the URL should be as simple as url.split("/").
from urllib.parse import urlparse, urlunparse
myurl = "http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en"
# Run a parser over it
parts = urlparse(myurl)
# Crop the path to UP TO length 2
new_path = str("/".join(parts.path.split("/")[:3]))
# Unparse
urlunparse(parts._replace(path=new_path))
#>>> 'http://www.hulu.jp/watch/589851'
You can try following regex
.*\/watch\/\d+
Working Demo

Regex giving tuple and not full match

I'm trying to use regex to find proxy address on a website. Currently I'm using this piece of regex (\d{1,3}\.){3}\d{1,3}:(\d+). It works on regexr.com and in sublime text, but when I try to use it in Python it doesn't work as expected.
This is the piece of code I'm using:
p = re.compile("(\d{1,3}\.){3}\d{1,3}:(\d+)")
ipCandidates = p.findall(soupString)
It should return proxies like this 120.206.182.172:8123 but it returns tuples like this ('44.', '3128'). What can I do to fix this?
Thank you.
re.findall() only returns the contents of capturing groups instead of the whole match (if you have such groups in your regex).
Then, you're repeating a capturing group three times, which means that only the third repetition is preserved (the other two are overwritten).
Change your regex to
p = re.compile(r"(?:\d{1,3}\.){3}\d{1,3}:\d+")
and you'll get whole matches.
If you do want tuples of the separate submatches (without the dots and colon), you can do that, too, but you can't use repetition then:
p = re.compile(r"(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}):(\d+)")
Also, always use raw strings for regexes, so regex escape sequences and string escape sequences can't be confused.

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

Grouping in Python Regular Expressions

So I'm playing around with regular expressions in Python. Here's what I've gotten so far (debugged through RegExr):
##(VAR|MVAR):([a-zA-Z0-9]+)+(?::([a-zA-Z0-9]+))*##
So what I'm trying to match is stuff like this:
##VAR:param1##
##VAR:param2:param3##
##VAR:param4:param5:param6:0##
Essentially, you have either VAR or MVAR followed by a colon then some param name, then followed by the end chars (##) or another : and a param.
So, what I've gotten for the groups on the regex is the VAR, the first param, and then the last thing in the parameter list (for the last example, the 3rd group would be 0). I understand that groups are created by (...), but is there any way for the regex to match the multiple groups, so that param5, param6, and 0 are in their own group, rather than only having a maximum of three groups?
I'd like to avoid having to match this string then having to split on :, as I think this is capable of being done with regex. Perhaps I'm approaching this the wrong way.
Essentially, I'm attempting to see if I can find and split in the matching process rather than a postprocess.
If this format is fixed, you don't need regex, it just makes it harder. Just use split:
text.strip('#').split(':')
should do it.
The number of groups in a regular expression is fixed. You will need to postprocess somehow.

Categories

Resources