For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'
Related
I'm using python's re library to do this, but it's a basic regex question.
I am receiving a string of coordinate information in degrees-minutes-seconds format without spaces, and I'm parsing it out to discrete coordinate pairs for conversion.
The string is fed to me looking like this (fake coords for example):
102030N0102030E203040N0203040E304050N0304050E405060N0405060E
I am catching it like this:
coordstr = '102030N0102030E203040N0203040E304050N0304050E405060N0405060E'
coords = re.match(
re.compile(r"^(\d+[NS]{1}\d+[EW]{1})(\d+[NS]{1}\d+[EW]{1})(\d+[NS]{1}\d+[EW]{1})(\d+[NS]{1}\d+[EW]{1})"),
coordstr)
for x in coords.groups():
print(x)
which gives me
102030N0102030E
203040N0203040E
304050N0304050E
405060N0405060E
And allows me to address each coordinate pair as coords.group(1), coords.group(2) and so on.
So it works, but it feels like I'm being too verbose in the pattern. Is there a more succinct way to crawl the line with one of the capture groups, and add each matched group to .groups() as it's encountered? I know I could do it with brute force string slicing but that seems like more trouble than it's worth.
I've read this but it doesn't seem to address what I'm going after in this question.
Because this is for an enterprise and these strings describe raster bounds, I will be validating the string before introducing the regex search and falling back to a gdal object if the string is not found (or corrupted).
Since you will pre-validate the strings you will process with regex, you need not use re.search / re.match with several groups with identical pattern, you can use re.findall to get all \d+[NS]\d+[EW] pattern matches from your strings:
import re
coordstr = '102030N0102030E203040N0203040E304050N0304050E405060N0405060E'
coords = re.findall(r'\d+[NS]\d+[EW]', coordstr)
for x in coords:
print(x)
Output:
102030N0102030E
203040N0203040E
304050N0304050E
405060N0405060E
See the Python demo.
NOTE: the list of matches returned by re.findall will always be in the same order as they are in the source text, see this SO post.
I am looking for a regular expression that discriminates between a string that contains a numerical value enclosed between parentheses, and a string that contains outside of them. The problem is, parentheses may be embedded into each other:
So, for example the expression should match the following strings:
hey(example1)
also(this(onetoo2(hard)))
but(here(is(a(harder)one)maybe23)Hehe)
But it should not match any of the following:
this(one)is22misleading
how(to(go)on)with(multiple)3parent(heses(around))
So far I've tried
\d[A-Za-z] \)
and easy things like this one. The problem with this one is it does not match the example 2, because it has a ( string after it.
How could I solve this one?
The problem is not one of pattern matching. That means regular expressions are not the right tool for this.
Instead, you need lexical analysis and parsing. There are many libraries available for that job.
You might try the parsing or pyparsing libraries.
These type of regexes are not always easy, but sometimes it's possible to come up with a way provided the input remains somewhat consistent. A pattern generally like this should work:
(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)
Code:
import re
p = re.compile(ur'(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)', re.MULTILINE)
result = re.findall(p, searchtext)
print(result)
Result:
https://regex101.com/r/aL8bB8/1
I am trying to print the shared characters between 2 sets of strings in Python, I am doing this with the hopes of actually finding how to do this using nothing but python regular expressions (I don't know regex so this might be a good time to learn it).
So if first_word = "peepa" and second_word = "poopa" I want the return value to be: "pa"
since in both variables the characters that are shared are p and a. So far I am following the documentation on how to use the re module, but I can't seem to grasp the basic concepts of this.
Any ideas as to how would I solve this problem?
This sounds like a problem where you want to find the intersection of characters between the two strings. The quickest way would be to do this:
>>> set(first_word).intersection(second_word)
set(['a', 'p'])
I don't think regular expressions are the right fit for this problem.
Use sets. Casting a string to a set returns an iterable with unique letters. Then you can retrieve the intersection of the two sets.
match = set(first_word.lower()) & set(second_word.lower())
Using regular expressions
This problem is tailor made for sets. But, you ask for "how to do this using nothing but python regular expressions."
Here is a start:
>>> import re
>>> re.sub('[^peepa]', '', "poopa")
'ppa'
The above uses regular expressions to remove from "poopa" every letter that was not already in "peepa". (As you see it leaves duplicated letters which sets would not do.)
In more detail, re.sub does substitutions based on regular expressions. [peepa] is a regular expression that means any of the letters peepa. The regular expression [^peepa] means anything that is not in peepa. Anything matching this regular expression is replaced with the empty string "", that is, it is removed. What remains are only the common letters.
So I'm playing around with regular expressions in Python. Here's what I've gotten so far (debugged through RegExr):
##(VAR|MVAR):([a-zA-Z0-9]+)+(?::([a-zA-Z0-9]+))*##
So what I'm trying to match is stuff like this:
##VAR:param1##
##VAR:param2:param3##
##VAR:param4:param5:param6:0##
Essentially, you have either VAR or MVAR followed by a colon then some param name, then followed by the end chars (##) or another : and a param.
So, what I've gotten for the groups on the regex is the VAR, the first param, and then the last thing in the parameter list (for the last example, the 3rd group would be 0). I understand that groups are created by (...), but is there any way for the regex to match the multiple groups, so that param5, param6, and 0 are in their own group, rather than only having a maximum of three groups?
I'd like to avoid having to match this string then having to split on :, as I think this is capable of being done with regex. Perhaps I'm approaching this the wrong way.
Essentially, I'm attempting to see if I can find and split in the matching process rather than a postprocess.
If this format is fixed, you don't need regex, it just makes it harder. Just use split:
text.strip('#').split(':')
should do it.
The number of groups in a regular expression is fixed. You will need to postprocess somehow.
I'm using Python and I want to use regular expressions to check if something "is part of an include list" but "is not part of an exclude list".
My include list is represented by a regex, for example:
And.*
Everything which starts with And.
Also the exclude list is represented by a regex, for example:
(?!Andrea)
Everything, but not the string Andrea. The exclude list is obviously a negation.
Using the two examples above, for example, I want to match everything which starts with And except for Andrea.
In the general case I have an includeRegEx and an excludeRegEx. I want to match everything which matchs includeRegEx but not matchs excludeRegEx. Attention: excludeRegEx is still in the negative form (as you can see in the example above), so it should be better to say: if something matches includeRegEx, I check if it also matches excludeRegEx, if it does, the match is satisfied. Is it possible to represent this in a single regular expression?
I think Conditional Regular Expressions could be the solution but I'm not really sure of that.
I'd like to see a working example in Python.
Thank you very much.
Why not put both in one regex?
And(?!rea$).*
Since the lookahead only "looks ahead" without consuming any characters, this works just fine (well, this is the whole point of lookaround, actually).
So, in Python:
if re.match(r"And(?!rea$).*", subject):
# Successful match
# Note that re.match always anchor the match
# to the start of the string.
else:
# Match attempt failed
From the wording of your question, I'm not sure if you're starting with two already finished lists of "match/don't match" pairs. In that case, you could simply combine them automatically by concatenating the regexes. This works just as well but is uglier:
(?!Andrea$)And.*
In general, then:
(?!excludeRegex$)includeRegex