Regex pattern to require either of two specific strings in different positions - python

I have a string that can exist in either of the following two formats within a larger body of text:
OptionalSpecificString1 1234
1234 OptionalSpecificString2
The text here is all placeholders. I'm looking for a numerical string that's either preceded or followed by a specific optional string. One of the two optional specific strings will always be present and is needed to locate and capture the numerical string-of-interest. Is there a single regex pattern that exists that can capture this behavior?
Something like:
(?:OptionalSpecificString1)? (\d+) (?:OptionalSpecificString2)?
almost does it, but doesn't require that one of the two optional strings is present, and so it could end up matching any other numerical string in the body of the text. I know I could do something like:
(OptionalSpecificString1 (\d+)|(\d+) OptionSpecificString2)
but I guess I'm just wondering if there's something a little more elegant. I'm doing this with the Python re module, so code can be a bit simpler too when I can express a single capture group for the same pattern.

The solution could be OptionalSpecificString1\s*(?P<numeric>\d+)|(?P<numeric>\d+)\s*OptionalSpecificString2, simply making two different syntaxes regexp alternatives, if Python supported named groups redefinition.
As it doesn't, you could capture your numerical values into different groups, named or not, and choose the non-empty one back in Python code, like this:
import re
text = r'''
OptionalSpecificString1 1234
An irrelevant line
5678 OptionalSpecificString2
Another irrelevant line
'''
pattern = r'OptionalSpecificString1\s*(?P<numeric1>\d+)|(?P<numeric2>\d+)\s*OptionalSpecificString2'
numerics = []
for match in re.finditer (pattern, text):
numerics.append (match.group ('numeric1') or match.group ('numeric2'))
print (numerics)

Related

Python regex fuzzy searching

I have a question about making a pattern using fuzzy regex with the python regex module.
I have several strings such as TCATGCACGTGGGGCTGAC
The first eight characters of this string are variable (multiple options): TCAGTGTG, TCATGCAC, TGGTGGCT. In addition, there is a constant part after the variable part: GTGGGGCTGAC.
I would like to design a regex that can detect this string in a longer string, while allowing for at most 2 substitutions.
For example, this would be acceptable as two characters have been substituted:
TCATGCACGTGGGGCTGAC
TCCTGCACGTGGAGCTGAC
However, more substitutions should not be accepted.
In my code, I tried to do the following:
import regex
variable_parts = ["TCAGTGTG", "TCATGCAC", "TGGTGGCT", "GATAAGTG", "ATTAGACG", "CACTTCCG", "GTCTGTAT", "TGTCAAAG"]
string_to_test = "TCATGCACGTGGGGCTGAC"
motif = "(%s)GTGGGGCTGAC" % "|".join(variable_parts)
pattern = regex.compile(r''+motif+'{s<=2}')
print(pattern.search(string_to_test))
I get a match when I run this code and when I change the last character of string_to_test. But when I manually add a substitution in the middle of string_to_test, I do not get any match (even while I want to allow up to 2 substitutions).
Now I know that my regex is probably total crap, but I would like to know what I exactly need to do to make this work and where in the code I need to add/remove/change stuff. Any suggestions/tips are welcome!
Right now, you only add the restriction to the last C in the pattern that looks likelooks like (TCAGTGTG|TCATGCAC|TGGTGGCT|GATAAGTG|ATTAGACG|CACTTCCG|GTCTGTAT|TGTCAAAG)GTGGGGCTGAC{s<=2}.
To apply the {s<=2} quantifier to the whole expression you need to enclose the pattern within a non-capturing group:
pattern = regex.compile(fr'(?:{motif}){{s<=2}}')
The example above shows how to declare your pattern with the help of an f-string literal, where literal braces are defined with {{ and }} (doubled) braces. It yields the same result as pattern = regex.compile('(?:'+motif+'){s<=2}').
Also, note that r''+ is redundant and has no effect on the final pattern.

Regex optional order of capturing group

I have simple, but tricky question about regex (using in python), which i have did not find answer for anywhere here on google. Is there any "trick" how to make two capture groups in optional order? Let's say we have following:
.*abc.*
What i want is to match also this:
.*acb.*
I know i could use
.*abc|acb.*
but the problem is, that if we have something more complicated then abc, code is very long. Is not there any workaround to say e.g. "match last two capturing groups (or symbols, etc.) in any order?
I don't really get what is this in-any-order thing that would make the regex shorter. On the other hand, I can show you how to make this readable, even if you have tons of options.
import re
pattern = """
.* # match from starting the line
(?: # A non-capturing group starts so we can list lots of alternatives
abc| # alternative 1
acb # alternative 2
) # end of alternatives
.* # then match everything up to the end of the line
"""
re.search(pattern, 'qqabcqq', re.VERBOSE) # returns a match
re.search(pattern, 'qqacbqq', re.VERBOSE) # returns a match
re.search(pattern, 'qqaSDqq', re.VERBOSE) # does not return a match
So what did we just see here?
The """ ... """ construct is a convenient way to define multiline strings in python.
Then the re.VERBOSE skips the whitespaces and comments. As the manual says:
Whitespace within the pattern is ignored, except when in a character
class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.
This two things let you add structure and comments to your regex. Here is another great example.
With standard regular expressions you can define patterns without order. Example:
[cdgjow]
Of course this example refers to characters.
Alternative sequences must be specified using "|". Example:
abc|cba
There is no way to express what you would like to express in classic regular expression syntax. Regular expression syntax has no syntactic elements to express what you would like to express. It's lacking this feature. You have to rely on "manually" specifying your alternatives. It's not a limit of the automaton constructed from regular expressions but of the regular expression syntax itself.
That means: You will have to construct the regular expression you require by yourself with all variants possible. There are two ways how to do this:
Do it manually. Take your time, be careful, built the correct regex manually.
Do it programmatically. Write some code that generates the regex you require.
If you do it manually consider #TamasRev answer. (Thanks #TamasRev! Nice answer!) But if I were you I'd build the regex programmatically. (For things like that programming has been invented for anyway :-) )

Regex giving tuple and not full match

I'm trying to use regex to find proxy address on a website. Currently I'm using this piece of regex (\d{1,3}\.){3}\d{1,3}:(\d+). It works on regexr.com and in sublime text, but when I try to use it in Python it doesn't work as expected.
This is the piece of code I'm using:
p = re.compile("(\d{1,3}\.){3}\d{1,3}:(\d+)")
ipCandidates = p.findall(soupString)
It should return proxies like this 120.206.182.172:8123 but it returns tuples like this ('44.', '3128'). What can I do to fix this?
Thank you.
re.findall() only returns the contents of capturing groups instead of the whole match (if you have such groups in your regex).
Then, you're repeating a capturing group three times, which means that only the third repetition is preserved (the other two are overwritten).
Change your regex to
p = re.compile(r"(?:\d{1,3}\.){3}\d{1,3}:\d+")
and you'll get whole matches.
If you do want tuples of the separate submatches (without the dots and colon), you can do that, too, but you can't use repetition then:
p = re.compile(r"(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}):(\d+)")
Also, always use raw strings for regexes, so regex escape sequences and string escape sequences can't be confused.

How can I find all Markdown links using regular expressions?

In Markdown there is two ways to place a link, one is to just type the raw link in, like: http://example.com, the other is to use the ()[] syntax: (Stack Overflow)[http://example.com ].
I'm trying to write a regular expression that can match both of these, and, if it's the second match to also capture the display string.
So far I have this:
(?P<href>http://(?:www\.)?\S+.com)|(?<=\((.*)\)\[)((?P=href))(?=\])
Debuggex Demo
But this doesn't seem to match either of my two test cases in Debuggex:
http://example.com
(Example)[http://example.com]
Really not sure why the first one isn't matched at the very least, is it something to do with my use of the named group? Which, if possible I'd like to keep using because this is a simplified expression to match the link and in the real example it is too long for me to feel comfortable duplicating it in two different places in the same pattern.
What am I doing wrong? Or is this not doable at all?
EDIT: I'm doing this in Python so will be using their regex engine.
The reason your pattern doesn't work is here: (?<=\((.*)\)\[) since the re module of Python doesn't allow variable length lookbehind.
You can obtain what you want in a more handy way using the new regex module of Python (since the re module has few features in comparison).
Example: (?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])
An online demo
pattern details:
(?| # open a branch reset group
# first case there is only the url
(?<txt> # in this case, the text and the url
(?<url> # are the same
(?:ht|f)tps?://\S+(?<=\P{P})
)
)
| # OR
# the (text)[url] format
\( ([^)]+) \) # this group will be named "txt" too
\[ (\g<url>) \] # this one "url"
)
This pattern uses the branch reset feature (?|...|...|...) that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt> group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url> group.
\g<url> is a reference to the named subpattern ?<url> (like an alias, in this way, no need to rewrite it in the second member.)
(?<=\P{P}) checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct})

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

Categories

Resources