Get literal prefix of regex pattern - python

Here is the problem:
There is a list of thousands of regular expressions. I need to get regular expression which matches to the given string. Hopefully, these regexes are mutually exclusive, so if several regexes are matching at the same time, I'm ok with returning any of them.
I assume that most of regular expressions starts with literal prefix e.g.:
"some_literal_string(?:\?some_regular_part)?" → "some_literal_string"
I'd like to try the following data structure to make the search fast:
regexes = [index=prefix_length:{key=prefix:{*prefixes}]
now, to find the prefix, I need to iterate index from min(len(string), len(longest_prefix)) down to 0 and extract subset of regexes:
subset = regexes[i][string[0:i]]
Now I need to check each element for match and if pattern is found, return it, otherwise, continue with next index.
The question is: how to get literal prefix of a regular expression in the common case?

I came to the following regular expression:
(?:[^.^$*+?{\\[|(]|(?:\\(?:[^\dAbBdDsSwWZ]|0|[0-7]{3})))*(?![*?|]|{\d+(?:,\d*)?})
It's needed to replace backslash+symbol with the symbol in matched string after search:
\$ → $
It's needed to replace octal escapes:
\0100 → #
https://regex101.com/r/fF8aB9/2

Related

Python regex expression example

I have an input that is valid if it has this parts:
starts with letters(upper and lower), numbers and some of the following characters (!,#,#,$,?)
begins with = and contains only of numbers
begins with "<<" and may contain anything
example: !!Hel##lo!#=7<<vbnfhfg
what is the right regex expression in python to identify if the input is valid?
I am trying with
pattern= r"([a-zA-Z0-9|!|#|#|$|?]{2,})([=]{1})([0-9]{1})([<]{2})([a-zA-Z0-9]{1,})/+"
but apparently am wrong.
For testing regex I can really recommend regex101. Makes it much easier to understand what your regex is doing and what strings it matches.
Now, for your regex pattern and the example you provided you need to remove the /+ in the end. Then it matches your example string. However, it splits it into four capture groups and not into three as I understand you want to have from your list. To split it into four caputre groups you could use this:
"([a-zA-Z0-9!##$?]{2,})([=]{1}[0-9]+)(<<.*)"
This returns the capture groups:
!!Hel##lo!#
=7
<<vbnfhfg
Notice I simplified your last group a little bit, using a dot instead of the list of characters. A dot matches anything, so change that back to your approach in case you don't want to match special characters.
Here is a link to your regex in regex101: link.

search substring + integer from a string in python using regular expression

I have a string
str="TMOUT=1800; export TMOUT"
I want to extract only TMOUT=1800 from above string, but 1800 is not constant it can be any integer value. For example TMOUT=18 or TMOUT=201 etc. I'm very new to regular expression.
I tried using code below
re.search("TMOUT=\d",str).
It is not working. Please help
\d matches a single digit. You want to match one or more digits, so you have to add a + quantifier:
re.search("TMOUT=\d+", text)
If you then you want to extract the number you have to create a group using parenthesis ():
match = re.search(r"TMOUT=(\d+)", text)
number = int(match.group(1))
Or you may want to use the named group syntax (?P<name>):
match = re.search(r"TMOUT=(?P<num>\d+)", text)
number = int(match.group("num"))
I suggest you use regex101 to test your regexes and get an explanation of what they do. Also read python's re docs to learn about the methods of the various objects and functions available.

Python Regex for alpha(alpha|digit)*

I'm trying to produce a python regex to represent identifiers for a lexical analyzer. My approach is:
([a-zA-Z]([a-zA-Z]|\d)*)
When I use this in:
regex = re.compile("\s*([a-zA-Z]([a-zA-Z]|\d)*)")
regex.findall(line)
It doesn't produce a list of identifiers like it should. Have I built the expression incorrectly?
What's a good way to represent the form:
alpha(alpha|digit)*
With the python re module?
like this:
regex = re.compile(r'[a-zA-Z][a-zA-Z\d]*')
Note the r before the quote to obtain a raw string, otherwise you need to escape all backslashes.
Since the \s* before is optional, you can remove it, like capture groups.
If you want to ensure that the match isn't preceded by a digit, you can write it like this with a negative lookbehind (?<!...):
regex = re.compile(r'(?:^|(?<![\da-zA-Z]))[a-zA-Z][a-zA-Z\d]*')
Note that with re.compile you can use the case insensitive option:
regex = re.compile(r'(?:^|(?<![\da-z]))[a-z][a-z\d]*', re.I)

Joinning two regular expressions together

I have two regular expressions, one matching for all characters [a-z] and the other excluding the following combination of characters [^spuz(ih)] (the characters s, p, u, z, ih)how would I combine these two so that I could allow all alphanumeric characters except those listed in the second RE?
(re.match(r'^[a-z]*(?![spuz]|ih)[a-z]s$', insert_phrase)
You can't "combine" them as such, but you can write another regular expression which has the same effect. For this, you can use the (?!) construct. It matches 0 characters only if the regular expression in it is not matched by the following part. So you can use:
'(?![spuz(ih)])[a-z]'
Or, since this wasn't what you wanted, change it to:
'(?![spuz]|ih)[a-z]'
In the changed question, you seem to want negative lookbehind instead. This turns the pattern into:
'^[a-z]*(?<![a-z][spuz]|ih)s$'
Note the extra [a-z] in the lookbehind part. It is required because lookbehind expressions must be fixed width. This means that a string like 'ps' will match the pattern, but you don't want that. So instead, it's better to use two separate lookbehinds (both of which have to be be true for the string to match):
'^[a-z]*(?<![spuz])(?<!ih)s$'

Difference in regex behavior between Perl and Python?

I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?
The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.
I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'
Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True
There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)

Categories

Resources