Referencing a RegEx Variable - python

I'm using python to loop through a large list of self reported locations to try to match them to their home states. The RegEx expression I'm using is:
/^"[^\s]+,\s*([a-zA-Z]{2})"$/
Basically, I'm trying to find a pattern that looks like XXXCITYXXX, [Statecode], where statecode is only two letters.
My issue is that I don't know how to reference the varying state code once I find a matching string. I know in Perl that I could use:
$state = uc($1)
However, I don't know the equivalent Python syntax. Anyone know?

You can do it with re.search, which returns a match object (if the regex matches at all) with a groups property containing the captured groups:
import re
match = re.search('^[^\s]+,\s*([a-zA-Z]{2})$', my_string)
if match:
print match.groups()[0]

Related

Does this regex fail, or do I need to modify the regex to support "optional followed by"?

I am trying the following regex: https://regex101.com/r/5dlRZV/1/, I am aware, that I am trying with \author and not \maketitle
In python, I try the following:
import re
text = str(r'
\author{
\small
}
\maketitle
')
regex = [re.compile(r'[\\]author*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S),
re.compile(r'[\\]maketitle*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S)]
for p in regex:
for m in p.finditer(text):
print(m.group())
Python freezes, I am suspecting that this has something to do with my pattern, and the SRE fails.
EDIT: Is there something wrong with my regex? Can it be improved to actually work? Still I get the same results on my machine.
EDIT 2: Can this be fixed somehow so the pattern supports optional followed by ?: or ?= look-heads? So that one can capture both?
After reading the heading, "Parentheses Create Numbered Capturing Groups", on this site: https://www.regular-expressions.info/brackets.html, I managed to find the answer which is:
Besides grouping part of a regular expression together, parentheses also create a
numbered capturing group. It stores the part of the string matched by the part of
the regular expression inside the parentheses.
The regex Set(Value)? matches Set or SetValue.
In the first case, the first (and only) capturing group remains empty.
In the second case, the first capturing group matches Value.

search substring + integer from a string in python using regular expression

I have a string
str="TMOUT=1800; export TMOUT"
I want to extract only TMOUT=1800 from above string, but 1800 is not constant it can be any integer value. For example TMOUT=18 or TMOUT=201 etc. I'm very new to regular expression.
I tried using code below
re.search("TMOUT=\d",str).
It is not working. Please help
\d matches a single digit. You want to match one or more digits, so you have to add a + quantifier:
re.search("TMOUT=\d+", text)
If you then you want to extract the number you have to create a group using parenthesis ():
match = re.search(r"TMOUT=(\d+)", text)
number = int(match.group(1))
Or you may want to use the named group syntax (?P<name>):
match = re.search(r"TMOUT=(?P<num>\d+)", text)
number = int(match.group("num"))
I suggest you use regex101 to test your regexes and get an explanation of what they do. Also read python's re docs to learn about the methods of the various objects and functions available.

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

How can I match any substring except a particular one in python

I want to write a regular expression that will match the following string
a (any substring except 'ABC') ABC
An example for this would be a pqrs h js ABC
The tricky part is to match any substring except 'ABC'. Since the document in which I am searching for, can contain multiple lines that contain such pattern and I want to find all the lines separately I can't use the following expression
a.*ABC
because this would just give me the line where the first a is found extending uptill where the last 'ABC' is found in the document.
There is this answer which says I can use look ahead negation but that is not working in python, or maybe in my case because there is substring before and I have not tested simply using that expression because it will not serve my purpose
Use the non greedy quantifier i.e ?
^a.*?ABC

Difference in regex behavior between Perl and Python?

I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?
The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.
I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'
Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True
There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)

Categories

Resources