Difference in regex behavior between Perl and Python? - python

I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?

The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.

I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'

Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True

There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)

Related

python regex: match everything inside brackets including other brackets [duplicate]

In python, I can easily search for the first occurrence of a regex within a string like this:
import re
re.search("pattern", "target_text")
Now I need to find the last occurrence of the regex in a string, this doesn't seems to be supported by re module.
I can reverse the string to "search for the first occurrence", but I also need to reverse the regex, which is a much harder problem.
I can also iterate to find all occurrences from left to right, and just keep the last one, but that looks awkward.
Is there a smart way to find the rightmost occurrence?
One approach is to prefix the regex with (?s:.*) and force the engine to try matching at the furthest position and gradually backing off:
re.search("(?s:.*)pattern", "target_text")
Do note that the result of this method may differ from re.findall("pattern", "target_text")[-1], since the findall method searches for non-overlapping matches, and not all substrings which can be matched are included in the result.
For example, executing the regex a.a on abaca, findall would return aba as the only match and select it as the last match, while the code above will return aca as the match.
Yet another alternative is to use regex package, which supports REVERSE matching mode.
The result would be more or less the same as the method with (?s:.*) in re package as described above. However, since I haven't tried the package myself, it's not clear how backreference works in REVERSE mode - the pattern might require modification in such cases.
import re
re.search("pattern(?!.*pattern)", "target_text")
or
import re
re.findall("pattern", "target_text")[-1]
You can use these 2 approaches.
If you want positions use
x="abc abc abc"
print [(i.start(),i.end(),i.group()) for i in re.finditer(r"abc",x)][-1]
One approach is to use split. For example if you wanted to get the last group after ':' in this sample string:
mystr = 'dafdsaf:ewrewre:cvdsfad:ewrerae'
':'.join(mystr.split(':')[-1:])

Regex only finds results once

I'm trying to find any text between a '>' character and a new line, so I came up with this regex:
result = re.search(">(.*)\n", text).group(1)
It works perfectly with only one result, such as:
>test1
(something else here)
Where the result, as intended, is
test1
But whenever there's more than one result, it only shows the first one, like in:
>test1
(something else here)
>test2
(something else here)
Which should give something like
test1\ntest2
But instead just shows
test1
What am I missing? Thank you very much in advance.
re.search only returns the first match, as documented:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance.
To find all the matches, use findall.
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found.
Here's an example from the shell:
>>> import re
>>> re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx")
['test1', 'test2']
Edit: I just read your question again and realised that you want "test1\ntest2" as output. Well, just join the list with \n:
>>> "\n".join(re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx"))
'test1\ntest2'
You could try:
y = re.findall(r'((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+)', text)
Which returns ['t1\nt2\nt3'] for 't1\nt2\nt3\n'. If you simply want the string, you can get it by:
s = y[0]
Although it seems much larger than your initial code, it will give you your desired string.
Explanation -
((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+) is the regex as well as the match.
(?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|)) is the non-capturing group that matches any text followed by a newline, and is repeatedly found one-or-more times by the + after it.
(?:.+?) matches the actual words which are then followed by a newline.
(?:(?=[\n\r][^\n\r])\n|) is a non-capturing conditional group which tells the regex that if the matched text is followed by a newline, then it should match it, provided that the newline is not followed by another newline or carriage return
(?=[\n\r][^\n\r]) is a positive look-ahead which ascertains that the text found is followed by a newline or carriage return, and then some non-newline characters, which combined with the \n| after it, tells the regex to match a newline.
Granted, after typing this big mess out, the regex is pretty long and complicated, so you would be better off implementing the answers you understand, rather than this answer, which you may not. However, this seems to be the only one-line answer to get the exact output you desire.

Regex giving tuple and not full match

I'm trying to use regex to find proxy address on a website. Currently I'm using this piece of regex (\d{1,3}\.){3}\d{1,3}:(\d+). It works on regexr.com and in sublime text, but when I try to use it in Python it doesn't work as expected.
This is the piece of code I'm using:
p = re.compile("(\d{1,3}\.){3}\d{1,3}:(\d+)")
ipCandidates = p.findall(soupString)
It should return proxies like this 120.206.182.172:8123 but it returns tuples like this ('44.', '3128'). What can I do to fix this?
Thank you.
re.findall() only returns the contents of capturing groups instead of the whole match (if you have such groups in your regex).
Then, you're repeating a capturing group three times, which means that only the third repetition is preserved (the other two are overwritten).
Change your regex to
p = re.compile(r"(?:\d{1,3}\.){3}\d{1,3}:\d+")
and you'll get whole matches.
If you do want tuples of the separate submatches (without the dots and colon), you can do that, too, but you can't use repetition then:
p = re.compile(r"(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}):(\d+)")
Also, always use raw strings for regexes, so regex escape sequences and string escape sequences can't be confused.

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.

python regular expression replacing part of a matched string

i got an string that might look like this
"myFunc('element','node','elementVersion','ext',12,0,0)"
i'm currently checking for validity using, which works fine
myFunc\((.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\)
now i'd like to replace whatever string is at the 3rd parameter.
unfortunately i cant just use a stringreplace on whatever sub-string on the 3rd position since the same 'sub-string' could be anywhere else in that string.
with this and a re.findall,
myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)
i was able to get the contents of the substring on the 3rd position, but re.sub does not replace the string it just returns me the string i want to replace with :/
here's my code
myRe = re.compile(r"myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)")
val = "myFunc('element','node','elementVersion','ext',12,0,0)"
print myRe.findall(val)
print myRe.sub("noVersion",val)
any idea what i've missed ?
thanks!
Seb
In re.sub, you need to specify a substitution for the whole matching string. That means that you need to repeat the parts that you don't want to replace. This works:
myRe = re.compile(r"(myFunc\(.+?\,.+?\,)(.+?)(\,.+?\,.+?\,.+?\,.+?\))")
print myRe.sub(r'\1"noversion"\3', val)
If your only tool is a hammer, all problems look like nails. A regular expression is a powerfull hammer but is not the best tool for every task.
Some tasks are better handled by a parser. In this case the argument list in the string is just like a Python tuple, sou you can cheat: use the Python builtin parser:
>>> strdata = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> args = re.search(r'\(([^\)]+)\)', strdata).group(1)
>>> eval(args)
('element', 'node', 'elementVersion', 'ext', 12, 0, 0)
If you can't trust the input ast.literal_eval is safer than eval for this. Once you have the argument list in the string decontructed I think you can figure out how to manipulate and reassemble it again, if needed.
Read the documentation: re.sub returns a copy of the string where every occurrence of the entire pattern is replaced with the replacement. It cannot in any case modify the original string, because Python strings are immutable.
Try using look-ahead and look-behind assertions to construct a regex that only matches the element itself:
myRe = re.compile(r"(?<=myFunc\(.+?\,.+?\,)(.+?)(?=\,.+?\,.+?\,.+?\,.+?\))")
Have you tried using named groups? http://docs.python.org/howto/regex.html#search-and-replace
Hopefully that will let you just target the 3rd match.
If you want to do this without using regex:
>>> s = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> l = s.split(",")
>>> l[2]="'noVersion'"
>>> s = ",".join(l)
>>> s
"myFunc('element','node','noVersion','ext',12,0,0)"

Categories

Resources