python regex: match everything inside brackets including other brackets [duplicate] - python

In python, I can easily search for the first occurrence of a regex within a string like this:
import re
re.search("pattern", "target_text")
Now I need to find the last occurrence of the regex in a string, this doesn't seems to be supported by re module.
I can reverse the string to "search for the first occurrence", but I also need to reverse the regex, which is a much harder problem.
I can also iterate to find all occurrences from left to right, and just keep the last one, but that looks awkward.
Is there a smart way to find the rightmost occurrence?

One approach is to prefix the regex with (?s:.*) and force the engine to try matching at the furthest position and gradually backing off:
re.search("(?s:.*)pattern", "target_text")
Do note that the result of this method may differ from re.findall("pattern", "target_text")[-1], since the findall method searches for non-overlapping matches, and not all substrings which can be matched are included in the result.
For example, executing the regex a.a on abaca, findall would return aba as the only match and select it as the last match, while the code above will return aca as the match.
Yet another alternative is to use regex package, which supports REVERSE matching mode.
The result would be more or less the same as the method with (?s:.*) in re package as described above. However, since I haven't tried the package myself, it's not clear how backreference works in REVERSE mode - the pattern might require modification in such cases.

import re
re.search("pattern(?!.*pattern)", "target_text")
or
import re
re.findall("pattern", "target_text")[-1]
You can use these 2 approaches.
If you want positions use
x="abc abc abc"
print [(i.start(),i.end(),i.group()) for i in re.finditer(r"abc",x)][-1]

One approach is to use split. For example if you wanted to get the last group after ':' in this sample string:
mystr = 'dafdsaf:ewrewre:cvdsfad:ewrerae'
':'.join(mystr.split(':')[-1:])

Related

how to find occurrence of special character using regex

I have an url like this
http://foo.com/bar_by_baz.html
now I want to extract baz from that URL using a regex. But so far I have managed to write this much only
[_]+?\w[^.]+
This is giving me
_by_baz
as output. Now I want to know that how can I select any special character exactly one time or what would be the best approach to solve this using regex ?
I am trying it on python 3.x
Here's your regex: [_]+?([^_.]+) the group match will return baz.. The concept is to isolate underscore and dot from the target match
In another case, this works based on capturing only the alphanumerics [_]+?([A-Za-z0-9]+)
I am going to assume from your profile that you are seeking a javascript-friendly solution (you should update your question & tags).
For javascript, you could use this pattern: /[^_]+(?=\.[a-z]+$)/
Demo Link The pattern matches the substring containing no underscores that is followed by a dot then one or more alphabetical characters until the end of the string.
There will be several ways to accomplish your task. Finding the best/most efficient one can only be achieved if you provide more information about the coding environment/language and a few more sample strings.

Python Regex instantly replace groups

Is there any way to directly replace all groups using regex syntax?
The normal way:
re.match(r"(?:aaa)(_bbb)", string1).group(1)
But I want to achieve something like this:
re.match(r"(\d.*?)\s(\d.*?)", "(CALL_GROUP_1) (CALL_GROUP_2)")
I want to build the new string instantaneously from the groups the Regex just captured.
Have a look at re.sub:
result = re.sub(r"(\d.*?)\s(\d.*?)", r"\1 \2", string1)
This is Python's regex substitution (replace) function. The replacement string can be filled with so-called backreferences (backslash, group number) which are replaced with what was matched by the groups. Groups are counted the same as by the group(...) function, i.e. starting from 1, from left to right, by opening parentheses.
The accepted answer is perfect. I would add that group reference is probably better achieved by using this syntax:
r"\g<1> \g<2>"
for the replacement string. This way, you work around syntax limitations where a group may be followed by a digit. Again, this is all present in the doc, nothing new, just sometimes difficult to spot at first sight.

Regular expression code is not working (Python)

Assume I have a word AB1234XZY or even 1AB1234XYZ.
I want to extract ONLY 'AB1234' or 1AB1234 (ie. everything up until the letters at the end).
I have used the following code to extract that but it's not working:
base= re.match(r"^(\D+)(\d+)", word).group(0)
When I print base, it's not working for the second case. Any ideas why?
Your regex doesn't work for the second case because it starts with a number; the \D at the beginning of your pattern matches anything that ISN'T a number.
You should be able to use something quite simple for this--simpler, in fact, than anything else I see here.
'.*\d'
That's it! This should match everything up to and including the last number in your string, and ignore everything after that.
Here's the pattern working online, so you can see for yourself.
(.+?\d+)\w+ would give you what you want.
Or even something like this
^(.+?)[a-zA-Z]+$
re.match starts at the beginning of the string, and re.search simply looks for it in the string. both return the first match. .group(0) is everything included in the match, if you had capturing groups, then .group(1) is the first group...etc etc... as opposed to normal convention where 0 is the first index, in this case, 0 is a special use case meaning everything.
in your case, depending on what you really need to capture, maybe using re.search is better. and instead of using 2 groups, you can use (\D+\d+) keep in mind, it will capture the first (non-digits,digits) group. it might be sufficient for you, but you might want to be more specific.
after reading your comment "everything before the letters at the end"
this regex is what you need:
regex = re.compile(r'(.+)[A-Za-z]')

re.match() multiple times in the same string with Python

I have a regular expression to find :ABC:`hello` pattern. This is the code.
format =r".*\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
m = patt.match(l.rstrip())
if m:
...
It works well when the pattern happens once in a line, but with an example ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`". It finds only the last one.
How can I find all the three patterns?
EDIT
Based on Paul Z's answer, I could get it working with this code
format = r"\:([^:]*)\:\`([^`]*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
print tag, ":::", value
Result
tagbox ::: Verilog
tagbox ::: Multiply
tagbox ::: VHDL
Yeah, dcrosta suggested looking at the re module docs, which is probably a good idea, but I'm betting you actually wanted the finditer function. Try this:
format = r"\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
....
Your current solution always finds the last one because the initial .* eats as much as it can while still leaving a valid match (the last one). Incidentally this is also probably making your program incredibly slower than it needs to be, because .* first tries to eat the entire string, then backs up character by character as the remaining expression tells it "that was too much, go back". Using finditer should be much more performant.
A good place to start is there module docs. In addition to re.match (which searches starting explicitly at the beginning of the string), there is re.findall (finds all non-overlapping occurrences of the pattern), and the methods match and search of compiled RegexObjects, both of which accept start and end positions to limit the portion of the string being considered. See also split, which returns a list of substrings, split by the pattern. Depending on how you want your output, one of these may help.
re.findall or even better regex.findall can do that for you in a single line:
import regex as re #or just import re
s = ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`"
format = r"\:([^:]*)\:\`([^`]*)\`"
re.findall(format,s)
result is:
[('tagbox', 'Verilog'), ('tagbox', 'Multiply'), ('tagbox', 'VHDL')]

Difference in regex behavior between Perl and Python?

I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?
The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.
I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'
Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True
There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)

Categories

Resources