EDIT: I've gotten it work--I had forgotten to put in a space as a separator for multiple edges.
I've got this Python regex, which handles most of the strings I have to parse.
edge_value_pattern = re.compile(r'(?P<edge>e[0-9]+) +(?P<label1>[^ ]*)[^"]+"(?P<word>[^"]+)"[^:]+:: (?P<label2>[^\n]+)')
Here is an example string that my regex is meant to parse:
'e0 BIKE-EVENT 1 "biking" 2'
It correctly stores e0 into the edge group, BIKE-EVENT into the label1 group, and "biking" into the word group. The last group, label2, is for a slightly different variation of the string, as shown below. Note that the label2 regex group behaves as expected when given a string like the one below.
'e29 e30 "of" :: of, OF'
However, the regex pattern fills in label1 with the value e30. The truth is that this string does not have any label1 value--it should be None or at least the empty string. An ad-hoc solution would be to parse label1 with a regex to determine if it's an actual label or just another edge. I want to know if there is way to modify my original regex so that the group edge takes in all edges. E.g., the output for the above string would be:
edge = "e29 e30"
label1 = None
word = of
label2 = of, OF
I tried this solution below, which I thought would translate to simply looping over the first group, edge (this would be trivial if I had an actual FSA), but it doesn't change the behavior of the regex.
edge_value_pattern = re.compile(r'(?P<edge>(e[0-9]+)+) +(?P<label1>[^ ]*)[^"]+"(?P<word>[^"]+)"[^:]+:: (?P<label2>[^\n]+)')
If you want edge to match "e29 e30", you have to put the repetition inside the group, not outside.
You did that by sticking a new group inside the edge group with a + repetition—which is fine, although you probably wanted a non-capturing group there—but you forgot to include the space inside the repeating group.
(You also left the external repeat, and used a capturing group where you probably wanted a non-capturing, but those are less serious.)
Look at just that fragment:
(?P<edge>(e[0-9]+)+)
Debuggex Demo
Here, the expression catches e29 as one match, then e30 as a subsequent match. So, if you add anything else to the expression, it's either going to miss e29, or just fail. But add the space:
(?P<edge>(e[0-9]+ )+)
Debuggex Demo
And now it's matching e29 e30 plus the trailing space as a single match, which means you can tack on any additional stuff and it will work (as long as you get that additional stuff right—you still need to remove the extra +, and I think you may need to make a couple of other repetitions non-greedy…).
Related
What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.
Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.
For sure there are other ways to solve this, but I'm interested if this can be solved exclusively via regex. I have lines of text like this:
9,A
11,B
22,>
33,B
72,A
91,<
112,A
162,B
When I try to apply this replacement to basically "join" or erase the part between arrows and replace them with "+++":
re.sub(r'\>(\n\d.+)+<','+++',string_above)
I get this, which is fine:
9,A
11,B
22,+++
112,A
162,B
But what if want to keep that last number before the "<" sign and "X" last say, so to get something like this:
9,A
11,B
22,+++
91,X
112,A
162,B
How can I do that?
In this concrete case, you may replace with
r'+++\1X'
See the regex demo
If X is a digit, replace with
r'+++\g<1>X'
The \1 and \g<1> are called replacement backreferences, these refer to the capturing group #1 value.
I want to use a regex to find merge conflicts in a file.
I've found previous posts that show how to find a pattern that matches this structure
FIRST SUBSTRING
/* several
new
lines
*/
SECOND SUBSTRING
which works with the following regex: (^FIRST SUBSTRING)(.+)((?:\n.+)+)(SECOND SUBSTRING)
However, I need to match this pattern:
FIRST SUBSTRING
/* several
new
lines
*/
SECOND SUBSTRING
/* several
new
lines
*/
THIRD SUBSTRING
Where first, second and third substrings are <<<<<<<, =======, >>>>>>> respectively.
I gave (^<<<<<<<)(.+)((?:\n.+)+)(=======)(.+)((?:\n.+)+)(>>>>>>) a shot but it does not work, which you can see on this demo ((^<<<<<<<)(.+)((?:\n.+)+)(=======) does work but it is not exactly what I am looking for)
Your expression does work with a couple of slight changes. Lengths of characters do not exactly match. And You are asking for at least one character after the SECOND SUBSTRING with (.+), when there are none in the text.
(<<<<<<<)(.+)((?:\n.+)+)(=======)(.*)((?:\n.+)+)(>>>>>>>)
From then onwards it makes groups as you expect (which the answer in the comments does not). You probably want to distinguish between your and their code.
Plus, if you have to choose among working expressions, I would choose yours instead of the options proposed for readability. Regex are not friendly things to read, and using repetitions (among other sophistications) make the code harder to read. This also goes for the ?:, just query specific groups, there is no need to avoid group creation there.
Setting the flag s (single line - dot matches newline) is needed to match the text from the structure. So you can use .*? for select multi line text overriding \n, until the next pattern (? lazy mode).
With this setting, the regex below matches what you need.
(<{7})(.*)(={7})(.*?)(>{7})(.*?\n)
I have found odd behavior in Python 3.7.0 when capturing groups with an or operator when one branch initially matches but the regex has to eventually backtrack and use a different branch. In this scenario, the capture groups stick with the first branch even though the regex uses the second branch.
Example code:
regexString = "^(a)|(ab)$"
captureString = "ab"
match = re.match(regexString, captureString)
print(match.groups())
Output:
('a', None)
The second group is the group that is used, but the first group is captured and the second group isn't.
Interestingly, I have found a workaround by adding non-capturing parentheses around both groups like so:
regexString = "^(?:(a)|(ab))$"
New Output:
(None, 'ab')
To me this behavior looks like a bug. If it is not, can someone point me to some documentation explaining why this is occurring? Thank you!
This is a common regex mistake. Here is your original pattern:
^(a)|(ab)$
This actually says to match ^a, i.e. a at the start of the input or ab$, i.e. ab at the end of the input. If you instead want to match a or ab as the entire input, then as you figured out you need:
^(?:(a)|(ab))$
To further convince yourself of this behavior, you may verify that the following pattern matches the same things as your original pattern:
(ab)$|^(a)
That is, each term in alternation is separate, and the position does not even matter, at least with regard to which inputs would match or nor match. By the way, you could have just used the following pattern:
^ab?$
This would match a or ab, and also you would not even need a capture group, as the entire match would correspond to what you want.
Question Formation
background
As I am reading through the tutorial at python2.7 redoc, it introduces the behavior of the groups:
The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.
question
I clearly understands how this works singly. but I can understand the following example:
>>> m = re.match("([abc])+","abc")
>>> m.groups()
('c',)
I mean, isn't + simply means one or more. If so, shouldn't the regex ([abc])+ = ([abc])([abc])+ (not formal BNF). Thus, the result should be:
('a','b','c')
Please shed some light about the mechanism behind, thanks.
P.S
I want to learn the regex language interpreter, how should I start with? books or regex version, thanks!
Well, I guess a picture is worth a 1000 words:
link to the demo
what's happening is that, as you can see on the visual representation of the automaton, your regexp is grouping over a one character one or more times until it reaches the end of the match. Then that last character gets into the group.
If you want to get the output you say, you need to do something like the following:
([abc])([abc])([abc])
which will match and group one character at each position.
About documentation, I advice you to read first theory of NFA, and regexps. The MIT documentation on the topic is pretty nice:
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-045j-automata-computability-and-complexity-spring-2011/lecture-notes/
Basically, the groups that are referred to in regex terminology are the capture groups as defined in your regex.
So for example, in '([abc])+', there's only a single capture group, namely, ([abc]), whereas in something like '([abc])([xyz])+' there are 2 groups.
So in your example, calling .groups() will always return a tuple of length 1 because that is how many groups exist in your regex.
The reason why it isn't returning the results you'd expect is because you're using the repeat operator + outside of the group. This ends up causing the group to equal only the last match, and thus only the last match (c) is retained. If, on the other hand, you had used '([abc]+)' (notice the + is inside the capture group), the results would have been:
('abc',)
One pair of grouping parentheses forms one group, even if it's inside a quantifier. If a group matches multiple times due to a quantifier, only the last match for that group is saved. The group doesn't become as many groups as it had matches.