I am trying to match a rather trivial pattern but still cannot figure it out.
Imagine, we have a string with "words" split by underscores. Sometimes the sequence can end up with a specific keyword with a follow-up number ("kw#"). Example:
"foo_bar_omega"
"bar_omega_kw1"
"alpha_betta_foo_bar_kw15"
I need the regex that will place the sequence of the words in one group and the keyword with the number into another group.
I've tried:
regex = '(?P<grp1>[a-zA-Z]+(_[a-zA-Z]+)*)(_(?P<grp2>kw\d+))?'
but it places the bar_omega_pw in the grp1 and nothing into grp2.
I can also try to match only the keyword with the regex='(_(?P<grp2>kw\d+$))?' and after simply split the initial string to identify the grp1, but it looks like over-complication.
Is it possible to do the job in one regex go?
What if this pattern is surrounded by other characters, i.e:123_bar_omega_kw1_&&?
Edit
The question was closed with the "repeated question" flag and referencing this question. While one can get some idea/direction from the suggested old question, here, I believe, we have a little different situation: the pattern of the 1st group "catches" part of symbols that are supposed to be in the second group.
I think that the good answer here would contain an explanation of the importance of using anchors (^/$) for such types of problems.
Related
What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.
Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.
I have an input that is valid if it has this parts:
starts with letters(upper and lower), numbers and some of the following characters (!,#,#,$,?)
begins with = and contains only of numbers
begins with "<<" and may contain anything
example: !!Hel##lo!#=7<<vbnfhfg
what is the right regex expression in python to identify if the input is valid?
I am trying with
pattern= r"([a-zA-Z0-9|!|#|#|$|?]{2,})([=]{1})([0-9]{1})([<]{2})([a-zA-Z0-9]{1,})/+"
but apparently am wrong.
For testing regex I can really recommend regex101. Makes it much easier to understand what your regex is doing and what strings it matches.
Now, for your regex pattern and the example you provided you need to remove the /+ in the end. Then it matches your example string. However, it splits it into four capture groups and not into three as I understand you want to have from your list. To split it into four caputre groups you could use this:
"([a-zA-Z0-9!##$?]{2,})([=]{1}[0-9]+)(<<.*)"
This returns the capture groups:
!!Hel##lo!#
=7
<<vbnfhfg
Notice I simplified your last group a little bit, using a dot instead of the list of characters. A dot matches anything, so change that back to your approach in case you don't want to match special characters.
Here is a link to your regex in regex101: link.
I have an url like this
http://foo.com/bar_by_baz.html
now I want to extract baz from that URL using a regex. But so far I have managed to write this much only
[_]+?\w[^.]+
This is giving me
_by_baz
as output. Now I want to know that how can I select any special character exactly one time or what would be the best approach to solve this using regex ?
I am trying it on python 3.x
Here's your regex: [_]+?([^_.]+) the group match will return baz.. The concept is to isolate underscore and dot from the target match
In another case, this works based on capturing only the alphanumerics [_]+?([A-Za-z0-9]+)
I am going to assume from your profile that you are seeking a javascript-friendly solution (you should update your question & tags).
For javascript, you could use this pattern: /[^_]+(?=\.[a-z]+$)/
Demo Link The pattern matches the substring containing no underscores that is followed by a dot then one or more alphabetical characters until the end of the string.
There will be several ways to accomplish your task. Finding the best/most efficient one can only be achieved if you provide more information about the coding environment/language and a few more sample strings.
Question Formation
background
As I am reading through the tutorial at python2.7 redoc, it introduces the behavior of the groups:
The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.
question
I clearly understands how this works singly. but I can understand the following example:
>>> m = re.match("([abc])+","abc")
>>> m.groups()
('c',)
I mean, isn't + simply means one or more. If so, shouldn't the regex ([abc])+ = ([abc])([abc])+ (not formal BNF). Thus, the result should be:
('a','b','c')
Please shed some light about the mechanism behind, thanks.
P.S
I want to learn the regex language interpreter, how should I start with? books or regex version, thanks!
Well, I guess a picture is worth a 1000 words:
link to the demo
what's happening is that, as you can see on the visual representation of the automaton, your regexp is grouping over a one character one or more times until it reaches the end of the match. Then that last character gets into the group.
If you want to get the output you say, you need to do something like the following:
([abc])([abc])([abc])
which will match and group one character at each position.
About documentation, I advice you to read first theory of NFA, and regexps. The MIT documentation on the topic is pretty nice:
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-045j-automata-computability-and-complexity-spring-2011/lecture-notes/
Basically, the groups that are referred to in regex terminology are the capture groups as defined in your regex.
So for example, in '([abc])+', there's only a single capture group, namely, ([abc]), whereas in something like '([abc])([xyz])+' there are 2 groups.
So in your example, calling .groups() will always return a tuple of length 1 because that is how many groups exist in your regex.
The reason why it isn't returning the results you'd expect is because you're using the repeat operator + outside of the group. This ends up causing the group to equal only the last match, and thus only the last match (c) is retained. If, on the other hand, you had used '([abc]+)' (notice the + is inside the capture group), the results would have been:
('abc',)
One pair of grouping parentheses forms one group, even if it's inside a quantifier. If a group matches multiple times due to a quantifier, only the last match for that group is saved. The group doesn't become as many groups as it had matches.
I'm struggling to get a regex to work where it matches a certain pattern, so long as isn't proceeded by another. For example,
Accessory for MyProduct01 <<< Should be classified as an accessory
MyProduct01 with accessory << Should be classified as a product
So I need to add something to my 'accessory' regex, something like 'match "accessory" so long as the word before isn't "with"'.
I have seen some examples where people are using negative lookaheads to find if a word is anywhere in the string, but I want to be a bit more specific regarding the position of the word to negate. Something like:
(?!with\s)accessory
Just use a negative look-behind in your regex:
(?<!with\s)accessory
Since Python doesn't support unbounded lookbehinds, I think you are going to have to use a lookahead similar to what you are currently using, but change the original pattern a bit.
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b)
Here, the negative lookahead is used to ensure that "accessory" doesn't come after the word "with". Then, the positive lookahead is used to ensure that the word "accessory" occurs within the string, captured with a group if you need to capture it for some reason.
Based on the way that I wrote the above, you'd want to use the search method and not the match method. In order to use match, which requires that the entire search string match the pattern, you'd need to add a bit more to the pattern:
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b).*$