Match specific pattern with regular expression

Match specific pattern with regular expression - python

I've to make a regex to match exactly this kind of pattern
here an example
JK+6.00,PP*2,ZZ,GROUPO
having a match for every group like
Match 1
JK
+
6.00
Match 2
PP
*
2
Match 3
ZZ
Match 4
GROUPO
So comma separated blocks of
(2 to 12 all capitals letters) [optional (+ or *) and a (positive number 0[.0[0]])
This block successfully parse the pattern
(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?)
we have the subject group
(?P<subject>[A-Z]{2,12})
The value
(?P<value>\d+(?:.?\d{1,2})?)
All the optional operation section (value within)
(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?
But the regex must fail if the string doesn't match EXACTLY the pattern
and that's the problem
I tried this but doesn't work
^(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?)(?:,(?P=block))*$
Any suggestion?
PS. I use Python re

I'd personally go for a 2 step solution, first check that the whole string fits to your pattern, then extract the groups you want.
For the overall check you might want to use ^(?:[A-Z]{2,12}(?:[*+]\d+(?:\.\d{1,2})?)?(?:,|$))*$ as a pattern, which contains basically your pattern, the (?:,|$) to match the delimiters and anchors.
I have also adjusted your pattern a bit, to (?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>[*+])(?P<value>\d+(?:\.\d{1,2})?))?). I have replaced (?:\*|\+) with [+*] in your operation pattern and \. with .? in your value pattern.
A (very basic) python implementation could look like
import re
str='JK+6.00,PP*2,ZZ,GROUPO'
full_pattern=r'^(?:[A-Z]{2,12}(?:[*+]\d+(?:\.\d{1,2})?)?(?:,|$))*$'
extract_pattern=r'(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>[*+])(?P<value>\d+(?:\.\d{1,2})?))?)'
if re.fullmatch(full_pattern, str):
for match in re.finditer(extract_pattern, str):
print(match.groups())
http://ideone.com/kMl9qu

I'm guessing this is the pattern you were looking for:
(2 different letter)+(time stamp),(2 of the same letter)*(1 number),(2 of the same letter),(a string)
If thats the case, this regex would do the trick:
^(\w{2}\+\d{1,2}\.\d{2}),((\w)\3\*\d),((\w)\5),(\w+)$
Demo: https://regex101.com/r/8B3C6e/2

Related

Capturing entire repeated string based on a repeated pattern

Following regex matches both 59-59-59 and 59-59-59-59 and outputs only 59
The intent is to match four and only numbers followed by - with the max number being 59. Numbers less than 10 are represented as 00-09.
print(re.match(r'(\b[0-5][0-9]-{1,4}\b)','59-59-59').groups())
--> output ('59-',)
I need a pattern match that matches exactly 59-59-59-59
and does not match 59--59-59or 59-59-59-59-59

Try using the following pattern, if using re.match:
[0-5][0-9](?:-[0-5][0-9]){3}$
This is phrased to match an initial number starting with 0 through 5, followed by any second digit. Then, this is followed by a dash and a number with the same rules, this quantity three times exactly. Note that re.match anchor at the beginning by default, so we only need an ending anchor $.
Code:
print(re.match(r'([0-5][0-9](?:-[0-5][0-9]){3})$', '59-59-59-59').groups())
('59-59-59-59',)
If you intend to actually match the same number four times in a row, then see the answer by #Thefourthbird.
If you want to find such a string in a larger text, then consider using re.search. In that case, use this pattern:
(?:^|(?<=\s))[0-5][0-9](?:-[0-5][0-9]){3}(?=\s|$)
Note that instead of using word boundaries \b I used lookarounds to enforce the end of the "word" here. This means that the above pattern will not match something like 59-59-59-59-59.

In your pattern, this part -{1,4} matches 1-4 times a hyphen so 59-- will match.
If all the matches should be the same as 59, you could use a backreference to the first capturing group and repeat that 3 times with a prepended hyphen.
\b([0-5][0-9])(?:-\1){3}\b
Your code might look like:
import re
res = re.match(r'\b([0-5][0-9])(?:-\1){3}\b', '59-59-59-59')
if res:
print(res.group())
If there should not be partial matches, you could use an anchors to assert the ^ start and the end $ of the string:
^([0-5][0-9])(?:-\1){3}$

How to regex for a numerical suffix?

I have the following regex (example is in Python):
pattern = re.compile(r'^(([a-zA-Z0-9]*[a-zA-Z]+)([\d]+)|([\d]+))$')
This correctly parses any string that has a numerical suffix and an optional prefix that is alphanumerics:
a123
a2a123
123
All will correctly see 123 as a suffix. It will correctly reject bad inputs:
abc
123abc
()123 # Or other non-alphanumerics
The regex itself is fairly unwieldy, though, and several of the capture groups are often empty as a result, meaning I have to go through the additional step of filtering them out. I am curious if there is a better way to be thinking about this regex than "a number OR a number preceeded by an alphanumeric that ends in a character"?

You may use
^[A-Za-z0-9]*?([0-9]+)$
See the regex demo
Details
^ - start of string
[A-Za-z0-9]*? - any letters/digits, zero or more times, as few as possible (due to this non-greedy matching, the next pattern, ([0-9]+), will match all digits at the end of the string there are)
([0-9]+) - Group 1: one or more digits
$ - end of string.
In Python:
m = re.search(r'^[A-Za-z0-9]*?([0-9]+)$') # Or, see below
# m = re.match(r'[A-Za-z0-9]*?([0-9]+)$') # re.match only searches at the start of the string
# m = re.fullmatch(r'[A-Za-z0-9]*?([0-9]+)') # Only in Python 3.x
if m:
print(m.group(1))

If you use non-capturing groups and a correct management of repetitions, the problem eases itself.
pattern = re.compile(r'^(?:[a-zA-Z0-9]*[a-zA-Z]+)?([0-9]+)$')
There's only one capturing group (group 1) for the suffix, and the alphanumerics before it is not captured.
Alternatively, using named groups is another option, and it often makes long, structured regexes easier to maintain:
pattern = re.compile(r'^(?P<a>[a-zA-Z0-9]*[a-zA-Z]+)?(?P<suffix>[0-9]+)$')

The Behavior of Alternative Match "|" with .* in a Regex

I seldom use | together with .* before. But today when I use both of them together, I find some results really confusing. The expression I use is as follows (in python):
>>> s = "abcdefg"
>>> re.findall(r"((a.*?c)|(.*g))",s)
[('abc',''),('','defg')]
The result of the first caputure is all right, but the second capture is beyond my expectation, for I have expected the second capture would be "abcdefg" (the whole string).
Then I reverse the two alternatives:
>>> re.findall(r"(.*?g)|(a.*?c)",s)
[('abcdefg', '')]
It seems that the regex engine only reads the string once - when the whole string is read in the first alternative, the regex engine will stop and no longer check the second alternative. However, in the first case, after dealing with the first alternative, the regex engine only reads from "a" to "c", and there are still "d" to "g" left in the string, which matches ".*?g" in the second alternative. Have I got it right? What's more, as for an expression with alternatives, the regex engine will check the first alternative first, and if it matches the string, it will never check the second alternative. Is it correct?
Besides, if I want to get both "abc" and "abcdefg" or "abc" and "bcde" (the two results overlap) like in the first case, what expression should I use?
Thank you so much!

You cannot have two matches starting from the same location in the regex (the only regex flavor that does it is Perl6).
In re.findall(r"((a.*?c)|(.*g))",s), re.findall will grab all non-overlapping matches in the string, and since the first one starts at the beginning, ends with c, the next one can only be found after c, within defg.
The (.*?g)|(a.*?c) regex matches abcdefg because the regex engine parses the string from left to right, and .*? will get any 0+ chars as few as possible but up to the first g. And since g is the last char, it will match and capture the whole string into Group 1.
To get abc and abcdefg, you may use, say
(a.*?c)?.*g
See the regex demo
Python demo:
import re
rx = r"(a.*?c)?.*g"
s = "abcdefg"
m = re.search(rx, s)
if m:
print(m.group(0)) # => abcdefg
print(m.group(1)) # => abc
It might not be what you exactly want, but it should give you a hint: you match the bigger part, and capture a subpart of the string.

Re-read the docs for the re.findall method.
findall "return[s] all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found."
Specifically, non-overlapping matches, and left-to-right. So if you have a string abcdefg and one pattern will match abc, then any other patterns must (1) not overlap; and (2) be further to the right.
It's perfectly valid to match abc and defg per the description. It would be a bug to match abc and abcdefg or even abc and cdefg because they would overlap.

Regular expression with two non-repeating symbols in any order

I need to create the regex that will match such string:
AA+1.01*2.01,BB*2.01+1.01,CC
Order of * and + should be any
I've created the following regex:
^(([A-Z][A-Z](([*+][0-9]+(\.[0-9])?[0-9]?){0,2}),)*[A-Z]{2}([*+][0-9]+(\.[0-9])?[0-9]?){0,2})$
But the problem is that with this regex + or * could be used twice but I only need any of them once so the following strings matches should be:
AA+1*2,CC - true
AA+1+2,CC - false (now is true with my regex)
AA*1+2,CC - true
AA*1*2,CC - false (now is true with my regex)

Either of the [+*] should be captured first and then use negative lookahead to match the other one.
Regex: [A-Z]{2}([+*])(?:\d+(?:\.\d+)?)(?!\1)[+*](?:\d+(?:\.\d+)?),[A-Z]{2}
Explanation:
[A-Z]{2} Matches two upper case letters.
([+*]) captures either of + or *.
(?:\d+(?:\.\d+)?) matches number with optional decimal part.
(?!\1)[+*] looks ahead for symbol captured and matched the other one. So if + is captured previously then * will be matched.
(?:\d+(?:\.\d+)?) matches number with optional decimal part.
,[A-Z]{2} matches , followed by two upper case letters.
Regex101 Demo
To match the first case AA+1.01*2.01,BB*2.01+1.01,CC which is just a little advancement over previous pattern, use following regex.
Regex: (?:[A-Z]{2}([+*])(?:\d+(?:\.\d+)?)(?!\1)[+*](?:\d+(?:\.\d+)?),)+[A-Z]{2}
Explanation: Added whole pattern except ,CC in first group and made it greedy by using + to match one or more such patterns.
Regex101 Demo

To get a regex to match your given example, extended to an arbitrary number of commas, you could use:
^(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\1)[+*]?\d*\.?\d*,?)*$
Note that this example will also allow a trailing comma. I'm not sure if there is much you can do about that.
Regex 101 Example
If the trailing comma is an issue:
^(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\1)[+*]?\d*\.?\d*,?)*?(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\2)[+*]?\d*\.?\d*?)$
Regex 101 Example

How to check urls againts a predefined list of regex rules in order to get a match in python?

I'm trying to match url request that have literal components and variable components in their path to a list of predefined regex rules. Similar to the routes python library. I'm new to regex so if you could explain the anchors and control characters used in your regex solution I would really appreciate it.
assumming I have the following list of rules. components containing : are variables and can match any string value.
(rule 1) /user/delete/:key
(rule 2) /user/update/:key
(rule 3) /list/report/:year/:month/:day
(rule 4) /show/:categoryid/something/:key/reports
here are example test cases which show request urls and the rules they should match
/user/delete/222 -> matches rule 1
/user/update/222 -> matches rule 2
/user/update/222/bob -> does not match any rule defined
/user -> does not match any rule defined
/list/report/2004/11/2 -> matches rule 3
/show/44/something/222/reports -> matches rule 4
can someone help me write the regex rules for rule 1,2,3,4 ?
Thank you!!

I'm not sure why you need a regex to do something like that. You can split and count:
if len(url.split("/")) == 4:
# do something
You make sure the length is 4 because there's an additional element at the beginning which is an empty string.
Of using something like:
if url.count("/") == 3:
# do something
If you really want to use regex, them maybe you could use something like this:
if re.match(r'^(?:/[^/]*){3}$', url):
# do something
As per your edit:
You could use this for rule 1:
^/user/delete/[0-9]+$
For rule 2:
^/user/update/[0-9]+$
For rule 3:
^/list/report/[0-9]{4}/[0-9]{1,2}/[0-9]{1,2}$
For rule 4:
^/show/[0-9]+/something/[0-9]+/reports$
^ matches the beginning of the string. $ matches the end of the string. Together, they make sure that the string you are testing begins and ends with the regex; there's nothing before or after the 'template'.
[0-9] matches any 1 digit.
+ is a quantifier. It will allow for the repetition of the character or group just before it. [0-9]+ thus means 1 or more digits.
{4} is a fixed quantifier. It is a bit like +, but it repeats only 4 times. {1,2} is a variation of it, it means between 1 and 2 times.
All the other characters in the regex above are literal characters and will match themselves.

Well you can specify that you want only three matches as follows:
'^((\/(\w+)){3})$' with the g and m flags enabled
^ matches from start of string
(\/(\w+)){3} matches (a forward slash followed by alphanumeric characters) exactly 3 times
$ matches end of string
g flag to return more than just one match
m flag to make the ^ and $ treat each line of text as a separate string rather than just one huge multi-line string.
Demo:
http://regex101.com/r/xT3xW2
I will assume my_str is a string that is valid by the regex above
Then, to make a method call with that, you can do:
eval(my_str.split('/')[1]+my_str.split('/')[2].capitalize()+'('+my_str.split('/')[3]+')')
Here is what the above string within eval returns:
>>> print my_str.split('/')[1]+my_str.split('/')[2].capitalize()+'('+my_str.split('/')[3]+')'
userUpdate(222)
OR
userDelete(222)
Then you simply do eval() on it to get the method call. That is the best I can do right now.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match specific pattern with regular expression - python

I'm guessing this is the pattern you were looking for: (2 different letter)+(time stamp),(2 of the same letter)(1 number),(2 of the same letter),(a string) If thats the case, this regex would do the trick: ^(\w{2}\+\d{1,2}\.\d{2}),((\w)\3\\d),((\w)\5),(\w+)$ Demo: https://regex101.com/r/8B3C6e/2

Related

Capturing entire repeated string based on a repeated pattern

How to regex for a numerical suffix?

The Behavior of Alternative Match "|" with .* in a Regex

Regular expression with two non-repeating symbols in any order

How to check urls againts a predefined list of regex rules in order to get a match in python?

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match specific pattern with regular expression - python

I'm guessing this is the pattern you were looking for: (2 different letter)+(time stamp),(2 of the same letter)*(1 number),(2 of the same letter),(a string) If thats the case, this regex would do the trick: ^(\w{2}\+\d{1,2}\.\d{2}),((\w)\3\*\d),((\w)\5),(\w+)$ Demo: https://regex101.com/r/8B3C6e/2

Related

Capturing entire repeated string based on a repeated pattern

How to regex for a numerical suffix?

The Behavior of Alternative Match "|" with .* in a Regex

Regular expression with two non-repeating symbols in any order

How to check urls againts a predefined list of regex rules in order to get a match in python?

Categories

Resources

I'm guessing this is the pattern you were looking for: (2 different letter)+(time stamp),(2 of the same letter)(1 number),(2 of the same letter),(a string) If thats the case, this regex would do the trick: ^(\w{2}\+\d{1,2}\.\d{2}),((\w)\3\\d),((\w)\5),(\w+)$ Demo: https://regex101.com/r/8B3C6e/2