Python regex - exclude a certain match

Python regex - exclude a certain match - python

I am trying to capture the following only:
.1
,2
'3
The number after .,' can be any digit and can have anything before or after it. So for example, .1 abc, I only want to capture the 1 or abc,2, I only want to capture the 2.
So if we have the following:
10,000
1.1
,1
.2
'3
'100.000
.200,000
'300'000
abc'100,000
abc.4
abc,5
abc'6
abc 7
,8 abc
.9 abc
'10 abc
.11abc
,12abc
I have the following python regex:
((?<![0-9])([.,':’])([0-9]{1,4}))
The problem is that it's capturing '100 in '100.000 and .200 in .200,000 and '300'000 - how can I stop it from capturing this. So it shouldn't capture '100.000 or .200,000 or '300'000 or abc'100,000 and so on.
I use this to test my regex: https://pythex.org/
Why am I doing this? I am converting InDesign files to HTML, and on some of the conversion the footnotes are not working so using RegReplace on SublimeText to find and replace the footnotes with specific HTML.
Just want to make it more clear as someone has commented thats not clear.
I want to capture a digit that has a . , ' before it, for example:
This is a long string with subscript footnote numbers like this.1 Sometimes they have a dot before the footnote number and sometimes they have a comma,2 Then there are times when it has an apostrophe'3
Now the problem with my regex was that it was capturing the numbers after a dot, comma or apostrophe for values like this 30,000 or 20.000 or '10,000. I don't want to capture anything like that except like this'4 or like this.5 or like this ,6
So what I was trying to do with my regex is to look before the dot, comma and apostrophe to see if there was a digit and if there was then I didn't want to capture none of it, e.g. '10,000 or .20.000 or ,15'000
Now mypetlion got the closest but his regex was not capturing the last 3 in the list, let me see what I can with his regex.

If I am not mistaken, you don't want to capture '100.000 or .200,000 or '300'000 or abc'100,000 but you do want to capture the rest which contains [.,'] followed by one or more digits.
You could match them and then use an alternation | and capture in a group what you do want to match:
[.,']\d+[.,']\d+|[.,'](\d+)
Details
[.,']\d+[.,']\d+ Match one of the characters in the character class, one or more digits and match one of the characters in the character class (the pattern that you don't want to capture)
| Or
[.,'](\d+) Match one of the characters in the character class and capture in a group one or more digits.
Your values will be in captured group 1
Demo

If I understand you correctly and you only want the next digit after ANY comma, period, or single quote then (([\.,'’])([0-9])) should do the trick.
If I misunderstand and you have the negative lookbehind for a reason then try this:
((?<![0-9])([\.,'’])([0-9]))

Related

Python Regex selecting first option

I have the following regex that looks for the string 191(x)(y) and (z) and combinations of this (for example - 191(x) , 191(x) and (z).
My regular expression is:
(191?(?:\w|\(.{0,3}\)(?:( (and)?|-)*)){0,5})
See the regex demo.
This expression works for the most part I need help with the following (which I can't figure out):
While I do get 5 matches, there are 3 groups, I need to limit the result to only the first group.
If I have the text: '191Transit', the regex should only match 191 and ignore the word 'Transit'. in this case it's 'Transit' in other examples this could be any word e.g: 191Bob, 191Smith
I am using Python 3.6.

You can use
191?(?:\([^()]{0,3}\)(?: (?:and)?|-)*){0,5}
See the regex demo
Details
Replace .{0,3} to [^()]{0,3} to stay within parentheses
Remove one group around ( (?:and)?|-)* as it's redundant
Change the groups to non-capturing, i.e. (...) to (?:...)
Remove \w alternative, it matches any word char and thus matches 0 to 5 first letters/digits/underscores after 191

Match specific pattern with regular expression

I've to make a regex to match exactly this kind of pattern
here an example
JK+6.00,PP*2,ZZ,GROUPO
having a match for every group like
Match 1
JK
+
6.00
Match 2
PP
*
2
Match 3
ZZ
Match 4
GROUPO
So comma separated blocks of
(2 to 12 all capitals letters) [optional (+ or *) and a (positive number 0[.0[0]])
This block successfully parse the pattern
(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?)
we have the subject group
(?P<subject>[A-Z]{2,12})
The value
(?P<value>\d+(?:.?\d{1,2})?)
All the optional operation section (value within)
(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?
But the regex must fail if the string doesn't match EXACTLY the pattern
and that's the problem
I tried this but doesn't work
^(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?)(?:,(?P=block))*$
Any suggestion?
PS. I use Python re

I'd personally go for a 2 step solution, first check that the whole string fits to your pattern, then extract the groups you want.
For the overall check you might want to use ^(?:[A-Z]{2,12}(?:[*+]\d+(?:\.\d{1,2})?)?(?:,|$))*$ as a pattern, which contains basically your pattern, the (?:,|$) to match the delimiters and anchors.
I have also adjusted your pattern a bit, to (?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>[*+])(?P<value>\d+(?:\.\d{1,2})?))?). I have replaced (?:\*|\+) with [+*] in your operation pattern and \. with .? in your value pattern.
A (very basic) python implementation could look like
import re
str='JK+6.00,PP*2,ZZ,GROUPO'
full_pattern=r'^(?:[A-Z]{2,12}(?:[*+]\d+(?:\.\d{1,2})?)?(?:,|$))*$'
extract_pattern=r'(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>[*+])(?P<value>\d+(?:\.\d{1,2})?))?)'
if re.fullmatch(full_pattern, str):
for match in re.finditer(extract_pattern, str):
print(match.groups())
http://ideone.com/kMl9qu

I'm guessing this is the pattern you were looking for:
(2 different letter)+(time stamp),(2 of the same letter)*(1 number),(2 of the same letter),(a string)
If thats the case, this regex would do the trick:
^(\w{2}\+\d{1,2}\.\d{2}),((\w)\3\*\d),((\w)\5),(\w+)$
Demo: https://regex101.com/r/8B3C6e/2

Look Around and re.sub()

I want to know how re.sub() works.
The following example is in a book I am reading.
I want "1234567890" to be "1,234,567,890".
pattern = re.compile(r"\d{1,3}(?=(\d{3})+(?!\d))")
pattern.sub(r"\g<0>,", "1234567890")
"1,234,567,890"
Then, I changed "\g<0>" to "\g<1>" and it did not work.
The result was "890,890,890,890".
Why?
I want to know exactly how the capturing and replacing of re.sub()and look ahead mechanism is working.

You have 890 repeated because it is Group 1 (= \g<1>), and you replace every 3 digits with the last captured Group 1 (which is 890).
One more thing here is (\d{3})+ that also captures groups of 3 digits one by one until the end (because of the (?!\d) condition), and places only the last captured group of characters into Group 1. And you are using it to replace each 3-digit chunks in the input string.
See visualization at regex101.com.

Regex to find specific number using Python regex

I need a regex to find the maxtimeout value (40 in the following) in the RequestReadTimeout directive in Apache config. file, for example :
RequestReadTimeout header=XXX-40,MinRate=XXX body=XXX
RequestReadTimeout header=40 body=XXX
PS : XXX refer to a decimal digit
I used this :
str="RequestReadTimeout header=10-40,MinRate=10 body=10"
re.search(r'header=\d+[-\d+]*', str).group()
'header=10-40'
But I need a regex to get only the maxtimeout value (40 in this example) in one row (without using other function like spit("-")[1] ...etc).
Thanks.

You'd group the part you wanted to extract:
re.search(r'header=(?:\d*-)?(\d+)', inputstr).group(1)
The (...) marks a group, and positional groups like that are numbered starting at 1.
I altered your expression a little to only capture the number after an optional non-capturing group containing digits and a dash, to match both patterns you are looking for. The (?:...) is a non-capturing group; it doesn't store the matched text in a group, but does let you use the ? quantifier on the group to mark it optional.
Pythex demo.
Python session:
>>> import re
>>> for inputstr in ('RequestReadTimeout header=1234-40,MinRate=XXX body=XXX', 'RequestReadTimeout header=40 body=XXX'):
... print re.search(r'header=(?:\d*-)?(\d+)', inputstr).group(1)
...
40
40

You could do it with the following regex:
'RequestReadTimeout\sheader=(?:\d+)?-?(\d+).*'
The first captured group \1 is what you want
Demo: http://regex101.com/r/cD6hY0

Python regex matching only if digit

Given the regex and the word below I want to match the part after the - (which can also be a _ or space) only if the part after the delimiter is a digit and nothing comes after it (I basically want to to be a number and number only). I am using group statements but it just doesn't seem to work right. It keeps matching the 3 at the beginning (or the 1 at the end if I modify it a bit). How do I achieve this (by using grouping) ?
Target word: BR0227-3G1
Regex: ([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*)
It should not match 3G1, G1 , 1G
It should match only pure numbers like 3,10, 2 etc.
Here is also a helper web site for evaluating the regex: http://www.pythonregex.com/
More examples:
It should match:
BR0227-3
BR0227 3
BR0227_3
into groups (BR0227) (3)
It should only match (BR0227) for
BR0227-3G1
BR0227-CS
BR0227
BR0227-

I would use
re.findall('^([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*$)?', str)
Each string starts with the first group and ends with the last group, so the ^ and $ groups can assist in capture. The $ at the end requires all numbers to be captured, but it's optional so the first group can still be captured.

Since you want the start and (possible) end of the word in groups, then do this:
r'\b([A-Z0-9]+)(?:[ _-](\d+))?\b'
This will put the first part of the word in the first group, and optionally the remainder in the second group. The second group will be None if it didn't match.

This should match anything followed by '-', ' ', or '_' with only digits after it.
(.*)[- _](\d+)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.