Look Around and re.sub()

Look Around and re.sub() - python

I want to know how re.sub() works.
The following example is in a book I am reading.
I want "1234567890" to be "1,234,567,890".
pattern = re.compile(r"\d{1,3}(?=(\d{3})+(?!\d))")
pattern.sub(r"\g<0>,", "1234567890")
"1,234,567,890"
Then, I changed "\g<0>" to "\g<1>" and it did not work.
The result was "890,890,890,890".
Why?
I want to know exactly how the capturing and replacing of re.sub()and look ahead mechanism is working.

You have 890 repeated because it is Group 1 (= \g<1>), and you replace every 3 digits with the last captured Group 1 (which is 890).
One more thing here is (\d{3})+ that also captures groups of 3 digits one by one until the end (because of the (?!\d) condition), and places only the last captured group of characters into Group 1. And you are using it to replace each 3-digit chunks in the input string.
See visualization at regex101.com.

Related

I need to write a regex that recognize all numbers with coma separated or not, excluding 4 digits numbers

I want to capture all number with comma or not comma-separated excluding 4 digit numbers:
I want to match these numbers (in my case the number are separated by 3 digits always)
978,763,835,536,363
123
123,456
123456
7456
3400
excluding the years like
1200 till 2020
I have written this
regex_patterns = [
re.compile(r'[0-9]+,?[0-9]+,?[0-9]+,?[0-9]+')
]
it works good ,I do not how exclude years from these number...many thanks
Of course, I am working o the sentients, the number are inside the sentences not necessity at first fo the line like this
-Thus 60 is to 41 as 100,000 is to 65,656½, the appropriate magnitude for βυ
This was found to be 36,075,5621 (with an eccentricity of 9165), corresponding to the entire oval path of Mars.
-It was 4657.
EDIT:
Since during my task I faced wit a lot of issues have updated the question a few time.
first of all the problem is mainly solved! thank you for all for the contribution.
just a very tiny issue. based on other comments I have t integrated the solution as here
r'(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
It can caputer most of the case correctly,
https://regex101.com/r/o5gdDt/8
then again as there is a kind of noise in my text like this:
"
I take ψο as a figured unit [x]. It's square GEOM will also be a figured unit [x2]. Add the square GEOM on εο, 227,052, and the sum of the two will be the square GEOM of ψε or ψν. But the square GEOM of βν is 4,310,747,475 PARA
"
It can not capture the number 227,052, which end with ","
when I changed it I faced with this problem
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
``` (basically ignoring comma in (,?![\d]))
I faced with another problem which the regex captured 4,310,747,475 in this:
4,310,747,475x2+978,763,835,536,363
as you see here..
https://regex101.com/r/o5gdDt/9
any idea would be very appreciated
however the regex now works almost good, but in order to be perfect I need to improve it
-

If excluding all 4 digit number years its this
\b(?!\d{4}\b)[0-9]+(?:,(?!\d{4}\b)[0-9]+)*\b
https://regex101.com/r/T3L3X5/1
If excluding just the number years between 1200 and 2020 its this
\b(?!(?:12\d{2}|1[3-9]\d{2}|20[01]\d|2020)\b)[0-9]+(?:,(?!(?:12\d{2}|1[3-9]\d{2}|20[01]\d|2020)\b)[0-9]+)*\b
https://regex101.com/r/ZuC6LR/1

You can use following regex to match one to three digit numbers and optionally also match any subsequent numbers that are comma separated but don't have more than 3 digits.
\b\d{1,3}(?:,\d{1,3})*\b
https://regex101.com/r/T6sNUs/1/
The explanation goes like this,
\b - marks word boundary to avoid matching partially in a larger number then 3 digits
\d{1,3} - matches one to three digit number
(?:,\d{1,3})* - non-capturing group optionally matches comma separated number having one to three digits
\b - again marks word boundary to avoid matching partially in a larger number then 3 digits
Edit: For requirement mentioned in comments, where numbers with at least three or more digits optionally separated by comma should match. But it should reject the match if any of the numbers present in the line lies from 1200 to 2020.
This regex should give you what you need,
^(?!.*\b(?:1[2-9]\d\d|20[01]\d|2020)\b)\d{3,}(?:,\d{3,})*$
Demo
Please confirm if this works for you, so I can add explanation to above regex.
And in case you want it to restrict it from 1200 to 1800 as you mentioned in your comments, you can use this regex,
^(?!.*\b(?:1[2-7]\d\d|1800)\b)\d{3,}(?:,\d{3,})*$
Demo

This is matching all your test cases:
(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}|\d{1,3}(?:,\d{3})*)(?![\d,])
Explanation:
(?<![\d,]) # negative lookbehind, make we haven't digit or comma before
(?: # non capture group
(?! # negative lookahead, make sure we haven't after:
(?: # non capture group
1[2-9]\d\d # range 1200 -> 1999
| # OR
20[01]\d # range 2000 -> 2019
| # OR
2020 # 2020
) # end group
) # end lookahead
\d{4,} # 4 or more digits
| # OR
\d{1,3} # 1 up to 3 digits
(?:,\d{3})* # non capture group, a comma and 3 digits, 0 or more times
) # end group
(?![\d,]) # negative lookahead, make sure we haven't digit or comma after
Demo

Here is the final answer that I got with using the comments and integrating according my context:
https://regex101.com/r/o5gdDt/8
As you see this code
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
can capture all digits which sperated by 3 digits in text like
"here is 100,100"
"23,456"
"1,435"
all more than 4 digit number like without comma separated
2345
1234 " here is 123456"
also this kind of number
65,656½
65,656½,
23,123½
The only tiny issue here is if there is a comma(dot) after the first two types it can not capture those. for example, it can not capture
"here is 100,100,"
"23,456,"
"1,435,"
unfortunately there is a few number intext which ends with comma...can someone gives me an idea of how to modify this to capture above also?
I have tried to do it and modified version is so:
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
basically I delete comma in (?![\d,]) but it causes to another problem in my context
it captures part of a number that is part of equation like this :
4,310,747,475x2
57,349,565,416,398x.
see here:
https://regex101.com/r/o5gdDt/10
I know that is kind of special question I would be happy to know your ides

Python regex - exclude a certain match

I am trying to capture the following only:
.1
,2
'3
The number after .,' can be any digit and can have anything before or after it. So for example, .1 abc, I only want to capture the 1 or abc,2, I only want to capture the 2.
So if we have the following:
10,000
1.1
,1
.2
'3
'100.000
.200,000
'300'000
abc'100,000
abc.4
abc,5
abc'6
abc 7
,8 abc
.9 abc
'10 abc
.11abc
,12abc
I have the following python regex:
((?<![0-9])([.,':’])([0-9]{1,4}))
The problem is that it's capturing '100 in '100.000 and .200 in .200,000 and '300'000 - how can I stop it from capturing this. So it shouldn't capture '100.000 or .200,000 or '300'000 or abc'100,000 and so on.
I use this to test my regex: https://pythex.org/
Why am I doing this? I am converting InDesign files to HTML, and on some of the conversion the footnotes are not working so using RegReplace on SublimeText to find and replace the footnotes with specific HTML.
Just want to make it more clear as someone has commented thats not clear.
I want to capture a digit that has a . , ' before it, for example:
This is a long string with subscript footnote numbers like this.1 Sometimes they have a dot before the footnote number and sometimes they have a comma,2 Then there are times when it has an apostrophe'3
Now the problem with my regex was that it was capturing the numbers after a dot, comma or apostrophe for values like this 30,000 or 20.000 or '10,000. I don't want to capture anything like that except like this'4 or like this.5 or like this ,6
So what I was trying to do with my regex is to look before the dot, comma and apostrophe to see if there was a digit and if there was then I didn't want to capture none of it, e.g. '10,000 or .20.000 or ,15'000
Now mypetlion got the closest but his regex was not capturing the last 3 in the list, let me see what I can with his regex.

If I am not mistaken, you don't want to capture '100.000 or .200,000 or '300'000 or abc'100,000 but you do want to capture the rest which contains [.,'] followed by one or more digits.
You could match them and then use an alternation | and capture in a group what you do want to match:
[.,']\d+[.,']\d+|[.,'](\d+)
Details
[.,']\d+[.,']\d+ Match one of the characters in the character class, one or more digits and match one of the characters in the character class (the pattern that you don't want to capture)
| Or
[.,'](\d+) Match one of the characters in the character class and capture in a group one or more digits.
Your values will be in captured group 1
Demo

If I understand you correctly and you only want the next digit after ANY comma, period, or single quote then (([\.,'’])([0-9])) should do the trick.
If I misunderstand and you have the negative lookbehind for a reason then try this:
((?<![0-9])([\.,'’])([0-9]))

Regular expression - number with spaces and decimal comma

I'd like to write a regular expression for following type of strings in Pyhton:
1 100
1 567 865
1 474 388 346
i.e. numbers separated from thousand. Here's my regexp:
r"(\d{1,3}(?:\s*\d{3})*)
and it works fine. However, I also wanna parse
1 100,34848
1 100 300,8
19 328 383 334,23499
i.e. separated numbers with decimal digits. I wrote
rr=r"(\d{1,3}(?:\s*\d{3})*)(,\d+)?\s
It doesn't work. For instance, if I make
sentence = "jsjs 2 222,11 dhd"
re.findall(rr, sentence)
[('2 222', ',11')]
Any help appreciated, thanks.

This works:
import re
rr=r"(\d{1,3}(?:\s*\d{3})*(?:,\d+)?)"
sentence = "jsjs 2 222,11 dhd"
print re.findall(rr, sentence) # prints ['2 222,11']

TL;DR: This regular expresion will print ['2 222,11 ']
r"(?:\d{1,3}(?:\s*\d{3})*)(?:,\d+)?"
The result of the search are expresions in parentheses except those starting (?: or whole expresion if the're aren't any subexpresion
So in your first regex it will match your string and return the whole expresion, since there aren't subexpressions (the only parenteses starts with (?:)
In the second it will find the string 2 222,11 and match it, then it looks at subexpresions ((\d{1,3}(?:\s*\d{3})*) and (,\d+), and will return tuple containing those: namely part before decimal comma, and the part after
So to fix your expresion, you'll need to either add to all parentheses ?: or remove them
Also the last \s is redundant as regexes always match as much characters as possible - meaning it will match all numbers after comma

The only problem with your result is that you're getting two match groups instead of one. The only reason that's happening is that you're creating two capture groups instead of one. You're putting separate parentheses around the first half and the second half, and that's what parentheses mean. Just don't do that, and you won't have that problem.
So, with this, you're half-way there:
(\d{1,3}(?:\s*\d{3})*,\d+)\s
Debuggex Demo
The only problem is that the ,\d+ part is now mandatory instead of optional. You obviously need somewhere to put the ?, as you were doing. But without a group, how do you do that? Simple: you can use a group, just make it a non-capturing group ((?:…) instead of (…)). And put it inside the main capturing group, not separate from it. Exactly as you're already doing for the repeated \s*\d{3} part.
(\d{1,3}(?:\s*\d{3})*(?:,\d+)?)\s
Debuggex Demo

Python regex matching only if digit

Given the regex and the word below I want to match the part after the - (which can also be a _ or space) only if the part after the delimiter is a digit and nothing comes after it (I basically want to to be a number and number only). I am using group statements but it just doesn't seem to work right. It keeps matching the 3 at the beginning (or the 1 at the end if I modify it a bit). How do I achieve this (by using grouping) ?
Target word: BR0227-3G1
Regex: ([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*)
It should not match 3G1, G1 , 1G
It should match only pure numbers like 3,10, 2 etc.
Here is also a helper web site for evaluating the regex: http://www.pythonregex.com/
More examples:
It should match:
BR0227-3
BR0227 3
BR0227_3
into groups (BR0227) (3)
It should only match (BR0227) for
BR0227-3G1
BR0227-CS
BR0227
BR0227-

I would use
re.findall('^([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*$)?', str)
Each string starts with the first group and ends with the last group, so the ^ and $ groups can assist in capture. The $ at the end requires all numbers to be captured, but it's optional so the first group can still be captured.

Since you want the start and (possible) end of the word in groups, then do this:
r'\b([A-Z0-9]+)(?:[ _-](\d+))?\b'
This will put the first part of the word in the first group, and optionally the remainder in the second group. The second group will be None if it didn't match.

This should match anything followed by '-', ' ', or '_' with only digits after it.
(.*)[- _](\d+)

Python Regex to look for string

I have a text file with text that looks like below
Format={ Window_Type="Tabular", Tabular={ Num_row_labels=10
}
}
I need to look for Num_row_labels >=10 in my text file. How do I do that using Python 3.2 regex?
Thanks.

Assume that the data is formatted as above, and there is no leading 0's in the number:
Num_row_labels=\d{2,}
A more liberal regex which allows arbitrary spaces, still assume no leading 0's:
Num_row_labels\s*=\s*\d{2,}
An even more liberal regex which allows arbitrary spaces, and allow leading 0's:
Num_row_labels\s*=\s*0*[1-9]\d+
If you need to capture the numbers, just surround \d{2,} (in 1st and 2nd regex) or [1-9]\d+ (in 3rd regex) with parentheses () and refers to it in the 1st capture group.

Use:
match = re.search("Num_row_labels=(\d+)", line)
The (\d+) matches at least one decimal digit (0-9) and captures all digits matched as a group (groups are stored in the object returned by re.search and re.match, which I'm assigning to match here). To access the group and compare compare against 10, use:
if int(match.group(1)) >= 10:
print "Num_row_labels is at least 10"
This will allow you to easily change the value of your threshold, unlike the answers that do everything in the regex. Additionally, I believe this is more readable in that it is very obvious that you are comparing a value against 10, rather than matching a nonzero digit in the regex followed by at least one other digit. What the code above does is ask for the 1st group that was matched (match.group(1) returns the string that was matched by \d+), and then, with the call to int(), converts the string to an integer. The integer returned by int() is then compared against 10.

The regex is Num_row_labels=[1-9][0-9]{1}.*
Now you can use the re python module (take a look here) to analyze your text and extract those

the re looks like:
Num_row_labels=[0-9]*[1-9][0-9]+
Example of usage:
if re.search('Num_row_labels=[0-9]*[1-9][0-9]+', line):
print line
The regular expression [0-9]*[1-9][0-9]+ means that in the string must be at least
one digit from 1 to 9 ([1-9], symbol class [] in regular expressions means that here can be any symbol from the range specified in the brackets);
and at least one digit from 0 to 9 (but it can be more of them) ([0-9]+, the + sign in regular expression means that the symbol/expression that stand before it can be repeated 1 or more times).
Before these digits can be any other digits ([0-9]*, that means any digit, 0 or more times). When you already have two digits you can have any other digits before — the number would be greater or equal 10 anyway.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.