Regex unexpected outcome - python

import re
pattern =r"[1-9][0-9]{0,2}(?:,\d{3})?(?:,\d{3})?"
string = '42 1,234 6,368,745 12,34,567 1234'
a = re.findall(pattern,string)
print(a)
Dears, what should I do to get the expected result?
Expected output:
['42', '1,234', '6,368,745']
Actual output:
['42', '1,234', '6,368,745', '12', '34,567', '123', '4']
I was trying to solve this quiz in a book.
How would you write a regex that matches a number with commas for every three digits? It must match the following:
• '42'
• '1,234'
• '6,368,745'
but not like the following:
• '12,34,567' (which has only two digits between the commas)
• '1234' (which lacks commas)
Your help would be much appreciated!

You may use
import re
pattern =r"(?<!\d,)(?<!\d)[1-9][0-9]{0,2}(?:,\d{3})*(?!,?\d)"
string = '42 1,234 6,368,745 12,34,567 1234'
a = re.findall(pattern,string)
print(a) # => ['42', '1,234', '6,368,745']
See Python demo.
Regex details
(?- no digit or digit +,` allowed immediately to the left of the current location
[1-9][0-9]{0,2} - a non-zero digit followed with any zero, one or two digits
(?:,\d{3})* - 0 or more occurrences of a comma and then any three digits
(?!,?\d) - no , or , + digit allowed immediately to the right of the current location.

You could use the following regular expression.
r'(?<![,\d])[1-9]\d{,2}(?:,\d{3})*(?![,\d])'
with re.findall.
Demo
Python's regex engine performs the following operations.
(?<! begin negative lookbehind
[,\d] match ',' or a digit
) end negative lookbehind
[1-9] match a digit other than '0'
\d{0,2} match 0-2 digits
(?: begin non-capture group
,\d{3} match ',' then 3 digits
) end non-capture group
* execute non-capture group 0+ times
(?![,\d]) previous match is not followed by ',' or a digit
(negative lookahead)

Related

How to extract a specific type of number from a string using regex?

Consider this string:
text = '''
4 500,5
12%
1,63%
568768,74832 days in between
34 cars in a row'''
As you can see, there are simple numbers, numbers with spaces in between, numbers with comas, and both. Thus, 4 500,5 is considered as a standalone, separate number. Extracting the numbers with comas and spaces is easy and I found the pattern as:
pattern = re.compile(r'(\d+ )?\d+,\d+')
However, I am struggling to extract just the simple numbers like 12 and 34. I tried using (?!...) and [^...] but these options do not allow me to exclude the unwanted parts of other numbers.
((?:\d+ )?\d+,\d+)|(\d+(?! \d))
I believe this will do what you want (Regexr link: https://regexr.com/695tc)
To capture "simple" numbers, it looks for [one or more digits], which are not followed by [a space and another digit].
I edited so that you can use capture groups appropriately, if desired.
If you only want to match 12 and 34:
(?<!\S)\d+\b(?![^\S\n]*[,\d])
(?<!\S) Assert a whitespace boundary to the left
\d+\b Match 1+ digits and a word boundary
(?! Negative lookahead, assert what is directly to the right is not
[^\S\n]*[,\d] Match optional spaces and either , or a digit
) Close lookahead
Regex demo
I'd suggest extracting all numbers first, then filter those with a comma to a list with floats, and those without a comma into a list of integers:
import re
text = '4 500,5\n\n12%\n\n1,63%\n\n568768,74832 days in between\n\n34 cars in a row'
number_rx = r'(?<!\d)(?:\d{1,3}(?:[ \xA0]\d{3})*|\d+)(?:,\d+)?(?!\d)'
number_list = re.findall(number_rx, text)
print('Float: ', [x for x in number_list if ',' in x])
# => Float: ['4 500,5', '1,63', '568768,74832']
print('Integers: ', [x for x in number_list if ',' not in x])
# => Integers: ['12', '34']
See the Python demo and the regex demo.
The regex matches:
(?<!\d) - a negative lookbehind that allows no digit immediately to the left of the current location
(?:\d{1,3}(?:[ \xA0]\d{3})*|\d+) - either of the two alternatives:
\d{1,3}(?:[ \xA0]\d{3})* - one, two or three digits, and then zero or more occurrences of a space / hard (no-breaking) space followed with three digits
| - or
\d+ - one or more digits
(?:,\d+)? - an optional sequence of , and then one or more digits
(?!\d) - a negative lookahead that allows no digit immediately to the right of the current location.

How do I write a Regex in Python to remove leading zeros for a number in the middle of a string

I have a string composed of both letters followed by a number, and I need to remove all the letters, as well as the leading zeros in the number.
For example: in the test string U012034, I want to match the U and the 0 at the beginning of 012034.
So far I have [^0-9] to match the any characters that aren't digits, but I can't figure out how to also remove the leading zeros in the number.
I know I could do this in multiple steps with something like int(re.sub("[^0-9]", "", test_string) but I need this process to be done in one regex.
You can use
re.sub(r'^\D*0*', '', text)
See the regex demo. Details
^ - start of string
\D* - any zero or more non-digit chars
0* - zero or more zeros.
See Python demo:
import re
text = "U012034"
print( re.sub(r'^\D*0*', '', text) )
# => 12034
If there is more text after the first number, use
print( re.sub(r'^\D*0*(\d+).*', r'\1', text) )
See this regex demo. Details:
^ - start of string
\D* - zero or more non-digits
0* - zero or more zeros
(\d+) - Group 1: one or more digits (use (\d+(?:\.\d+)?) to match float or int values)
`.* - the rest of the string.
The replacement is the Group 1 value.
You may use this re.sub in Python:
string = re.sub(r'^[a-zA-Z]*0*|[a-zA-Z]+', '', string)
RegEx Demo
Explanation:
^: Start
[a-zA-Z]*: Match 0 or more letters
0*L: Match 0 or more zeroes
|: OR
[a-zA-Z]+: Match 1+ of letters
Does this do what you need?
re.sub("[^0-9]+0*", "", "U0123")
>>> '123'

Regular expression from the first number block after an opening bracket. Repeat after closing the parenthesis

I've been trying to solve this problem for a while now, but I just can't get it right.
I would like the following string to be split up so that I get the first block of numbers after the opening bracket. If another bracket is opened before the previous one is closed, the following numerical block is invalid.
Test String:
[(16908,76,(2585,0,0),(),()),(18404,74,(),(),()),(16823,66,(),(),()),(0,0,(),(),()),(16905,76,(),(),()),(16910,76,(),(),()),(16909,76,(2585,0,0),(),()),(16906,76,(1887,0,0),(),()),(16911,76,(1886,0,0),(),()),(16907,76,(1887,0,0),(),()),(19384,83,(),(),()),(19898,68,(),(),()),(13965,63,(),(),()),(11815,58,(),(),()),(13340,63,(849,0,0),(),()),(19896,65,(1900,0,0),(),()),(19910,65,(1900,0,0),(),()),(17069,69,(),(),()),(0,0,(),(),())],[]
Valid number blocks:
16908, 18404, 16823, 16905, etc
Invalid number blocks:
2585, 2585, 1887, etc
The valid blocks should be displayed separated by commas.
In this example the numbers have all five digits, but this can vary from 0 - 8 digits.
The use of such (\d{0,8}) does not look very adequate to me.
I am absolutely not a regex professional and would be happy about any kind of impulse or help that brings me to my goal.
Thanks in advance.
I found a way to do it with two regexes:
from re import findall, search
text = "[(16908,76,(2585,0,0),(),()),(18404,74,(),(),()),(16823,66,(),(),()),(0,0,(),(),()),(16905,76,(),(),()),(16910,76,(),(),()),(16909,76,(2585,0,0),(),()),(16906,76,(1887,0,0),(),()),(16911,76,(1886,0,0),(),()),(16907,76,(1887,0,0),(),()),(19384,83,(),(),()),(19898,68,(),(),()),(13965,63,(),(),()),(11815,58,(),(),()),(13340,63,(849,0,0),(),()),(19896,65,(1900,0,0),(),()),(19910,65,(1900,0,0),(),()),(17069,69,(),(),()),(0,0,(),(),())],[]"
matches = findall(r'\(\w+(?!\().+?\)', text) // find valid blocks
blocks = []
for match in matches:
blocks.append(search('\d+', match).group()) // get first number in every match (block number)
print(blocks)
Output is:
['16908', '18404', '16823', '0', '16905', '16910', '16909', '16906', '16911', '16907', '19384', '19898', '13965', '11815', '13340', '19896', '19910', '17069', '0']
Is this the behavior you want?
Is this regex want you need r'\((\d{1,8}),\d+(?:,\(\d*,?\d*,?\d*\)){3}\)'?
Demo:
https://regex101.com/r/oz0bdE/1
Python code:
import re
string = '[(16908,76,(2585,0,0),(),()),(18404,74,(),(),()),(16823,66,(),(),()),(0,0,(),(),()),(16905,76,(),(),()),(16910,76,(),(),()),(16909,76,(2585,0,0),(),()),(16906,76,(1887,0,0),(),()),(16911,76,(1886,0,0),(),()),(16907,76,(1887,0,0),(),()),(19384,83,(),(),()),(19898,68,(),(),()),(13965,63,(),(),()),(11815,58,(),(),()),(13340,63,(849,0,0),(),()),(19896,65,(1900,0,0),(),()),(19910,65,(1900,0,0),(),()),(17069,69,(),(),()),(0,0,(),(),())],[]'
matches = re.findall(r'\((\d{1,8}),\d+(?:,\(\d*,?\d*,?\d*\)){3}\)', string)
print(matches)
Output:
['16908', '18404', '16823', '0', '16905', '16910', '16909', '16906', '16911', '16907', '19384', '19898', '13965', '11815', '13340', '19896', '19910', '17069', '0']
If you don't need to verify the structure of the string you can match a very simple regular expression that reflects the observation that the strings of digits of interest are the only ones that are immediately preceded by a left parenthesis.
re'(?<=\()\d{1,8}'
(?<=\() is a positive lookbehind that asserts that the current position in the string is immediately preceded by a left parenthesis.
Matching regex
If you need to verify the structure of the string as well you could use the following regular expression. I've assumed the string ends "],[]". If that is not the case an adjustment is of course necessary.
r'^\[(?:(?:(?<!\[),|)\(\d{1,8},\d+(?:,\((?:\d+(?:,\d+)+)?\))*\))*\],\[\]$'
Verification regex
For verification Pyton's regex engine performs the following operations.
^ : match beginning of string
\[ : match '['
(?: : begin non-capture group
(?: : begin non-capture group
(?<!\[) : use negative lookbehind to assert current
position is not preceded by '['
, : match ','
| : or
: match an empty string
) : end non-capture group
\(\d{1,8},\d+ : match '(', 1-8 digits, ',', 1+ digits
(?: : begin non-capture group
,\( : match ',('
(?: : begin non-capture group
\d+ : match 1+ digits
(?:,\d+) : match ',', 1+ digits in a non-capture group
+ : execute non-capture group 1+ times
)? : end non-capture group and make it optional
\) : match ')'
)* : end non-capture group and execute it 0+ times
\) : match ')'
)* : end non-capture group and execute it 0+ times
\],\[\]$ : match '],[]' at end of string

Extracting a string within a string and omitting the search string

I have a string:
string="soupnot$23.99dedarikjdf$44.65 notworryfence$98.44coyoteugle$33.94rock$2,300.00"
I want to extract the numbers 23.99, 44.65, 98.44,33.44, 2,300.00. I have this regex
\$(.*[^\s])
There are 2 issues with this.
It returns the '$' sign. I only want the number.
It only works when there is a space at the end of the number but sometimes there might be letters and it won't work in that case.
Thanks.
You can use regex as shown:
import re
string="soupnot$23.99dedarikjdf$44.65 notworryfence$98.44coyoteugle$33.94rock$2,300.00"
res = re.findall(pattern="[\d.,]+", string=string)
output:
['23.99', '44.65', '98.44', '33.94', '2,300.00']
Try this regex:
(?<=\$)\d+(?:,\d+)*(?:\.\d+)?
Click for Demo
Explanation
(?<=\$) - positive lookbehind to find the position just preceded by a $
\d+ - matches 1+ occurrences of a digit
(?:,\d+)* - matches 0+ occurrences of a , followed by 1 or more digits
(?:\.\d+)? - matches a . followed by 1+ digits. ? in the end makes this decimal part optional

Regex for fraction mathematical expressions using python re module

I need a regex to parse through a string that contains fractions and a operation [+, -, *, or /] and to return a 5 element tuple containing the numerators, denominators, and operation using the findall function in the re module.
Example: str = "15/9 + -9/5"
The output should of the form[("15","9","+","-9","5")]
I was able to come up with this:
pattern = r'-?\d+|\s+\W\s+'
print(re.findall(pattarn,str))
Which produces an output of ["15","9"," + ","-9","5"]. But after fiddling with this for so time, I cannot get this into a 5 element tuple and I cannot match the operation without also matching the white spaces around it.
This pattern will work:
(-?\d+)\/(\d+)\s+([+\-*/])\s+(-?\d+)\/(\d+)
#lets walk through it
(-?\d+) #matches any group of digits that may or may not have a `-` sign to group 1
\/ #escape character to match `/`
(\d+) #matches any group of digits to group 2
\s+([+\-*/])\s+ #matches any '+,-,*,/' character and puts only that into group 3 (whitespace is not captured in group)
(-?\d+)\/(\d+) #exactly the same as group 1/2 for groups 4/5
demo for this:
>>> s = "15/9 + -9/5 6/12 * 2/3"
>>> re.findall('(-?\d+)\/(\d+)\s([+\-*/])\s(-?\d+)\/(\d+)',s)
[('15', '9', '+', '-9', '5'), ('6', '12', '*', '2', '3')]
A general way to tokenize a string based on a regexp is this:
import re
pattern = "\s*(\d+|[/+*-])"
def tokens(x):
return [ m.group(1) for m in re.finditer(pattern, x) ]
print tokens("9 / 4 + 6 ")
Notes:
The regex begins with \s* to pass over any initial whitespace.
The part of the regex which matches the token is enclosed in parens to form a capture.
The different token patterns are in the capture separated by the alternation operation |.
Be careful about using \W since that will also match whitespace.

Categories

Resources