Regex unexpected outcome - python
import re
pattern =r"[1-9][0-9]{0,2}(?:,\d{3})?(?:,\d{3})?"
string = '42 1,234 6,368,745 12,34,567 1234'
a = re.findall(pattern,string)
print(a)
Dears, what should I do to get the expected result?
Expected output:
['42', '1,234', '6,368,745']
Actual output:
['42', '1,234', '6,368,745', '12', '34,567', '123', '4']
I was trying to solve this quiz in a book.
How would you write a regex that matches a number with commas for every three digits? It must match the following:
• '42'
• '1,234'
• '6,368,745'
but not like the following:
• '12,34,567' (which has only two digits between the commas)
• '1234' (which lacks commas)
Your help would be much appreciated!
You may use
import re
pattern =r"(?<!\d,)(?<!\d)[1-9][0-9]{0,2}(?:,\d{3})*(?!,?\d)"
string = '42 1,234 6,368,745 12,34,567 1234'
a = re.findall(pattern,string)
print(a) # => ['42', '1,234', '6,368,745']
See Python demo.
Regex details
(?- no digit or digit +,` allowed immediately to the left of the current location
[1-9][0-9]{0,2} - a non-zero digit followed with any zero, one or two digits
(?:,\d{3})* - 0 or more occurrences of a comma and then any three digits
(?!,?\d) - no , or , + digit allowed immediately to the right of the current location.
You could use the following regular expression.
r'(?<![,\d])[1-9]\d{,2}(?:,\d{3})*(?![,\d])'
with re.findall.
Demo
Python's regex engine performs the following operations.
(?<! begin negative lookbehind
[,\d] match ',' or a digit
) end negative lookbehind
[1-9] match a digit other than '0'
\d{0,2} match 0-2 digits
(?: begin non-capture group
,\d{3} match ',' then 3 digits
) end non-capture group
* execute non-capture group 0+ times
(?![,\d]) previous match is not followed by ',' or a digit
(negative lookahead)
Related
How to extract a specific type of number from a string using regex?
Consider this string: text = ''' 4 500,5 12% 1,63% 568768,74832 days in between 34 cars in a row''' As you can see, there are simple numbers, numbers with spaces in between, numbers with comas, and both. Thus, 4 500,5 is considered as a standalone, separate number. Extracting the numbers with comas and spaces is easy and I found the pattern as: pattern = re.compile(r'(\d+ )?\d+,\d+') However, I am struggling to extract just the simple numbers like 12 and 34. I tried using (?!...) and [^...] but these options do not allow me to exclude the unwanted parts of other numbers.
((?:\d+ )?\d+,\d+)|(\d+(?! \d)) I believe this will do what you want (Regexr link: https://regexr.com/695tc) To capture "simple" numbers, it looks for [one or more digits], which are not followed by [a space and another digit]. I edited so that you can use capture groups appropriately, if desired.
If you only want to match 12 and 34: (?<!\S)\d+\b(?![^\S\n]*[,\d]) (?<!\S) Assert a whitespace boundary to the left \d+\b Match 1+ digits and a word boundary (?! Negative lookahead, assert what is directly to the right is not [^\S\n]*[,\d] Match optional spaces and either , or a digit ) Close lookahead Regex demo
I'd suggest extracting all numbers first, then filter those with a comma to a list with floats, and those without a comma into a list of integers: import re text = '4 500,5\n\n12%\n\n1,63%\n\n568768,74832 days in between\n\n34 cars in a row' number_rx = r'(?<!\d)(?:\d{1,3}(?:[ \xA0]\d{3})*|\d+)(?:,\d+)?(?!\d)' number_list = re.findall(number_rx, text) print('Float: ', [x for x in number_list if ',' in x]) # => Float: ['4 500,5', '1,63', '568768,74832'] print('Integers: ', [x for x in number_list if ',' not in x]) # => Integers: ['12', '34'] See the Python demo and the regex demo. The regex matches: (?<!\d) - a negative lookbehind that allows no digit immediately to the left of the current location (?:\d{1,3}(?:[ \xA0]\d{3})*|\d+) - either of the two alternatives: \d{1,3}(?:[ \xA0]\d{3})* - one, two or three digits, and then zero or more occurrences of a space / hard (no-breaking) space followed with three digits | - or \d+ - one or more digits (?:,\d+)? - an optional sequence of , and then one or more digits (?!\d) - a negative lookahead that allows no digit immediately to the right of the current location.
How do I write a Regex in Python to remove leading zeros for a number in the middle of a string
I have a string composed of both letters followed by a number, and I need to remove all the letters, as well as the leading zeros in the number. For example: in the test string U012034, I want to match the U and the 0 at the beginning of 012034. So far I have [^0-9] to match the any characters that aren't digits, but I can't figure out how to also remove the leading zeros in the number. I know I could do this in multiple steps with something like int(re.sub("[^0-9]", "", test_string) but I need this process to be done in one regex.
You can use re.sub(r'^\D*0*', '', text) See the regex demo. Details ^ - start of string \D* - any zero or more non-digit chars 0* - zero or more zeros. See Python demo: import re text = "U012034" print( re.sub(r'^\D*0*', '', text) ) # => 12034 If there is more text after the first number, use print( re.sub(r'^\D*0*(\d+).*', r'\1', text) ) See this regex demo. Details: ^ - start of string \D* - zero or more non-digits 0* - zero or more zeros (\d+) - Group 1: one or more digits (use (\d+(?:\.\d+)?) to match float or int values) `.* - the rest of the string. The replacement is the Group 1 value.
You may use this re.sub in Python: string = re.sub(r'^[a-zA-Z]*0*|[a-zA-Z]+', '', string) RegEx Demo Explanation: ^: Start [a-zA-Z]*: Match 0 or more letters 0*L: Match 0 or more zeroes |: OR [a-zA-Z]+: Match 1+ of letters
Does this do what you need? re.sub("[^0-9]+0*", "", "U0123") >>> '123'
Regular expression from the first number block after an opening bracket. Repeat after closing the parenthesis
I've been trying to solve this problem for a while now, but I just can't get it right. I would like the following string to be split up so that I get the first block of numbers after the opening bracket. If another bracket is opened before the previous one is closed, the following numerical block is invalid. Test String: [(16908,76,(2585,0,0),(),()),(18404,74,(),(),()),(16823,66,(),(),()),(0,0,(),(),()),(16905,76,(),(),()),(16910,76,(),(),()),(16909,76,(2585,0,0),(),()),(16906,76,(1887,0,0),(),()),(16911,76,(1886,0,0),(),()),(16907,76,(1887,0,0),(),()),(19384,83,(),(),()),(19898,68,(),(),()),(13965,63,(),(),()),(11815,58,(),(),()),(13340,63,(849,0,0),(),()),(19896,65,(1900,0,0),(),()),(19910,65,(1900,0,0),(),()),(17069,69,(),(),()),(0,0,(),(),())],[] Valid number blocks: 16908, 18404, 16823, 16905, etc Invalid number blocks: 2585, 2585, 1887, etc The valid blocks should be displayed separated by commas. In this example the numbers have all five digits, but this can vary from 0 - 8 digits. The use of such (\d{0,8}) does not look very adequate to me. I am absolutely not a regex professional and would be happy about any kind of impulse or help that brings me to my goal. Thanks in advance.
I found a way to do it with two regexes: from re import findall, search text = "[(16908,76,(2585,0,0),(),()),(18404,74,(),(),()),(16823,66,(),(),()),(0,0,(),(),()),(16905,76,(),(),()),(16910,76,(),(),()),(16909,76,(2585,0,0),(),()),(16906,76,(1887,0,0),(),()),(16911,76,(1886,0,0),(),()),(16907,76,(1887,0,0),(),()),(19384,83,(),(),()),(19898,68,(),(),()),(13965,63,(),(),()),(11815,58,(),(),()),(13340,63,(849,0,0),(),()),(19896,65,(1900,0,0),(),()),(19910,65,(1900,0,0),(),()),(17069,69,(),(),()),(0,0,(),(),())],[]" matches = findall(r'\(\w+(?!\().+?\)', text) // find valid blocks blocks = [] for match in matches: blocks.append(search('\d+', match).group()) // get first number in every match (block number) print(blocks) Output is: ['16908', '18404', '16823', '0', '16905', '16910', '16909', '16906', '16911', '16907', '19384', '19898', '13965', '11815', '13340', '19896', '19910', '17069', '0'] Is this the behavior you want?
Is this regex want you need r'\((\d{1,8}),\d+(?:,\(\d*,?\d*,?\d*\)){3}\)'? Demo: https://regex101.com/r/oz0bdE/1 Python code: import re string = '[(16908,76,(2585,0,0),(),()),(18404,74,(),(),()),(16823,66,(),(),()),(0,0,(),(),()),(16905,76,(),(),()),(16910,76,(),(),()),(16909,76,(2585,0,0),(),()),(16906,76,(1887,0,0),(),()),(16911,76,(1886,0,0),(),()),(16907,76,(1887,0,0),(),()),(19384,83,(),(),()),(19898,68,(),(),()),(13965,63,(),(),()),(11815,58,(),(),()),(13340,63,(849,0,0),(),()),(19896,65,(1900,0,0),(),()),(19910,65,(1900,0,0),(),()),(17069,69,(),(),()),(0,0,(),(),())],[]' matches = re.findall(r'\((\d{1,8}),\d+(?:,\(\d*,?\d*,?\d*\)){3}\)', string) print(matches) Output: ['16908', '18404', '16823', '0', '16905', '16910', '16909', '16906', '16911', '16907', '19384', '19898', '13965', '11815', '13340', '19896', '19910', '17069', '0']
If you don't need to verify the structure of the string you can match a very simple regular expression that reflects the observation that the strings of digits of interest are the only ones that are immediately preceded by a left parenthesis. re'(?<=\()\d{1,8}' (?<=\() is a positive lookbehind that asserts that the current position in the string is immediately preceded by a left parenthesis. Matching regex If you need to verify the structure of the string as well you could use the following regular expression. I've assumed the string ends "],[]". If that is not the case an adjustment is of course necessary. r'^\[(?:(?:(?<!\[),|)\(\d{1,8},\d+(?:,\((?:\d+(?:,\d+)+)?\))*\))*\],\[\]$' Verification regex For verification Pyton's regex engine performs the following operations. ^ : match beginning of string \[ : match '[' (?: : begin non-capture group (?: : begin non-capture group (?<!\[) : use negative lookbehind to assert current position is not preceded by '[' , : match ',' | : or : match an empty string ) : end non-capture group \(\d{1,8},\d+ : match '(', 1-8 digits, ',', 1+ digits (?: : begin non-capture group ,\( : match ',(' (?: : begin non-capture group \d+ : match 1+ digits (?:,\d+) : match ',', 1+ digits in a non-capture group + : execute non-capture group 1+ times )? : end non-capture group and make it optional \) : match ')' )* : end non-capture group and execute it 0+ times \) : match ')' )* : end non-capture group and execute it 0+ times \],\[\]$ : match '],[]' at end of string
Extracting a string within a string and omitting the search string
I have a string: string="soupnot$23.99dedarikjdf$44.65 notworryfence$98.44coyoteugle$33.94rock$2,300.00" I want to extract the numbers 23.99, 44.65, 98.44,33.44, 2,300.00. I have this regex \$(.*[^\s]) There are 2 issues with this. It returns the '$' sign. I only want the number. It only works when there is a space at the end of the number but sometimes there might be letters and it won't work in that case. Thanks.
You can use regex as shown: import re string="soupnot$23.99dedarikjdf$44.65 notworryfence$98.44coyoteugle$33.94rock$2,300.00" res = re.findall(pattern="[\d.,]+", string=string) output: ['23.99', '44.65', '98.44', '33.94', '2,300.00']
Try this regex: (?<=\$)\d+(?:,\d+)*(?:\.\d+)? Click for Demo Explanation (?<=\$) - positive lookbehind to find the position just preceded by a $ \d+ - matches 1+ occurrences of a digit (?:,\d+)* - matches 0+ occurrences of a , followed by 1 or more digits (?:\.\d+)? - matches a . followed by 1+ digits. ? in the end makes this decimal part optional
Regex for fraction mathematical expressions using python re module
I need a regex to parse through a string that contains fractions and a operation [+, -, *, or /] and to return a 5 element tuple containing the numerators, denominators, and operation using the findall function in the re module. Example: str = "15/9 + -9/5" The output should of the form[("15","9","+","-9","5")] I was able to come up with this: pattern = r'-?\d+|\s+\W\s+' print(re.findall(pattarn,str)) Which produces an output of ["15","9"," + ","-9","5"]. But after fiddling with this for so time, I cannot get this into a 5 element tuple and I cannot match the operation without also matching the white spaces around it.
This pattern will work: (-?\d+)\/(\d+)\s+([+\-*/])\s+(-?\d+)\/(\d+) #lets walk through it (-?\d+) #matches any group of digits that may or may not have a `-` sign to group 1 \/ #escape character to match `/` (\d+) #matches any group of digits to group 2 \s+([+\-*/])\s+ #matches any '+,-,*,/' character and puts only that into group 3 (whitespace is not captured in group) (-?\d+)\/(\d+) #exactly the same as group 1/2 for groups 4/5 demo for this: >>> s = "15/9 + -9/5 6/12 * 2/3" >>> re.findall('(-?\d+)\/(\d+)\s([+\-*/])\s(-?\d+)\/(\d+)',s) [('15', '9', '+', '-9', '5'), ('6', '12', '*', '2', '3')]
A general way to tokenize a string based on a regexp is this: import re pattern = "\s*(\d+|[/+*-])" def tokens(x): return [ m.group(1) for m in re.finditer(pattern, x) ] print tokens("9 / 4 + 6 ") Notes: The regex begins with \s* to pass over any initial whitespace. The part of the regex which matches the token is enclosed in parens to form a capture. The different token patterns are in the capture separated by the alternation operation |. Be careful about using \W since that will also match whitespace.