Regular expression - number with spaces and decimal comma - python

I'd like to write a regular expression for following type of strings in Pyhton:
1 100
1 567 865
1 474 388 346
i.e. numbers separated from thousand. Here's my regexp:
r"(\d{1,3}(?:\s*\d{3})*)
and it works fine. However, I also wanna parse
1 100,34848
1 100 300,8
19 328 383 334,23499
i.e. separated numbers with decimal digits. I wrote
rr=r"(\d{1,3}(?:\s*\d{3})*)(,\d+)?\s
It doesn't work. For instance, if I make
sentence = "jsjs 2 222,11 dhd"
re.findall(rr, sentence)
[('2 222', ',11')]
Any help appreciated, thanks.

This works:
import re
rr=r"(\d{1,3}(?:\s*\d{3})*(?:,\d+)?)"
sentence = "jsjs 2 222,11 dhd"
print re.findall(rr, sentence) # prints ['2 222,11']

TL;DR: This regular expresion will print ['2 222,11 ']
r"(?:\d{1,3}(?:\s*\d{3})*)(?:,\d+)?"
The result of the search are expresions in parentheses except those starting (?: or whole expresion if the're aren't any subexpresion
So in your first regex it will match your string and return the whole expresion, since there aren't subexpressions (the only parenteses starts with (?:)
In the second it will find the string 2 222,11 and match it, then it looks at subexpresions ((\d{1,3}(?:\s*\d{3})*) and (,\d+), and will return tuple containing those: namely part before decimal comma, and the part after
So to fix your expresion, you'll need to either add to all parentheses ?: or remove them
Also the last \s is redundant as regexes always match as much characters as possible - meaning it will match all numbers after comma

The only problem with your result is that you're getting two match groups instead of one. The only reason that's happening is that you're creating two capture groups instead of one. You're putting separate parentheses around the first half and the second half, and that's what parentheses mean. Just don't do that, and you won't have that problem.
So, with this, you're half-way there:
(\d{1,3}(?:\s*\d{3})*,\d+)\s
Debuggex Demo
The only problem is that the ,\d+ part is now mandatory instead of optional. You obviously need somewhere to put the ?, as you were doing. But without a group, how do you do that? Simple: you can use a group, just make it a non-capturing group ((?:…) instead of (…)). And put it inside the main capturing group, not separate from it. Exactly as you're already doing for the repeated \s*\d{3} part.
(\d{1,3}(?:\s*\d{3})*(?:,\d+)?)\s
Debuggex Demo

Related

Avoid special values or space between values using python re

For any phone number which allows () in the area code and any space between area code and the 4th number, I want to create a tuple of the 3 sets of numbers.
For example: (301) 556-9018 or (301)556-9018 would return ('301','556','9018').
I will raise a Value error exception if the input is anything other than the original format.
How do I avoid () characters and include either \s or none between the area code and the next values?
This is my foundation so far:
phonenum=re.compile('''([\d)]+)\s([\d]+) - ([\d]+)$''',re.VERBOSE).match('(123) 324244-123').groups()
print(phonenum)
Do I need to make a if then statement to ignore the () for the first tuple element, or is there a re expression that does that more efficiently?
In addition the \s in between the first 2 tuples doesn't work if it's (301)556-9018.
Any hints on how to approach this?
When specifying a regular expression, you should use raw-string mode:
`r'abc'` instead of `'abc'`
That said, right now you are capturing three sets of numbers in groups. To allow parens, you will need to match parens. (The parens you currently have are for the capturing groups.)
You can match parens by escaping them: \( and \)
You can find various solutions to "what is a regex for XXX" by seaching one of the many "regex libary" web sites. I was able to find this one via DuckDuckGo: http://www.regexlib.com/Search.aspx?k=phone
To make a part of your pattern optional, you can make the individual pieces optional, or you can provide alternatives with the piece present or absent.
Since the parens have to be present or absent together - that is, you don't want to allow an opening paren but no closing paren - you probably want to provide alternatives:
# number, no parens: 800 555-1212
noparens = r'\d{3}\s+\d{3}-\d{4}'
# number with parens: (800) 555-1212
yesparens = r'\(\d{3}\)\s*\d{3}-\d{4}'
You can match the three pieces by inserting "grouping parens":
noparens_grouped = r'(\d{3})\s+(\d{3})-(\d{4})'
yesparens_grouped = r'\((\d{3})\)\s*(\d{3})-(\d{4})'
Note that the quoted parens go outside of the grouping parens, so that the parens do not become part of the captured group.
You can join the alternatives together with the | operator:
yes_or_no_parens_groups = noparens_grouped + '|' + yesparens_grouped
In regular expressions you can use special characters to specify some behavior of some part of the expression.
From python re documentation:
'*' =
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
'+' =
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
'?' =
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
So to solve the blank space problem you can use either '?' if you know the occurrence will be no more than 1, or '+' if you can have more than 1.
In case of grouping information together and them returning a list, you can put your expression inside parenthesis and then use function groups() from re.
The result would be:
results = re.search('\((\d{3})\)\s?(\d{3})-(\d{4})', '(301) 556-9018')
if results:
print results.groups()
else:
print('Invalid phone number')

Look Around and re.sub()

I want to know how re.sub() works.
The following example is in a book I am reading.
I want "1234567890" to be "1,234,567,890".
pattern = re.compile(r"\d{1,3}(?=(\d{3})+(?!\d))")
pattern.sub(r"\g<0>,", "1234567890")
"1,234,567,890"
Then, I changed "\g<0>" to "\g<1>" and it did not work.
The result was "890,890,890,890".
Why?
I want to know exactly how the capturing and replacing of re.sub()and look ahead mechanism is working.
You have 890 repeated because it is Group 1 (= \g<1>), and you replace every 3 digits with the last captured Group 1 (which is 890).
One more thing here is (\d{3})+ that also captures groups of 3 digits one by one until the end (because of the (?!\d) condition), and places only the last captured group of characters into Group 1. And you are using it to replace each 3-digit chunks in the input string.
See visualization at regex101.com.

Regex Price Matching

I have a webscraper that scrapes prices, for that I need it to find following prices in strings:
762,50
1.843,75
In my first naive implementation, I didn't take the . into consideration and matched the first number with this regex perfectly:
re.findall("\d+,\d+", string)[0]
Now I need to match both cases and my initial idea was this:
re.findall("(\d+.\d+,\d+|\d+,\d+)", string)[0]
With an idea, that using the or operator, could find either the first or the second, which don't work, any suggestions?
No need to use a or, just add the first part as an optional parameter:
(?:\d+\.)?\d+,\d+
The ? after (?:\d+\.) makes it an optional parameter.
The '?:' indicate to not capture this group, just match it.
>>> re.findall(r'(?:\d+\.)?\d+,\d+', '1.843,75 762,50')
['1.843,75', '762,50']
Also note that you have to escape the . (dot) that would match any character except a newline (see http://docs.python.org/2/library/re.html#regular-expression-syntax)
In regular expression, dot (.) matches any character (except newline unless DOTALL flag is not set). Escape it to match . literally:
\d+\.\d+,\d+|\d+,\d+
^^
To match multiple leading digits, the regular expression should be:
>>> re.findall(r'(?:\d+\.)*\d+,\d+', '1,23 1.843,75 123.456.762,50')
['1,23', '1.843,75', '123.456.762,50']
NOTE used non-capturing group because re.findall return a list of groups If one or more groups are present in the pattern.
UPDATE
>>> re.findall(r'(?<![\d.])\d{1,3}(?:\.\d{3})*,\d+',
... '1,23 1.843,75 123.456.762,50 1.2.3.4.5.6.789,123')
['1,23', '1.843,75', '123.456.762,50']
How about:
(\d+[,.]\d+(?:[.,]\d+)?)
Matches:
- some digits followed by , or . and some digits
OR
- some digits followed by , or . and some digits followed by , or . and some digits
It matches: 762,50 and 1.843,75 and 1,75
It will also match 1.843.75 are you OK with that?
See it in action.
I'd use this:
\d{1,3}(?:\.\d{3})*,\d\d
This will match number that have dot as thousand separator
\d*\.?\d{3},\d{2}
See the working example here
This might be slower than regex, but given that the strings you are parsing are probably short, it should not matter.
Since the solution below does not use regex, it is simpler, and you can be more sure you are finding valid floats. Moreover, it parses the digit-strings into Python floats which is probably the next step you intend to perform anyway.
import locale
locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
def float_filter(iterable):
result = []
for item in iterable:
try:
result.append(locale.atof(item))
except ValueError:
pass
return result
text = 'The price is 762,50 kroner'
print(float_filter(text.split()))
yields
[762.5]
The basic idea: by setting a Danish locale, locale.atof parses commas as the decimal marker and dots as the grouping separator.
In [107]: import locale
In [108]: locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
Out[108]: 'en_DK.UTF-8'
In [109]: locale.atof('762,50')
Out[109]: 762.5
In [110]: locale.atof('1.843,75')
Out[110]: 1843.75
In general, you have a set of zero or more XXX., followed by one or more XXX,, each up to 3 numbers, followed by two numbers (always). Do you want to also support numbers like 1,375 (without 'cents'?). You also need to avoid some false detection cases.
That looks like this:
matcher=r'((?:(?:(?:\d{1,3}\.)?(?:\d{3}.)*\d{3}\,)|(?:(?<![.0-9])\d{1,3},))\d\d)'
re.findall(matcher, '1.843,75 762,50')
This detects a lot of boundary cases, but may not catch everything....

Regex search to extract float from string. Python

import re
sequence = 'i have -0.03 dollars in my hand'
m = re.search('(have )(-\w[.]+)( dollars\w+)',sequence)
print m.group(0)
print m.group(1)
print m.group(2)
Looking for a way to extract text between two occurrences. In this case, the format is 'i have ' followed by - floats and then followed by ' dollars\w+'
How do i use re.search to extract this float ?
Why don't the groups work this way ? I know there's something I can tweak to get it to work with these groups. any help would be greatly appreciated
I thought I could use groups with paranthesis but i got an eror
-\w[.]+ does not match -0.03 because [.] matches . literally because . is inside the [...].
\w after dollars also prevent the pattern to match the sequence. There no word character after dollars.
Use (-?\d+\.\d+) as pattern:
import re
sequence = 'i have -0.03 dollars in my hand'
m = re.search(r'(have )(-?\d+\.\d+)( dollars)', sequence)
print m.group(1) # captured group start from `1`.
print m.group(2)
print m.group(3)
BTW, captured group numbers start from 1. (group(0) returns entire matched string)
Your regex doesn't match for several reasons:
it always requires a - (OK in this case, questionable in general)
it requires exactly one digit before the . (and it even allows non-digits like A).
it allows any number of dots, but no more digits after the dots.
it requires one or more alphanumerics immediately after dollars.
So it would match "I have -X.... dollarsFOO in my hand" but not "I have 0.10 dollars in my hand".
Also, there is no use in putting fixed texts into capturing parentheses.
m = re.search(r'\bhave (-?\d+\.\d+) dollars\b', sequence)
would make much more sense.
This question has already been asked in many formulations before. You're looking for a regular expression that will find a number. Since number formats may include decimals, commas, exponents, plus/minus signs, and leading zeros, you'll need a robust regular expression. Fortunately, this regular expression has already been written for you.
See How to extract a floating number from a string and Regular expression to match numbers with or without commas and decimals in text

Python regex matching only if digit

Given the regex and the word below I want to match the part after the - (which can also be a _ or space) only if the part after the delimiter is a digit and nothing comes after it (I basically want to to be a number and number only). I am using group statements but it just doesn't seem to work right. It keeps matching the 3 at the beginning (or the 1 at the end if I modify it a bit). How do I achieve this (by using grouping) ?
Target word: BR0227-3G1
Regex: ([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*)
It should not match 3G1, G1 , 1G
It should match only pure numbers like 3,10, 2 etc.
Here is also a helper web site for evaluating the regex: http://www.pythonregex.com/
More examples:
It should match:
BR0227-3
BR0227 3
BR0227_3
into groups (BR0227) (3)
It should only match (BR0227) for
BR0227-3G1
BR0227-CS
BR0227
BR0227-
I would use
re.findall('^([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*$)?', str)
Each string starts with the first group and ends with the last group, so the ^ and $ groups can assist in capture. The $ at the end requires all numbers to be captured, but it's optional so the first group can still be captured.
Since you want the start and (possible) end of the word in groups, then do this:
r'\b([A-Z0-9]+)(?:[ _-](\d+))?\b'
This will put the first part of the word in the first group, and optionally the remainder in the second group. The second group will be None if it didn't match.
This should match anything followed by '-', ' ', or '_' with only digits after it.
(.*)[- _](\d+)

Categories

Resources