Regex thousand operator either , or - python

I'm using a python Regex and I'm receiving numbers with either a , as a thousand sign or a . If the , is the thousand sign then a . is the decimal sign and vice versa. The only positive thing there though is that there are always two decimal numbers.
I need to regex these number and I don't care about the decimal number so I would like to extract the following. Can someone smarter than me help? This is giving me a headache.
111.112.123,55 -> 111112123
123.44 -> 123
123,353,123.55 -> 123353123
21,23 -> 21
152.00 -> 152

You may use the following pattern:
[,.]\d+$|[.,]
[,.] Character set for either , or ..
\d+$ Digits at end of string.
| Alternation (OR).
[.,] Character set for either , or ..
Regex demo here.
Python demo:
import re
mynumbers=['111.112.123,55','123.44','123,353,123.55','21,23','152.00']
for number in mynumbers:
print(re.sub(r'[,.]\d+$|[.,]','',number))
Prints:
111112123
123
123353123
21
152
You may alternatively use a more restrictive pattern if you are working with text:
[.,]\d+$|(?<=\d{3})[.,]
Regex demo here.
Python:
mytext = '''
111.112.123,55
123.44
123,353,123.55
21,23
152.00
Text, and punctuation.
'''
for line in mytext.splitlines():
print(re.sub(r'[.,]\d+$|(?<=\d{3})[.,]','',line))
Prints:
111112123
123
123353123
21
152
Text, and punctuation.

Assuming you are dealing with strings that only contain one number, you can use this pattern:
re.sub(r'[.,](?:\d\d$)?', '', s)
(a , or a . eventually followed by 2 digits and the end of the string.)

You could capture one or more digits in a capturing group (\d+) followed by a character class [.,] that matches either a dot or a comma.
To match the digits at the end you could use an optional non capturing group (?:\d+$)? that would match one or more times a digit followed by asserting the end of the line. You might start the match with a word boundary to prevent it from being part of a longer match
In the replacement use the first capturing group \1
\b(\d+)[.,](?:\d+$)?
Regex demo
Python demo

Related

How can I write a regex that finds everything but 4 digit numbers like 2000 or 1990 or 1234?

I have a text like this:
Film_relase_date:1970_films_by_20th_Century_Fox
I would like to create a regex that matches all text except 1970, resulting in:
Film_relase_date:_films_by_20th_Century_Fox
I tried with the regex:
[^\d{4}]
But this regex returns:
Film_relase_date:_films_by_th_Century_Fox
And therefore also excludes the 20 which instead I would like to be matched.
How can I improve the regex?
EDIT:
I want to use this regex to do something like:
x = 'Film_relase_date: 1970_films_by_20th_Century_Fox'
REPLACE (x, "Anything that is not a 4-digit number", "Non-Space") = 1970
Remember that {4} is supposed to be added after the character class, not inside.
Anyway, if you want to match "all text except 1970", you can use the following regex:
([^\d]|(?<!\d)\d(?!\d{3}(?!\d))\d*)?
see demo.
This regex matches:
a non-digit character or
a digit char that is nor preceded by another digit and it is not followeb by exactly 3 digits
If you want to match all except 4 digits, I would suggest an unrolled version matching either 1-3 or 5 digits asserting not followed by a digit to prevent consecutive matching digits.
If you don't want to cross newlines, you could use [^\d\r\n] instead of \D
\D+(?:(?:\d{1,3}|\d{5,})(?!\d)\D*)*
Explanation
\D+ Match 1+ non digits
(?: Non capture group
(?:\d{1,3}|\d{5,}) Match either 1-3 or 5 or more digits
(?!\d)\D* Negative lookahead, assert not a digit directly to the right followed by matching optional non digits
)* Close the non capture group and repeat 0+ times
Regex demo
Note that if you want to match 4 digits only, you could perhaps extract the 4 digits using (?<!\d)\d{4}(?!\d) instead of replacing with an empty string.
See another regex demo

How to use regularexpression for this example?

I am trying to write regular expression for this line:
- 5.0 - 4.0 - 3.0 ... + 12.0
That It could group floats with sign in a single group (-5.0,-4.0...)
I have tried:
\s*([+](?:\s)*\d*[.])
But apparently It does not ignore non-capture group inside capture group.
Any suggestion how this could be solved?
According to your requirement:
That It could group floats with sign in a single group (-5.0,-4.0...)
The solution using re.findall() function:
s = '- 5.0 - 4.0 - 3.0 ... + 12.0'
signed_floats = [re.sub(r'\s+', r'', f) for f in re.findall(r'-\s*\d+\.\d+\b', s)]
print(signed_floats)
The output:
['-5.0', '-4.0', '-3.0']
Your capture group has the following elements:
[+] matches a literal +
(?:\s)* matches any number of whitespace characters
\d* matches any number of digits
[.] matches a literal .
So right now, that matches a plus sign followed by space followed by digits followed by a decimal point. But it sounds like you want to match several sign-space-digits-decimalpoint-digits sequences in a row, as long as they have the same sign. I'd do that like this:
Start with the expression to match a single such sequence:
[+-]\s*\d+[.]\d+
This matches plus or minus, then space, then digits, decimal point, digits.
You'll want to save the sign to make sure that the rest of the pattern only matches sequences with the same sign. So make a capturing group.
([+-])\s*\d+[.]\d+
Now let's repeat the pattern (with some intervening space) to match another group, except that we want to make sure the sign is the same so we use a backreference.
([+-])\s*\d+[.]\d+\s*\1\s*\d+[.]\d+
The \1 matches whatever was matched by capturing group number 1. In this case, that's the sign, + or -. This pattern will match two consecutive sequences that have the same sign.
Now change the second part of the pattern to match zero or more additional sequences.
([+-])\s*\d+[.]\d+(?:\s*\1\s*\d+[.]\d+)*
Finally, you can allow for spaces before and after the match. This can be solved with judicious use of the search function, or findall, rather than match. You can then use match_object.group() with no arguments to get the sequence that was matched, which is what you want.
Here's something you can try:
(\+|-)\s*(\d+\.\d+)\s*
Although, you will always have a trailing comma, so you'd have to remove it.
Here is a demo

RegEx/Python: optional whitespace not found

got a really weird problem. My (Python) RegEx looks like this:
u'^.*(?:Grundfl|gfl|wfl|wohnfl|whg|wohnung).*(\s\d{1,3}[.,]?\d{1,2}?)\s*(?:m\u00B2|qm)'
In a re.findall()-term, this should throw two matches in for the following text: "...from 71m² to 83m²"
However, only 83 is matched. The problem has something to do with the optional whitespace between the number (\s\d{1,3}[.,]?\d{1,2}?) and the squaremeters (?:m\u00B2|qm), for when I'm deleting the \s*, only 71 is matched as expected. I have no idea what is wrong with my regex.
Thanks for yout help!
Why don't you try using a positive lookahead? This will match 1+ digits (with an optional comma inside), as long as there is m² or qm after it. There is an optional space between the numbers and the unit:
>>> import re
>>> re.findall("[\d|\,]{1,}(?=\s{0,4}[m\u00B2|qm])", "from 71m² to 83m²")
['71', '83']
>>> re.findall("[\d|\,]{1,}(?=\s{0,4}[m\u00B2|qm])", "from 71,56 m² to 837,78 qm")
['71,56', '837,78']
>>>
It does not take into account the words you have specified, but you can easily add that part back in. However re.findall() returns non-overlapping results, so if you specify the start of the string in your search, it will only ever return the first value, as it effectively 'chops' out parts that it matches, therefore never finding the second part.
You can use the following regex with re.findall:
(\d*[.,]?\d+)\s*(?:m\u00B2|qm)
See the regex demo. re.findall will only return a list of Group 1 captured values.
Pattern details:
(\d*[.,]?\d+) - Group 1 containing the integer or float number: 0+ digits, followed with 1 or 0 . or , followed with 1+ digits
\s* - 0+ whitespaces
(?:m\u00B2|qm) - either m² or qm.
See Python demo:
# -*- coding: utf-8 -*-
import re
p = re.compile(u'(\d*[.,]?\d+)\s*(?:m\u00B2|qm)')
s = u"wohnung from 71,556m² to 183.4456m²"
print(p.findall(s)) # => [u'71,556', u'183.4456']

python regex look ahead positive + negative

This regex will get 456. My question is why it CANNOT be 234 from 1-234-56 ? Does 56 qualify the (?!\d)) pattern since it is NOT a single digit. Where is the beginning point that (?!\d)) will look for?
import re
pattern = re.compile(r'\d{1,3}(?=(\d{3})+(?!\d))')
a = pattern.findall("The number is: 123456") ; print(a)
It is in the first stage to add the comma separator like 123,456.
a = pattern.findall("The number is: 123456") ; print(a)
results = pattern.finditer('123456')
for result in results:
print ( result.start(), result.end(), result)
My question is why it CANNOT be 234 from 1-234-56?
It is not possible as (?=(\d{3})+(?!\d)) requires 3-digit sequences appear after a 1-3-digit sequence. 56 (the last digit group in your imagined scenario) is a 2-digit group. Since a quantifier can be either lazy or greedy, you cannot match both one, two and three digit groups with \d{1,3}. To get 234 from 123456, you'd need a specifically tailored regex for it: \B\d{3}, or (?<=1)\d{3} or even \d{3}(?=\d{2}(?!\d)).
Does 56 match the (?!\d)) pattern? Where is the beginning point that (?!\d)) will look for?
No, this is a negative lookahead, it does not match, it only checks if there is no digit right after the current position in the input string. If there is a digit, the match is failed (not result found and returned).
More clarification on the look-ahead: it is located after (\d{3})+ subpattern, thus the regex engine starts searching for a digit right after the last 3-digit group, and fails a match if the digit is found (as it is a negative lookahead). In plain words, the (?!\d) is a number closing/trailing boundary in this regex.
A more detailed breakdown:
\d{1,3} - 1 to 3 digit sequence, as many as possible (greedy quantifier is used)
(?=(\d{3})+(?!\d)) - a positive look-ahead ((?=...)) that checks if the 1-3 digit sequence matched before are followed by
(\d{3})+ - 1 or more (+) sequences of exactly 3 digits...
(?!\d) - not followed by a digit.
Lookaheads do not match, do not consume characters, but you still can capture inside them. When a lookahead is executed, the regex index is at the same character as before. With your regex and input, you match 123 with \d{1,3} as then you have 3-digit sequence (456). But 456 is capured within a lookahead, and re.findall returns only captured texts if capturing groups are set.
To just add comma as digit grouping symbol, use
rx = r'\d(?=(?:\d{3})+(?!\d))'
See IDEONE demo

Python Regular Expression Match All 5 Digit Numbers but None Larger

I'm attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232, 21032, 40021 etc... I can handle the simpler case of any string of 5 digits with [0-9]{5}, though this also matches 6, 7, 8... n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit numbers?
>>> import re
>>> s="four digits 1234 five digits 56789 six digits 012345"
>>> re.findall(r"\D(\d{5})\D", s)
['56789']
if they can occur at the very beginning or the very end, it's easier to pad the string than mess with special cases
>>> re.findall(r"\D(\d{5})\D", " "+s+" ")
Without padding the string for special case start and end of string, as in John La Rooy answer one can use the negatives lookahead and lookbehind to handle both cases with a single regular expression
>>> import re
>>> s = "88888 999999 3333 aaa 12345 hfsjkq 98765"
>>> re.findall(r"(?<!\d)\d{5}(?!\d)", s)
['88888', '12345', '98765']
full string: ^[0-9]{5}$
within a string: [^0-9][0-9]{5}[^0-9]
Note: There is problem in using \D since \D matches any character that is not a digit , instead use \b.
\b is important here because it matches the word boundary but only at end or beginning of a word .
import re
input = "four digits 1234 five digits 56789 six digits 01234,56789,01234"
re.findall(r"\b\d{5}\b", input)
result : ['56789', '01234', '56789', '01234']
but if one uses
re.findall(r"\D(\d{5})\D", s)
output : ['56789', '01234']
\D is unable to handle comma or any continuously entered numerals.
\b is important part here it matches the empty string but only at end or beginning of a word .
More documentation: https://docs.python.org/2/library/re.html
More Clarification on usage of \D vs \b:
This example uses \D but it doesn't capture all the five digits number.
This example uses \b while capturing all five digits number.
Cheers
A very simple way would be to match all groups of digits, like with r'\d+', and then skip every match that isn't five characters long when you process the results.
You probably want to match a non-digit before and after your string of 5 digits, like [^0-9]([0-9]{5})[^0-9]. Then you can capture the inner group (the actual string you want).
You could try
\D\d{5}\D
or maybe
\b\d{5}\b
I'm not sure how python treats line-endings and whitespace there though.
I believe ^\d{5}$ would not work for you, as you likely want to get numbers that are somewhere within other text.
I use Regex with easier expression :
re.findall(r"\d{5}", mystring)
It will research 5 numerical digits. But you have to be sure not to have another 5 numerical digits in the string

Categories

Resources