RegEx/Python: optional whitespace not found - python

got a really weird problem. My (Python) RegEx looks like this:
u'^.*(?:Grundfl|gfl|wfl|wohnfl|whg|wohnung).*(\s\d{1,3}[.,]?\d{1,2}?)\s*(?:m\u00B2|qm)'
In a re.findall()-term, this should throw two matches in for the following text: "...from 71m² to 83m²"
However, only 83 is matched. The problem has something to do with the optional whitespace between the number (\s\d{1,3}[.,]?\d{1,2}?) and the squaremeters (?:m\u00B2|qm), for when I'm deleting the \s*, only 71 is matched as expected. I have no idea what is wrong with my regex.
Thanks for yout help!

Why don't you try using a positive lookahead? This will match 1+ digits (with an optional comma inside), as long as there is m² or qm after it. There is an optional space between the numbers and the unit:
>>> import re
>>> re.findall("[\d|\,]{1,}(?=\s{0,4}[m\u00B2|qm])", "from 71m² to 83m²")
['71', '83']
>>> re.findall("[\d|\,]{1,}(?=\s{0,4}[m\u00B2|qm])", "from 71,56 m² to 837,78 qm")
['71,56', '837,78']
>>>
It does not take into account the words you have specified, but you can easily add that part back in. However re.findall() returns non-overlapping results, so if you specify the start of the string in your search, it will only ever return the first value, as it effectively 'chops' out parts that it matches, therefore never finding the second part.

You can use the following regex with re.findall:
(\d*[.,]?\d+)\s*(?:m\u00B2|qm)
See the regex demo. re.findall will only return a list of Group 1 captured values.
Pattern details:
(\d*[.,]?\d+) - Group 1 containing the integer or float number: 0+ digits, followed with 1 or 0 . or , followed with 1+ digits
\s* - 0+ whitespaces
(?:m\u00B2|qm) - either m² or qm.
See Python demo:
# -*- coding: utf-8 -*-
import re
p = re.compile(u'(\d*[.,]?\d+)\s*(?:m\u00B2|qm)')
s = u"wohnung from 71,556m² to 183.4456m²"
print(p.findall(s)) # => [u'71,556', u'183.4456']

Related

Python -- Regex match pattern OR end of string

import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <$])", "+1.222.222.2222<")
The above code works fine if my string ends with a "<" or space. But if it's the end of the string, it doesn't work. How do I get +1.222.222.2222 to return in this condition:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <$])", "+1.222.222.2222")
*I removed the "<" and just terminated the string. It returns none in this case. But I'd like it to return the full string -- +1.222.222.2222
POSSIBLE ANSWER:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <]|$)", "+1.222.222.2222")
I think you've solved the end-of-string issue, but there are a couple of other potential issues with the pattern in your question:
the - in [ -.] either needs to be escaped as \- or placed in the first or last position within square brackets, e.g. [-. ] or [ .-]; if you search for [] in the docs here you'll find the relevant info:
Ranges of characters can be indicated by giving two characters and separating them
by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match
all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal
digit. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character
(e.g. [-a] or [a-]), it will match a literal '-'.
you may want to require that either matching parentheses or none are present around the first 3 of 10 digits using (?:\(\d{3}\) ?|\d{3}[-. ]?)
Here's a possible tweak incorporating the above
import re
pat = "^((?:\+1[-. ]?|1[-. ]?)?(?:\(\d{3}\) ?|\d{3}[-. ]?)\d{3}[-. ]?\d{4})(?:[ <]|$)"
print( re.findall(pat, "+1.222.222.2222") )
print( re.findall(pat, "+1(222)222.2222") )
print( re.findall(pat, "+1(222.222.2222") )
Output:
['+1.222.222.2222']
['+1(222)222.2222']
[]
Maybe try:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:| |<|$)", "+1.222.222.2222")
null matches any position, +1.222.222.2222
matches space character, +1.222.222.2222
< matches less-than sign character, +1.222.222.2222<
$ end of line, +1.222.222.2222
You can also use regex101 for easier debugging.

Regex thousand operator either , or

I'm using a python Regex and I'm receiving numbers with either a , as a thousand sign or a . If the , is the thousand sign then a . is the decimal sign and vice versa. The only positive thing there though is that there are always two decimal numbers.
I need to regex these number and I don't care about the decimal number so I would like to extract the following. Can someone smarter than me help? This is giving me a headache.
111.112.123,55 -> 111112123
123.44 -> 123
123,353,123.55 -> 123353123
21,23 -> 21
152.00 -> 152
You may use the following pattern:
[,.]\d+$|[.,]
[,.] Character set for either , or ..
\d+$ Digits at end of string.
| Alternation (OR).
[.,] Character set for either , or ..
Regex demo here.
Python demo:
import re
mynumbers=['111.112.123,55','123.44','123,353,123.55','21,23','152.00']
for number in mynumbers:
print(re.sub(r'[,.]\d+$|[.,]','',number))
Prints:
111112123
123
123353123
21
152
You may alternatively use a more restrictive pattern if you are working with text:
[.,]\d+$|(?<=\d{3})[.,]
Regex demo here.
Python:
mytext = '''
111.112.123,55
123.44
123,353,123.55
21,23
152.00
Text, and punctuation.
'''
for line in mytext.splitlines():
print(re.sub(r'[.,]\d+$|(?<=\d{3})[.,]','',line))
Prints:
111112123
123
123353123
21
152
Text, and punctuation.
Assuming you are dealing with strings that only contain one number, you can use this pattern:
re.sub(r'[.,](?:\d\d$)?', '', s)
(a , or a . eventually followed by 2 digits and the end of the string.)
You could capture one or more digits in a capturing group (\d+) followed by a character class [.,] that matches either a dot or a comma.
To match the digits at the end you could use an optional non capturing group (?:\d+$)? that would match one or more times a digit followed by asserting the end of the line. You might start the match with a word boundary to prevent it from being part of a longer match
In the replacement use the first capturing group \1
\b(\d+)[.,](?:\d+$)?
Regex demo
Python demo

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?

I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the
hyphen in 12-34 should be kept while the equal mark after 123 should be removed.
Here is my python script.
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)
the expected output should be
中国中国foo中国bar中123国中国12-34中国
but the result is
中国中国foo中国bar中123=国中国12-34中国
I can't figure out why there is an extra equal sign in the output?
Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.
You can try the following regex:
u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'
You can use it as such:
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))
I suggest matching and capturing these characters in between digits (to restore them later in the output), and just match them in other contexts.
In Python 2, it will look like
import re
s = u"中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
pat_block = u'[^\u4e00-\u9fff0-9a-zA-Z]+';
pattern = u'([0-9]+{0}[0-9]+)|{0}'.format(pat_block)
res = re.sub(pattern, lambda x: x.group(1) if x.group(1) else u"" ,s)
print(res.encode("utf8")) # => 中国中国foo中国bar中123国中国12-34中国
See the Python demo
If you need to preserve those symbols inside any Unicode digits, you need to replace [0-9] with \d and pass the re.UNICODE flag to the regex.
The regex will look like
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)|[^\u4e00-\u9fff0-9a-zA-Z]+
It will works like this:
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+) - Group 1 capturing
[0-9]+ - 1+ digits
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
[0-9]+ - 1+ digits
| - or
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
In Python 2.x, when a group is not matched in re.sub, the backreference to it is None, that is why a lambda expression is required to check if Group 1 matched first.

Regular expression with two non-repeating symbols in any order

I need to create the regex that will match such string:
AA+1.01*2.01,BB*2.01+1.01,CC
Order of * and + should be any
I've created the following regex:
^(([A-Z][A-Z](([*+][0-9]+(\.[0-9])?[0-9]?){0,2}),)*[A-Z]{2}([*+][0-9]+(\.[0-9])?[0-9]?){0,2})$
But the problem is that with this regex + or * could be used twice but I only need any of them once so the following strings matches should be:
AA+1*2,CC - true
AA+1+2,CC - false (now is true with my regex)
AA*1+2,CC - true
AA*1*2,CC - false (now is true with my regex)
Either of the [+*] should be captured first and then use negative lookahead to match the other one.
Regex: [A-Z]{2}([+*])(?:\d+(?:\.\d+)?)(?!\1)[+*](?:\d+(?:\.\d+)?),[A-Z]{2}
Explanation:
[A-Z]{2} Matches two upper case letters.
([+*]) captures either of + or *.
(?:\d+(?:\.\d+)?) matches number with optional decimal part.
(?!\1)[+*] looks ahead for symbol captured and matched the other one. So if + is captured previously then * will be matched.
(?:\d+(?:\.\d+)?) matches number with optional decimal part.
,[A-Z]{2} matches , followed by two upper case letters.
Regex101 Demo
To match the first case AA+1.01*2.01,BB*2.01+1.01,CC which is just a little advancement over previous pattern, use following regex.
Regex: (?:[A-Z]{2}([+*])(?:\d+(?:\.\d+)?)(?!\1)[+*](?:\d+(?:\.\d+)?),)+[A-Z]{2}
Explanation: Added whole pattern except ,CC in first group and made it greedy by using + to match one or more such patterns.
Regex101 Demo
To get a regex to match your given example, extended to an arbitrary number of commas, you could use:
^(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\1)[+*]?\d*\.?\d*,?)*$
Note that this example will also allow a trailing comma. I'm not sure if there is much you can do about that.
Regex 101 Example
If the trailing comma is an issue:
^(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\1)[+*]?\d*\.?\d*,?)*?(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\2)[+*]?\d*\.?\d*?)$
Regex 101 Example

Python Regular Expression Match All 5 Digit Numbers but None Larger

I'm attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232, 21032, 40021 etc... I can handle the simpler case of any string of 5 digits with [0-9]{5}, though this also matches 6, 7, 8... n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit numbers?
>>> import re
>>> s="four digits 1234 five digits 56789 six digits 012345"
>>> re.findall(r"\D(\d{5})\D", s)
['56789']
if they can occur at the very beginning or the very end, it's easier to pad the string than mess with special cases
>>> re.findall(r"\D(\d{5})\D", " "+s+" ")
Without padding the string for special case start and end of string, as in John La Rooy answer one can use the negatives lookahead and lookbehind to handle both cases with a single regular expression
>>> import re
>>> s = "88888 999999 3333 aaa 12345 hfsjkq 98765"
>>> re.findall(r"(?<!\d)\d{5}(?!\d)", s)
['88888', '12345', '98765']
full string: ^[0-9]{5}$
within a string: [^0-9][0-9]{5}[^0-9]
Note: There is problem in using \D since \D matches any character that is not a digit , instead use \b.
\b is important here because it matches the word boundary but only at end or beginning of a word .
import re
input = "four digits 1234 five digits 56789 six digits 01234,56789,01234"
re.findall(r"\b\d{5}\b", input)
result : ['56789', '01234', '56789', '01234']
but if one uses
re.findall(r"\D(\d{5})\D", s)
output : ['56789', '01234']
\D is unable to handle comma or any continuously entered numerals.
\b is important part here it matches the empty string but only at end or beginning of a word .
More documentation: https://docs.python.org/2/library/re.html
More Clarification on usage of \D vs \b:
This example uses \D but it doesn't capture all the five digits number.
This example uses \b while capturing all five digits number.
Cheers
A very simple way would be to match all groups of digits, like with r'\d+', and then skip every match that isn't five characters long when you process the results.
You probably want to match a non-digit before and after your string of 5 digits, like [^0-9]([0-9]{5})[^0-9]. Then you can capture the inner group (the actual string you want).
You could try
\D\d{5}\D
or maybe
\b\d{5}\b
I'm not sure how python treats line-endings and whitespace there though.
I believe ^\d{5}$ would not work for you, as you likely want to get numbers that are somewhere within other text.
I use Regex with easier expression :
re.findall(r"\d{5}", mystring)
It will research 5 numerical digits. But you have to be sure not to have another 5 numerical digits in the string

Categories

Resources