My Code:
import re
#Phone Number regex
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)? # separator
(\d{3}) # first 3 digits
(\s|-|\.) # separator
(\d{4}) # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
)''', re.VERBOSE)
phoneRegex.findall('Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)')
Output:
[('800.420.7240', '800', '.', '420', '.', '7240', '', '', ''), ('415.863.9900', '415', '.', '863', '.', '9900', '', '', '')]
Questions:
Why are empty strings included in the match?
The empty strings are matched from what positions of the string?
What are the conditions for empty strings to be matched?
P.S.
The empty strings are not included in the match when I use the same regex on https://regex101.com/
Also, I just started learning regex a few days ago, so I'm sorry if my questions aren't good enough.
The ? operator means that it will return zero or one matches. In this case, you've made some capturing groups optional with ?, and python is returning a zero-length match for each of the three optional capturing groups you created.
If you remove the first two ? you'll eliminate some zero-length matches. To deal with the last one, you need to change the extension pattern. It accounts for two, again because you use a zero or one operator (*).
If you don't care about the internal capturing groups and just want the full match, you could filter out the zero-length matches with something like
>>> [match.group(0) for match in phoneRegex.finditer('Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)')]
['800.420.7240', '415.863.9900']
You could make the extension capture group match conditional on there being a match for a preceding phone number. Also, I think you may need to escape the . in the third alternative ext.. As written it matches any character, but I think what you meant was ext\..
for reference:
zero-length regex matches
python re
Why are empty strings included in the match?
Because you have used various groups in your regex. The engine captures the part of the match that you have put into a group.
The empty strings are matched from what positions of the string?
From this regex: (\s*(ext|x|ext.)\s*(\d{2,5}))?
It has three groups (you can count the opening parentheses). The engine does not find a match with extensions and the 3 groups that try to capture information return empty strings.
What are the conditions for empty strings to be matched?
If you group your regex in a way that the engine catches an empty substring in the string matched, it will return empty strings :-)
I think you are following the exercise from "automate the boring stuff with python". In the regex in VERBOSE mode on page 178, try to look for the opening parentheses. Where is the closing parentheses? The number of groups is equivalent with the number of opening parentheses. The whole regex is group number zero.
Groups are useful if you want to extract certain parts of a matched string. If you only want to extract full phone numbers, leave the groups away.
You may try just this:
phoneRegex = re.compile(r'\d{3}[\.|-|\/]\d{3}[\.|-|\/]\d{4}')
Is this what you want to achieve?
If you want to stick with your regex in VERBOSE mode, you could also use non-capturing groups. This only captures a full match:
phoneRegex = re.compile(r'''(
(?:\d{3}|\(?:\d{3}\))?
(?:\s|-|\.)? # separator
(?:\d{3}) # first 3 digits
(?:\s|-|\.) # separator
(?:\d{4}) # last 4 digits
(?:\s*(?:ext|x|ext.)\s*(?:\d{2,5}))? # extension
)''', re.VERBOSE)
Related
I do have got the below string and I am looking for a way to split it in order to consistently end up with the following output
'1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
['1GB 02060250396L1.060,70',
'2BE 129517720L2.639,40',
'3NL 134187650L4.024,23',
'4DE 165893440L8.111,00',
'5PL 65775644897L3.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L8.0221,30']
My current approach
re.split("([0-9][0-9][0-9][A-Z][A-Z])", input) however is also splitting my delimiter which gives and there is no other split possible than the one I am currently using in order to remain consistent. Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?
Use re.findall() instead of re.split().
You want to match
a number \d, followed by
two letters [A-Z]{2}, followed by
a space \s, followed by
a bunch of characters until you encounter a comma [^,]+, followed by
two digits \d{2}
Try it at regex101
So do:
input_str = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
re.findall(r"\d[A-Z]{2}\s[^,]+,\d{2}", input_str)
Which gives
['1GB 02060250396L7.067,70',
'2BE 129517720L6.633,40',
'3NL 134187650L3.824,23',
'4DE 165893440L3.111,00',
'5PL 65775644897L1.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L3.001,30']
Alternatively, if you don't want to be so specific with your pattern, you could simply use the regex
[^,]+,\d{2} Try it at regex101
This will match as many of any character except a comma, then a single comma, then two digits.
re.findall(r"[^,]+,\d{2}", input_str)
# Output:
['1GB 02060250396L7.067,70',
'2BE 129517720L6.633,40',
'3NL 134187650L3.824,23',
'4DE 165893440L3.111,00',
'5PL 65775644897L1.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L3.001,30']
Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?
If you must use re.split AT ANY PRICE then you might exploit zero-length assertion for this task following way
import re
text = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
parts = re.split(r'(?<=,[0-9][0-9])', text)
print(parts)
output
['1GB 02060250396L7.067,70', '2BE 129517720L6.633,40', '3NL 134187650L3.824,23', '4DE 165893440L3.111,00', '5PL 65775644897L1.010,00', '6DE 811506926L3.547,40', '7AT U16235008L-830,00', '8SE U57469158L3.001,30', '']
Explanation: This particular one is positive lookbehind, it does find zero-length substring preceded by , digit digit. Note that parts has superfluous empty str at end.
I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')
I'm using a simple regex (.*?)(\d+[.]\d+)|(.*?)(\d+) to match int/float/double value in a string. When doing findall the regex shows empty strings in the output. The empty strings gets removed when I remove the | operator and do an individual match. I had also tried this on regex101 it doesn't show any empty string. How can I remove this empty strings ? Here's my code:
>>>import re
>>>match_float = re.compile('(.*?)(\d+[.]\d+)|(.*?)(\d+)')
>>>match_float.findall("CA$1.90")
>>>match_float.findall("RM1")
Output:
>>>[('CA$', '1.90', '', '')]
>>>[('', '', 'RM', '1')]
Since you defined 4 capturing groups in the pattern, they will always be part of the re.findall output unless you remove them (say, by using filter(None, ...).
However, in the current situation, you may "shrink" your pattern to
r'(.*?)(\d+(?:\.\d+)?)'
See the regex demo
Now, it will only have 2 capturing groups, and thus, findall will only output 2 items per tuple in the resulting list.
Details:
(.*?) - Capturing group 1 matching any zero or more chars other than line break chars, as few as possible up to the first occurrence of ...
(\d+(?:\.\d+)?) - Capturing group 2:
\d+ - one of more digits
(?:\.\d+)? - an optional *non-*capturing group that matches 1 or 0 occurrences of a . and 1+ digits.
See the Python demo:
import re
rx = r"(.*?)(\d+(?:[.]\d+)?)"
ss = ["CA$1.90", "RM1"]
for s in ss:
print(re.findall(rx, s))
# => [('CA$', '1.90')] [('RM', '1')]
I need to create the regex that will match such string:
AA+1.01*2.01,BB*2.01+1.01,CC
Order of * and + should be any
I've created the following regex:
^(([A-Z][A-Z](([*+][0-9]+(\.[0-9])?[0-9]?){0,2}),)*[A-Z]{2}([*+][0-9]+(\.[0-9])?[0-9]?){0,2})$
But the problem is that with this regex + or * could be used twice but I only need any of them once so the following strings matches should be:
AA+1*2,CC - true
AA+1+2,CC - false (now is true with my regex)
AA*1+2,CC - true
AA*1*2,CC - false (now is true with my regex)
Either of the [+*] should be captured first and then use negative lookahead to match the other one.
Regex: [A-Z]{2}([+*])(?:\d+(?:\.\d+)?)(?!\1)[+*](?:\d+(?:\.\d+)?),[A-Z]{2}
Explanation:
[A-Z]{2} Matches two upper case letters.
([+*]) captures either of + or *.
(?:\d+(?:\.\d+)?) matches number with optional decimal part.
(?!\1)[+*] looks ahead for symbol captured and matched the other one. So if + is captured previously then * will be matched.
(?:\d+(?:\.\d+)?) matches number with optional decimal part.
,[A-Z]{2} matches , followed by two upper case letters.
Regex101 Demo
To match the first case AA+1.01*2.01,BB*2.01+1.01,CC which is just a little advancement over previous pattern, use following regex.
Regex: (?:[A-Z]{2}([+*])(?:\d+(?:\.\d+)?)(?!\1)[+*](?:\d+(?:\.\d+)?),)+[A-Z]{2}
Explanation: Added whole pattern except ,CC in first group and made it greedy by using + to match one or more such patterns.
Regex101 Demo
To get a regex to match your given example, extended to an arbitrary number of commas, you could use:
^(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\1)[+*]?\d*\.?\d*,?)*$
Note that this example will also allow a trailing comma. I'm not sure if there is much you can do about that.
Regex 101 Example
If the trailing comma is an issue:
^(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\1)[+*]?\d*\.?\d*,?)*?(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\2)[+*]?\d*\.?\d*?)$
Regex 101 Example
I have a bunch of lines in a file with either one or two occurences of same pattern (id=):
Linetype1 : ...id=1234...id=4321...value=5678... # "..." means whatever
Linetype2 : ...id=7890...value=8765
I thought I could write such a regex to grep all my ids and associated values:
>>> l="...id=1234...id=4321...value=5678...\n...id=7890...value=8765\n"
>>> ret = re.findall('(id=[0-9]+).*?(id=[0-9]+)*.*?(value=[0-9]+)',l)
[('id=1234', '', 'value=5678'), ('id=7890', '', 'value=8765')]
I can't get the second "id=4321" part.
This looks very strange to me since I use the non-greedy .*? between first id=[0-9]+ and second.
The middle of your regex has
(id=[0-9]+)*
The empty string matches this, since it is under the Kleene star *. So the regex engine proceeds through the string as follows:
find the first id=[0-9]+ group
expand .*? to the empty string, since it matches
expand (id=[0-9]+)* to the empty string, since it matches
expand .*? to the rest of the string
If you replace the middle group's quantifier with +, or just remove it entirely, then it works.