when using re.split I'd expect the maxsplit to be the length of the returned list (-1).
The examples in the docs suggest so.
But when there is a capture group (and maybe some other cases) then I don't understand how the maxsplit argument works.
>>> re.split("(\W+)", "Words, words, words.", maxsplit=1)
['Words', ', ', 'words, words.']
>>> re.split("(:)", ":a:b::c", maxsplit=2)
['', ':', 'a', ':', 'b::c']
>>> re.split("((:))", ":a:b::c", maxsplit=2)
['', ':', ':', 'a', ':', ':', 'b::c']
What am I missing?
It's not about maxsplit, it's about you using parentheses in the regular expression:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
DOCS: https://docs.python.org/3/library/re.html#re.split
So what I'm guessing is that maxsplit determines the number of splits, and the parentheses return additional groups.
Example
":a:b::c" with maxsplit=2 splits your string in three parts:
"", "a", "b::c"
But because the pattern "(:)" also contains a captured group, it's returned in between the parts:
"", ":", "a", ":", "b::c"
If the pattern is "((:))", then each colon is returned twice in between the parts
Related
I have some questions on the split() description/examples from the Python RE documents
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
re.split(r'(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']
In this example there is a capturing group, it matches at the start and end of the string, thus, the result starts and ends with an empty string. Outside of understanding that this happens, I would like to better understand the reasoning. The explanation for this is:
That way, separator components are always found at the same relative
indices within the result list.
Could someone expand on this? Relative to what?
My other query is related to this example:
re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
\w will match any character that can be used in any word in any language (Flag:unicode), or will be the equivalent of [a-zA-Z0-9_] (Flag:ASCII), \W is the inverse of this. Can someone talk about each of the matches in the example above, explain each (if possible) in terms of what is matched (\B, \U, ...).
Added 29/01/2019:
Apart of what I am after wasn't stated very clear (my bad). In terms of the second example, I am curious about the steps taken to come to the result (how the python re module processed the example). After reading this post on Zero-Length Regex Matches things are clearer, but I would still be interest if anyone can break down the logic up to ['', '...', '', '', 'w', in the results.
What it's trying to say is that when you have a capturing group in the delimiter, and it matches the beginning of the string, the resulting list will always start with the delimiter. Similarly, if it matches at the end of the string, the list will always end with the delimiter.
For consistency, this is true even when the delimiter matches an empty string. The input string is considered to have an empty string before the first character and after the last character, and the delimiter will match these. And then they'll be the first and last elements of the resulting list.
Check this:
>>> re.split('(a)', 'awords')
['', 'a', 'words']
>>> re.split('(w)', 'awords')
['a', 'w', 'ords']
>>> re.split('(o)', 'awords')
['aw', 'o', 'rds']
>>> re.split('(s)', 'awords')
['aword', 's', '']
Always at the second place (index of 1).
On the other hand:
>>> re.split('a', 'awords')
['', 'words']
>>> re.split('w', 'awords')
['a', 'ords']
>>> re.split('s', 'awords')
['aword', '']
Almost the same, only the catching group not inside.
I'm a python beginner.I'm having doubt about the output about re.split()
text='alpha, beta,,,gamma dela'
In [9]: re.split('(,)+',text)
Out[9]: ['alpha', ',', ' beta', ',', 'gamma dela']
In [11]: re.split('(,+)',text)
Out[11]: ['alpha', ',', ' beta', ',,,', 'gamma dela']
In [7]: re.split('[,]+',text)
Out[7]: ['alpha', ' beta', 'gamma dela']
why these output are different?
please help me ,thank you very much!
As is specified in the documentation of re.split:
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the occurrences of pattern. If capturing parentheses
are used in pattern, then the text of all groups in the pattern
are also returned as part of the resulting list. If maxsplit is
nonzero, at most maxsplit splits occur, and the remainder of the
string is returned as the final element of the list.
A capture group is usually described using parenthesis ((..)) that do not contain ?: or lookahead/lookbehind marker. So the first two regexes have capture groups:
(,)+
# ^ ^
(,+)
# ^ ^
In the first case the capture group is a single comma. So that means the last capture is used (a single comma thus). In the second case ((,+)) it can capture multiple commas (and a regex aims to capture as much as possible, so it captures here all).
In the last case, there is no capture group, so this means splitting is done and the text matched against the pattern is completely ignored.
We know that anchors, word boundaries, and lookaround match at a position, rather than matching a character.
Is it possible to split a string by one of the preceding ways with regex (specifically in python)?
For example consider the following string:
"ThisisAtestForchEck,Match IngwithPosition."
So i want the following result (the sub-strings that start with uppercase letter but not precede by space ):
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']
If i split with grouping i get:
>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']
And this is the result with look-around :
>>> re.split(r'(?<=[A-Z])',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z]))',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z])?)',s)
['ThisisAtestForchEck,MatchingwithPosition.']
Note that if i want to split by sub-strings that start with uppercase and are preceded by a space, e.g.:
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']
I can use re.findall, viz.:
>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']
But what about the first example: is it possible to solve it with re.findall?
A way with re.findall:
re.findall(r'(?:[A-Z]|^[^A-Z\s])[^A-Z\s]*(?:\s+[A-Z][^A-Z]*)*',s)
When you decide to change your approach from split to findall, the first job consists to reformulate your requirements: "I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter)"
(?<!\s)(?=[A-Z])
You can use this to split with regex module as re does not support split at 0 width assertions.
import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)
or
print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]
See demo.
https://regex101.com/r/sJ9gM7/65
I know this might be less convenient because of the tuple nature of the result. But I think that this findall finds what you need:
re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)
## returns [('Thisis', 's'), ('Atest', 't'), ('Forch', 'h'), ('Eck,', ','), ('Match Ingwith', 'h'), ('Position.', '.')]
This can be used in the following list comprehension to give the desired output:
[val[0] for val in re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
And here is a hack that uses split:
re.split(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)[1::3]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
try capture using this pattern
([A-Z][a-z]*(?: [A-Z][a-z]*)*)
Demo
I am trying to split a string such as 'if (x==5) {' to be:
['if', '(', 'x', '==', '5', ')', '{']
I have a list of keywords that I created as my delimiters. Another problem I faced was the order of the delimiters. I would like to split on '==' before I split on '='
I would like to split on multiple delimiters, yet keep the delimiters as separate elements.
Use re.split.
>>> x = 'if (x==5) {'
>>> [i for i in re.split(r'(==)|(\d+)|([(){]|[a-z]+)|\s+', x) if i and i != None ]
['if', '(', 'x', '==', '5', ')', '{']
Capturing group will keep up the delimiters.
Assuming you have a list of delimiters like
seps= ('(',')','{','}','==','=')
You can try this:
import re
pattern= r'\s*(%s)\s*'%('|'.join((re.escape(sep) for sep in seps)))
print [token for token in re.split(pattern, 'if (x==5) {') if token]
Putting the delimiters inside a capture group (i.e. (==|=|...)) causes re.split not to discard them.
As #dylrei mentioned in the comments, this is lexing. The lexing tool http://www.dabeaz.com/ply/ was able to answer my question.
Thanks!
I'm using re.split() to separate a string into tokens. Currently the pattern I'm using as the argument is [^\dA-Za-z], which retrieves alphanumeric tokens from the string.
However, what I need is to also split tokens that have both numbers and letters into tokens with only one or the other, eg.
re.split(pattern, "my t0kens")
would return ["my", "t", "0", "kens"].
I'm guessing I might need to use lookahead/lookbehind, but I'm not sure if that's actually necessary or if there's a better way to do it.
Try the findall method instead.
>>> print re.findall ('[^\d ]+', "my t0kens");
['my', 't', 'kens']
>>> print re.findall ('[\d]+', "my t0kens");
['0']
>>>
Edit: Better way from Bart's comment below.
>>> print re.findall('[a-zA-Z]+|\\d+', "my t0kens")
['my', 't', '0', 'kens']
>>>
>>> [x for x in re.split(r'\s+|(\d+)',"my t0kens") if x]
['my', 't', '0', 'kens']
By using capturing parenthesis within the pattern, the tokens will also be return. Since you only want to maintain digits and not the spaces, I've left the \s outside the parenthesis so None is returned which can then be filtered out using a simple loop.
Should be one line of code
re.findall('[a-z]+|[\d]+', 'my t0kens')
Not perfect, but removing space from the list below is easy :-)
re.split('([\d ])', 'my t0kens')
['my', ' ', 't', '0', 'kens']
docs: "Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list."