I have some questions on the split() description/examples from the Python RE documents
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
re.split(r'(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']
In this example there is a capturing group, it matches at the start and end of the string, thus, the result starts and ends with an empty string. Outside of understanding that this happens, I would like to better understand the reasoning. The explanation for this is:
That way, separator components are always found at the same relative
indices within the result list.
Could someone expand on this? Relative to what?
My other query is related to this example:
re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
\w will match any character that can be used in any word in any language (Flag:unicode), or will be the equivalent of [a-zA-Z0-9_] (Flag:ASCII), \W is the inverse of this. Can someone talk about each of the matches in the example above, explain each (if possible) in terms of what is matched (\B, \U, ...).
Added 29/01/2019:
Apart of what I am after wasn't stated very clear (my bad). In terms of the second example, I am curious about the steps taken to come to the result (how the python re module processed the example). After reading this post on Zero-Length Regex Matches things are clearer, but I would still be interest if anyone can break down the logic up to ['', '...', '', '', 'w', in the results.
What it's trying to say is that when you have a capturing group in the delimiter, and it matches the beginning of the string, the resulting list will always start with the delimiter. Similarly, if it matches at the end of the string, the list will always end with the delimiter.
For consistency, this is true even when the delimiter matches an empty string. The input string is considered to have an empty string before the first character and after the last character, and the delimiter will match these. And then they'll be the first and last elements of the resulting list.
Check this:
>>> re.split('(a)', 'awords')
['', 'a', 'words']
>>> re.split('(w)', 'awords')
['a', 'w', 'ords']
>>> re.split('(o)', 'awords')
['aw', 'o', 'rds']
>>> re.split('(s)', 'awords')
['aword', 's', '']
Always at the second place (index of 1).
On the other hand:
>>> re.split('a', 'awords')
['', 'words']
>>> re.split('w', 'awords')
['a', 'ords']
>>> re.split('s', 'awords')
['aword', '']
Almost the same, only the catching group not inside.
Related
This question already has an answer here:
Learning Regular Expressions [closed]
(1 answer)
Closed 2 years ago.
txt = "The rain in Spain '69'"
x = re.split(r'\W*', txt)
print(x)
['', 'T', 'h', 'e', '', 'r', 'a', 'i', 'n', '', 'i', 'n', '', 'S', 'p', 'a', 'i', 'n', '', '6', '9', '', '']
txt = "The rain in Spain '69'"
x = re.split(r'\W+', txt)
print(x)
['The', 'rain', 'in', 'Spain', '69', '']
The documentation (python.org):
Another repeating metacharacter is +, which matches one or more times. Pay careful attention to the difference between * and +; * matches zero or more times, so whatever’s being repeated may not be present at all, while + requires at least one occurrence. To use a similar example, ca+t will match 'cat' (1 'a'), 'caaat' (3 'a's), but won’t match 'ct'.
Please explain this difference.
the split function goes through the string looking for the pattern, when found, it makes a new element in the result array.
asterisk split
it starts the string, it sees nothing, this matches the pattern (0 or more). The array is now [''].
then it sees the first character, which is also zero or more. the array is now ['', 'T']. This continues until all characters match, and each one gets own element.
plus regex split
With this mode, the space before the first character does not match. only at the end of the word is the first non word character found (it needs at least one). If it were "a b" splitting on one or more non-word ("\W+") would result in ['a', 'b'] i think.
So it finds a space at the end of every word and splits there.
If there were no characters which fit the pattern, it does not split, ie, if the pattern is one or more 'a' and the input is 'ct' it should not split.
definition of 'word' in regex
Word and nonword: word is [a-zA-Z0-9_], so what you expect in a word, non word is everything typically between: [^a-zA-Z0-9_]
I define a split function as lambda x: re.split('[(|)|.]', x), and when I applied this function to my original strings, it always generates some empty strings. For example:
When applied to string:
(Type).(Terrorist organization)AND(Involved in attacks).(nine-eleven)
The result is:
['', 'Type', '', '', 'Terrorist organization', 'AND', 'Involved in attacks', '', '', 'nine-eleven', '']
I know I can simply remove those empty strings manually, but is there any smart way to get rid of them?
grab as many separators as you can with + instead of only one:
re.split('[().]+', s)
unfortunately, this doesn't suffice as re.split notoriously yields empty strings at start & end of the string:
['', 'Type', 'Terrorist organization', 'AND', 'Involved in attacks', 'nine-eleven', '']
but you can filter them out by using post processing:
[x for x in re.split('[().]+', s) if x]
On the other hand, you could revert the regex and use re.findall to match as much non-separators as possible:
re.findall('[^().]+', s)
this directly yields:
['Type', 'Terrorist organization', 'AND', 'Involved in attacks', 'nine-eleven']
The regexp matches ), ., and ( individually. Since these are next to each other in the input, there's an empty string between them, so the result contains those empty strings.
If you want to treat a sequence of delimiters as a single delimiter, add a + quantifier to the regexp so it matches them as a sequence.
re.split('[|().]+', x)
The empty string at the beginning is because of the empty string before the first (. Similarly, the empty string at the end is from the empty string in the input after the last ). I don't think there's a simple way to prevent these, just remove them from the result.
You can filter:
filter(lambda x: x, re.split('[().]+', s))
Test:
import re
s = '(Type).(Terrorist organization)AND(Involved in attacks).(nine-eleven)'
print(list(filter(None, re.split('[().]+', s))))
Result:
['Type', 'Terrorist organization', 'AND', 'Involved in attacks', 'nine-eleven']
print re.split("([0-9]{4})", "Spring2014")
results in
['Spring', '2014', '']
Where is that extra '' coming from at the end? My desired List is the above, without that extra blank item at the end. It's easy enough to just discard the extra item, but I'd just like to understand why re.split is including it.
You asked re.split() to split the text on 4 digits part; the string before '2014' contains 'Spring', and after that part is the string ''.
This is documented behaviour:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
>>> re.split('(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']
That way, separator components are always found at the same relative indices within the result list (e.g., if there’s one capturing group in the separator, the 0th, the 2nd and so forth).
I'm confused with the following three patterns, would someone explain it in more detail?
## IPython with Python 2.7.3
In [62]: re.findall(r'[a-z]*',"f233op")
Out[62]: ['f', '', '', '', 'op', ''] ## why does the last '' come out?
In [63]: re.findall(r'([a-z])*',"f233op")
Out[63]: ['f', '', '', '', 'p', ''] ## why does the character 'o' get lost?
In [64]: re.findall(r'([a-z]*)',"f233op")
Out[64]: ['f', '', '', '', 'op', ''] ## what's the different than line 63 above?
Example 1
re.findall(r'[a-z]*',"f233op")
This pattern is matching zero-or-more instances of lower case alphabet characters. The ZERO-or-more part is key here, since a match of nothing, starting from every index position in the string, is just as valid as a match of f or op. The last empty string returned is the match starting from the end of the string (the position between p and $ (end of string).
Example 2
re.findall(r'([a-z])*',"f233op")
Now you are matching character groups, consisting of a single lower-case alphabet character. The o is no longer returned because this is a greedy search, and the last valid matched group will be returned. So if you changed the string to f233op12fre, the final e would be returned, but no the preceding f or r. Likewise, if you take out the p in your string, you still see that o is returned as a valid match.
Conversely, if you tried to make this regex non-greedy by adding a ? (eg. ([a-z])*?), the returned set of matches would all be empty strings, since a valid match of nothing has a higher precedence of a valid match of something.
Example 3
re.findall(r'([a-z]*)',"f233op")
Nothing is different in the matched characters, but now you are returning character groups instead of raw matches. The output of this regex query will be the same as your first example, but you'll notice that if you add an additional matching group, you will suddenly see the results of each match attempt grouped into tuples:
IN : re.findall(r'([a-z]*)([0-9]*)',"f233op")
OUT: [('f', '233'), ('op', ''), ('', '')]
Contrast this with the same pattern, minus the parenthesis (groups), and you'll see why they are important:
IN : re.findall(r'[a-z]*[0-9]*',"f233op")
OUT: ['f233', 'op', '']
Also...
It can be useful to plug regex patterns like these into regex diagram generators like Regexplained to see how the pattern matching logic works. For example, as an explanation as to why your regex is always returning empty character string matches, take a look at the difference between the patterns [a-z]* and [a-z]+.
Don't forget to check the Python docs for the re library if you get stuck, they actually give a pretty stellar explanation for the standard regex syntax.
You get the final '' because [a-z]* is matching the empty string at the end.
The character 'o' is missing because you have told re.findall to match groups, and each group has a single character. Put another way, you’re doing the equivalent of
m = re.match(r'([a-z])*', 'op')
m.group(1)
which will return 'p', because that’s the last thing captured by the parens (capture group 1).
Again, you’re matching groups, but this time multi-character groups.
Your surprising results are related to the Regular Expression Quantifier *.
Consider:
[a-z]*
Debuggex Demo
Vs:
[a-z]+
Debuggex Demo
Consider as another example that I think is more illustrative of what you are seeing:
>>> re.findall(r'[a-z]*', '123456789')
['', '', '', '', '', '', '', '', '', '']
There are no characters in the set [a-z] in the string 123456789. Yet, since * means 'zero or more', all character positions 'match' by not matching any characters at that position.
For example, assume you just wanted to test if there were any letters in a string, and you use a regex like so:
>>> re.search(r'[a-z]*', '1234')
<_sre.SRE_Match object at 0x1069b6988> # a 'match' is returned, but this is
# probably not what was intended
Now consider:
>>> re.findall(r'[a-z]*', '123abc789')
['', '', '', 'abc', '', '', '', '']
Vs:
>>> re.findall(r'([a-z])*', '123abc789')
['', '', '', 'c', '', '', '', '']
The first pattern is [a-z]*. The part [a-z] is a character class matching a single character in the set a-z unless modified; the addition of * quantifier will greedily match as many characters as possible if more than zero -- hence the match of 'abc' but will also allow zero characters to be a match (or a character outside the character set to match the position since 0 is a match).
The addition of a grouping in ([a-z])* effectively reduces the match in the quantified set back to a single character and the last character matched in the set is returned.
If you want the effect of grouping (say in a more complex pattern) use a non capturing group:
>>> re.findall(r'(?:[a-z])*', '123abc789')
['', '', '', 'abc', '', '', '', '']
In line 63 you're finding all instances of a group (indicated with the parens) of characters of length 1. The * isn't doing much for you here (just causing you to match zero length groups).
In the other examples having the * next to the [a-z], you match adjacent characters of any length.
EDIT
Playing around with this tool may help.
I'm using re.split() to separate a string into tokens. Currently the pattern I'm using as the argument is [^\dA-Za-z], which retrieves alphanumeric tokens from the string.
However, what I need is to also split tokens that have both numbers and letters into tokens with only one or the other, eg.
re.split(pattern, "my t0kens")
would return ["my", "t", "0", "kens"].
I'm guessing I might need to use lookahead/lookbehind, but I'm not sure if that's actually necessary or if there's a better way to do it.
Try the findall method instead.
>>> print re.findall ('[^\d ]+', "my t0kens");
['my', 't', 'kens']
>>> print re.findall ('[\d]+', "my t0kens");
['0']
>>>
Edit: Better way from Bart's comment below.
>>> print re.findall('[a-zA-Z]+|\\d+', "my t0kens")
['my', 't', '0', 'kens']
>>>
>>> [x for x in re.split(r'\s+|(\d+)',"my t0kens") if x]
['my', 't', '0', 'kens']
By using capturing parenthesis within the pattern, the tokens will also be return. Since you only want to maintain digits and not the spaces, I've left the \s outside the parenthesis so None is returned which can then be filtered out using a simple loop.
Should be one line of code
re.findall('[a-z]+|[\d]+', 'my t0kens')
Not perfect, but removing space from the list below is easy :-)
re.split('([\d ])', 'my t0kens')
['my', ' ', 't', '0', 'kens']
docs: "Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list."