I want to break a Python string into its characters.
sequenceOfAlphabets = list( string.uppercase )
works.
However, why does not
sequenceOfAlphabets = re.split( '.', string.uppercase )
work?
All I get are empty, albeit expected count of elements
The '.' matches every character and re.split returns everything that wasn't matched, that's why you're getting the empty list.
Using list is usually the way to handle something like this but if you want to use regular expressions just use re.findall
sequenceOfAlphabets = re.findall( '.', string.uppercase )
That should give you ['A', 'B', 'C', .... ,'Z']
Because the delimiter character used by split does not appear in the resulting list. This allows it be used like:
re.split(',', "foo,bar,baz")
['foo', 'bar', 'baz']
Also, you will find the resulting list from your split code actually contains one extra element, since split returns one more than the number of delimiters found. The above has two commas, so it returns a three-element list.
If you can do something with both a built-in function and with regexes, then usually the built-in approach will be faster and more legible.
The regex world is a maze of twisty little passages, populated by purveyors of almost-truths like """The '.' matches every character""" ... which it does, but only when you use the re.DOTALL flag. This information is not cunningly concealed in the fine print of the documentation; it's right there as the FIRST entry of "special characters":
'.'
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
>>> import re
>>> re.findall(".", "fu\nbar")
['f', 'u', 'b', 'a', 'r']
>>>
Just an FYI, this also works:
sequenceOfAlphabets = [a for a in string.uppercase]
...but that does exactly what list() would do so I don't think it would be any faster (I could be wrong).
You can also create an empty set and use the update method, like so:
destroy_string = set()
destroy_string.update('Stack Overflow')
destroy_string
{'k', ' ', 'S', 'c', 'v', 'o', 'r', 't', 'w', 'e', 'f', 'O', 'l', 'a'}
Albeit, it will become unordered and the duplicates will be lost in the set, however, this is still a valid way to decompose a string into a set of its individual members.
From the documentation:
If capturing parentheses are used in
pattern, then the text of all groups
in the pattern are also returned as
part of the resulting list.
Also note:
If there are capturing groups in the
separator and it matches at the start
of the string, the result will start
with an empty string. The same holds
for the end of the string.
So, use re.split( '(.)', string.uppercase)[1:-1] instead.
Related
I want to split this string 'AB4F2D' in ['A', 'B4', 'F2', 'D'].
Essentially, if character is a letter, return the letter, if character is a number return previous character plus present character (luckily there is no number >9 so there is never a X12).
I have tried several combinations but I am not able to find the correct one:
def get_elements(input_string):
patterns = [
r'[A-Z][A-Z0-9]',
r'[A-Z][A-Z0-9]|[A-Z]',
r'\D|\D\d',
r'[A-Z]|[A-Z][0-9]',
r'[A-Z]{1}|[A-Z0-9]{1,2}'
]
for p in patterns:
elements = re.findall(p, input_string)
print(elements)
results:
['AB', 'F2']
['AB', 'F2', 'D']
['A', 'B', 'F', 'D']
['A', 'B', 'F', 'D']
['A', 'B', '4F', '2D']
Can anyone help? Thanks
\D\d?
One problem with yours is that you put the shorter alternative first, so the longer one never gets a chance. For example, the correct version of your \D|\D\d is \D\d|\D. But just use \D\d?.
Use Extended Groups
There is special syntax for python regexes allowing you to match ahead without consuming the characters (and much more).
Here is a pattern I would come up with using that:
[A-Z](?![0-9])|[A-Z][0-9]
This matches everything in just one pattern. There might be simpler ways to match it, but I find this to be the most flexible if you want to adjust it later. Read it like this: greedily match a letter if the next character is not a digit. If that is not the case, match a letter followed by a digit.
More info in the docs. If you want to test around I recommend using a regex tester like this and make sure to select python syntax.
I have some questions on the split() description/examples from the Python RE documents
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
re.split(r'(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']
In this example there is a capturing group, it matches at the start and end of the string, thus, the result starts and ends with an empty string. Outside of understanding that this happens, I would like to better understand the reasoning. The explanation for this is:
That way, separator components are always found at the same relative
indices within the result list.
Could someone expand on this? Relative to what?
My other query is related to this example:
re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
\w will match any character that can be used in any word in any language (Flag:unicode), or will be the equivalent of [a-zA-Z0-9_] (Flag:ASCII), \W is the inverse of this. Can someone talk about each of the matches in the example above, explain each (if possible) in terms of what is matched (\B, \U, ...).
Added 29/01/2019:
Apart of what I am after wasn't stated very clear (my bad). In terms of the second example, I am curious about the steps taken to come to the result (how the python re module processed the example). After reading this post on Zero-Length Regex Matches things are clearer, but I would still be interest if anyone can break down the logic up to ['', '...', '', '', 'w', in the results.
What it's trying to say is that when you have a capturing group in the delimiter, and it matches the beginning of the string, the resulting list will always start with the delimiter. Similarly, if it matches at the end of the string, the list will always end with the delimiter.
For consistency, this is true even when the delimiter matches an empty string. The input string is considered to have an empty string before the first character and after the last character, and the delimiter will match these. And then they'll be the first and last elements of the resulting list.
Check this:
>>> re.split('(a)', 'awords')
['', 'a', 'words']
>>> re.split('(w)', 'awords')
['a', 'w', 'ords']
>>> re.split('(o)', 'awords')
['aw', 'o', 'rds']
>>> re.split('(s)', 'awords')
['aword', 's', '']
Always at the second place (index of 1).
On the other hand:
>>> re.split('a', 'awords')
['', 'words']
>>> re.split('w', 'awords')
['a', 'ords']
>>> re.split('s', 'awords')
['aword', '']
Almost the same, only the catching group not inside.
What causes the '' in ['h', 'e', 'l', 'l', 'o', ''] when you do re.findall('[\w]?', 'hello'). I thought the result would be ['h', 'e', 'l', 'l', 'o'], without the last empty string.
The question mark in your regex ('[\w]?') is responsible for the empty string being one of the returned results.
A question mark is a quantifier meaning "zero-or-one matches." You are asking for all occurrences of either zero-or-one "word characters". The letters satisfy the "-or-one word characters" match. The empty string satisfies the “zero word characters” match condition.
Change your regex to '\w' (remove the question mark and superfluous character class brackets) and the output will be as you expect.
Regexes search through strings one character at a time. If a match is found at a character position the regex advances to the next part of the pattern. If a match is not found, the regex tries alternation (different variations) if available. If all alternatives fail, it backtracks and tries alternating the previous part and so on until either an entire match is found or all alternatives fail. This is why some seemingly simple regexes will match a string quickly, but fail to match in exponential time. In your example you only have one part to your pattern.
You are searching for [\w]?. The ? means "one or zero of prior part" and is equivalent to {0,1}. Each of 'h', 'e', 'l', 'l' & 'o' matches [\w]{1}, so the pattern advances and completes for each letter, restarting the regex at the beginning because you asked for all the matches, not just the first. At the end of the string the regex is still trying to find a match. [\w]{1} no longer matches but the alternative [\w]{0} does, so it matches ''. Modern regex engines have a rule to stop zero-length matches from repeating at the same position. The regex tries again, but this time fails because it can't find a match for [\w]{1} and it has already found a match for [\w]{0}. It can't advance through the string because it is at the end, so it exits. It has run the pattern 7 times and found 6 matches, the last one of which was empty.
As pointed out in a comment, if your regex was \w?? (I've removed [ and ] because they aren't necessary in your original regex), it means find zero or one (note the order has changed from before). It will return '', 'h', '', 'e', '', 'l', '', 'l', '', 'o' & ''. This is because it now prefers to find zero but it can't find two zero-length matches in a row without advancing.
We know that anchors, word boundaries, and lookaround match at a position, rather than matching a character.
Is it possible to split a string by one of the preceding ways with regex (specifically in python)?
For example consider the following string:
"ThisisAtestForchEck,Match IngwithPosition."
So i want the following result (the sub-strings that start with uppercase letter but not precede by space ):
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']
If i split with grouping i get:
>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']
And this is the result with look-around :
>>> re.split(r'(?<=[A-Z])',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z]))',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z])?)',s)
['ThisisAtestForchEck,MatchingwithPosition.']
Note that if i want to split by sub-strings that start with uppercase and are preceded by a space, e.g.:
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']
I can use re.findall, viz.:
>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']
But what about the first example: is it possible to solve it with re.findall?
A way with re.findall:
re.findall(r'(?:[A-Z]|^[^A-Z\s])[^A-Z\s]*(?:\s+[A-Z][^A-Z]*)*',s)
When you decide to change your approach from split to findall, the first job consists to reformulate your requirements: "I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter)"
(?<!\s)(?=[A-Z])
You can use this to split with regex module as re does not support split at 0 width assertions.
import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)
or
print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]
See demo.
https://regex101.com/r/sJ9gM7/65
I know this might be less convenient because of the tuple nature of the result. But I think that this findall finds what you need:
re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)
## returns [('Thisis', 's'), ('Atest', 't'), ('Forch', 'h'), ('Eck,', ','), ('Match Ingwith', 'h'), ('Position.', '.')]
This can be used in the following list comprehension to give the desired output:
[val[0] for val in re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
And here is a hack that uses split:
re.split(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)[1::3]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
try capture using this pattern
([A-Z][a-z]*(?: [A-Z][a-z]*)*)
Demo
I'm using re.split() to separate a string into tokens. Currently the pattern I'm using as the argument is [^\dA-Za-z], which retrieves alphanumeric tokens from the string.
However, what I need is to also split tokens that have both numbers and letters into tokens with only one or the other, eg.
re.split(pattern, "my t0kens")
would return ["my", "t", "0", "kens"].
I'm guessing I might need to use lookahead/lookbehind, but I'm not sure if that's actually necessary or if there's a better way to do it.
Try the findall method instead.
>>> print re.findall ('[^\d ]+', "my t0kens");
['my', 't', 'kens']
>>> print re.findall ('[\d]+', "my t0kens");
['0']
>>>
Edit: Better way from Bart's comment below.
>>> print re.findall('[a-zA-Z]+|\\d+', "my t0kens")
['my', 't', '0', 'kens']
>>>
>>> [x for x in re.split(r'\s+|(\d+)',"my t0kens") if x]
['my', 't', '0', 'kens']
By using capturing parenthesis within the pattern, the tokens will also be return. Since you only want to maintain digits and not the spaces, I've left the \s outside the parenthesis so None is returned which can then be filtered out using a simple loop.
Should be one line of code
re.findall('[a-z]+|[\d]+', 'my t0kens')
Not perfect, but removing space from the list below is easy :-)
re.split('([\d ])', 'my t0kens')
['my', ' ', 't', '0', 'kens']
docs: "Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list."