python/regex: match letter only or letter followed by number - python

I want to split this string 'AB4F2D' in ['A', 'B4', 'F2', 'D'].
Essentially, if character is a letter, return the letter, if character is a number return previous character plus present character (luckily there is no number >9 so there is never a X12).
I have tried several combinations but I am not able to find the correct one:
def get_elements(input_string):
patterns = [
r'[A-Z][A-Z0-9]',
r'[A-Z][A-Z0-9]|[A-Z]',
r'\D|\D\d',
r'[A-Z]|[A-Z][0-9]',
r'[A-Z]{1}|[A-Z0-9]{1,2}'
]
for p in patterns:
elements = re.findall(p, input_string)
print(elements)
results:
['AB', 'F2']
['AB', 'F2', 'D']
['A', 'B', 'F', 'D']
['A', 'B', 'F', 'D']
['A', 'B', '4F', '2D']
Can anyone help? Thanks

\D\d?
One problem with yours is that you put the shorter alternative first, so the longer one never gets a chance. For example, the correct version of your \D|\D\d is \D\d|\D. But just use \D\d?.

Use Extended Groups
There is special syntax for python regexes allowing you to match ahead without consuming the characters (and much more).
Here is a pattern I would come up with using that:
[A-Z](?![0-9])|[A-Z][0-9]
This matches everything in just one pattern. There might be simpler ways to match it, but I find this to be the most flexible if you want to adjust it later. Read it like this: greedily match a letter if the next character is not a digit. If that is not the case, match a letter followed by a digit.
More info in the docs. If you want to test around I recommend using a regex tester like this and make sure to select python syntax.

Related

How to extract strings between two fixed marks repeatedly

suppose we have a string
text="xaxbx"
We try to get everything between "x". In this case the answer should be "a" and "b"
But when I try
result=re.findall('x(.*?)x',text)
I only get "a", but not "b"
Is there a solution for more generalized situations such as
text="xaxbxcxdxexfx"?
Thanks!
That is, because you "consume" the x's by matching them directly. Look up lookahead and lookbehind. Using these features you get the correct solution:
(?<=x).*?(?=x)
Try it on regex101, you can test example strings there and they explain each part of the regex.
In re.findall('x(.*?)x',text) the characters "x" are consumed in the process of making a match. You can use lookahead and lookbehind instead:
import re
text="xaxbx"
re.findall(r"(?<=x)[^x]+(?=x)", text)
It gives:
['a', 'b']
Another approach is to use regular expression grouping:
import re
text = "xaxbxcxdxexfx"
re.findall("x([^x]+)", text)
Output:
['a', 'b', 'c', 'd', 'e', 'f']

What causes the '' in ['h', 'e', 'l', 'l', 'o', ''] when you do re.findall('[\w]?', 'hello')

What causes the '' in ['h', 'e', 'l', 'l', 'o', ''] when you do re.findall('[\w]?', 'hello'). I thought the result would be ['h', 'e', 'l', 'l', 'o'], without the last empty string.
The question mark in your regex ('[\w]?') is responsible for the empty string being one of the returned results.
A question mark is a quantifier meaning "zero-or-one matches." You are asking for all occurrences of either zero-or-one "word characters". The letters satisfy the "-or-one word characters" match. The empty string satisfies the “zero word characters” match condition.
Change your regex to '\w' (remove the question mark and superfluous character class brackets) and the output will be as you expect.
Regexes search through strings one character at a time. If a match is found at a character position the regex advances to the next part of the pattern. If a match is not found, the regex tries alternation (different variations) if available. If all alternatives fail, it backtracks and tries alternating the previous part and so on until either an entire match is found or all alternatives fail. This is why some seemingly simple regexes will match a string quickly, but fail to match in exponential time. In your example you only have one part to your pattern.
You are searching for [\w]?. The ? means "one or zero of prior part" and is equivalent to {0,1}. Each of 'h', 'e', 'l', 'l' & 'o' matches [\w]{1}, so the pattern advances and completes for each letter, restarting the regex at the beginning because you asked for all the matches, not just the first. At the end of the string the regex is still trying to find a match. [\w]{1} no longer matches but the alternative [\w]{0} does, so it matches ''. Modern regex engines have a rule to stop zero-length matches from repeating at the same position. The regex tries again, but this time fails because it can't find a match for [\w]{1} and it has already found a match for [\w]{0}. It can't advance through the string because it is at the end, so it exits. It has run the pattern 7 times and found 6 matches, the last one of which was empty.
As pointed out in a comment, if your regex was \w?? (I've removed [ and ] because they aren't necessary in your original regex), it means find zero or one (note the order has changed from before). It will return '', 'h', '', 'e', '', 'l', '', 'l', '', 'o' & ''. This is because it now prefers to find zero but it can't find two zero-length matches in a row without advancing.

Split string by position not character

We know that anchors, word boundaries, and lookaround match at a position, rather than matching a character.
Is it possible to split a string by one of the preceding ways with regex (specifically in python)?
For example consider the following string:
"ThisisAtestForchEck,Match IngwithPosition."
So i want the following result (the sub-strings that start with uppercase letter but not precede by space ):
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']
If i split with grouping i get:
>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']
And this is the result with look-around :
>>> re.split(r'(?<=[A-Z])',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z]))',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z])?)',s)
['ThisisAtestForchEck,MatchingwithPosition.']
Note that if i want to split by sub-strings that start with uppercase and are preceded by a space, e.g.:
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']
I can use re.findall, viz.:
>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']
But what about the first example: is it possible to solve it with re.findall?
A way with re.findall:
re.findall(r'(?:[A-Z]|^[^A-Z\s])[^A-Z\s]*(?:\s+[A-Z][^A-Z]*)*',s)
When you decide to change your approach from split to findall, the first job consists to reformulate your requirements: "I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter)"
(?<!\s)(?=[A-Z])
You can use this to split with regex module as re does not support split at 0 width assertions.
import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)
or
print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]
See demo.
https://regex101.com/r/sJ9gM7/65
I know this might be less convenient because of the tuple nature of the result. But I think that this findall finds what you need:
re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)
## returns [('Thisis', 's'), ('Atest', 't'), ('Forch', 'h'), ('Eck,', ','), ('Match Ingwith', 'h'), ('Position.', '.')]
This can be used in the following list comprehension to give the desired output:
[val[0] for val in re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
And here is a hack that uses split:
re.split(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)[1::3]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
try capture using this pattern
([A-Z][a-z]*(?: [A-Z][a-z]*)*)
Demo

Python Regular Expressions: Capture lookahead value (capturing text without consuming it)

I wish to use regular expressions to split words into groups of (vowels, not_vowels, more_vowels), using a marker to ensure every word begins and ends with a vowel.
import re
MARKER = "~"
VOWELS = {"a", "e", "i", "o", "u", MARKER}
word = "dog"
if word[0] not in VOWELS:
word = MARKER+word
if word[-1] not in VOWELS:
word += MARKER
re.findall("([%]+)([^%]+)([%]+)".replace("%", "".join(VOWELS)), word)
In this example we get:
[('~', 'd', 'o')]
The issue is that I wish the matches to overlap - the last set of vowels should become the first set of the next match. This appears possible with lookaheads, if we replace the regex as follows:
re.findall("([%]+)([^%]+)(?=[%]+)".replace("%", "".join(VOWELS)), word)
We get:
[('~', 'd'), ('o', 'g')]
Which means we are matching what I want. However, it now doesn't return the last set of vowels. The output I want is:
[('~', 'd', 'o'), ('o', 'g', '~')]
I feel this should be possible (if the regex can check for the second set of vowels, I see no reason it can't return them), but I can't find any way of doing it beyond the brute force method, looping through the results after I have them and appending the first character of the next match to the last match, and the last character of the string to the last match. Is there a better way in which I can do this?
The two things that would work would be capturing the lookahead value, or not consuming the text on a match, while capturing the value - I can't find any way of doing either.
I found it just after posting:
re.findall("([%]+)([^%]+)(?=([%]+))".replace("%", "".join(VOWELS)), word)
Adding an extra pair of brackets inside the lookahead means that it becomes a capture itself.
I found this pretty obscure and hard to find - I'm not sure if it's just everyone else found this obvious, but hopefully anyone else in my position will find this more easily in future.
I would not try to make the regex engine do this; I would split the string into consonant and vowel chunks, and then generate the overlapping results. This way, you also don't actually need to hack in markers, assuming you're okay with '' as the "vowel" part when the word doesn't actually being or end with a vowel.
def overlapping_matches(word):
pieces = re.split('([^aeiou]+)', word)
# There are other ways to do this; I'm kinda showing off
return zip(pieces[:-2], pieces[1:-1], pieces[2:])[::2]
overlapping_matches('dog') # [('', 'd', 'o'), ('o', 'g', '')]
(This still fails if word contains only vowels, but that is trivially corrected if necessary.)

Decompose a Python string into its characters

I want to break a Python string into its characters.
sequenceOfAlphabets = list( string.uppercase )
works.
However, why does not
sequenceOfAlphabets = re.split( '.', string.uppercase )
work?
All I get are empty, albeit expected count of elements
The '.' matches every character and re.split returns everything that wasn't matched, that's why you're getting the empty list.
Using list is usually the way to handle something like this but if you want to use regular expressions just use re.findall
sequenceOfAlphabets = re.findall( '.', string.uppercase )
That should give you ['A', 'B', 'C', .... ,'Z']
Because the delimiter character used by split does not appear in the resulting list. This allows it be used like:
re.split(',', "foo,bar,baz")
['foo', 'bar', 'baz']
Also, you will find the resulting list from your split code actually contains one extra element, since split returns one more than the number of delimiters found. The above has two commas, so it returns a three-element list.
If you can do something with both a built-in function and with regexes, then usually the built-in approach will be faster and more legible.
The regex world is a maze of twisty little passages, populated by purveyors of almost-truths like """The '.' matches every character""" ... which it does, but only when you use the re.DOTALL flag. This information is not cunningly concealed in the fine print of the documentation; it's right there as the FIRST entry of "special characters":
'.'
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
>>> import re
>>> re.findall(".", "fu\nbar")
['f', 'u', 'b', 'a', 'r']
>>>
Just an FYI, this also works:
sequenceOfAlphabets = [a for a in string.uppercase]
...but that does exactly what list() would do so I don't think it would be any faster (I could be wrong).
You can also create an empty set and use the update method, like so:
destroy_string = set()
destroy_string.update('Stack Overflow')
destroy_string
{'k', ' ', 'S', 'c', 'v', 'o', 'r', 't', 'w', 'e', 'f', 'O', 'l', 'a'}
Albeit, it will become unordered and the duplicates will be lost in the set, however, this is still a valid way to decompose a string into a set of its individual members.
From the documentation:
If capturing parentheses are used in
pattern, then the text of all groups
in the pattern are also returned as
part of the resulting list.
Also note:
If there are capturing groups in the
separator and it matches at the start
of the string, the result will start
with an empty string. The same holds
for the end of the string.
So, use re.split( '(.)', string.uppercase)[1:-1] instead.

Categories

Resources