Python Regular Expressions: Capture lookahead value (capturing text without consuming it)

Python Regular Expressions: Capture lookahead value (capturing text without consuming it) - python

I wish to use regular expressions to split words into groups of (vowels, not_vowels, more_vowels), using a marker to ensure every word begins and ends with a vowel.
import re
MARKER = "~"
VOWELS = {"a", "e", "i", "o", "u", MARKER}
word = "dog"
if word[0] not in VOWELS:
word = MARKER+word
if word[-1] not in VOWELS:
word += MARKER
re.findall("([%]+)([^%]+)([%]+)".replace("%", "".join(VOWELS)), word)
In this example we get:
[('~', 'd', 'o')]
The issue is that I wish the matches to overlap - the last set of vowels should become the first set of the next match. This appears possible with lookaheads, if we replace the regex as follows:
re.findall("([%]+)([^%]+)(?=[%]+)".replace("%", "".join(VOWELS)), word)
We get:
[('~', 'd'), ('o', 'g')]
Which means we are matching what I want. However, it now doesn't return the last set of vowels. The output I want is:
[('~', 'd', 'o'), ('o', 'g', '~')]
I feel this should be possible (if the regex can check for the second set of vowels, I see no reason it can't return them), but I can't find any way of doing it beyond the brute force method, looping through the results after I have them and appending the first character of the next match to the last match, and the last character of the string to the last match. Is there a better way in which I can do this?
The two things that would work would be capturing the lookahead value, or not consuming the text on a match, while capturing the value - I can't find any way of doing either.

I found it just after posting:
re.findall("([%]+)([^%]+)(?=([%]+))".replace("%", "".join(VOWELS)), word)
Adding an extra pair of brackets inside the lookahead means that it becomes a capture itself.
I found this pretty obscure and hard to find - I'm not sure if it's just everyone else found this obvious, but hopefully anyone else in my position will find this more easily in future.

I would not try to make the regex engine do this; I would split the string into consonant and vowel chunks, and then generate the overlapping results. This way, you also don't actually need to hack in markers, assuming you're okay with '' as the "vowel" part when the word doesn't actually being or end with a vowel.
def overlapping_matches(word):
pieces = re.split('([^aeiou]+)', word)
# There are other ways to do this; I'm kinda showing off
return zip(pieces[:-2], pieces[1:-1], pieces[2:])[::2]
overlapping_matches('dog') # [('', 'd', 'o'), ('o', 'g', '')]
(This still fails if word contains only vowels, but that is trivially corrected if necessary.)

Related

python/regex: match letter only or letter followed by number

I want to split this string 'AB4F2D' in ['A', 'B4', 'F2', 'D'].
Essentially, if character is a letter, return the letter, if character is a number return previous character plus present character (luckily there is no number >9 so there is never a X12).
I have tried several combinations but I am not able to find the correct one:
def get_elements(input_string):
patterns = [
r'[A-Z][A-Z0-9]',
r'[A-Z][A-Z0-9]|[A-Z]',
r'\D|\D\d',
r'[A-Z]|[A-Z][0-9]',
r'[A-Z]{1}|[A-Z0-9]{1,2}'
]
for p in patterns:
elements = re.findall(p, input_string)
print(elements)
results:
['AB', 'F2']
['AB', 'F2', 'D']
['A', 'B', 'F', 'D']
['A', 'B', 'F', 'D']
['A', 'B', '4F', '2D']
Can anyone help? Thanks

\D\d?
One problem with yours is that you put the shorter alternative first, so the longer one never gets a chance. For example, the correct version of your \D|\D\d is \D\d|\D. But just use \D\d?.

Use Extended Groups
There is special syntax for python regexes allowing you to match ahead without consuming the characters (and much more).
Here is a pattern I would come up with using that:
[A-Z](?![0-9])|[A-Z][0-9]
This matches everything in just one pattern. There might be simpler ways to match it, but I find this to be the most flexible if you want to adjust it later. Read it like this: greedily match a letter if the next character is not a digit. If that is not the case, match a letter followed by a digit.
More info in the docs. If you want to test around I recommend using a regex tester like this and make sure to select python syntax.

How can I define a regex that matches multiple patterns anchored at the same position in my search text?

I'm trying to use Python's findall to try and find all the hypenated and non-hypenated identifiers in a string (this is to plug into existing code, so using any constructs beyond findall won't work). If you imagine code like this:
regex = ...
body = "foo-bar foo-bar-stuff stuff foo-word-stuff"
ids = re.compile(regex).findall(body)
I would like the ids value to be ['foo', 'bar', 'word', 'foo-bar', 'foo-bar-stuff', and 'stuff'] (although not bar-stuff, because it's hypenated, but does not appear as a standalone space-separated identifier). Order of the array/set is not important.
A simple regex which matches the non-hypenated identifiers is \w+ and one which matches the hypenated ones is [\w-]+. However, I cannot figure out one which does both simultaneously (I don't have total control over the code, so cannot concatenate the lists together - I would like to do this in one Regex if possible).
I have tried \w|[\w-]+ but since the expression is greedy, this misses out bar for example, only matching -bar since foo has already been matched and it won't retry the pattern from the same starting position. I would like to find matches for (for example) both foo and foo-bar which begin (are anchored) at the same string position (which I think findall simply doesn't consider).
I've been trying some tricks such as lookaheads/lookbehinds such as mentioned, but I can't find any way to make them applicable to my scenario.
Any help would be appreciated.

You may use
import re
s = "foo-bar foo-bar-stuff stuff" #=> {'foo-bar', 'foo', 'bar', 'foo-bar-stuff', 'stuff'}
# s = "A-B-C D" # => {'C', 'D', 'A', 'A-B-C', 'B'}
l = re.findall(r'(?<!\S)\w+(?:-\w+)*(?!\S)', s)
res = []
for e in l:
res.append(e)
res.extend(e.split('-'))
print(set(res))
Pattern details
(?<!\S) - no non-whitespace right before
\w+ - 1+ word chars
(?:-\w+)* - zero or more repetitions of
- - a hyphen
\w+ - 1+ word chars
(?!\S) - no non-whitespace right after.
See the pattern demo online.
Note that to get all items, I split the matches with - and add these items to the resulting list. Then, with set, I remove any eventual dupes.

If you don't have to use regex
Just use split(below is example)
result = []
for x in body.split():
if x not in result:
result.append(x)
for y in x.split('-'):
if y not in result:
result.append(y)

This is not possible with findall alone, since it finds all non-overlapping matches, as the documentation says.
All you can do is find all longest matches with \w[-\w]* or something like that, and then generate all valid spans out of them (most probably starting from their split on '-').
Please note that \w[-\w]* will also match 123, 1-a, and a--, so something like(?=\D)\w[-\w]* or (?=\D)\w+(?:-\w+)* could be preferable (but you would still have to filter out the 1 from a-1).

Find every two (non-overlapping) vowels inbetween consonants

Task
You are given a string . It consists of alphanumeric characters, spaces and symbols(+,-).
Your task is to find all the substrings of the origina string that contain two or more vowels.
Also, these substrings must lie in between consonants and should contain vowels only.
Input Format: a single line of input containing string .
Output Format: print the matched substrings in their order of occurrence on separate
lines. If no match is found, print -1.
Sample Input:
rabcdeefgyYhFjkIoomnpOeorteeeeet
Sample Output:
ee
Ioo
Oeo
eeeee
The challenge above was taken from https://www.hackerrank.com/challenges/re-findall-re-finditer
The following code passes all the test cases:
import re
sol = re.findall(r"[^aiueo]([aiueoAIUEO]{2,})(?=[^aiueo])", input())
if sol:
for s in sol:
print(s)
else:
print(-1)
The following doesn't.
import re
sol = re.findall(r"[^aiueo]([aiueoAIUEO]{2,})[^aiueo]", input())
if sol:
for s in sol:
print(s)
else:
print(-1)
The only difference beteen them is the final bit of the regex. I can't understand why the second code fails. I would argue that ?= is useless because by grouping [aiueoAIUEO]{2,} I'm already excluding it from capture, but obviously I'm wrong and I can't tell why.
Any help?

The lookahead approach allows the consonant that ends one sequence of vowels to start the next sequence, whereas the non-lookahead approach requires at least two consonants between those sequences (one to end a sequence, another to start the next, as both are matched).
See
import re
print(re.findall(r'[^aiueo]([aiueoAIUEO]{2,})(?=[^aiueo])', 'moomoom'))
print(re.findall(r'[^aiueo]([aiueoAIUEO]{2,})[^aiueo]', 'moomoom'))
Which will output
['oo', 'oo']
['oo']
https://ideone.com/2Wn1TS
To be a bit picky, both attempts aren't correct regarding your problem description. They allow for uppercase vowels, spaces and symbols to be separators. You might want to use [b-df-hj-np-tv-z] instead of [^aeiou] and use flags=re.I

Here's an alternative solution that doesn't require using the special () characters for grouping, relying instead on a "positive lookbehind assertion" with (?<=...) RE syntax:
import re
sol=re.findall(r"(?<=[QWRTYPSDFGHJKLZXCVBNMqwrtypsdfghjklzxcvbnm])[AEIOUaeiou]{2,}(?=[QWRTYPSDFGHJKLZXCVBNMqwrtypsdfghjklzxcvbnm])", input())
if sol:
print(*sol, sep="\n")
else:
print(-1)

You can use re.IGNORECASE our re.I flag to ignore case sensitivity. Also, you can avoid vowels, digits from alphanumeric characters and space, + and - characters mentioned in the problem.
import re
vowels = re.findall(r"[^aeiou\d\s+-]([aeiou]{2,})(?=[^aeiou\d\s+-])", input(), re.I)
if len(vowels):
for vowel in vowels:
print(vowel)
else:
print("-1")

Split string by position not character

We know that anchors, word boundaries, and lookaround match at a position, rather than matching a character.
Is it possible to split a string by one of the preceding ways with regex (specifically in python)?
For example consider the following string:
"ThisisAtestForchEck,Match IngwithPosition."
So i want the following result (the sub-strings that start with uppercase letter but not precede by space ):
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']
If i split with grouping i get:
>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']
And this is the result with look-around :
>>> re.split(r'(?<=[A-Z])',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z]))',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z])?)',s)
['ThisisAtestForchEck,MatchingwithPosition.']
Note that if i want to split by sub-strings that start with uppercase and are preceded by a space, e.g.:
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']
I can use re.findall, viz.:
>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']
But what about the first example: is it possible to solve it with re.findall?

A way with re.findall:
re.findall(r'(?:[A-Z]|^[^A-Z\s])[^A-Z\s]*(?:\s+[A-Z][^A-Z]*)*',s)
When you decide to change your approach from split to findall, the first job consists to reformulate your requirements: "I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter)"

(?<!\s)(?=[A-Z])
You can use this to split with regex module as re does not support split at 0 width assertions.
import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)
or
print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]
See demo.
https://regex101.com/r/sJ9gM7/65

I know this might be less convenient because of the tuple nature of the result. But I think that this findall finds what you need:
re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)
## returns [('Thisis', 's'), ('Atest', 't'), ('Forch', 'h'), ('Eck,', ','), ('Match Ingwith', 'h'), ('Position.', '.')]
This can be used in the following list comprehension to give the desired output:
[val[0] for val in re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
And here is a hack that uses split:
re.split(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)[1::3]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']

try capture using this pattern
([A-Z][a-z]*(?: [A-Z][a-z]*)*)
Demo

Decompose a Python string into its characters

I want to break a Python string into its characters.
sequenceOfAlphabets = list( string.uppercase )
works.
However, why does not
sequenceOfAlphabets = re.split( '.', string.uppercase )
work?
All I get are empty, albeit expected count of elements

The '.' matches every character and re.split returns everything that wasn't matched, that's why you're getting the empty list.
Using list is usually the way to handle something like this but if you want to use regular expressions just use re.findall
sequenceOfAlphabets = re.findall( '.', string.uppercase )
That should give you ['A', 'B', 'C', .... ,'Z']

Because the delimiter character used by split does not appear in the resulting list. This allows it be used like:
re.split(',', "foo,bar,baz")
['foo', 'bar', 'baz']
Also, you will find the resulting list from your split code actually contains one extra element, since split returns one more than the number of delimiters found. The above has two commas, so it returns a three-element list.

If you can do something with both a built-in function and with regexes, then usually the built-in approach will be faster and more legible.
The regex world is a maze of twisty little passages, populated by purveyors of almost-truths like """The '.' matches every character""" ... which it does, but only when you use the re.DOTALL flag. This information is not cunningly concealed in the fine print of the documentation; it's right there as the FIRST entry of "special characters":
'.'
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
>>> import re
>>> re.findall(".", "fu\nbar")
['f', 'u', 'b', 'a', 'r']
>>>

Just an FYI, this also works:
sequenceOfAlphabets = [a for a in string.uppercase]
...but that does exactly what list() would do so I don't think it would be any faster (I could be wrong).

You can also create an empty set and use the update method, like so:
destroy_string = set()
destroy_string.update('Stack Overflow')
destroy_string
{'k', ' ', 'S', 'c', 'v', 'o', 'r', 't', 'w', 'e', 'f', 'O', 'l', 'a'}
Albeit, it will become unordered and the duplicates will be lost in the set, however, this is still a valid way to decompose a string into a set of its individual members.

From the documentation:
If capturing parentheses are used in
pattern, then the text of all groups
in the pattern are also returned as
part of the resulting list.
Also note:
If there are capturing groups in the
separator and it matches at the start
of the string, the result will start
with an empty string. The same holds
for the end of the string.
So, use re.split( '(.)', string.uppercase)[1:-1] instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regular Expressions: Capture lookahead value (capturing text without consuming it) - python

Related

python/regex: match letter only or letter followed by number

How can I define a regex that matches multiple patterns anchored at the same position in my search text?

Find every two (non-overlapping) vowels inbetween consonants

Split string by position not character

Decompose a Python string into its characters

Categories

Resources