Task
You are given a string . It consists of alphanumeric characters, spaces and symbols(+,-).
Your task is to find all the substrings of the origina string that contain two or more vowels.
Also, these substrings must lie in between consonants and should contain vowels only.
Input Format: a single line of input containing string .
Output Format: print the matched substrings in their order of occurrence on separate
lines. If no match is found, print -1.
Sample Input:
rabcdeefgyYhFjkIoomnpOeorteeeeet
Sample Output:
ee
Ioo
Oeo
eeeee
The challenge above was taken from https://www.hackerrank.com/challenges/re-findall-re-finditer
The following code passes all the test cases:
import re
sol = re.findall(r"[^aiueo]([aiueoAIUEO]{2,})(?=[^aiueo])", input())
if sol:
for s in sol:
print(s)
else:
print(-1)
The following doesn't.
import re
sol = re.findall(r"[^aiueo]([aiueoAIUEO]{2,})[^aiueo]", input())
if sol:
for s in sol:
print(s)
else:
print(-1)
The only difference beteen them is the final bit of the regex. I can't understand why the second code fails. I would argue that ?= is useless because by grouping [aiueoAIUEO]{2,} I'm already excluding it from capture, but obviously I'm wrong and I can't tell why.
Any help?
The lookahead approach allows the consonant that ends one sequence of vowels to start the next sequence, whereas the non-lookahead approach requires at least two consonants between those sequences (one to end a sequence, another to start the next, as both are matched).
See
import re
print(re.findall(r'[^aiueo]([aiueoAIUEO]{2,})(?=[^aiueo])', 'moomoom'))
print(re.findall(r'[^aiueo]([aiueoAIUEO]{2,})[^aiueo]', 'moomoom'))
Which will output
['oo', 'oo']
['oo']
https://ideone.com/2Wn1TS
To be a bit picky, both attempts aren't correct regarding your problem description. They allow for uppercase vowels, spaces and symbols to be separators. You might want to use [b-df-hj-np-tv-z] instead of [^aeiou] and use flags=re.I
Here's an alternative solution that doesn't require using the special () characters for grouping, relying instead on a "positive lookbehind assertion" with (?<=...) RE syntax:
import re
sol=re.findall(r"(?<=[QWRTYPSDFGHJKLZXCVBNMqwrtypsdfghjklzxcvbnm])[AEIOUaeiou]{2,}(?=[QWRTYPSDFGHJKLZXCVBNMqwrtypsdfghjklzxcvbnm])", input())
if sol:
print(*sol, sep="\n")
else:
print(-1)
You can use re.IGNORECASE our re.I flag to ignore case sensitivity. Also, you can avoid vowels, digits from alphanumeric characters and space, + and - characters mentioned in the problem.
import re
vowels = re.findall(r"[^aeiou\d\s+-]([aeiou]{2,})(?=[^aeiou\d\s+-])", input(), re.I)
if len(vowels):
for vowel in vowels:
print(vowel)
else:
print("-1")
Related
For example:
The string BINGO!
should be B I N G O!
I have already tried:
s = "BINGO!"
print(" ".join(s[::1]))
I'd use regex: re.sub(r'(?<=[a-zA-Z])(?=[a-zA-Z])', ' ', 'BINGO!')
This basically says, "for every empty string in 'BINGO!' that both follows a letter and precedes a letter, substitute a space."
I like what esramish had said, however that is a bit overcomplicated for your goal. It is as easy as this:
string = "BINGO"
print(" ".join(string))
This returns your desired output.
You can use a single lookahead to assert a char a-z to the right and make the pattern case insensitive.
In the replacement use the full match and a space using \g<0>
(?i)[A-Z](?=[A-Z])
(?i) Inline modifier for a case insensitive match
[A-Z] Match a single char A-Z
(?=[A-Z]) Positive lookahead, assert a char A-Z to the right of the current position
See a regex demo and a Python demo.
Example
import re
print(re.sub(r'(?i)[A-Z](?=[A-Z])', '\g<0> ', 'BINGO!'))
Output
B I N G O!
I want to split strings only by suffixes. For example, I would like to be able to split dord word to [dor,wor].
I though that \wd would search for words that end with d. However this does not produce the expected results
import re
re.split(r'\wd',"dord word")
['do', ' wo', '']
How can I split by suffixes?
x='dord word'
import re
print re.split(r"d\b",x)
or
print [i for i in re.split(r"d\b",x) if i] #if you dont want null strings.
Try this.
As a better way you can use re.findall and use r'\b(\w+)d\b' as your regex to find the rest of word before d:
>>> re.findall(r'\b(\w+)d\b',s)
['dor', 'wor']
Since \w also captures digits and underscore, I would define a word consisting of just letters with a [a-zA-Z] character class:
print [x.group(1) for x in re.finditer(r"\b([a-zA-Z]+)d\b","dord word")]
See demo
If you're wondering why your original approach didn't work,
re.split(r'\wd',"dord word")
It finds all instances of a letter/number/underscore before a "d" and splits on what it finds. So it did this:
do[rd] wo[rd]
and split on the strings in brackets, removing them.
Also note that this could split in the middle of words, so:
re.split(r'\wd', "said tendentious")
would split the second word in two.
I want to find all possible substrings inside a string with the following requirement: The substring starts with N, the next letter is anything but P, and the next letter is S or T
With the test string "NNSTL", I would like to get as results "NNS" and "NST"
Is this possible with Regex?
Try the following regex:
N[^P\W\d_][ST]
The first character is N, the next character is none of (^) P, a non-letter (\W), a digit (\d) or underscore (_). The last letter is either S or T. I'm assuming the second character must be a letter.
EDIT
The above regex will only match the first instance in the string "NNSTL" because it will then start the next potential match at position 3: "TL". If you truly want both results at the same time use the following:
(?=(N[^P\W\d_][ST])).
The substring will be in group 1 instead of the whole pattern match which will only be the first character.
You can do this with the re module:
import re
Here's a possible search string:
my_txt = 'NfT foo NxS bar baz NPT'
So we use the regular expression that first looks for an N, any character other than a P, and a character that is either an S or a T.
regex = 'N[^P][ST]'
and using re.findall:
found = re.findall(regex, my_txt)
and found returns:
['NfT', 'NxS']
Yes. The regex snippet is: "N[^P][ST]"
Plug it in to any regex module methods from here: http://docs.python.org/2/library/re.html
Explanation:
N matches a literal "N".
[^P] is a set, where the caret ("^") denotes inverse (so, it matches anything not in the set.
[ST] is another set, where it matches either an "S" or a "T".
I'm looking for logic that searches a capital word in a line in python, like I have a *.txt:
aaa
adadad
DDD_AAA
Dasdf Daa
I would like to search only for the lines which have 2 or more capital words after each other (in the above case DDD_AAA).
Regex are the way to go:
import re
pattern = "([A-Z]+_[A-Z]+)" # matches CAPITALS_CAPITALS only
match = re.search(pattern, text)
if match: print match.group(0)
You have to figure out what exactly you are looking for though.
Presuming your definition of a "capital word" is a string of two or more uppercase alphabet (non-numeric) characters, i.e. [A-Z], and assuming that what separates one "capital word" from another is not quite the complementary set ([^A-Z]) but rather the complementary set to the alphanumeric characters, i.e. [^a-zA-Z0-9], you're looking for a regex like
\b[A-Z]{2,}\b.*\b[A-Z]{2,}\b
I say like because the above is not exactly correct: \b counts the underscore _ as a word character. Replace the \bs with [^a-zA-Z0-9]s wrapped in lookaround assertions (to make them zero-width, like \b), and you have the correct regex:
(?<=[^a-zA-Z0-9]|^)[A-Z]{2,}(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]{2,}(?=[^a-zA-Z0-9]|$)
Here's a Rubular demo.
Finally, if you consider a one-character word, a "word", then simply do away with the {2,} quantifiers:
(?<=[^a-zA-Z0-9]|^)[A-Z]+(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]+(?=[^a-zA-Z0-9]|$)
print re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",search_text)
should work to match 2 words that both start with a capital letter
for your specific example
lines = []
for line in file:
if re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",line): lines.append(line)
print lines
basically look into regexes!
Here you go:
import re
lines = open("r1.txt").readlines()
for line in lines:
if re.match(r'[^\w]*[A-Z]+[ _][A-Z]+[^\w]*', line) is not None:
print line.strip("\n")
Output:
DDD_AAA
Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -
import re
re.sub("a*", "a", "aaaa") # gives 'a'
What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?
import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'
NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes
>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'
The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.
Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.
On side note...
Your example code for just a is actually buggy:
>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'
You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".
In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this
s="ababacbdefefbcdefde"
while re.search(r'([a-z])(.*)\1', s):
s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)
print s # prints 'abcdef'
A solution including all category:
re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')
gives:
'ab['