How to find words with two vowels in the middle in a string using python regular expression
this is my code :
s= "reading a book is great"
print(re.findall(r'\b(\w+[aeiyou]+w)\b',s))
expected output: [book]
my output: [book],[grea]
Replace + in your regex with {2}, because + repeats the previous token one or more times where {2} repeats the previous token exactly the 2 times.
print(re.findall(r'\b\w[aeiou]{2}\w\b',s))
For both upper and lowercase vowels.
print(re.findall(r'\b\w[aeiouAEIOU]{2}\w\b',s))
You could use [A-Za-z] instead of \w, if you don't want a digit or _ exists before or after the vowels. Because \w also matches _ and digits.
print(re.findall(r'\b[A-Za-z][aeiouAEIOU]{2}[A-Za-z]\b',s))
Add case-insensitive modifier (?i) or re.IGNORECASE to do a case-insensitive match.
print(re.findall(r'(?i)\b[a-z][aeiou]{2}[a-z]\b',s))
Related
For example:
The string BINGO!
should be B I N G O!
I have already tried:
s = "BINGO!"
print(" ".join(s[::1]))
I'd use regex: re.sub(r'(?<=[a-zA-Z])(?=[a-zA-Z])', ' ', 'BINGO!')
This basically says, "for every empty string in 'BINGO!' that both follows a letter and precedes a letter, substitute a space."
I like what esramish had said, however that is a bit overcomplicated for your goal. It is as easy as this:
string = "BINGO"
print(" ".join(string))
This returns your desired output.
You can use a single lookahead to assert a char a-z to the right and make the pattern case insensitive.
In the replacement use the full match and a space using \g<0>
(?i)[A-Z](?=[A-Z])
(?i) Inline modifier for a case insensitive match
[A-Z] Match a single char A-Z
(?=[A-Z]) Positive lookahead, assert a char A-Z to the right of the current position
See a regex demo and a Python demo.
Example
import re
print(re.sub(r'(?i)[A-Z](?=[A-Z])', '\g<0> ', 'BINGO!'))
Output
B I N G O!
I am trying to match a string of length 8 containing both numbers and alphabets(cannot have just numbers or just alphabets)using re.findall. The string can start with either letter or alphabet followed by any combination.
e.g.-
Input String: The reference number is 896av6uf and not 87987647 or ahduhsjs or hn0.
Output: ['896av6uf','a96bv6u0']
I came up with this regex r'([a-z]+[\d]+[\w]*|[\d]+[a-z]+[\w]*)' however it is giving me strings with less than 8 characters as well.
Need to modify the regex to return strings with exactly 8 chars that contain both letters and alphabets.
You can use
\b(?=[a-zA-Z]*[0-9])(?=[0-9]*[a-zA-Z])[a-zA-Z0-9]{8}\b
\b(?=[^\W\d_]*\d)(?=\d*[^\W\d_])[^\W_]{8}\b
The first one only supports ASCII letters, while the second one supports all Unicode letters and digits since [^\W\d_] matches any Unicode letter and \d matches any Unicode digit (as the re.UNICODE option is used by default in Python 3.x).
Details:
\b - a word boundary
(?=[a-zA-Z]*[0-9]) - after any 0+ ASCII letters, there must be a digit
(?=[0-9]*[a-zA-Z]) - after any 0+ digits, there must be an ASCII letter
[a-zA-Z0-9]{8} - eight ASCII alphanumeric chars
\b - a word boundary
First, let's find statement that finds words made of lowercase letters and digits that are 8 characters long:
\b[a-z\d]{8}\b
Next condition is that the word must contain both letters and numbers:
[a-d]\d
Now for the challenging part, combining these into one statement. Easiest way might be to just spit them up but we can use some look-aheads to get this to work:
\b(?=.*[a-z]\d)[a-z\d]{8}\b
Im sure there a tidier way of doing this but this will work.
You can use \b\w{8}\b
It does not guarantee that you will have both digits AND letters, but does guarantee that you will have exactly eight characters, surrounded by word boundaries (e.g. whitespace, start/end of line).
You can try it in one of the online playgrounds such as this one: https://regex101.com/
The meat of the matching is done with the \w{8} which means 8 letters/words (including capitals and underscore). \b means "word boundary"
If you want only digits and lowercase letters, replace this by \b[a-z0-9]{8}\b
You can then further check for existence of both digits AND letter, e.g. by using filter:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), result))
result is what you get from re.findall() .
So bottom line, I would use:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), re.findall(r'\b[a-z0-9]{8}\b', str)))
A more compact solution than others have suggested is this:
((?![A-Za-z]{8}|[0-9]{8})[0-9A-Za-z]{8})
This guarantees that the found matches are 8 characters in length and that they can not be only numeric or only alphabets.
Breakdown:
(?![A-Za-z]{8}|[0-9]{8}) = This is a negative lookahead that means the match can't be a string of 8 numbers or 8 alphabets.
[0-9A-Za-z]{8} = Simple regex saying the input needs to be alphanumeric of 8 characters in length.
Test Case:
Input: 12345678 abcdefgh i8D0jT5Yu6Ms1GNmrmaUjicc1s9D93aQBj3WWWjww54gkiKqOd7Ytkl0MliJy9xadAgcev8b2UKdfGRDOpxRPm30dw9GeEz3WPRO 1234567890987654321 qwertyuiopasdfghjklzxcvbnm
import re
pattern = re.compile(r'((?![A-Za-z]{8}|\d{8})[A-Za-z\d]{8})')
test = input()
match = pattern.findall(test)
print(match)
Output: ['i8D0jT5Y', 'u6Ms1GNm', 'maUjicc1', 's9D93aQB', 'j3WWWjww', '54gkiKqO', 'd7Ytkl0M', 'liJy9xad', 'Agcev8b2', 'DOpxRPm3', '0dw9GeEz']
Entering a string
I used 'findall' to find words that are only letters and numbers (The number of words to be found is not specified).
I created:
words = re.findall ("\ w * \ s", x) # x is the input string
If i entered "asdf1234 cdef11dfe a = 1 b = 2"
these sentences seperated asdf1234, cdef11dfe, a =, 1, b =, 2
I would like to pick out only asdf1234, cdef11dfe
How do you write a regular expression?
Try /[a-zA-z0-9]{2,}/.
This looks for any alphanumeric character ([a-zA-Z0-9]) at least 2 times in a row ({2,}). That would be the only way to filter out the one letter words of the string.
The problem with \w is that it includes underscores.
This one should work : (?<![\"=\w])(?:[^\W_]+)(?![\"=\w])
Explanation
(?:[^\W_])+ Anything but a non-word character or an underscore at least one time (non capturing group)
(?<![\"=\w]) not precedeed by " or a word character
(?![\"=\w]) not followed by " or a word character
RegEx Demo
Sample code Run online
import re
regex = r"(?<![\"=\w])(?:[^\W_]+)(?![\"=\w])"
test_str = "a01a b02 c03 e dfdfd abcdef=2 b=3 e=4 c=\"a b\" aaa=2f f=\"asdf 12af\""
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
print (match.group())
I have a regular expression given by a word and a range of words following.
For example:
pattern = 'word \\w+ \\w+ \\w+"
result = [text[match.start():match.end()] for match in re.finditer(pattern, text)]
How could you modify the regular expression so that when there is a smaller number of elements that in the interval also recognize it? For example if the word is in the end of the string I would like it to return that interval too.
Always if possible to return the greatest possible pattern.
Your 'word \\w+ \\w+ \\w+" regex matches a word and then 3 more "words" (space separated). You want to match 0 to 3 of these words. Use
re.findall(r'word(?:\s+\w+){0,3}', s)
Or, to allow any non-word chars in between the "words", replace \s with \W:
re.findall(r'word(?:\W+\w+){0,3}', s)
Details:
word - word string
(?:\s+\w+){0,3} - 0 to 3 sequences (the {0,3} is a greedy version of the limiting quantifier, it will match as many occurrences as possible) of:
\s+ - 1+ whitespaces
\w+ - 1 or more word chars.
See the regex demo.
I have a list of words such as:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
I want to find the words that have the same first and last character, and that the two middle characters are different from the first/last character.
The desired final result:
['abca', 'bcab', 'cbac']
I tried this:
re.findall('^(.)..\\1$', l, re.MULTILINE)
But it returns all of the unwanted words as well.
I thought of using [^...] somehow, but I couldn't figure it out.
There's a way of doing this with sets (to filter the results from the search above), but I'm looking for a regex.
Is it possible?
Edit: fixed to use negative lookahead assertions instead of negative lookbehind assertions. Read comments for #AlanMoore and #bukzor explanations.
>>> [s for s in l.splitlines() if re.search(r'^(.)(?!\1).(?!\1).\1$', s)]
['abca', 'bcab', 'cbac']
The solution uses negative lookahead assertions which means 'match the current position only if it isn't followed by a match for something else.' Now, take a look at the lookahead assertion - (?!\1). All this means is 'match the current character only if it isn't followed by the first character.'
There are lots of ways to do this. Here's probably the simplest:
re.findall(r'''
\b #The beginning of a word (a word boundary)
([a-z]) #One letter
(?!\w*\1\B) #The rest of this word may not contain the starting letter except at the end of the word
[a-z]* #Any number of other letters
\1 #The starting letter we captured in step 2
\b #The end of the word (another word boundary)
''', l, re.IGNORECASE | re.VERBOSE)
If you want, you can loosen the requirements a bit by replacing [a-z] with \w. That will allow numbers and underscores as well as letters. You can also restrict it to 4-character words by changing the last * in the pattern to {2}.
Note also that I'm not very familiar with Python, so I'm assuming your usage of findall is correct.
Are you required to use regexes? This is a much more pythonic way to do the same thing:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
for word in l.split():
if word[-1] == word[0] and word[0] not in word[1:-1]:
print word
Here's how I would do it:
result = re.findall(r"\b([a-z])(?:(?!\1)[a-z]){2}\1\b", subject)
This is similar to Justin's answer, except where that one does a one-time lookahead, this one checks each letter as it's consumed.
\b
([a-z]) # Capture the first letter.
(?:
(?!\1) # Unless it's the same as the first letter...
[a-z] # ...consume another letter.
){2}
\1
\b
I don't know what your real data looks like, so chose [a-z] arbitrarily because it works with your sample data. I limited the length to four characters for the same reason. As with Justin's answer, you may want to change the {2} to *, + or some other quantifier.
To heck with regexes.
[
word
for word in words.split('\n')
if word[0] == word[-1]
and word[0] not in word[1:-1]
]
You can do this with negative lookahead or lookbehind assertions; see http://docs.python.org/library/re.html for details.
Not a Python guru, but maybe this
re.findall('^(.)(?:(?!\1).)*\1$', l, re.MULTILINE)
expanded (use multi-line modifier):
^ # begin of line
(.) # capture grp 1, any char except newline
(?: # grouping
(?!\1) # Lookahead assertion, not what was in capture group 1 (backref to 1)
. # this is ok, grab any char except newline
)* # end grouping, do 0 or more times (could force length with {2} instead of *)
\1 # backref to group 1, this character must be the same
$ # end of line