I want to accept words that include these characters: '.:?!'
I tried this as pattern: r'*(.|:|?|!)*' but does not work.
How I implement this by using python re.
the code I'd like to run is like this:
import re
pattern = r'*(.|:|?|!)*'
word = '.Flask'
match = re.match(pattern, word)
if match:
print('yes')
for example, I want to accept these words:'.flask', 'flask.','!flask','flask!'....
and even non ascii character. So I want to include those words as well:.日本語, 日本語.
that's why I wanted to use * symbol.
If you want to match words starting or ending with one of these characters, this regex should fit your needs:
pattern = r'[.:?!][^ ]*|[^ ]*[.:?!]'
Or even better with word boundary:
pattern = r'\b[.:?!][^ ]*\b|\b[^ ]*[.:?!]\b'
Explenation:
[.:?!][^ ]* matches words beginning with [.:?!] followed by all characters except whitespace.
[^ ]*[.:?!] matches all words beginning Wirth any character except whitespace ending with a [.:?!] character.
\b matches the word boundaries.
Related
I have different length strings which have to be checked for substrings which match patterns of "tion", "ex", "ph", "ost", "ast", "ist" ignoring the case and the position i.e. prefix/suffix/middle of word. The matching words have to be returned in a new list rather than the matching substring element alone. With the below code I can return a new list of matching substring element without the full matching word.
def latin_ish_words(text):
import re
pattern=re.compile(r"tion|ex|ph|ost|ast|ist")
matches=pattern.findall(text)
return matches
latin_ish_words("This functions as expected")
With the results as follows:['tion', 'ex']
I was wondering how I could return the whole word rather than the matching substring element into a newlist?
You can use
pattern=re.compile(r"\w*?(?:tion|ex|ph|ost|ast|ist)\w*")
pattern=re.compile(r"[a-zA-Z]*?(?:tion|ex|ph|ost|ast|ist)[a-zA-Z]*")
pattern=re.compile(r"[^\W\d_]*?(?:tion|ex|ph|ost|ast|ist)[^\W\d_]*")
The regex (see the regex demo) matches
\w*? - zero or more but as few as possible word chars
(?:tion|ex|ph|ost|ast|ist) - one of the strings
\w* - zero or more but as many as possible word chars
The [a-zA-Z] part will match only ASCII letters, and [^\W\d_] will match any Unicode letters.
Mind the use of the non-capturing group with re.findall, as otherwise, the captured substrings will also get their way into the output list.
If you need to only match letter words, and you need to match them as whole words, add word boundaries, r"\b[a-zA-Z]*?(?:tion|ex|ph|ost|ast|ist)[a-zA-Z]*\b".
See the Python demo:
import re
def latin_ish_words(text):
import re
pattern=re.compile(r"\w*?(?:tion|ex|ph|ost|ast|ist)\w*")
return pattern.findall(text)
print(latin_ish_words("This functions as expected"))
# => ['functions', 'expected']
ignoring the case
pattern=re.compile(r"tion|ex|ph|ost|ast|ist")
matches=pattern.findall(text)
does not do that, consider following example
import re
pattern=re.compile(r"tion|ex|ph|ost|ast|ist")
text = "SCREAMING TEXT"
print(pattern.findall(text))
output
[]
despite there should be EX, you should add re.IGNORECASE flag like so
import re
pattern=re.compile(r"tion|ex|ph|ost|ast|ist", re.IGNORECASE)
text = "SCREAMING TEXT"
print(pattern.findall(text))
output
['EX']
For a case insensitive match with whitspace boundaries you could use:
(?i)(?<!\S)\w*(?:tion|ex|ph|[oia]st)\w*(?!\S)
The pattern matches:
(?i) Inline modifier for a case insensitive match (Or use re.I)
(?<!\S) Assert a whitespace boundary to the left
\w* Match optional word characters
(?: Non capture group
tion|ex|ph|[oia]st Match either tion ex php or ost ist ast using a character class
) Close non capture group
\w* Match optional word characters
(?!\S) Assert a whitespace boundary to the right
Regex demo | Python demo
def latin_ish_words(text):
import re
pattern = r"(?i)(?<!\S)\w*(?:tion|ex|ph|[oia]st)\w*(?!\S)"
return re.findall(pattern, text)
print(latin_ish_words("This functions as expected"))
Output
['functions', 'expected']
I am cleaning a text and I would like to remove all the hyphens and special characters. Except for the hyphens between two words such as: tic-tacs, popcorn-flavoured.
I wrote the below regex but it removes every hyphen.
text='popcorn-flavoured---'
new_text=re.sub(r'[^a-zA-Z0-9]+', '',text)
new_text
I would like the output to be:
popcorn-flavoured
You can replace matches of the regular expression
-(?!\w)|(?<!\w)-
with empty strings.
Regex demo <¯\_(ツ)_/¯> Python demo
The regex will match hyphens that are not both preceded and followed by a word character.
Python's regex engine performs the following operations.
- match '-'
(?!\w) the previous character is not a word character
|
(?<!\w) the following character is not a word character
- match '-'
(?!\w) is a negative lookahead; (?<!\w) is a negative lookbehind.
As an alternative, you could capture a hyphen between word characters and keep that group in the replacement. Using an alternation, you could match the hyphens that you want to remove.
(\w+-\w+)|-+
Explanation
(\w+-\w+) Capture group 1, match 1+ word chars, hyphen and 1+ word chars
| Or
-+ Match 1+ times a hyphen
Regex demo | Python demo
Example code
import re
regex = r"(\w+-\w+)|-+"
test_str = ("popcorn-flavoured---\n"
"tic-tacs")
result = re.sub(regex, r"\1", test_str)
print (result)
Output
popcorn-flavoured
tic-tacs
You can use findall() to get that part that matches your criteria.
new_text = re.findall('[\w]+[-]?[\w]+', text)[0]
Play around with it with other inputs.
You can use
p = re.compile(r"(\b[-]\b)|[-]")
result = p.sub(lambda m: (m.group(1) if m.group(1) else ""), text)
Test
With:
text='popcorn-flavoured---'
Output (result):
popcorn-flavoured
Explanation
This pattern detects hyphens between two words:
(\b[-]\b)
This pattern detects all hyphens
[-]
Regex substitution
p.sub(lambda m: (m.group(1) if m.group(1) else " "), text)
When hyphen detected between two words m.group(1) exists, so we maintain things as they are
else "")
Occurs when the pattern was triggered by [-] then we substitute a "" for the hyphen removing it.
I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?
Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.
Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']
For example, if I have a string:
"I really..like something like....that"
I want to get only:
"I something"
Any suggestion?
If you want to do it with regex; you can to use below regex to remove them:
r"[^\.\s]+\.{2,}[^\.\s]+"g
[ Regex Demo ]
Regex explanation:
[^\.\s]+ at least one of any character instead of '.' and a white space
\.{2,} at least two or more '.'
[^\.\s]+ at least one of any character instead of '.' and a white space
or this regex:
r"\s+[^\.\s]+\.{2,}[^\.\s]+"g
^^^ for including spaces before those combination
[ Regex Demo ]
If you want to use a regex explicitly you could use the following.
import re
string = "I really..like something like....that"
with_dots = re.findall(r'\w+[.]+\w+', string)
split = string.split()
without_dots = [word for word in split if word not in with_dots]
The solution provided by rawing also works in this case.
' '.join(word for word in text.split() if '..' not in word)
You may very well use boundaries in combination with lookarounds:
\b(?<!\.)(\w+)\b(?!\.)
See a demo on regex101.com.
Broken apart, this says:
\b # a word boundary
(?<!\.) # followed by a negative lookbehind making sure there's no '.' behind
\w+ # 1+ word characters
\b # another word boundary
(?!\.) # a negative lookahead making sure there's no '.' ahead
As a whole Python snippet:
import re
string = "I really..like something like....that"
rx = re.compile(r'\b(?<!\.)(\w+)\b(?!\.)')
print(rx.findall(string))
# ['I', 'something']
myreg = r"\babcb\"
mystr = "sdf ddabc"
mystr1 = "sdf abc"
print(re.findall(myreg,mystr))=[]
print(re.findall(myreg,mystr1))=[abc]
Until now everything works as expected but if i change my reg and my str to.
myreg = r"\b\+abcb\"
mystr = "sdf +abc"
print(re.findall(myreg,mystr)) = [] but i would like to get [+abc]
I have noticed that using the following works as expected.
myreg = "^\\+abc$"
mystr = "+abc"
mystr1 = "-+abc"
My question: Is it possible to achieve the same results as above without splitting the string?
Best regards,
Gabriel
Your problem is the following:
\b is defined as the boundary between a \w and a \W character
(or vice versa).
\w contains the character set [a-zA-Z0-9_]
\W contains the character set [^a-zA-Z0-9_], which means all characters except [a-zA-Z0-9_]
'+' is not contained in \w so you won't match the boundary between the whitespace and the '+'.
To get what you want, you should remove the first \b from your pattern:
import re
string = "sdf +abc"
pattern = r"\+abc\b"
matches = re.findall(pattern, string)
print matches
['+abc']
There are two problems
Before your + in +abc, there is no word boundary, so \b cannot match.
Your regex \b\+abcb\ tries to match a literal b character after abc (typo).
Word Boundaries
The word boundary \b matches at a position between a word character (letters, digits and underscore) and a non-word character (or a line beginning or ending). For instance, there is a word boundary between the + and the a
Solution: Make your Own boundary
If you want to match +abc but only when it is not preceded by a word character (for instance, you don't want it inside def+abc), then you can make your own boundary with a lookbehind:
(?<!\w)\+abc
This says "match +abc if it is not preceded by a word character (letter, digit, underscore)".