Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
def symbolsReplaceDashes(text):
I want to replace all spaces and symbols with hyphens. Because I want to use this with URL.
import re
text = "this isn't alphanumeric"
result = re.sub(r'\W','-',text) # result will be "this-isn-t-alphanumeric"
The \W class is the inverse of the \w class, which consists of alphanumeric characters and underscores ([a-zA-Z0-9_]). Thus, replacing any character that doesn't match \W with a dash will leave you with a string that consists of only alphanumerics, underscores, and dashes, suitable for a URL.
Instead of regex, if you want to escape a string to be used for an url, use urllib.quote() or urllib.quote_plus(). For more complex queries, you might want to build the url using urllib.urlencode(). You can reverse the quotation with urllib.unquote() and urllib.unquote_plus().
This response doesn't use regular expressions, but should also work, with greater control over the types of symbols to filter. It uses the unicodedata module to remove all symbols by checking the categories of the characters.
import unicodedata
# See http://www.dpawson.co.uk/xsl/rev2/UnicodeCategories.html for character categories
replace = ('Sc', 'Sk', 'Sm', 'So', 'Zs')
def symbolsReplaceDashes(text):
L = []
for char in text:
if unicodedata.category(unicode(char)) in replace:
L.append('-')
else: L.append(char)
return ''.join(L)
You may need to use something like urllib.quote(output.encode('utf-8')) to encode characters if ranges are beyond basic ASCII alphanumeric characters.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 months ago.
Improve this question
I have a text containing a URL that needs to be reworked.
text='dfs:/?url=https://myserver/c12&ofg={"tes":{"id":1812}}'
I need to replace programmatically the id value (in this example 1812, which is unknown before the execution) with a fixed substring (e.g. 189). So the end result must be
'dfs:/?url=https://myserver/c12&ofg={"tes":{"id":189}}'
As I'm programming in Python, I guess that I should use the regular expression (module re) to automatically replace that value between "id": and }} but I couldn't find one that works for this use case.
I assume you are always generating the same URL with that pattern, and the value to 'change' is always in {"id":X}. One way to solve this particular problem is with a positive lookbehind + re.sub replacement.
import re
pattern = re.compile(r"(?<=\"id\":)\d+")
string = "dfs:/?url=https://myserver/c12&ofg={\"tes\":{\"id\":1812}}"
print(pattern.sub("desired_value", string))
Generated output will contain desired_value in place of the 1812. A good explanation of what is happening is done in regex101 but a quick rep of what is happening in the pattern:
Matches any digit one or more times ONLY if behind has "id":, without consuming characters
what about simply splitting the string twice? eg.
my_string = 'dfs:/?url=https://myserver/c12&ofg={"tes":{"id":1812}}'
substring = my_string.split('"id":',1)[1]
substring = substring.split('}}')[0]
print(my_string.replace(substring, "189"))
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 11 months ago.
Improve this question
Example:
Input: Output:
dustbin bin
if 'dust' in string:
new = string.split('dust')
listToStr = ''.join(map(str, new))
print(listToStr)
The above code works fine.
But if the input changes like this.
Input: Sample Output:
dustduuuustdustbin bin
The above code doesn't work. Is there a solution to this?
Use a regular expression.
import re
result = re.sub(r'[dust]', '', string)
The regexp [dust] matches any of those characters, and all the matches are replaced with an empty string.
If you want to remove only the whole word dust, with possible repetitions of the letters, the regexp would be r'd+u+s+t+'.
If only u can be repeated, use r'du+st'.
Use the below function definition, to remove any pattern from any string.
re allows to remove the match with the pattern allowing to print only the non-matched string.
import re
def remover(input, pattern):
temp='['
temp+=pattern
temp+=']'
return re.sub(r''+temp,'',input)
remover("dustttttttttbinggg",'dust')
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I'm making a chat filtering bot, and people are bypassing the bot with symbols such as underscores, periods, and a bunch of other symbols. Does anyone know a way to block all of these that are in swear words?
The best way would be to use a regular expression, using re, combined with the characters in the string module.
Here's an example:
import re
import string
symbols = string.punctuation + string.digits + string.whitespace
letters = string.ascii_letters
with open("path/to/blacklisted/words.txt") as file:
blacklist = file.read().split('\n')
for word in blacklist:
regex_match_true = re.compile(fr"[{symbols}]*".join(list(word)), re.IGNORECASE)
regex_match_none = re.compile(fr"([{letters}]+{word})|({word}[{letters}]+)", re.IGNORECASE)
if regex_match_true.search(message.content) and regex_match_none.search(message.content) is None:
# Do something here
In this regular expression, an optional group is created of symbols and inserted between letters of the word variable. This is a basic layout and likely will not catch all of blacklisted words or it may catch too many. You will likely have to do lots of testing and experimentation in order to create a regular expression that fits your need.
Edit: The second regular expression checks if the bad word being searched is found with letters preceding or succeeding the bad word itself (no special characters between letters).
The problem that now arises it that if there is a word with a space in between, but with letters on the end(s), the regular expressions will match that pattern. For example, if the word being searched was "word" and the message contained the phrase "two rd.", the message would be flagged. The results are improved, but there are still issues.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
In my chat app TalkTalkTalk, for usernames, I allowed alphanumeric characters only (A-Z, a-z, 0-9):
username = re.sub(r'\W+', '', username) # regex to keep alphanumeric only
This is a bit too restrictive because UTF8 characters are useful in many cases (people who have a name with another alphabet than latin, etc.). Now I would like to allow these useful UTF8 characters from other alphabets, and even things like ❤ ☀ ☆ ☂ ☻ ♞ ☯ ☭ ☢. (Why not?)
But I don't want :
all kind of whitespaces, all kind of newlines (
)
malicious characters that look like empty zero-width char : http://unicode-table.com/fr/200D/
etc. and more generally every character that could make that userA<malicious_char> looks like real userA.
Which are the printable UTF8 characters? (to be used in a username)
How to filter them with a regex, for example in Python?
Note: This question is about finding a regex to filter them, so it's not a duplicate of some linked questions.
You can use flag re.UNICODE and unicode in regex expression, \u200b is not technically defined as whitespace
python 2.7 and 3
import re
username = u'My \u200bNick \u2602 \u263b \u200c '
white_chars = ['\s', u'\u200b',u'\u200c'] #etc
regex_str = '[' + ''.join(white_chars) + ']'
regex = re.compile(regex_str, flags=re.UNICODE)
regex.sub("", username )
print ( regex.sub("", username ) )
you get
u'MyNick\u2602\u263b'
MyNick☂☻
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I need to write a regex matching pattern code to either return true if there is one '+' between two words and nothing else. I have written the code to check if there is only one '+' in the string but how will I check it is between two words?
The code is below:
import re
inputStr= "ali+ahmedafaw+"
inputStr2= "hello+world+again"
plus=re.findall(r'[+]', inputStr)
print (plus)
l_plus=len(plus)
print "The length is ",l_plus
if l_plus<=1:
print "True"
else:
print "False"
Actually it depends on what you mean by word. If you mean a word with more than one character, you can simply use [a-zA-Z]+ around the + character. Or other patterns which will match different characters like \w to match word characters.
re.search(r'[a-zA-Z]+\+[a-zA-Z]+', input_str)
But if you just want it doesn't appears at the leading and trailing of your text you can use negative look-around:
re.search(r'(?<!^)\+(?!$)', input_str)