Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I'm working on a code using python to extract the mentions from a tweet text.
The parameter is a tweet text. This function should return a list containing all of the mentions in the tweet, in the order they appear in the tweet. Each mention in the returned list should have the initial mention symbol removed, and the list should contain every mention encountered — including repeats, if a user is mentioned more than once within a tweet.Here are two examples:
>>>extract_mentions('#AndreaTantaros- You are a true journalistic\
professional. I so agree with what you say. Keep up the great\
work!#RepJohnLewis ')
['AndreaTantaros','RepJohnLewis']
>>>extract_mentions('#CPAC For all the closet #libertarians attending \
#CPAC2016 , I'll be there Thurs/Fri -- speaking Thurs. a.m. on the main\
stage. Look me up! #CPAC')
['CPAC','CPAC']
a mention begins with the '#' symbol and contains all alphanumeric characters up to (but not including) a space character, punctuation, or the end of a tweet.
How can I extract the mentions from the string? And sorry, I haven't learned about regex, is there any other ways?
You can use the following regular expression as it disregards email addresses.
(^|[^#\w])#(\w{1,15})
Example Code
import re
text = "#RayFranco is answering to #jjconti, this is a real '#username83' but this is an#email.com, and this is a #probablyfaketwitterusername";
result = re.findall("(^|[^#\w])#(\w{1,15})", text)
print(result);
This returns:
[('', 'RayFranco'), (' ', 'jjconti'), ("'", 'username83'), (' ', 'probablyfaketwi')]
Note that, twitter allows max 15 characters for twitter usernames. Based on Twitter specs:
Your username cannot be longer than 15 characters. Your real name can
be longer (20 characters), but usernames are kept shorter for the sake
of ease. A username can only contain alphanumeric characters (letters
A-Z, numbers 0-9) with the exception of underscores, as noted above.
Check to make sure your desired username doesn't contain any symbols,
dashes, or spaces.
Use regex :
import re
input_string = '#AndreaTantaros- You are a true journalistic professional. I so agree with what you say. Keep up the great work!#RepJohnLewis '
result = re.findall("#([a-zA-Z0-9]{1,15})", input_string)
Output : ['AndreaTantaros', 'RepJohnLewis']
If you want to remove email-addresses first, simply do :
re.sub("[\w]+#[\w]+\.[c][o][m]", "", input_string)
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 months ago.
Improve this question
I have a text containing a URL that needs to be reworked.
text='dfs:/?url=https://myserver/c12&ofg={"tes":{"id":1812}}'
I need to replace programmatically the id value (in this example 1812, which is unknown before the execution) with a fixed substring (e.g. 189). So the end result must be
'dfs:/?url=https://myserver/c12&ofg={"tes":{"id":189}}'
As I'm programming in Python, I guess that I should use the regular expression (module re) to automatically replace that value between "id": and }} but I couldn't find one that works for this use case.
I assume you are always generating the same URL with that pattern, and the value to 'change' is always in {"id":X}. One way to solve this particular problem is with a positive lookbehind + re.sub replacement.
import re
pattern = re.compile(r"(?<=\"id\":)\d+")
string = "dfs:/?url=https://myserver/c12&ofg={\"tes\":{\"id\":1812}}"
print(pattern.sub("desired_value", string))
Generated output will contain desired_value in place of the 1812. A good explanation of what is happening is done in regex101 but a quick rep of what is happening in the pattern:
Matches any digit one or more times ONLY if behind has "id":, without consuming characters
what about simply splitting the string twice? eg.
my_string = 'dfs:/?url=https://myserver/c12&ofg={"tes":{"id":1812}}'
substring = my_string.split('"id":',1)[1]
substring = substring.split('}}')[0]
print(my_string.replace(substring, "189"))
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I'm making a chat filtering bot, and people are bypassing the bot with symbols such as underscores, periods, and a bunch of other symbols. Does anyone know a way to block all of these that are in swear words?
The best way would be to use a regular expression, using re, combined with the characters in the string module.
Here's an example:
import re
import string
symbols = string.punctuation + string.digits + string.whitespace
letters = string.ascii_letters
with open("path/to/blacklisted/words.txt") as file:
blacklist = file.read().split('\n')
for word in blacklist:
regex_match_true = re.compile(fr"[{symbols}]*".join(list(word)), re.IGNORECASE)
regex_match_none = re.compile(fr"([{letters}]+{word})|({word}[{letters}]+)", re.IGNORECASE)
if regex_match_true.search(message.content) and regex_match_none.search(message.content) is None:
# Do something here
In this regular expression, an optional group is created of symbols and inserted between letters of the word variable. This is a basic layout and likely will not catch all of blacklisted words or it may catch too many. You will likely have to do lots of testing and experimentation in order to create a regular expression that fits your need.
Edit: The second regular expression checks if the bad word being searched is found with letters preceding or succeeding the bad word itself (no special characters between letters).
The problem that now arises it that if there is a word with a space in between, but with letters on the end(s), the regular expressions will match that pattern. For example, if the word being searched was "word" and the message contained the phrase "two rd.", the message would be flagged. The results are improved, but there are still issues.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I would like to extract some strings from between quotes using regular expression. The text is shown below:
CCKeyUpDomReady('test.asmx/asdasd', 'QMlPJZTOH09XOPCcbB2jcg==', '0OO6h+G2Tzhr5XWj1Upg0A==', '0OO6h+G2Tzhr5XWj1Upg0A==', '/qqwweq2.asmx/qqq')
Expected result must be:
test.asmx/asdasd
/qqwweq2.asmx/qqq
How can I do it? Here is the platform for testing:
https://regexr.com/3n142
The criteria: string which is between quotes must contains "asmx" word. The text is much more than showed above. You can think like that you are searching asmx urls in a website source code.
See regex in use here
'((?:[^'\\]|\\.)*asmx(?:[^'\\]|\\.)*)'
' Match this literally
((?:[^'\\]|\\.)*asmx(?:[^'\\]|\\.)*) Capture the following into capture group 1
(?:[^'\\]|\\.)* This is a beautiful trick gathered from PhiLho's answer to Regex for quoted string with escaping quotes. It matches escaped ' or any other character.
asmx The OP's search string/criterion
(?:[^'\\]|\\.)* This again
' Match this literally
The result is in capture group:
test.asmx/asdasd
/qqwweq2.asmx/qqq
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
In my chat app TalkTalkTalk, for usernames, I allowed alphanumeric characters only (A-Z, a-z, 0-9):
username = re.sub(r'\W+', '', username) # regex to keep alphanumeric only
This is a bit too restrictive because UTF8 characters are useful in many cases (people who have a name with another alphabet than latin, etc.). Now I would like to allow these useful UTF8 characters from other alphabets, and even things like ❤ ☀ ☆ ☂ ☻ ♞ ☯ ☭ ☢. (Why not?)
But I don't want :
all kind of whitespaces, all kind of newlines (
)
malicious characters that look like empty zero-width char : http://unicode-table.com/fr/200D/
etc. and more generally every character that could make that userA<malicious_char> looks like real userA.
Which are the printable UTF8 characters? (to be used in a username)
How to filter them with a regex, for example in Python?
Note: This question is about finding a regex to filter them, so it's not a duplicate of some linked questions.
You can use flag re.UNICODE and unicode in regex expression, \u200b is not technically defined as whitespace
python 2.7 and 3
import re
username = u'My \u200bNick \u2602 \u263b \u200c '
white_chars = ['\s', u'\u200b',u'\u200c'] #etc
regex_str = '[' + ''.join(white_chars) + ']'
regex = re.compile(regex_str, flags=re.UNICODE)
regex.sub("", username )
print ( regex.sub("", username ) )
you get
u'MyNick\u2602\u263b'
MyNick☂☻
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
def symbolsReplaceDashes(text):
I want to replace all spaces and symbols with hyphens. Because I want to use this with URL.
import re
text = "this isn't alphanumeric"
result = re.sub(r'\W','-',text) # result will be "this-isn-t-alphanumeric"
The \W class is the inverse of the \w class, which consists of alphanumeric characters and underscores ([a-zA-Z0-9_]). Thus, replacing any character that doesn't match \W with a dash will leave you with a string that consists of only alphanumerics, underscores, and dashes, suitable for a URL.
Instead of regex, if you want to escape a string to be used for an url, use urllib.quote() or urllib.quote_plus(). For more complex queries, you might want to build the url using urllib.urlencode(). You can reverse the quotation with urllib.unquote() and urllib.unquote_plus().
This response doesn't use regular expressions, but should also work, with greater control over the types of symbols to filter. It uses the unicodedata module to remove all symbols by checking the categories of the characters.
import unicodedata
# See http://www.dpawson.co.uk/xsl/rev2/UnicodeCategories.html for character categories
replace = ('Sc', 'Sk', 'Sm', 'So', 'Zs')
def symbolsReplaceDashes(text):
L = []
for char in text:
if unicodedata.category(unicode(char)) in replace:
L.append('-')
else: L.append(char)
return ''.join(L)
You may need to use something like urllib.quote(output.encode('utf-8')) to encode characters if ranges are beyond basic ASCII alphanumeric characters.