Regular expressions in python to match Twitter handles

Regular expressions in python to match Twitter handles - python

I'm trying to use regular expressions to capture all Twitter handles within a tweet body. The challenge is that I'm trying to get handles that
Contain a specific string
Are of unknown length
May be followed by either
punctuation
whitespace
or the end of string.
For example, for each of these strings, Ive marked in italics what I'd like to return.
"#handle what is your problem?" [RETURN '#handle']
"what is your problem #handle?" [RETURN '#handle']
"#123handle what is your problem #handle123?" [RETURN '#123handle', '#handle123']
This is what I have so far:
>>> import re
>>> re.findall(r'(#.*handle.*?)\W','hi #123handle, hello #handle123')
['#123handle']
# This misses the handles that are followed by end-of-string
I tried modifying to include an or character allowing the end-of-string character. Instead, it just returns the whole string.
>>> re.findall(r'(#.*handle.*?)(?=\W|$)','hi #123handle, hello #handle123')
['#123handle, hello #handle123']
# This looks like it is too greedy and ends up returning too much
How can I write an expression that will satisfy both conditions?
I've looked at a couple other places, but am still stuck.

It seems you are trying to match strings starting with #, then having 0+ word chars, then handle, and then again 0+ word chars.
Use
r'#\w*handle\w*'
or - to avoid matching #+word chars in emails:
r'\B#\w*handle\w*'
See the Regex 1 demo and the Regex 2 demo (the \B non-word boundary requires a non-word char or start of string to be right before the #).
Note that the .* is a greedy dot matching pattern that matches any characters other than newline, as many as possible. \w* only matches 0+ characters (also as many as possible) but from the [a-zA-Z0-9_] set if the re.UNICODE flag is not used (and it is not used in your code).
Python demo:
import re
p = re.compile(r'#\w*handle\w*')
test_str = "#handle what is your problem?\nwhat is your problem #handle?\n#123handle what is your problem #handle123?\n"
print(p.findall(test_str))
# => ['#handle', '#handle', '#123handle', '#handle123']

Matches only handles that contain this range of characters -> /[a-zA-Z0-9_]/.
s = "#123handle what is your problem #handle123?"
print re.findall(r'\B(#[\w\d_]+)', s)
>>> ['#123handle', '#handle123']
s = '#The quick brown fox#jumped over the LAAZY #_dog.'
>>> ['#The', '#_dog']

Related

how to write regex to accept the string which end with string

I want to write a regex which accepts this:
Accept:
done
done1
done1,done2,done3
Do not accept:
done1,
done1,done2,
I tried to write this regex
([a-zA-Z]+)?(/d)?(,)([a-zA-Z]+)
but it is not working.
What's wrong? How can I fix it?

I would phrase the regex pattern as:
(?<!\S)\w+(?:,\w+)*(?!\S)
Sample script:
inp = "done done1 done1,done2,done3 done1, done1,done2,"
matches = re.findall(r'(?<!\S)\w+(?:,\w+)*(?!\S)', inp)
print(matches) # ['done', 'done1', 'done1,done2,done3']
Here is an explanation of the regex pattern:
(?<!\S) assert that what precedes is either whitespace or the start of the input
\w+ match a word
(?:,\w+)* followed by comma another word, both zero or more times
(?!\S) assert that what follows the final word is either whitespace
or the end of the input

It also depends on how you apply the regex. The regex alone (e.g. when used with re.search()) tells you whether the input contains any substring which matches your regex. In the trivial case, if you are examining one line at a time, add start and end of line anchors around your regex to force it to match the entire line.
Also, of course, notice that the regex to match a single digit is \d, not /d.
Your regex looks like you want both the alphabetics and the numbers to be optional, but the group of alphabetics and numbers to be non-empty; is that correct? One way to do that is to add a lookahead (?=[a-zA-Z\d]) before the phrase which matches both optionally.
import re
tests = """\
done
done1
done1,done2,done3
done1,
done1,done2,
"""
regex = re.compile(r'^(?=[a-zA-Z\d])[a-zA-Z]*\d?(?:,(?=[a-zA-Z\d])[a-zA-Z]*\d?)*$')
for line in tests.splitlines():
match = regex.search(line)
if match:
print(line)
The individual phrases here should be easy to understand. [a-zA-Z]* matches zero or more alphabetics, and \d? matches zero or one digits. We require one of those, followed by zero or more repetitions of a comma followed by a repeat of the first expression.
Perhaps also note that [a-zA-Z\d] is almost the same as \w (the latter also matches an underscore). If you don't care about this inexactness, the expression could be simplified. It would certainly be useful in the lookahead, where the regex after it will not match an underscore anyhow. But I've left in the more complex expression just to make the code easier to follow in relation to the original example.
Demo: https://ideone.com/4mVGDh

Regex that match any string except specific string [duplicate]

I need a regular expression able to match everything but a string starting with a specific pattern (specifically index.php and what follows, like index.php?id=2342343).

Regex: match everything but:
a string starting with a specific pattern (e.g. any - empty, too - string not starting with foo):
Lookahead-based solution for NFAs:
^(?!foo).*$
^(?!foo)
Negated character class based solution for regex engines not supporting lookarounds:
^(([^f].{2}|.[^o].|.{2}[^o]).*|.{0,2})$
^([^f].{2}|.[^o].|.{2}[^o])|^.{0,2}$
a string ending with a specific pattern (say, no world. at the end):
Lookbehind-based solution:
(?<!world\.)$
^.*(?<!world\.)$
Lookahead solution:
^(?!.*world\.$).*
^(?!.*world\.$)
POSIX workaround:
^(.*([^w].{5}|.[^o].{4}|.{2}[^r].{3}|.{3}[^l].{2}|.{4}[^d].|.{5}[^.])|.{0,5})$
([^w].{5}|.[^o].{4}|.{2}[^r].{3}|.{3}[^l].{2}|.{4}[^d].|.{5}[^.]$|^.{0,5})$
a string containing specific text (say, not match a string having foo):
Lookaround-based solution:
^(?!.*foo)
^(?!.*foo).*$
POSIX workaround:
Use the online regex generator at www.formauri.es/personal/pgimeno/misc/non-match-regex
a string containing specific character (say, avoid matching a string having a | symbol):
^[^|]*$
a string equal to some string (say, not equal to foo):
Lookaround-based:
^(?!foo$)
^(?!foo$).*$
POSIX:
^(.{0,2}|.{4,}|[^f]..|.[^o].|..[^o])$
a sequence of characters:
PCRE (match any text but cat): /cat(*SKIP)(*FAIL)|[^c]*(?:c(?!at)[^c]*)*/i or /cat(*SKIP)(*FAIL)|(?:(?!cat).)+/is
Other engines allowing lookarounds: (cat)|[^c]*(?:c(?!at)[^c]*)* (or (?s)(cat)|(?:(?!cat).)*, or (cat)|[^c]+(?:c(?!at)[^c]*)*|(?:c(?!at)[^c]*)+[^c]*) and then check with language means: if Group 1 matched, it is not what we need, else, grab the match value if not empty
a certain single character or a set of characters:
Use a negated character class: [^a-z]+ (any char other than a lowercase ASCII letter)
Matching any char(s) but |: [^|]+
Demo note: the newline \n is used inside negated character classes in demos to avoid match overflow to the neighboring line(s). They are not necessary when testing individual strings.
Anchor note: In many languages, use \A to define the unambiguous start of string, and \z (in Python, it is \Z, in JavaScript, $ is OK) to define the very end of the string.
Dot note: In many flavors (but not POSIX, TRE, TCL), . matches any char but a newline char. Make sure you use a corresponding DOTALL modifier (/s in PCRE/Boost/.NET/Python/Java and /m in Ruby) for the . to match any char including a newline.
Backslash note: In languages where you have to declare patterns with C strings allowing escape sequences (like \n for a newline), you need to double the backslashes escaping special characters so that the engine could treat them as literal characters (e.g. in Java, world\. will be declared as "world\\.", or use a character class: "world[.]"). Use raw string literals (Python r'\bworld\b'), C# verbatim string literals #"world\.", or slashy strings/regex literal notations like /world\./.

You could use a negative lookahead from the start, e.g., ^(?!foo).*$ shouldn't match anything starting with foo.

You can put a ^ in the beginning of a character set to match anything but those characters.
[^=]*
will match everything but =

Just match /^index\.php/, and then reject whatever matches it.

In Python:
>>> import re
>>> p='^(?!index\.php\?[0-9]+).*$'
>>> s1='index.php?12345'
>>> re.match(p,s1)
>>> s2='index.html?12345'
>>> re.match(p,s2)
<_sre.SRE_Match object at 0xb7d65fa8>

Came across this thread after a long search. I had this problem for multiple searches and replace of some occurrences. But the pattern I used was matching till the end. Example below
import re
text = "start![image]xxx(xx.png) yyy xx![image]xxx(xxx.png) end"
replaced_text = re.sub(r'!\[image\](.*)\(.*\.png\)', '*', text)
print(replaced_text)
gave
start* end
Basically, the regex was matching from the first ![image] to the last .png, swallowing the middle yyy
Used the method posted above https://stackoverflow.com/a/17761124/429476 by Firish to break the match between the occurrence. Here the space is not matched; as the words are separated by space.
replaced_text = re.sub(r'!\[image\]([^ ]*)\([^ ]*\.png\)', '*', text)
and got what I wanted
start* yyy xx* end

Regular expression which does not match specific string [duplicate]

I need a regular expression able to match everything but a string starting with a specific pattern (specifically index.php and what follows, like index.php?id=2342343).

Regex: match everything but:
a string starting with a specific pattern (e.g. any - empty, too - string not starting with foo):
Lookahead-based solution for NFAs:
^(?!foo).*$
^(?!foo)
Negated character class based solution for regex engines not supporting lookarounds:
^(([^f].{2}|.[^o].|.{2}[^o]).*|.{0,2})$
^([^f].{2}|.[^o].|.{2}[^o])|^.{0,2}$
a string ending with a specific pattern (say, no world. at the end):
Lookbehind-based solution:
(?<!world\.)$
^.*(?<!world\.)$
Lookahead solution:
^(?!.*world\.$).*
^(?!.*world\.$)
POSIX workaround:
^(.*([^w].{5}|.[^o].{4}|.{2}[^r].{3}|.{3}[^l].{2}|.{4}[^d].|.{5}[^.])|.{0,5})$
([^w].{5}|.[^o].{4}|.{2}[^r].{3}|.{3}[^l].{2}|.{4}[^d].|.{5}[^.]$|^.{0,5})$
a string containing specific text (say, not match a string having foo):
Lookaround-based solution:
^(?!.*foo)
^(?!.*foo).*$
POSIX workaround:
Use the online regex generator at www.formauri.es/personal/pgimeno/misc/non-match-regex
a string containing specific character (say, avoid matching a string having a | symbol):
^[^|]*$
a string equal to some string (say, not equal to foo):
Lookaround-based:
^(?!foo$)
^(?!foo$).*$
POSIX:
^(.{0,2}|.{4,}|[^f]..|.[^o].|..[^o])$
a sequence of characters:
PCRE (match any text but cat): /cat(*SKIP)(*FAIL)|[^c]*(?:c(?!at)[^c]*)*/i or /cat(*SKIP)(*FAIL)|(?:(?!cat).)+/is
Other engines allowing lookarounds: (cat)|[^c]*(?:c(?!at)[^c]*)* (or (?s)(cat)|(?:(?!cat).)*, or (cat)|[^c]+(?:c(?!at)[^c]*)*|(?:c(?!at)[^c]*)+[^c]*) and then check with language means: if Group 1 matched, it is not what we need, else, grab the match value if not empty
a certain single character or a set of characters:
Use a negated character class: [^a-z]+ (any char other than a lowercase ASCII letter)
Matching any char(s) but |: [^|]+
Demo note: the newline \n is used inside negated character classes in demos to avoid match overflow to the neighboring line(s). They are not necessary when testing individual strings.
Anchor note: In many languages, use \A to define the unambiguous start of string, and \z (in Python, it is \Z, in JavaScript, $ is OK) to define the very end of the string.
Dot note: In many flavors (but not POSIX, TRE, TCL), . matches any char but a newline char. Make sure you use a corresponding DOTALL modifier (/s in PCRE/Boost/.NET/Python/Java and /m in Ruby) for the . to match any char including a newline.
Backslash note: In languages where you have to declare patterns with C strings allowing escape sequences (like \n for a newline), you need to double the backslashes escaping special characters so that the engine could treat them as literal characters (e.g. in Java, world\. will be declared as "world\\.", or use a character class: "world[.]"). Use raw string literals (Python r'\bworld\b'), C# verbatim string literals #"world\.", or slashy strings/regex literal notations like /world\./.

You could use a negative lookahead from the start, e.g., ^(?!foo).*$ shouldn't match anything starting with foo.

You can put a ^ in the beginning of a character set to match anything but those characters.
[^=]*
will match everything but =

Just match /^index\.php/, and then reject whatever matches it.

In Python:
>>> import re
>>> p='^(?!index\.php\?[0-9]+).*$'
>>> s1='index.php?12345'
>>> re.match(p,s1)
>>> s2='index.html?12345'
>>> re.match(p,s2)
<_sre.SRE_Match object at 0xb7d65fa8>

Came across this thread after a long search. I had this problem for multiple searches and replace of some occurrences. But the pattern I used was matching till the end. Example below
import re
text = "start![image]xxx(xx.png) yyy xx![image]xxx(xxx.png) end"
replaced_text = re.sub(r'!\[image\](.*)\(.*\.png\)', '*', text)
print(replaced_text)
gave
start* end
Basically, the regex was matching from the first ![image] to the last .png, swallowing the middle yyy
Used the method posted above https://stackoverflow.com/a/17761124/429476 by Firish to break the match between the occurrence. Here the space is not matched; as the words are separated by space.
replaced_text = re.sub(r'!\[image\]([^ ]*)\([^ ]*\.png\)', '*', text)
and got what I wanted
start* yyy xx* end

regular expression match issue in Python

For input string, want to match text which starts with {(P) and ends with (P)}, and I just want to match the parts in the middle. Wondering if we can write one regular expression to resolve this issue?
For example, in the following example, for the input string, I want to retrieve hello world part. Using Python 2.7.
python {(P)hello world(P)} java

You can try {\(P\)(.*)\(P\)}, and use parenthesis in the pattern to capture everything between {(P) and (P)}:
import re
re.findall(r'{\(P\)(.*)\(P\)}', "python {(P)hello world(P)} java")
# ['hello world']
.* also matches unicode characters, for example:
import re
str1 = "python {(P)£1,073,142.68(P)} java"
str2 = re.findall(r'{\(P\)(.*)\(P\)}', str1)[0]
str2
# '\xc2\xa31,073,142.68'
print str2
# £1,073,142.68

You can use positive look-arounds to ensure that it only matches if the text is preceded and followed by the start and end tags. For instance, you could use this pattern:
(?<={\(P\)).*?(?=\(P\)})
See the demo.
(?<={\(P\)) - Look-behind expression stating that a match must be preceded by {(P).
.*? - Matches all text between the start and end tags. The ? makes the star lazy (i.e. non-greedy). That means it will match as little as possible.
(?=\(P\)}) - Look-ahead expression stating that a match must be followed by (P)}.
For what it's worth, lazy patterns are technically less efficient, so if you know that there will be no ( characters in the match, it would be better to use a negative character class:
(?<={\(P\))[^(]*(?=\(P\)})

You can also do this without regular expressions:
s = 'python {(P)hello world(P)} java'
r = s.split('(P)')[1]
print(r)
# 'hello world'

Python regular expressions match end of word

For example, how to match the second _ab in the sentence _ab_ab is a test? I tried \> to match end of word, but not work for Python 2.7. Note: I am matching not end of a string, but end of a single word.
There are implicit answers in other posts. But I believe a simple and direct answer to such question should be advocated. So I ask it after trying the following posts without direct & concise solutions found.
Python Regex to find whitespace, end of string, and/or word boundary
Does Python re module support word boundaries (\b)?

You may use word boundary \b at the last. Note that adding \b before _ab won't work because there is a b (word char) exists before underscore. \b matches between a word character and a non-word character(vice-versa).
r'_ab\b'

_ab(?!\w) #if you want `_` as word character
or
_ab(?![a-zA-Z0-9])
You can simply use lookahead to indicate end of word.
import re
p = re.compile(r'_ab(?!\w)') #consider underscore also as a word character.
or
p = re.compile(r'_ab(?![a-zA-Z0-9])')
test_str = "_ab_ab"
re.findall(p, test_str)

use r'\>' rather than just '\>'.
I find this solution after reading this post: https://stackoverflow.com/a/3995242/2728388
When using the re module in Python, remember Python’s raw string notation, add a r prefix to escape backslash in your regular expressions.
Any other solutions, such as using word boundary \b?

import re
string='''ab_ab _ab_ab ab__ab abab_ ab_ababab_ '''
patt=re.compile(r'_ab\b')
#this will search _ab from the back of the string
allmatches=patt.findall(patt,string)
print(allmatches)
this will match all _ab form the back of the string

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expressions in python to match Twitter handles - python

Matches only handles that contain this range of characters -> /[a-zA-Z0-9_]/. s = "#123handle what is your problem #handle123?" print re.findall(r'\B(#[\w\d_]+)', s) >>> ['#123handle', '#handle123'] s = '#The quick brown fox#jumped over the LAAZY #_dog.' >>> ['#The', '#_dog']

Related

how to write regex to accept the string which end with string

Regex that match any string except specific string [duplicate]

Regular expression which does not match specific string [duplicate]

regular expression match issue in Python

Python regular expressions match end of word

Categories

Resources