How to look only at entire words with regex? - python

I am writing personal information filter. When it encounters VALID phone or email replaces it with "[PRIVATE]";
Valid phone is for example '0123 45678' and '00123 45678' is invalid, but i get 0[PRIVATE] for the second one after the filtering. How do i look only at entire words using regex and \bword\b is totally not working properly.

I'm betting that you forgot to use raw strings:
re.search("\bword\b")
finds a string that starts with a backspace character, then word, then another backspace character.
re.search(r"\bword\b")
finds an entire word.

This one will work:
re.search(r"([\d]+([\s]+)?[\d]+)")

Related

Why do multi-line strings lead to different pattern matches from single line strings when using python regex?

I am trying to create a Discord Bot that reads users messages and detects when an Amazon link(s) is/are present in their message.
If I use a multi-line string I capture different results from when the message is used on a single line.
Here is the code I am using:
import re
AMAZON_REGEX = re.compile("(http[s]?://[a-zA-Z0-9.-]*(?:amazon|amzn).["
"a-zA-Z]+(?:.+?(?:ref=[^?]+)|.+(?= )|[^?]+))")
def extract_url(message):
foo = AMAZON_REGEX.findall(message)
return foo
user_message = """https://www.amazon.co.uk/dp/B07RLWTXKG blah blah
hello
https://www.amazon.co.uk/dp/B07RLWToop foobar"""
print(extract_url(user_message))
The result of the above code is: ['https://www.amazon.co.uk/dp/B07RLWTXKG blah blah', 'https://www.amazon.co.uk/dp/B07RLWToop']
However, if I change user_message from a multiline string to a single line one then I get the following result: ['https://www.amazon.co.uk/dp/B07RLWTXKG blah blah hello https://www.amazon.co.uk/dp/B07RLWToop']
Why is this the case? Also, how do I capture just the URL without the rest of the users' messages?
It seems like you're having an issue with the exact regex you're using.
Why does the newline change the output?
After parsing the link, it seems like your regex captures the following words, separated by spaces, but the newline character stops the regex from continuing. The fact that there's a newline between "blah" and "hello" in the first case is what's causing the "hello" to not be captured in the multi-line case. As you might know, there's a newline character (\n), a bit like a, * and other character exist.
Only capturing the link
I'm not quite sure what format the amazon link would come in, so it's difficult to say how it should look. However, you know that the link will not contain a space, so stopping the matching when you see a space character would be optimal.
(http[s]?:\/\/[a-zA-Z0-9.-]*(?:amazon|amzn).[a-zA-Z]+(?:.+?(?:ref=[^?]+)|.+(?= )|[^?]+))
(http[s]?:\/\/[a-zA-Z0-9.-]*(?:amazon|amzn).[a-zA-Z]+(?:.+?(?:ref=[^?]+)|[^ ]+(?= )|[^?]+))
In the example above, I turned one of your last . (basically "match all characters") into [^ ] (basically "match all except for a space"). This means you won't start matching the words following the spaces after the word.
Good luck with the Discord bot!
So the reason you're getting a different result between your two different input sources is because you're not doing any checks for the presence of new lines in your regex. This answer goes into a little more detail about how your regex might need to be modified to detect a newline string.
But - if what you really want is just to get a list of links without the rest of the text, you're better off using a different regex string designed to capture just the URL. This post has several different regex strategies for matching just a single URL.

Is it possible to add "any letter" to a string?

I am parsing a database and extracting entries to a new database. For this I use keywords which should and keywords which should not be included. For a keyword I want excluded, it should be "-anyletter-fv", I wonder if -anyletter- is possible to program. If there is no letter, a space, a comma, or anything but a letter, I don't want to exclude it, only if there is specifically a letter in front of it.
If I understand you correctly, you try to exclude those cases in which your keyword starts with some letter.
Use library re for it (https://docs.python.org/3/library/re.html)
print(re.match("^\w.*", " keyword"))
will return a match object if a pattern that you look for is found, otherwise None.
You can use it for if-expressions.
the "^" marks the beginning of the sequence, "\w" matches all [a-zA-Z0-9], while ".*" matches all other sequences of varying length.
Therefore you get matches for keywords that do not start with ascii character.
I hope this helps you.

Regular expression to search for specific twitter username

I have a project where I'm trying to analyze a database of tweets. I need to write a python regex expression that pulls tweets mentioning specific twitter users. Here is an example tweet I'd like to capture.
"That #A_Person is a real jerk."
The regex that I've been trying is
([^.?!]*)(\b([#]A_Person)\b)([^.?!]*)
But it's not working and I've tried lots of variations. Any advice would be appreciated!
\b matches a word boundary, but # is not a word character, so if it occurs after a space, the match will fail. Try removing the word boundary there, and removing the extra groups, and add a character set at the end for [.?!] to include the final punctuation, and you get:
[^.?!]*#A_Person\b.*?[^.?!]*[.?!]
You also might consider including a check for the start of the string or the end of the last sentence, otherwise the engine will go through a lot of steps while going through areas without any matches. Perhaps use
(?:^|(?<=[.?!])\s*)
which will match the start of the string, or will lookbehind for [.?!] possibly followed by spaces. Put those together and you get
(?:^|(?<=[.?!])\s*)([^.?!]*#A_Person\b.*?[^.?!]*[.?!])
where the string you want is in the first group (no leading spaces).
https://regex101.com/r/447KsF/3

How do I replace all characters in a Python string?

I found a solution on stackoverflow but it doesn't seem to work. I have made a string scanner that checks for character frequency and then replaces all characters with the "real" characters. I've made sure that the character recognition works but when I try replacing all characters in a string they no longer match up with the expected/calculated characters (when I try replacing for example only 2 characters it works fine and matches up perfectly). Here is my replacement code:
print(text.replace(re,'e').replace(rt,'t').replace(ra,'a').replace(ro,'o').replace(ri,'i').replace(rn,'n').replace(rs,'s').replace(rr,'r').replace(rh,'h').replace(rl,'l').replace(ru,'u').replace(rc,'c').replace(rm,'m').replace(rf,'f').replace(ry,'y').replace(rw,'w').replace(rg,'g').replace(rp,'p').replace(rb,'b').replace(rv,'v').replace(rk,'k').replace(rx,'x').replace(rq,'q').replace(rj,'j').replace(rz,'z').replace(rd,'d'))
You might want to take a look at translate. Your code would probably look something like
text = text.translate(str.maketrans('abcd...', ''.join([ra, rb, rc, rd...]))

Python regex ignore punctuation when using re.sub

Let's say I want to convert the word center to centre, theater to theatre, etc. In order to do so, I have written a regex like the one below:
s = "center ce..nnnnnnnnteeeerrrr mmmmeeeeet.eeerrr liiiiIIiter l1t3r"
regex = re.compile(r'(?:((?:(?:[l1]+\W*[i!1]+|m+\W*[e3]+|c+\W*[e3]+\W*n+)\W*t+|t+\W*h+\W*[e3]+\W*a+\W*t+|m+\W*a+\W*n+\W*[e3]+\W*u+\W*v+)\W*)([e3]+)(\W*)(r+))', re.I)
print(regex.sub(r'\1\4\3\2',s)
#prints "centre ce..nnnnnnnntrrrreeee mmmmeeeeet.rrreee liiiiIIitre l1tr3"
In order to account for loopholes like c.e.nn.ttteee,/rr (basically repeated characters and added punctuation), I have been forced to add \W* between each character.
However, people are still able to use strings like c.c.e.e.n.n.t.t.e.e.r.r, which don't match as there is punctuation between each letter, not just different letters.
I was wondering whether there is a smarter method of doing this, where I can use re.sub without removing whitespace/punctuation but nonetheless have it match.

Categories

Resources