Using Python regex to sanitize input string - python

I have the string:
text = 'href = "www.google.com" onmouseover = blahblah >'
I want 'href = "www.google.com">'
Currently, my function looks like this:
text = re.sub(r'href = \".*\".*>', 'href = \".*\">', text)
which ends up removing the website link and replacing it with the string '.*' . I think I'm supposed to use ?Pname somehow?, but do not know ho to write it properly so that I get the correct output.

You don't want to substitute in .*, you want to substitute in whatever the first .* matched.
To do that, you need a backreference, like \1.
And this means you need something for the backreference to refer back to—a capture group, like (.*) instead of .*.
More generally, the replacement string is not a regular expression, it's a different kind of thing—basically, it's a template that's all literal characters except for backreferences.* So, you don't want to try to escape the quotes, unless you want literal backslashes in the results.
So:
>>> re.sub(r'href = \"(.*)\".*>', r'href = "\1">', text)
'href = "www.google.com">'
This is explained in more detail in Search and Replace in the Regular Expression HOWTO.
* Or it can be a function which takes each match object and returns a string.

An alternative way to accomplish your goal is to take a substring. No regular expression is needed. The idea is to find the second double-quote character using the string method index().
For a string called input, this expression gives you the position of the second double-quote character:
input.index('"', input.index('"')+1)
If that value is k, write input[:k+1] to extract everything up to and including the second double-quote character.
Try out the following in your Python interpreter.
input = 'href = "www.google.com" onmouseover=hax0rFunction()>'
k = input.index('"', input.index('"')+1)
input[0:k+1]

Related

RegEx for a delimited string

I have a string like this '432342:username:full_name:1'. I need to write regular expression to check if string matches it.
I tried to .split(':') and then by accesing dict[i] checking if value in regular expression. But I need to match whole string.
only numbers:english letters and numbers:english, russian letters:1,2,3
Also tried like this but I don't understand how to add ':' separator to separate the string. Like in example above
pattern = r'[/b:]|[\d]|[a-zA-Z]|[а-яА-Я]|[1,2,3]'
As per your instructions, try this:
s = '432342:username:full_name:1'
re.findall(r'[0-9]+:[a-zA-Z]+:[а-яА-Я_]+:[123]',s)
#['432342:username:full_name:1']

Replace text between parentheses in python

My string will contain () in it. What I need to do is to change the text between the brackets.
Example string: "B.TECH(CS,IT)".
In my string I need to change the content present inside the brackets to something like this.. B.TECH(ECE,EEE)
What I tried to resolve this problem is as follows..
reg = r'(()([\s\S]*?)())'
a = 'B.TECH(CS,IT)'
re.sub(reg,"(ECE,EEE)",a)
But I got output like this..
'(ECE,EEE)B(ECE,EEE).(ECE,EEE)T(ECE,EEE)E(ECE,EEE)C(ECE,EEE)H(ECE,EEE)((ECE,EEE)C(ECE,EEE)S(ECE,EEE),(ECE,EEE)I(ECE,EEE)T(ECE,EEE))(ECE,EEE)'
Valid output should be like this..
B.TECH(CS,IT)
Where I am missing and how to correctly replace the text.
The problem is that you're using parentheses, which have another meaning in RegEx. They're used as grouping characters, to catch output.
You need to escape the () where you want them as literal tokens. You can escape characters using the backslash character: \(.
Here is an example:
reg = r'\([\s\S]*\)'
a = 'B.TECH(CS,IT)'
re.sub(reg, '(ECE,EEE)', a)
# == 'B.TECH(ECE,EEE)'
The reason your regex does not work is because you are trying to match parentheses, which are considered meta characters in regex. () actually captures a null string, and will attempt to replace it. That's why you get the output that you see.
To fix this, you'll need to escape those parens – something along the lines of
\(...\)
For your particular use case, might I suggest a simpler pattern?
In [268]: re.sub(r'\(.*?\)', '(ECE,EEE)', 'B.TECH(CS,IT)')
Out[268]: 'B.TECH(ECE,EEE)'

re.sub repl function returning \1 does not replace the group

I am trying to write a generic replace function for a regex sub operation in Python (trying in both 2 and 3) Where the user can provide a regex pattern and a replacement for the match. This could be just a simple string replacement to replacing using the groups from the match.
In the end, I get from the user a dictionary in this form:
regex_dict = {pattern:replacement}
When I try to replace all the occurrences of a pattern via this command, the replacement works for replacements for a group number, (such as \1) and I call the following operation:
re.sub(pattern, regex_dict[pattern], text)
This works as expected, but I need to do additional stuff when a match is found. Basically, what I try to achieve is as follows:
replace_function(matchobj):
result = regex_dict[matchobj.re]
##
## Do some other things
##
return result
re.sub(pattern, replace_function, text)
I see that this works for normal replacements, but the re.sub does not use the group information to get the match when the function is used.
I also tried to convert the \1 pattern to \g<1>, hoping that the re.sub would understand it, but to no avail.
Am I missing something vital?
Thanks in advance!
Additional notes: I compile the pattern using strings as in bytes, and the replacements are also in bytes. I have non-Latin characters in my pattern, but I read everything in bytes, including the text where the regex substitution will operate on.
EDIT
Just to clarify, I do not know in advance what kind of replacement the user will provide. It could be some combination of normal strings and groups, or just a string replacement.
SOLUTION
replace_function(matchobj):
repl = regex_dict[matchobj.re]
##
## Do some other things
##
return matchobj.expand(repl)
re.sub(pattern, replace_function, text)
I suspect you're after .expand, if you've got a compiled regex object (for instance), you can provide a string to be taken into consideration for the replacements, eg:
import re
text = 'abc'
# This would be your key in the dict
rx = re.compile('a(\w)c')
# This would be the value for the key (the replacement string, eg: `\1\1\1`)
res = rx.match(text).expand(r'\1\1\1')
# bbb

Python split by regular expression

In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']
Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"
The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>

Need regular expression expert: round bracket within stringliteral

I'm searching for strings within strings using Regex. The pattern is a string literal that ends in (, e.g.
# pattern
" before the bracket ("
# string
this text is before the bracket (and this text is inside) and this text is after the bracket
I know the pattern will work if I escape the character with a backslash, i.e.:
# pattern
" before the bracket \\("
But the pattern strings are coming from another search and I can not control what characters will be or where. Is there a way of escaping an entire string literal so that anything between markers is treated as a string? For example:
# pattern
\" before the ("
The only other option I have is to do a substitute adding escapes for every protected character.
re.escape is exactly what I need. I'm using regexp in Access VBA which doens't have that method. I only have replace, execute or test methods.
Is there a way to escape everything within a string in VBA?
Thanks
You didn't specify the language, but it looks like Python, so if you have a string in Python whose special regex characters you need to escape, use re.escape():
>>> import re
>>> re.escape("Wow. This (really) is *cool*")
'Wow\\.\\ This\\ \\(really\\)\\ is\\ \\*cool\\*'
Note that spaces are escaped, too (probably to ensure that they still work in a re.VERBOSE regex).
Maybe write your own VBA escape function:
Function EscapeRegEx(text As String) As String
Dim regEx As RegExp
Set regEx = New RegExp
regEx.Global = True
regEx.Pattern = "(\[|\\|\^|\$|\.|\||\?|\*|\+|\(|\)|\{|\})"
EscapeRegEx = regEx.Replace(text, "\$1")
End Function
I'm pretty sure that with the limitations of the RegExp abilities in VBA/VBScript, you are going to have to replace the special characters in your pattern before using it. There doesn't seem to be anything built into it like there is in Python.
The following regex will capture everything from the beginning of the string to the first (. The first captured group $1 will contain the portion before (.
^([^(]+)\(
Depending on your language, you might have to escape it as:
"^([^(]+)\\("

Categories

Resources