I have the following Python code:
string = '[subscript=hello] this is some text [subscript=hi again]'
superscripts = re.findall(r'\[subscript=(.+?)\]', string)
print superscripts
and it returns ['hello', 'hi again'] which is the text which I want, however, I would rather instead of returning this, I would prefer it to replace the string so that it returns <sub>hello</sub> this is some text <sub>hi again<sub>. I know I should use re.sub() but I'm unsure how to use it to correctly replace the string to my liking.
How would I do this? Thanks.
Edit: Screenshots
Use a backreference \1 which refers to the matched group in the first argument in your pattern:
your_new_string = re.sub(r'\[subscript=(.+?)\]', r'<sub>\1</sub>', your_old_string)
Related
Have a list which contains strings etc. as shown below
strpool = ['fruit,apple:3', '', '[1,abcd, ['fruit,apple'], ['1,kdlld', apple,taste]]']
Wanted to search exactly for the word 'apple' and replace with 'apple:3'
I tried the below code,
print str(strpool).replace("apple","apple:3")
print (re.sub(r"\bapple\b","apple:3",str(strpool)))
But its replacing even the apple:3 as well into apple:3:3 , not just the string apple.
Thought apple:3 would be considered as a string and that doesn't get changed.
Update:
How can I exactly match a string name without any other components attached to it and replace all of them inside a list ?
Tried re.sub but for that need to convert the list to string, instead is there any other way ?
Use a negative lookahead to match a word unless it's followed by :.
print(re.sub(r'\bapple\b(?!:)', 'apple:3', str(strpool))
I have the string:
text = 'href = "www.google.com" onmouseover = blahblah >'
I want 'href = "www.google.com">'
Currently, my function looks like this:
text = re.sub(r'href = \".*\".*>', 'href = \".*\">', text)
which ends up removing the website link and replacing it with the string '.*' . I think I'm supposed to use ?Pname somehow?, but do not know ho to write it properly so that I get the correct output.
You don't want to substitute in .*, you want to substitute in whatever the first .* matched.
To do that, you need a backreference, like \1.
And this means you need something for the backreference to refer back to—a capture group, like (.*) instead of .*.
More generally, the replacement string is not a regular expression, it's a different kind of thing—basically, it's a template that's all literal characters except for backreferences.* So, you don't want to try to escape the quotes, unless you want literal backslashes in the results.
So:
>>> re.sub(r'href = \"(.*)\".*>', r'href = "\1">', text)
'href = "www.google.com">'
This is explained in more detail in Search and Replace in the Regular Expression HOWTO.
* Or it can be a function which takes each match object and returns a string.
An alternative way to accomplish your goal is to take a substring. No regular expression is needed. The idea is to find the second double-quote character using the string method index().
For a string called input, this expression gives you the position of the second double-quote character:
input.index('"', input.index('"')+1)
If that value is k, write input[:k+1] to extract everything up to and including the second double-quote character.
Try out the following in your Python interpreter.
input = 'href = "www.google.com" onmouseover=hax0rFunction()>'
k = input.index('"', input.index('"')+1)
input[0:k+1]
In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']
Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"
The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>
I am having difficulty trying to figure out a bug in my Python (2.7) script. I am getting an difference with using sub and findall in recognizing special characters.
Here is the code:
>>> re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)
u'Castaeda'
>>> re.findall(ur"[^-' ().,\w]+", u'Castañeda', re.UNICODE)
[]
When I use findall, it correctly sees ñ as an alphabetic character, but when I use sub it replaces this--treating it as a non-alphabetic character.
I've been able to get the correct functionality using findall with string.replace, but this seems like a bad solution. Also, I want to use re.split, and I'm having the same problems as with re.sub.
Thanks in advance for the help.
The call signature of re.sub is:
re.sub(pattern, repl, string, count=0)
So
re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)
is setting count to re.UNICODE, which has value 32.
Try instead:
In [57]: re.sub(ur"(?u)[^-' ().,\w]+", '', u'Castañeda')
Out[57]: u'Casta\xf1eda'
Placing (?u) at the beginning of the regex is an alternate way to specify the re.UNICODE flag in the regex itself. You can also set the other flags
(?iLmsux) this way. (For more info click this link and search for "(?iLmsux)".)
Similarly, the call signature of re.split is:
re.split(pattern, string, maxsplit=0)
The solution is the same.
i got an string that might look like this
"myFunc('element','node','elementVersion','ext',12,0,0)"
i'm currently checking for validity using, which works fine
myFunc\((.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\)
now i'd like to replace whatever string is at the 3rd parameter.
unfortunately i cant just use a stringreplace on whatever sub-string on the 3rd position since the same 'sub-string' could be anywhere else in that string.
with this and a re.findall,
myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)
i was able to get the contents of the substring on the 3rd position, but re.sub does not replace the string it just returns me the string i want to replace with :/
here's my code
myRe = re.compile(r"myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)")
val = "myFunc('element','node','elementVersion','ext',12,0,0)"
print myRe.findall(val)
print myRe.sub("noVersion",val)
any idea what i've missed ?
thanks!
Seb
In re.sub, you need to specify a substitution for the whole matching string. That means that you need to repeat the parts that you don't want to replace. This works:
myRe = re.compile(r"(myFunc\(.+?\,.+?\,)(.+?)(\,.+?\,.+?\,.+?\,.+?\))")
print myRe.sub(r'\1"noversion"\3', val)
If your only tool is a hammer, all problems look like nails. A regular expression is a powerfull hammer but is not the best tool for every task.
Some tasks are better handled by a parser. In this case the argument list in the string is just like a Python tuple, sou you can cheat: use the Python builtin parser:
>>> strdata = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> args = re.search(r'\(([^\)]+)\)', strdata).group(1)
>>> eval(args)
('element', 'node', 'elementVersion', 'ext', 12, 0, 0)
If you can't trust the input ast.literal_eval is safer than eval for this. Once you have the argument list in the string decontructed I think you can figure out how to manipulate and reassemble it again, if needed.
Read the documentation: re.sub returns a copy of the string where every occurrence of the entire pattern is replaced with the replacement. It cannot in any case modify the original string, because Python strings are immutable.
Try using look-ahead and look-behind assertions to construct a regex that only matches the element itself:
myRe = re.compile(r"(?<=myFunc\(.+?\,.+?\,)(.+?)(?=\,.+?\,.+?\,.+?\,.+?\))")
Have you tried using named groups? http://docs.python.org/howto/regex.html#search-and-replace
Hopefully that will let you just target the 3rd match.
If you want to do this without using regex:
>>> s = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> l = s.split(",")
>>> l[2]="'noVersion'"
>>> s = ",".join(l)
>>> s
"myFunc('element','node','noVersion','ext',12,0,0)"