RegEx For Multiple Search & Replace - python

I'm trying to do a search and replace (for multiple chars) in the following string:
VAR=%2FlkdMu9zkpE8w7UKDOtkkHhJlYZ6CaEaxqmsA%2B7G3e8%3D&
One or more of these characters: %3D, %2F, %2B, %23, can be found anywhere (beginning, middle, or end of the string) and ideally, I'd like to search for all of them at once (using one regex) and replace them with = or / or + or # respectively, then return the final string.
Example 1:
VAR=%2FlkdMu9zkpE8w7UKDOtkkHhJlYZ6CaEaxqmsA%2B7G3e8%3D&
Should return
VAR=/lkdMu9zkpE8w7UKDOtkkHhJlYZ6CaEaxqmsA+7G3e8=&
Example 2:
VAR=s2P0n6I%2Flonpj6uCKvYn8PCjp%2F4PUE2TPsltCdmA%3DRQPY%3D&
Should return
VAR=s2P0n6I/lonpj6uCKvYn8PCjp/4PUE2TPsltCdmA=RQPY=&

I'm not convinced you need regex for this, but it's fairly easy to do with Python:
x = 'VAR=%2FlkdMu9zkpE8w7UKDOtkkHhJlYZ6CaEaxqmsA%2B7G3e8%3D&'
import re
MAPPING = {
'%3D': '=',
'%2F': '/',
'%2B': '+',
'%23': '#',
}
def replace(match):
return MAPPING[match.group(0)]
print x
print re.sub('%[A-Z0-9]{2}', replace, x)
Output:
VAR=%2FlkdMu9zkpE8w7UKDOtkkHhJlYZ6CaEaxqmsA%2B7G3e8%3D&
VAR=/lkdMu9zkpE8w7UKDOtkkHhJlYZ6CaEaxqmsA+7G3e8=&

There is no need for a regex to do that in your example. A simple replace method will do:
def rep(s):
for pat, txt in [['%2F','/'], ['%2B','+'], ['%3D','='], ['%23','#']]:
s = s.replace(pat, txt)
return s

I'm also not convinced you need regex, but there's a better way to do url-decode with regex. Basically you need that every string in the pattern of %XX will be converted into the char it represents. This can be done with re.sub() like so:
>>> VAR="%2FlkdMu9zkpE8w7UKDOtkkHhJlYZ6CaEaxqmsA%2B7G3e8%3D&"
>>> re.sub(r'%..', lambda x: chr(int(x.group()[1:], 16)), VAR)
'/lkdMu9zkpE8w7UKDOtkkHhJlYZ6CaEaxqmsA+7G3e8=&'
Enjoy.

var = "VAR=s2P0n6I%2Flonpj6uCKvYn8PCjp%2F4PUE2TPsltCdmA%3DRQPY%3D&"
var = var.replace("%2F", "/")
var = var.replace("%2B", "+")
var = var.replace("%3D", "=")
but you got same result with urllib2.unquote
import urllib2
var = "VAR=s2P0n6I%2Flonpj6uCKvYn8PCjp%2F4PUE2TPsltCdmA%3DRQPY%3D&"
var = urllib2.unquote(var)

This can't be done with a regex because there's no way to write any kind of conditional inside of a regex. Regular expressions can only answer the question "Does this string match this pattern?" and not perform the operation "If this string matches this pattern, replace part of it with this. If it matches this pattern, replace it with this. etc..."

Related

How to remove characters from a str in python?

I have the following str I want to delete characters.
For example:
from str1 = "A.B.1912/2013(H-0)02322"
to 1912/2013
from srt2 = "I.M.1591/2017(I-299)17529"
to 1591/2017
from str3 = "I.M.C.15/2017(I-112)17529"
to 15/2017
I'm trying this way, but I need to remove the rest from ( to the right
newStr = str1.strip('A.B.')
'1912/2013(H-0)02322'
For the moment I'm doing it with slice notation
str1 = "A.B.1912/2013(H-0)02322"
str1 = str1[4:13]
'1912/2013'
But not all have the same length.
Any ideas or suggestions?
With some (modest) assumptions about the format of the strings, here's a solution without using regex:
First split the string on the ( character, keeping the substring on the left:
left = str1.split( '(' )[0] # "A.B.1912/2013"
Then, split the result on the last . (i.e. split from the right just once), keeping the second component:
cut = left.rsplit('.', 1)[1] # "1912/2013"
or combining the two steps into a function:
def extract(s):
return s.split('(')[0].rsplit('.', 1)[1]
Use a regex instead:
import re
regex = re.compile(r'\d+/\d+')
print(regex.search(str1).group())
print(regex.search(str2).group())
print(regex.search(str3).group())
Output:
1912/2013
1591/2017
15/2017
We can try using re.sub here with a capture group:
str1 = "A.B.1912/2013(H-0)02322"
output = re.sub(r'.*\b(\d+/\d+)\b.*', '\\1', str1)
print(output)
1912/2013
You have to use a regular expression to solve this problem.
import re
pattern = r'\d+/\d+'
str1 = "A.B.1912/2013(H-0)02322"
srt2 = "I.M.1591/2017(I-299)17529"
str3 = "I.M.C.15/2017(I-112)17529"
print(*re.findall(pattern, str1))
print(*re.findall(pattern, str2))
print(*re.findall(pattern, str3))
Output:
1912/2013
1591/2017
15/2017

Regular Expression (find matching characters in order)

Let us say that I have the following string variables:
welcome = "StackExchange 2016"
string_to_find = "Sx2016"
Here, I want to find the string string_to_find inside welcome using regular expressions. I want to see if each character in string_to_find comes in the same order as in welcome.
For instance, this expression would evaluate to True since the 'S' comes before the 'x' in both strings, the 'x' before the '2', the '2' before the 0, and so forth.
Is there a simple way to do this using regex?
Your answer is rather trivial. The .* character combination matches 0 or more characters. For your purpose, you would put it between all characters in there. As in S.*x.*2.*0.*1.*6. If this pattern is matched, then the string obeys your condition.
For a general string you would insert the .* pattern between characters, also taking care of escaping special characters like literal dots, stars etc. that may otherwise be interpreted by regex.
This function might fit your need
import re
def check_string(text, pattern):
return re.match('.*'.join(pattern), text)
'.*'.join(pattern) create a pattern with all you characters separated by '.*'. For instance
>> ".*".join("Sx2016")
'S.*x.*2.*0.*1.*6'
Use wildcard matches with ., repeating with *:
expression = 'S.*x.*2.*0.*1.*6'
You can also assemble this expression with join():
expression = '.*'.join('Sx2016')
Or just find it without a regular expression, checking whether the location of each of string_to_find's characters within welcome proceeds in ascending order, handling the case where a character in string_to_find is not present in welcome by catching the ValueError:
>>> welcome = "StackExchange 2016"
>>> string_to_find = "Sx2016"
>>> try:
... result = [welcome.index(c) for c in string_to_find]
... except ValueError:
... result = None
...
>>> print(result and result == sorted(result))
True
Actually having a sequence of chars like Sx2016 the pattern that best serve your purpose is a more specific:
S[^x]*x[^2]*2[^0]*0[^1]*1[^6]*6
You can obtain this kind of check defining a function like this:
import re
def contains_sequence(text, seq):
pattern = seq[0] + ''.join(map(lambda c: '[^' + c + ']*' + c, list(seq[1:])))
return re.search(pattern, text)
This approach add a layer of complexity but brings a couple of advantages as well:
It's the fastest one because the regex engine walk down the string only once while the dot-star approach go till the end of the sequence and back each time a .* is used. Compare on the same string (~1k chars):
Negated class -> 12 steps
Dot star -> 4426 step
It works on multiline strings in input as well.
Example code
>>> sequence = 'Sx2016'
>>> inputs = ['StackExchange2015','StackExchange2016','Stack\nExchange\n2015','Stach\nExchange\n2016']
>>> map(lambda x: x + ': yes' if contains_sequence(x,sequence) else x + ': no', inputs)
['StackExchange2015: no', 'StackExchange2016: yes', 'Stack\nExchange\n2015: no', 'Stach\nExchange\n2016: yes']

Delete substring not matching regex in Python

I have a string like:
'class="a", class="b", class="ab", class="body", class="etc"'
I want to delete everything except class="a" and class="b".
How can I do it? I think the problem is easy but I'm stuck.
Here is some one of my attempts but it didn't solve my problem:
re.sub(r'class="also"|class="etc"', '', a)
My string is a very long HTML code with a lot of classes and I want to only keep two of them and drop all the others.
Some times its good to make a break. I found solution for me with bleach
def filter_class(name, value):
if name == 'class' and value == 'aaa':
return True
attrs = {
'div': filter_class,
}
bleach.clean(html, tags=('div'), attributes=attrs, strip_comments=True)
You tried to explicitly enumerate those substrings you wanted to delete. Rather than writing such long patterns, you can just use negative lookaheads that provide a means to add exclusions to some more generic pattern.
Here is a regex you can use to remove those substrings in a clean way and disregarding order:
,? ?\bclass="(?![ab]")[^"]+"
See regex demo
Here, with (?![ab]")[^"]+, we match 1 or more characters other than " ([^"]+), but not those equal to a or b ((?![ab]")).
Here is a sample code:
import re
p = re.compile(r',? ?\bclass="(?![ab]")[^"]+"')
test_str = "class=\"a\", class=\"b\", class=\"ab\", class=\"body\", class=\"etc\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"etc\", class=\"a\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"a\", class=\"etc\""
result = re.sub(p, '', test_str)
print(result)
See IDEONE demo
NOTE: If instead of a and b you have longer sequences, use a (?!(?:a|b) non-capturing group in the look-ahead instead of a character class:
,? ?\bclass="(?!(?:arbuz|baklazhan)")[^"]+"
See another demo
another pretty simple solution.. good luck.
st = 'class="a", class="b", class="ab", class="body", class="etc"'
import re
res = re.findall(r'class="[a-b]"', st)
print res
'['class="a"', 'class="b"']'
you can use re.sub very easily
res = re.sub(r'class="[a-zA-Z][a-zA-Z].*"', "", st)
print res
class="a", class="b"
If you only wanted to keep the first two entries, one approach would be to use the split() function. This will split your string into a list at given separator points. In your case, this could be a comma. The first two list elements can then be joined back together with commas.
text = 'class="a", class="b", class="ab", class="body", class="etc"'
print ",".join(text.split(",")[:2])
Would give class="a", class="b"
If the entries can be anywhere, and for an arbitrary list of wanted classes:
def keep(text, keep_list):
keep_set = set(re.findall("class\w*=\w*[\"'](.*?)[\"']", text)).intersection(set(keep_list))
output_list = ['class="%s"' % a_class for a_class in keep_set]
return ', '.join(output_list)
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["a", "b"])
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["body", "header"])
This would print:
class="a", class="b"
class="body"

Using parentheses as delimiter in re or str.split() python

I am trying to split a string such as: add(ten)sub(one) into add(ten) sub(one).
I can't figure out how to match the close parentheses. I have used re.sub(r'\\)', '\\) ') and every variation of escaping the parentheses,I can think of. It is hard to tell in this font but I am trying to add a space between these commands so I can split it into a list later.
There's no need to escape ) in the replacement string, ) has a special a special meaning only in the regex pattern so it needs to be escaped there in order to match it in the string, but in normal string it can be used as is.
>>> strs = "add(ten)sub(one)"
>>> re.sub(r'\)(?=\S)',r') ', strs)
'add(ten) sub(one)'
As #StevenRumbalski pointed out in comments the above operation can be simply done using str.replace and str.rstrip:
>>> strs.replace(')',') ').strip()
'add(ten) sub(one)'
d = ')'
my_str = 'add(ten)sub(one)'
result = [t+d for t in my_str.split(d) if len(t) > 0]
result = ['add(ten)','sub(one)']
Create a list of all substrings
import re
a = 'add(ten)sub(one)'
print [ b for b in re.findall('(.+?\(.+?\))', a) ]
Output:
['add(ten)', 'sub(one)']

How can i search and replace using python regex

I want to make the function which find for string in the array and then replace the corres[ponding element from the dictionary. so far i have tried this but i am not able to figure out few things like
How can escape special characters
I can i replace with match found. i tried \1 but it didn't work
dsds
def myfunc(h):
myarray = {
"#":"\\#",
"$":"\\$",
"%":"\\%",
"&":"\\&",
"~":"\\~{}",
"_":"\\_",
"^":"\\^{}",
"\\":"\\textbackslash{}",
"{":"\\{",
"}":"\\}"
}
pattern = "[#\$\%\&\~\_\^\\\\\{\}]"
pattern_obj = re.compile(pattern, re.MULTILINE)
new = re.sub(pattern_obj,myarray[\1],h)
return new
You're looking for re.sub callbacks:
def myfunc(h):
rules = {
"#":r"\#",
"$":r"\$",
"%":r"\%",
"&":r"\&",
"~":r"\~{}",
"_":r"\_",
"^":r"\^{}",
"\\":r"\textbackslash{}",
"{":r"\{",
"}":r"\}"
}
pattern = '[%s]' % re.escape(''.join(rules.keys()))
new = re.sub(pattern, lambda m: rules[m.group()], h)
return new
This way you avoid 1) loops, 2) replacing already processed content.
You can try to use re.sub inside a loop that iterates over myarray.items(). However, you'll have to do backslash first since otherwise that might replace things incorrectly. You also need to make sure that "{" and "}" happen first, so that you don't mix up the matching. Since dictionaries are unordered I suggest you use list of tuples instead:
def myfunc(h):
myarray = [
("\\","\\textbackslash")
("{","\\{"),
("}","\\}"),
("#","\\#"),
("$","\\$"),
("%","\\%"),
("&","\\&"),
("~","\\~{}"),
("_","\\_"),
("^","\\^{}")]
for (val, replacement) in myarray:
h = re.sub(val, replacement, h)
h = re.sub("\\textbackslash", "\\textbackslash{}", h)
return h
I'd suggest you to use raw literal syntax (r"") for better readability of the code.
For the case of your array you may want just to use str.replace function instead of re.sub.
def myfunc(h):
myarray = [
("\\", r"\textbackslash"),
("{", r"\{"),
("}", r"\}"),
("#", r"\#"),
("$", r"\$"),
("%", r"\%"),
("&", r"\&"),
("~", r"\~{}"),
("_", r"\_"),
("^", r"\^{}")]
for (val, replacement) in myarray:
h = h.replace(val, replacement)
h = h.replace(r"\textbackslash", r"\textbackslash{}", h)
return h
The code is a modification of #tigger's answer.
to escape metacharacters, use raw string and backslashes
r"regexp with a \* in it"

Categories

Resources