Using "r" with variables in re.sub - python

I need to do this:
text = re.sub(r'\]\n', r']', text)
But with find and replace as variables:
find = '\]\n'
replace = ']'
text = re.sub(find, replace, text)
Where should I put r (raw)? It is not a string.

The r'' is part of the string literal syntax:
find = r'\]\n'
replace = r']'
text = re.sub(find, replace, text)
The syntax is in no way specific to the re module. However, specifying regular expressions is one of the main use cases for raw strings.

Short answer: you should keep the r together with the string.
The r prefix is part of the string syntax. With r, Python doesn't interpret backslash sequences such as \n, \t etc inside the quotes. Without r, you'd have to type each backslash twice in order to pass it to re.sub.
r'\]\n'
and
'\\]\\n'
are two ways to write same string.

Keep r'...'
find = r'\]\n'
replace = r']'
text = re.sub(find, replace, text)
or go with
find = '\\]\\n'
replace = ']'
text = re.sub(find, replace, text)

Related

Python .replace() function, removing backslash in certain way

I have a huge string which contains emotions like "\u201d", AS WELL AS "\advance\"
all that I need is to remove back slashed so that:
- \u201d = \u201d
- \united\ = united
(as it breaks the process of uploading it to BigQuery database)
I know it should be somehow this way:
string.replace('\','') But not sure how to keep \u201d emotions.
ADDITIONAL:
Example of Unicode emotions
\ud83d\udc9e
\u201c
\u2744\ufe0f\u2744\ufe0f\u2744\ufe0f
You can split on all '\' and then use a regex to replace your emotions with adding leading '\'
s = '\\advance\\\\united\\ud83d\\udc9e\\u201c\\u2744\\ufe0f\\u2744\\ufe0f\\u2744\\ufe0f'
import re
print(re.sub('(u[a-f0-9]{4})',lambda m: '\\'+m.group(0),''.join(s.split('\\'))))
As your emotions are 'u' and 4 hexa numbers, 'u[a-f0-9]{4}' will match them all, and you just have to add leading backslashes
First of all, you delete every '\' in the string with either ''.join(s.split('\\')) or s.replace('\\')
And then we match every "emotion" with the regex u[a-f0-9]{4} (Which is u with 4 hex letters behind)
And with the regex sub, you replace every match with a leading \\
You could simply add the backslash in front of your string after replacement if your string starts with \u and have at least one digit.
import re
def clean(s):
re1='(\\\\)' # Any Single Character "\"
re2='(u)' # Any Single Character "u"
re3='.*?' # Non-greedy match on filler
re4='(\\d)' # Any Single Digit
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(s)
if m:
r = '\\'+s.replace('\\','')
else:
r = s.replace('\\','')
return r
a = '\\u123'
b = '\\united\\'
c = '\\ud83d'
>>> print(a, b, c)
\u123 \united\ \ud83d
>>> print(clean(a), clean(b), clean(c))
\u123 united \ud83d
Of course, you have to split your sting if multiple entries are in the same line:
string = '\\u123 \\united\\ \\ud83d'
clean_string = ' '.join([clean(word) for word in string.split()])
You can use this simple method to replace the last occurence of your character backslash:
Check the code and use this method.
def replace_character(s, old, new):
return (s[::-1].replace(old[::-1],new[::-1], 1))[::-1]
replace_character('\advance\', '\','')
replace_character('\u201d', '\','')
Ooutput:
\advance
\u201d
You can do it as simple as this
text = text.replace(text[-1],'')
Here you just replace the last character with nothing

Using variables in re.findall() regex function

I have a list of regex patterns like k[a-z]p[a-z]+a
and a list of words that can fit into these patterns. Now, the problem is that,
when I use:
re.findall(r'k[a-z]p[a-z]+a', list)
Everything works properly, but when I replace the raw expression with a variable like:
pattern = "r'" + pattern + "'"
and then try:
re.findall(pattern, list)
or
re.findall(str(pattern), list)
It no longer works. How could I fix it?
Thanks!
Spike
You are overthinking it. The r prefix is not part of the pattern string itself, it merely indicates that the following string should not use escape codes for certain characters.
This will work without adjusting your pattern:
re.findall(pattern, list)
If your pattern contains characters that do not need escaping (as they do not), you can add the prefix r to the pattern definition. Suppose you want to search for a different regex, then use
pattern = r'k\wp\wa'
re.findall(pattern, list)
and you don't need to escape it. Since pattern in itself is a perfectly ordinary string, you can concatenate it with other strings:
start = 'a'
middle = 'b'
end = 'c'
pattern = a + r'\w' + b + r'\w' + c
re.findall(pattern, list)

Python pattern match a string

I am trying to pattern match a string, so that if it ends in the characters 'std' I split the last 6 characters and append a different prefix.
I am assuming I can do this with regular expressions and re.split, but I am unsure of the correct notation to append a new prefix and take last 6 chars based on the presence of the last 3 chars.
regex = r"([a-zA-Z])"
if re.search(regex, "std"):
match = re.search(regex, "std")
#re.sub(r'\Z', '', varname)
You're confused about how to use regular expressions here. Your code is saying "search the string 'std' for any alphanumeric character".
But there is no need to use regexes here anyway. Just use string slicing, and .endswith:
if my_string.endswith('std'):
new_string = new_prefix + mystring[-6:]
No need for a regex. Just use standard string methods:
if s.endswith('std'):
s = s[:-6] + new_suffix
But if you had to use a regex, you would substitute a regex, you would substitute the new suffix in:
regex = re.compile(".{3}std$")
s = regex.sub(new_suffix, s)

Allowing escape sequences in my regular expression

I'm trying to create a regular expression which finds occurences of $VAR or ${VAR}. If something like \$VAR or \${VAR} was given, it would not match. If it were given something like \\$VAR or \\${VAR} or any multiple of 2 \'s, it should match.
i.e.
$BLOB matches
\$BLOB doesn't match
\\$BLOB matches
\\\$BLOB doesn't match
\\\\$BLOB matches
... etc
I'm currently using the following regex:
line = re.sub("[^\\][\\\\]*\$(\w[^-]+)|"
"[^\\][\\\\]*\$\{(\w[^-]+)\}",replace,line)
However, this doesn't work properly. When I give it \$BLOB, it still matches for some reason. Why is this?
The second groupings of double slashes are written as a redundant character class [\\\\]*, matching one or more backslashes, but should be a repeating group ((?:\\\\)*) matching one or more sets of two backslashes:
re.sub(r'(?<!\\)((?:\\\\)*)\$(\w[^-]+|\{(\w[^-]+)\})',r'\1' + replace, line)
To write a regular expression that finds $ unless it is escaped using E unless it in turn is also escaped EE:
import re
values = dict(BLOB='some value')
def repl(m):
return m.group('before') + values[m.group('name').strip('{}')]
regex = r"(?<!E)(?P<before>(?:EE)*)\$(?P<name>N|\{N\})"
regex = regex.replace('E', re.escape('\\'))
regex = regex.replace('N', r'\w+') # name
line = re.sub(regex, repl, line)
Using E instead of '\\\\' exposes your embed language without thinking about backslashes in Python string literals and regular expression patterns.

How is it possible to encode """ (triple quotes) into a raw string?

How do I encode """ in a raw python string?
The following does not seem to work:
string = r"""\"\"\""""
since when trying to match """ with a regular expression, I have to double-escape the character ":
Returns an empty list:
string = r"""\"\"\""""
regEx = re.compile(r"""
(\"\"\")
""", re.S|re.X)
result = re.findall(regEx, string)
in this case result is an empty list.
This same regular expression returns ['"""'] when I load a string with """ from file content.
Returns double-escaped quotations:
string = r"""\"\"\""""
regEx = re.compile(r"""
(\\"\\"\\")
""", re.S|re.X)
result = re.findall(regEx, string)
now result is equal to ['\\"\\"\\"'].
It want it to be equal to ['"""'].
In general, there are three options:
Don't use the r prefix. That's just a convenience to avoid excessive use of double-backslashes in regexes. It isn't required.
Use r'…', inside which the " character isn't special.
Mix and match r"…" and '':, e.g. pattern = '"""' + r"\s*\d\d-'\d\d'-\d\d\s*" + '"""'
In this instance, you can do both 1 and 2: single quotes and no r prefix.
The simplest way is to just do '"""'.

Categories

Resources