How to escape all unescaped slashes in Python - python

I have a partially escaped path in Python like so:
path = "C:\\Temp\\\\TestEmpty" # Actual value = C:\Temp\\TestEmpty
I would like to have all slashes double like so:
escapedpath = "C:\\\\Temp\\\\TestEmpty" # Actual value = C:\\Temp\\TestEmpty
I started with some regex
escapedpath = re.sub("[a-zA-Z0-9 _:-](\\)[a-zA-Z0-9 _:-]", "\\\\", path)
...but of course this removes the character before and after the \\s
How could this be done?

result = re.sub(r"""(?x)
(?<!\\) # Make sure that there is no backslash before the current position
\\ # Match a backslash
(?= # only if...
(?:\\\\)* # an even number of backslashes follows (including zero)
(?!\\) # and no further backslashes follow after that
) # (End of lookahead assertion)""",
r"\\\\", subject)
only replaces backslashes if the number of consecutive backslashes at this point is odd.

You don't need a regex for this, if you just do a simple string replace to double all the backslashes followed by another string replace to undouble the ones that were already doubled then you get where you wanted to go:
>>> path = "C:\\Temp\\\\TestEmpty"
>>> path.replace('\\','\\\\').replace(r'\\\\', r'\\')
'C:\\\\Temp\\\\TestEmpty'
Alternatively undouble the doubled ones first then double all backslashes:
>>> path.replace('\\\\', '\\').replace('\\', '\\\\')
'C:\\\\Temp\\\\TestEmpty'

I use the following combination to sidestep a bit:
escaped_path = re.sub( r'(\\)+', '/', path).replace('/', '\\')
(It has the added benefit of basic sanitation of paths someone (me) may have entered incorrectly by being greedy in replacing backslashes.)

Related

Python -- Regex match pattern OR end of string

import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <$])", "+1.222.222.2222<")
The above code works fine if my string ends with a "<" or space. But if it's the end of the string, it doesn't work. How do I get +1.222.222.2222 to return in this condition:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <$])", "+1.222.222.2222")
*I removed the "<" and just terminated the string. It returns none in this case. But I'd like it to return the full string -- +1.222.222.2222
POSSIBLE ANSWER:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <]|$)", "+1.222.222.2222")
I think you've solved the end-of-string issue, but there are a couple of other potential issues with the pattern in your question:
the - in [ -.] either needs to be escaped as \- or placed in the first or last position within square brackets, e.g. [-. ] or [ .-]; if you search for [] in the docs here you'll find the relevant info:
Ranges of characters can be indicated by giving two characters and separating them
by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match
all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal
digit. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character
(e.g. [-a] or [a-]), it will match a literal '-'.
you may want to require that either matching parentheses or none are present around the first 3 of 10 digits using (?:\(\d{3}\) ?|\d{3}[-. ]?)
Here's a possible tweak incorporating the above
import re
pat = "^((?:\+1[-. ]?|1[-. ]?)?(?:\(\d{3}\) ?|\d{3}[-. ]?)\d{3}[-. ]?\d{4})(?:[ <]|$)"
print( re.findall(pat, "+1.222.222.2222") )
print( re.findall(pat, "+1(222)222.2222") )
print( re.findall(pat, "+1(222.222.2222") )
Output:
['+1.222.222.2222']
['+1(222)222.2222']
[]
Maybe try:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:| |<|$)", "+1.222.222.2222")
null matches any position, +1.222.222.2222
matches space character, +1.222.222.2222
< matches less-than sign character, +1.222.222.2222<
$ end of line, +1.222.222.2222
You can also use regex101 for easier debugging.

How can I find all paths in javascript file with regex in Python?

Sample Javascript (content):
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("src","/cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray="+e.ray),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("sdfdsfsfds",'/test/path'),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
regex = ""
endpoints = re.findall(regex, content)
Output I want:
> /cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray=
> /test/path
I want to find all fields starting with "/ and '/ with regex. I've tried many url regexes but it didn't work for me.
This should do it:
regex = r"""["']\/[^"']*"""
Note that you will need to trim the first character from the match. This also assumes that there are no quotation marks in the path.
Consider:
import re
txt = ... #your code
pat = r"(\"|\')(\/.*?)\1"
for el in re.findall(pat, txt):
print(el[1])
each el will be match of pattern starting with single, or double quote. Then minimal number of characters, then the same character as at the beginning (same type of quote).
.* stands for whatever number of any characters, following ? makes it non-greedy i.e. provides minimal characters match. Then \1 refers to first group, so it will match whatever type of quote was matched at the beginning. Then by specifying el[1] we return second group matched i.e. whatever was matched within quotes.

Python: re.sub return illegal characters when the source containing Chinese character [duplicate]

I want to take the string 0.71331, 52.25378 and return 0.71331,52.25378 - i.e. just look for a digit, a comma, a space and a digit, and strip out the space.
This is my current code:
coords = '0.71331, 52.25378'
coord_re = re.sub("(\d), (\d)", "\1,\2", coords)
print coord_re
But this gives me 0.7133,2.25378. What am I doing wrong?
You should be using raw strings for regex, try the following:
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
With your current code, the backslashes in your replacement string are escaping the digits, so you are replacing all matches the equivalent of chr(1) + "," + chr(2):
>>> '\1,\2'
'\x01,\x02'
>>> print '\1,\2'
,
>>> print r'\1,\2' # this is what you actually want
\1,\2
Any time you want to leave the backslash in the string, use the r prefix, or escape each backslash (\\1,\\2).
Python interprets the \1 as a character with ASCII value 1, and passes that to sub.
Use raw strings, in which Python doesn't interpret the \.
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
This is covered right in the beginning of the re documentation, should you need more info.

Regular Expression issue in Python? [duplicate]

I want to take the string 0.71331, 52.25378 and return 0.71331,52.25378 - i.e. just look for a digit, a comma, a space and a digit, and strip out the space.
This is my current code:
coords = '0.71331, 52.25378'
coord_re = re.sub("(\d), (\d)", "\1,\2", coords)
print coord_re
But this gives me 0.7133,2.25378. What am I doing wrong?
You should be using raw strings for regex, try the following:
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
With your current code, the backslashes in your replacement string are escaping the digits, so you are replacing all matches the equivalent of chr(1) + "," + chr(2):
>>> '\1,\2'
'\x01,\x02'
>>> print '\1,\2'
,
>>> print r'\1,\2' # this is what you actually want
\1,\2
Any time you want to leave the backslash in the string, use the r prefix, or escape each backslash (\\1,\\2).
Python interprets the \1 as a character with ASCII value 1, and passes that to sub.
Use raw strings, in which Python doesn't interpret the \.
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
This is covered right in the beginning of the re documentation, should you need more info.

Python Regex Split Keeps Split Pattern Characters

Easiest way to explain this is an example:
I have this string: 'Docs/src/Scripts/temp'
Which I know how to split two different ways:
re.split('/', 'Docs/src/Scripts/temp') -> ['Docs', 'src', 'Scripts', 'temp']
re.split('(/)', 'Docs/src/Scripts/temp') -> ['Docs', '/', 'src', '/', 'Scripts', '/', 'temp']
Is there a way to split by the forward slash, but keep the slash part of the words?
For example, I want the above string to look like this:
['Docs/', '/src/', '/Scripts/', '/temp']
Any help would be appreciated!
Interesting question, I would suggest doing something like this:
>>> 'Docs/src/Scripts/temp'.replace('/', '/\x00/').split('\x00')
['Docs/', '/src/', '/Scripts/', '/temp']
The idea here is to first replace all / characters by two / characters separated by a special character that would not be a part of the original string. I used a null byte ('\x00'), but you could change this to something else, then finally split on that special character.
Regex isn't actually great here because you cannot split on zero-length matches, and re.findall() does not find overlapping matches, so you would potentially need to do several passes over the string.
Also, re.split('/', s) will do the same thing as s.split('/'), but the second is more efficient.
A solution without split() but with lookaheads:
>>> s = 'Docs/src/Scripts/temp'
>>> r = re.compile(r"(?=((?:^|/)[^/]*/?))")
>>> r.findall(s)
['Docs/', '/src/', '/Scripts/', '/temp']
Explanation:
(?= # Assert that it's possible to match...
( # and capture...
(?:^|/) # the start of the string or a slash
[^/]* # any number of non-slash characters
/? # and (optionally) an ending slash.
) # End of capturing group
) # End of lookahead
Since a lookahead assertion is tried at every position in the string and doesn't consume any characters, it doesn't have a problem with overlapping matches.
1) You do not need regular expressions to split on a single fixed character:
>>> 'Docs/src/Scripts/temp'.split('/')
['Docs', 'src', 'Scripts', 'temp']
2) Consider using this method:
import os.path
def components(path):
start = 0
for end, c in enumerate(path):
if c == os.path.sep:
yield path[start:end+1]
start = end
yield path[start:]
It doesn't rely on clever tricks like split-join-splitting, which makes it much more readable, in my opinion.
If you don't insist on having slashes on both sides, it's actually quite simple:
>>> re.findall(r"([^/]*/)", 'Docs/src/Scripts/temp')
['Docs/', 'src/', 'Scripts/']
Neither re nor split are really cut out for overlapping strings, so if that's what you really want, I'd just add a slash to the start of every result except the first.
Try about this:
re.split(r'(/)', 'Docs/src/Scripts/temp')
From python's documentation
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the
occurrences of pattern. If capturing parentheses are used in pattern,
then the text of all groups in the pattern are also returned as part
of the resulting list. If maxsplit is nonzero, at most maxsplit splits
occur, and the remainder of the string is returned as the final
element of the list. (Incompatibility note: in the original Python 1.5
release, maxsplit was ignored. This has been fixed in later releases.)
I'm not sure there is an easy way to do this. This is the best I could come up with...
import re
lSplit = re.split('/', 'Docs/src/Scripts/temp')
print [lSplit[0]+'/'] + ['/'+x+'/' for x in lSplit][1:-1] + ['/'+lSplit[len(lSplit)-1]]
Kind of a mess, but it does do what you wanted.

Categories

Resources