Issues handling strings with .encode('string-escape') method - python

I am working with variables containing directory paths in python on a windows machine, and as such need to convert string litterals to raw strings (removing escape sequences). All is fine when i use the os.getcwd() function and convert using the method .encode('string-escape'), but as soon as i try doing the same with a hard coded string it wont work. This is especially confusing as both objects are of the same type (string), and as such should behave in exactly the same way.
My code is:
import os
dir1 = os.getcwd()
type1 = type(dir1)
print type1
print dir1.encode('string-escape')
print "\n\n"
dir2 = "C:\Users\StaM\Desktop\brba\test1"
type2 = type(dir2)
print type2
print dir2.encode('string-escape')
And my output is:
<type 'str'>
C:\\Users\\StaM\\Desktop\\brba\\test1
<type 'str'>
C:\\Users\\StaM\\Desktop\x08rba\test1
As you can see both objects are the same type yet the behaviour is different in handling escape sequences. Any ideas on why this is happening and how to get this to work properly? All explanations / suggestions / solutions would be highly appreciated, I really want to understand what is going on here. Thnx
Please note: This question is about the .encode() method and not 'r' flag... Using the 'r' flag for raw strings is not an option here, as i am passing the variables containing directory paths into my program to construct a larger string to represent DOS commands.

The reason for this behavior is that the os.getcwd() function returns a pre-formatted string inclusive of double "\" even when pre-fixed to an escape character. While the .encode() method will only append the second "\" if the character that follows it is not an escape character.
>>> import os
>>> dir = os.getcwd()
>>> print "%r" %dir
'C:\\Users\\StaM\\Desktop\\brba\\test1'
The solution here is to use a dictionary to define all possible escape characters, then use a loop to locate these characters in the string in question and to append a secondary "\" directly preceding any escape characters. This should be done prior to using the .encode() method.
BOOM!

Related

Why do some functions in Python change \ to \\

When I declare pass a file to shutil.copy as
shutil.copy(r'i:\myfile.txt', r'UNC to where I want it to go')
I get an error
No such file or directory 'i:\\myfile.txt'
I've experienced this problem before with the os module when I have a UNC path. Usually I just get frustrated enough that I forget using the os module and just put the file path into with open() or whatever I'm using it for.
It is my understanding that placing an r before '' is supposed to cause python to ignore escape characters and treat them as string literals, but the behavior I'm seeing leads me to believe that this is not the case. For some reason it takes the \ and changes it to \\.
I've seen this when using os.path.join where the \\ at the beginning of the the UNC Path gets turned into \\\\.
What is the best way to pass a string literal to ensure that all escape characters are ignored and the string is preserved?
Your string is not being modified by Python. It's the representation of your string that's coming out differently.
When the error is printed, Python calls repr() to print the value. This function will
Return a string containing a printable representation of an object. For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object. A class can control what this function returns for its instances by defining a repr() method.
This can be very nice when debugging: if I paste that string (quotes, escapes, and all) into the REPL I'll get the string in memory that you were working with. I can use this to interactively try your copy command, maybe tweaking the string a bit.
If you want to see your string in a printed form, you could do
source_path = r'i:\myfile.txt'
target_path = r'UNC to where I want it to go'
print(f'Copying {source_path} to {target_path}...')
shutil.copy(source_path, target_path)

Is it possible to suppress Python's escape sequence processing on a given string without using the raw specifier?

Conclusion: It's impossible to override or disable Python's built-in escape sequence processing, such that, you can skip using the raw prefix specifier. I dug into Python's internals to figure this out. So if anyone tries designing objects that work on complex strings (like regex) as part of some kind of framework, make sure to specify in the docstrings that string arguments to the object's __init__() MUST include the r prefix!
Original question: I am finding it a bit difficult to force Python to not "change" anything about a user-inputted string, which may contain among other things, regex or escaped hexadecimal sequences. I've already tried various combinations of raw strings, .encode('string-escape') (and its decode counterpart), but I can't find the right approach.
Given an escaped, hexadecimal representation of the Documentation IPv6 address 2001:0db8:85a3:0000:0000:8a2e:0370:7334, using .encode(), this small script (called x.py):
#!/usr/bin/env python
class foo(object):
__slots__ = ("_bar",)
def __init__(self, input):
if input is not None:
self._bar = input.encode('string-escape')
else:
self._bar = "qux?"
def _get_bar(self): return self._bar
bar = property(_get_bar)
#
x = foo("\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
print x.bar
Will yield the following output when executed:
$ ./x.py
\x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4
Note the \x20 got converted to an ASCII space character, along with a few others. This is basically correct due to Python processing the escaped hex sequences and converting them to their printable ASCII values.
This can be solved if the initializer to foo() was treated as a raw string (and the .encode() call removed), like this:
x = foo(r"\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
However, my end goal is to create a kind of framework that can be used and I want to hide these kinds of "implementation details" from the end user. If they called foo() with the above IPv6 address in escaped hexadecimal form (without the raw specifier) and immediately print it back out, they should get back exactly what they put in w/o knowing or using the raw specifier. So I need to find a way to have foo's __init__() do whatever processing is necessary to enable that.
Edit: Per this SO question, it seems it's a defect of Python, in that it always performs some kind of escape sequence processing. There does not appear to be any kind of facility to completely turn off escape sequence processing, even temporarily. Sucks. I guess I am going to have to research subclassing str to create something like rawstr that intelligently determines what escape sequences Python processed in a string, and convert them back to their original format. This is not going to be fun...
Edit2: Another example, given the sample regex below:
"^.{0}\xcb\x00\x71[\x00-\xff]"
If I assign this to a var or pass it to a function without using the raw specifier, the \x71 gets converted to the letter q. Even if I add .encode('string-escape') or .replace('\\', '\\\\'), the escape sequences are still processed. thus resulting in this output:
"^.{0}\xcb\x00q[\x00-\xff]"
How can I stop this, again, without using the raw specifier? Is there some way to "turn off" the escape sequence processing or "revert" it after the fact thus that the q turns back into \x71? Is there a way to process the string and escape the backslashes before the escape sequence processing happens?
I think you have an understandable confusion about a difference between Python string literals (source code representation), Python string objects in memory, and how that objects can be printed (in what format they can be represented in the output).
If you read some bytes from a file into a bytestring you can write them back as is.
r"" exists only in source code there is no such thing at runtime i.e., r"\x" and "\\x" are equal, they may even be the exact same string object in memory.
To see that input is not corrupted, you could print each byte as an integer:
print " ".join(map(ord, raw_input("input something")))
Or just echo as is (there could be a difference but it is unrelated to your "string-escape" issue):
print raw_input("input something")
Identity function:
def identity(obj):
return obj
If you do nothing to the string then your users will receive the exact same object back. You can provide examples in the docs what you consider a concise readable way to represent input string as Python literals. If you find confusing to work with binary strings such as "\x20\x01" then you could accept ascii hex-representation instead: "2001" (you could use binascii.hexlify/unhexlify to convert one to another).
The regex case is more complex because there are two languages:
Escapes sequences are interpreted by Python according to its string literal syntax
Regex engine interprets the string object as a regex pattern that also has its own escape sequences
I think you will have to go the join route.
Here's an example:
>>> m = {chr(c): '\\x{0}'.format(hex(c)[2:].zfill(2)) for c in xrange(0,256)}
>>>
>>> x = "\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34"
>>> print ''.join(map(m.get, x))
\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34
I'm not entirely sure why you need that though. If your code needs to interact with other pieces of code, I'd suggest that you agree on a defined format, and stick to it.

How can I read blackslashes from a file correctly?

The following code:
key = open("C:\Scripts\private.ppk",'rb').read()
reads the file and assigns its data to the var key.
For a reason, backslashes are multiplied in the process. How can I make sure they don't get multiplied?
You ... don't. They are escaped when they are read in so that they will process properly when they are written out / used. If you're declaring strings and don't want to double up the back slashes you can use raw strings r'c:\myfile.txt', but that doesn't really apply to the contents of a file you're reading in.
>>> s = r'c:\boot.ini'
>>> s
'c:\\boot.ini'
>>> repr(s)
"'c:\\\\boot.ini'"
>>> print s
c:\boot.ini
>>>
As you can see, the extra slashes are stored internally, but when you use the value in a print statement (write a file, test for values, etc.) they're evaluated properly.
You should read this great blog post on python and the backslash escape character.
And under some circumstances, if
Python prints information to the
console, you will see the two
backslashes rather than one. For
example, this is part of the
difference between the repr() function
and the str() function.
myFilename =
"c:\newproject\typenames.txt" print
repr(myFilename), str(myFilename)
produces
'c:\newproject\typenames.txt'
c:\newproject\typenames.txt
Backslashes are represented as escaped. You'll see two backslashes for each real one existing on the file, but that is normal behaviour.
The reason is that the backslash is used in order to create codes that represent characters that cannot be easily represented, such as new line '\n' or tab '\t'.
Are you trying to put single backslashes in a string? Strings with backslashes require and escape character, in this case "\". It will print to the screen with a single slash
In fact there is a solution - using eval, as long as the file content can be wrapped into quotes of some kind. Following worked for me (PATH contains some script that executes Matlab):
MATLAB_EXE = "C:\Program Files (x86)\MATLAB\R2012b\bin\matlab.exe"
content = open(PATH).read()
MATLAB_EXE in content # False
content = eval(f'r"""{content}"""')
MATLAB_EXE in content # True
This works by evaluating the content as python string literal, making double escapes transform into single ones. Raw string is used to prevent escapes forming special characters.

Dealing with BACKSLASH character in non-string literals in Python

I have the following string read from an XML elememnt, and it is assigned to a variable called filename. I don't know how to make this any clearer as saying filename = the following string, without leading someone to think that I have a string literal then.
\\server\data\uploads\0224.1307.Varallo.mov
when I try and pass this to
os.path.basename(filename)
I get the following
\\server\\data\\uploads\x124.1307.Varallo.mov
I tried filename.replace('\\','\\\\') but that doesn't work either. os.path.basename(filename) then returns the following.
\\\\server\\data\\uploads\\0224.1307.Varallo.mov
Notice that the \0 is now not being converted to \x but now it doesn't process the string at all.
what can I do to my filename variable to get this String in a proper state so that os.path.basename() will actually give me back the basename. I am on OSX so the uncpath stuff is not available.
All attempts to replace the \ with \\ manually fail because of the \0 getting converted to \x in the beginning of the basename.
NOTE: this is NOT a string literal so r'' doesn't work.
We need more information. What exactly is in the variable filename? To answer, use print repr(filename) and add the results to your question above.
Wild guess
DISCLAIMER: This is a guess - try:
import ntpath
print ntpath.basename(filename)
All the downvoting in the world won't change the fact that you're doing it wrong. os.path is for native paths. \\foo\bar\baz is not a OS X path, it's a Windows UNC. posixpath is not equipped to handle UNCs; ntpath is.

How to append '\\?\' to the front of a file path in Python

I'm trying to work with some long file paths (Windows) in Python and have come across some problems. After reading the question here, it looks as though I need to append '\\?\' to the front of my long file paths in order to use them with os.stat(filepath). The problem I'm having is that I can't create a string in Python that ends in a backslash. The question here points out that you can't even end strings in Python with a single '\' character.
Is there anything in any of the Python standard libraries or anywhere else that lets you simply append '\\?\' to the front of a file path you already have? Or is there any other work around for working with long file paths in Windows with Python? It seems like such a simple thing to do, but I can't figure it out for the life of me.
"\\\\?\\" should give you exactly the string you want.
Longer answer: of course you can end a string in Python with a backslash. You just can't do so when it's a "raw" string (one prefixed with an 'r'). Which you usually use for strings that contains (lots of) backslashes (to avoid the infamous "leaning toothpick" syndrome ;-))
Even with a raw string, you can end in a backslash with:
>>> print r'\\?\D:\Blah' + '\\'
\\?\D:\Blah\
or even:
>>> print r'\\?\D:\Blah' '\\'
\\?\D:\Blah\
since Python concatenates to literal strings into one.

Categories

Resources