Why is escaping of single quotes inconsistent on file read in Python?

Why is escaping of single quotes inconsistent on file read in Python? - python

Given two nearly identical text files (plain text, created in MacVim), I get different results when reading them into a variable in Python. I want to know why this is and how I can produce consistent behavior.
For example, f1.txt looks like this:
This isn't a great example, but it works.
And f2.txt looks like this:
This isn't a great example, but it wasn't meant to be.
"But doesn't it demonstrate the problem?," she said.
When I read these files in, using something like the following:
f = open("f1.txt","r")
x = f.read()
I get the following when I look at the variables in the console. f1.txt:
>>> x
"This isn't a great example, but it works.\n\n"
And f2.txt:
>>> y
'This isn\'t a great example, but it wasn\'t meant to be. \n"But doesn\'t it demonstrate the problem?," she said.\n\n'
In other words, f1 comes in with only escaped newlines, while f2 also has its single quotes escaped.
repr() shows what's going on. first for f1:
>>> repr(x)
'"This isn\'t a great example, but it works.\\n\\n"'
And f2:
>>> repr(y)
'\'This isn\\\'t a great example, but it wasn\\\'t meant to be. \\n"But doesn\\\'t it demonstrate the problem?," she said.\\n\\n\''
This kind of behavior is driving me crazy. What's going on and how do I make it consistent? If it matters, I'm trying to read in plain text, manipulate it, and eventually write it out so that it shows the properly escaped characters (for pasting into Javascript code).

Python is giving you a string literal which, if you gave it back to Python, would result in the same string. This is known as the repr() (short for "representation") of the string. This may not (probably won't, in fact) match the string as it was originally specified, since there are so many ways to do that, and Python does not record anything about how it was originally specified.
It uses double quotes around your first example, which works fine because it doesn't contain any double quotes. The second string contains double quotes, so it can't use double quotes as a delimiter. Instead it uses single quotes and uses backslashes to escape the single quotes in the string (it doesn't have to escape the double quotes this way, and there are more of them than there are single quotes). This keeps the representation as short as possible.
There is no reason for this behavior to drive you crazy and no need to try to make it consistent. You only get the repr() of a string when you are peeking at values in Python's interactive mode. When you actually print or otherwise use the string, you get the string itself, not a reconstituted string literal.
If you want to get a JavaScript string literal, the easiest way is to use the json module:
import json
print json.dumps('I said, "Hello, world!"')

Both f1 and f2 contain perfectly normal, unescaped single quotes.
The fact that their repr looks different is meaningless.
There are a variety of different ways to represent the same string. For example, these are all equivalent literals:
"abc'def'ghi"
'abc\'def\'ghi'
'''abc'def'ghi'''
r"abc'def'ghi"
The repr function on a string always just generates some literal that is a valid representation of that string, but you shouldn't depend on exactly which one it generate. (In fact, you should rarely use it for anything but debugging purposes in the first place.)
Since the language doesn't define anywhere what algorithm it uses to generate a repr, it could be different for each version of each implementation.
Most of them will try to be clever, using single or double quotes to avoid as many escaped internal quotes as possible, but even that isn't guaranteed. If you really want to know the algorithm for a particular implementation and version, you pretty much have to look at the source. For example, in CPython 3.3, inside unicode_repr, it counts the number of quotes of each type; then if there are single quotes but no double quotes, it uses " instead of '.
If you want "the" representation of a string, you're out of luck, because there is no such thing. But if you want some particular representation of a string, that's no problem. You just have to know what format you want; most formats, someone's already written the code, and often it's in the standard library. You can make C literal strings, JSON-encoded strings, strings that can fit into ASCII RFC822 headers… But all of those formats have different rules from each other (and from Python literals), so you have to use the right function for the job.

Related

Writing and reading headers with struct

I have a file header which I am reading and planning on writing which contains information about the contents; version information, and other string values.
Writing to the file is not too difficult, it seems pretty straightforward:
outfile.write(struct.pack('<s', "myapp-0.0.1"))
However, when I try reading back the header from the file in another method:
header_version = struct.unpack('<s', infile.read(struct.calcsize('s')))
I have the following error thrown:
struct.error: unpack requires a string argument of length 2
How do I fix this error and what exactly is failing?

Writing to the file is not too difficult, it seems pretty straightforward:
Not quite as straightforward as you think. Try looking at what's in the file, or just printing out what you're writing:
>>> struct.pack('<s', 'myapp-0.0.1')
'm'
As the docs explain:
For the 's' format character, the count is interpreted as the size of the string, not a repeat count like for the other format characters; for example, '10s' means a single 10-byte string, while '10c' means 10 characters. If a count is not given, it defaults to 1.
So, how do you deal with this?
Don't use struct if it's not what you want. The main reason to use struct is to interact with C code that dumps C struct objects directly to/from a buffer/file/socket/whatever, or a binary format spec written in a similar style (e.g. IP headers). It's not meant for general serialization of Python data. As Jon Clements points out in a comment, if all you want to store is a string, just write the string as-is. If you want to store something more complex, consider the json module; if you want something even more flexible and powerful, use pickle.
Use fixed-length strings. If part of your file format spec is that the name must always be 255 characters or less, just write '<255s'. Shorter strings will be padded, longer strings will be truncated (you might want to throw in a check for that to raise an exception instead of silently truncating).
Use some in-band or out-of-band means of passing along the length. The most common is a length prefix. (You may be able to use the 'p' or 'P' formats to help, but it really depends on the C layout/binary format you're trying to match; often you have to do something ugly like struct.pack('<h{}s'.format(len(name)), len(name), name).)
As for why your code is failing, there are multiple reasons. First, read(11) isn't guaranteed to read 11 characters. If there's only 1 character in the file, that's all you'll get. Second, you're not actually calling read(11), you're calling read(1), because struct.calcsize('s') returns 1 (for reasons which should be obvious from the above). Third, either your code isn't exactly what you've shown above, or infile's file pointer isn't at the right place, because that code as written will successfully read in the string 'm' and unpack it as 'm'. (I'm assuming Python 2.x here; 3.x will have more problems, but you wouldn't have even gotten that far.)
For your specific use case ("file header… which contains information about the contents; version information, and other string values"), I'd just use write the strings with newline terminators. (If the strings can have embedded newlines, you could backslash-escape them into \n, use C-style or RFC822-style continuations, quote them, etc.)
This has a number of advantages. For one thing, it makes the format trivially human-readable (and human-editable/-debuggable). And, while sometimes that comes with a space tradeoff, a single-character terminator is at least as efficient, possibly more so, than a length-prefix format would be. And, last but certainly not least, it means the code is dead-simple for both generating and parsing headers.
In a later comment you clarify that you also want to write ints, but that doesn't change anything. A 'i' int value will take 4 bytes, but most apps write a lot of small numbers, which only take 1-2 bytes (+1 for a terminator/separator) if you write them as strings. And if you're not writing small numbers, a Python int can easily be too large to fit in a C int—in which case struct will silently overflow and just write the low 32 bits.

Is it possible to suppress Python's escape sequence processing on a given string without using the raw specifier?

Conclusion: It's impossible to override or disable Python's built-in escape sequence processing, such that, you can skip using the raw prefix specifier. I dug into Python's internals to figure this out. So if anyone tries designing objects that work on complex strings (like regex) as part of some kind of framework, make sure to specify in the docstrings that string arguments to the object's __init__() MUST include the r prefix!
Original question: I am finding it a bit difficult to force Python to not "change" anything about a user-inputted string, which may contain among other things, regex or escaped hexadecimal sequences. I've already tried various combinations of raw strings, .encode('string-escape') (and its decode counterpart), but I can't find the right approach.
Given an escaped, hexadecimal representation of the Documentation IPv6 address 2001:0db8:85a3:0000:0000:8a2e:0370:7334, using .encode(), this small script (called x.py):
#!/usr/bin/env python
class foo(object):
__slots__ = ("_bar",)
def __init__(self, input):
if input is not None:
self._bar = input.encode('string-escape')
else:
self._bar = "qux?"
def _get_bar(self): return self._bar
bar = property(_get_bar)
#
x = foo("\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
print x.bar
Will yield the following output when executed:
$ ./x.py
\x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4
Note the \x20 got converted to an ASCII space character, along with a few others. This is basically correct due to Python processing the escaped hex sequences and converting them to their printable ASCII values.
This can be solved if the initializer to foo() was treated as a raw string (and the .encode() call removed), like this:
x = foo(r"\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
However, my end goal is to create a kind of framework that can be used and I want to hide these kinds of "implementation details" from the end user. If they called foo() with the above IPv6 address in escaped hexadecimal form (without the raw specifier) and immediately print it back out, they should get back exactly what they put in w/o knowing or using the raw specifier. So I need to find a way to have foo's __init__() do whatever processing is necessary to enable that.
Edit: Per this SO question, it seems it's a defect of Python, in that it always performs some kind of escape sequence processing. There does not appear to be any kind of facility to completely turn off escape sequence processing, even temporarily. Sucks. I guess I am going to have to research subclassing str to create something like rawstr that intelligently determines what escape sequences Python processed in a string, and convert them back to their original format. This is not going to be fun...
Edit2: Another example, given the sample regex below:
"^.{0}\xcb\x00\x71[\x00-\xff]"
If I assign this to a var or pass it to a function without using the raw specifier, the \x71 gets converted to the letter q. Even if I add .encode('string-escape') or .replace('\\', '\\\\'), the escape sequences are still processed. thus resulting in this output:
"^.{0}\xcb\x00q[\x00-\xff]"
How can I stop this, again, without using the raw specifier? Is there some way to "turn off" the escape sequence processing or "revert" it after the fact thus that the q turns back into \x71? Is there a way to process the string and escape the backslashes before the escape sequence processing happens?

I think you have an understandable confusion about a difference between Python string literals (source code representation), Python string objects in memory, and how that objects can be printed (in what format they can be represented in the output).
If you read some bytes from a file into a bytestring you can write them back as is.
r"" exists only in source code there is no such thing at runtime i.e., r"\x" and "\\x" are equal, they may even be the exact same string object in memory.
To see that input is not corrupted, you could print each byte as an integer:
print " ".join(map(ord, raw_input("input something")))
Or just echo as is (there could be a difference but it is unrelated to your "string-escape" issue):
print raw_input("input something")
Identity function:
def identity(obj):
return obj
If you do nothing to the string then your users will receive the exact same object back. You can provide examples in the docs what you consider a concise readable way to represent input string as Python literals. If you find confusing to work with binary strings such as "\x20\x01" then you could accept ascii hex-representation instead: "2001" (you could use binascii.hexlify/unhexlify to convert one to another).
The regex case is more complex because there are two languages:
Escapes sequences are interpreted by Python according to its string literal syntax
Regex engine interprets the string object as a regex pattern that also has its own escape sequences

I think you will have to go the join route.
Here's an example:
>>> m = {chr(c): '\\x{0}'.format(hex(c)[2:].zfill(2)) for c in xrange(0,256)}
>>>
>>> x = "\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34"
>>> print ''.join(map(m.get, x))
\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34
I'm not entirely sure why you need that though. If your code needs to interact with other pieces of code, I'd suggest that you agree on a defined format, and stick to it.

Python str() - specify which kind of quotes to add/use?

Is there a way to influence the kind of quotes that python uses when casting a tuple/list to string?
For some NLP software I get tuples somewhat like this:
("It", ("isn't", "true"))
I want to cast it to a string and simply remove all double quotes and commas:
(It (Isn't true))
However, python is having its way with the quotes, it seems to prefer single quotes:
>>> print str(("It", ("Isn't" ,"true")))
('It', ("Isn't", 'true'))
, making my life more difficult. Of course I could write my own function for printing it out part-by-part, but there is so much similarity between the representation and native python tuples.

You can't rely on the exact representation that repr uses. I'd just do as you thought and write your own function -- I don't see it being more than a handful of lines of code. This should get you going.
def s_exp(x):
if isinstance(x, (tuple, list)):
return '(%s)' % (' '.join(map(s_exp, x)))
return str(x)
Writing your own function may be inevitable: if your strings contain brackets "(", ")" or spaces " " then you'll need some form of escaping to produce well-formed s-expressions.

Perhaps you can use json instead
>>> import json
>>> print json.dumps(("It", ("isn't", "true")))
["It", ["isn't", "true"]]

Python objects have a __str__ method that converts them into a string representation. This is what does the conversion and it's intelligent enough to use one kind of quote when the other is used in the string and also to do escaping if both are used.
In your example, the It got single quoted since that's what Python "prefers". The double quote was used for Isn't since it contains a `.
You should roll out your own converter really. Using a little recursion, it should be quite small.

python string good practise: ' vs " [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Single quotes vs. double quotes in Python
I have seen that when i have to work with string in Python both of the following sintax are accepted:
mystring1 = "here is my string 1"
mystring2 = 'here is my string 2'
Is anyway there any difference?
Is it by any reason better use one solution rather than the other?
Cheers,

No, there isn't. When the string contains a single quote, it's easier to enclose it in double quotes, and vice versa. Other than this, my advice would be to pick a style and stick to it.
Another useful type of string literals are triple-quoted strings that can span multiple lines:
s = """string literal...
...continues on second line...
...and ends here"""
Again, it's up to you whether to use single or double quotes for this.
Lastly, I'd like to mention "raw string literals". These are enclosed in r"..." or r'...' and prevent escape sequences (such as \n) from being parsed as such. Among other things, raw string literals are very handy for specifying regular expressions.
Read more about Python string literals here.

While it's true that there is no difference between one and the other, I encountered a lot of the following behavior in the opensource community:
" for text that is supposed to be read (email, feeback, execption, etc)
' for data text (key dict, function arguments, etc)
triple " for any docstring or text that includes " and '

No. A matter of style only. Just be consistent.

I tend to using " simply because that's what most other programming languages use.
So, habit, really.

There's no difference.
What's better is arguable. I use "..." for text strings and '...' for characters, because that's consistent with other languages and may save you some keypresses when porting to/from different language. For regexps and SQL queries, I always use r'''...''', because they frequently end up containing backslashes and both types of quotes.

Python is all about the least amount of code to get the most effect. The shorter the better. And ' is, in a way, one dot shorter than " which is why I prefer it. :)

As everyone's pointed out, they're functionally identical. However, PEP 257 (Docstring Conventions) suggests always using """ around docstrings just for the purposes of consistency. No one's likely to yell at you or think poorly of you if you don't, but there it is.

Why does python use unconventional triple-quotation marks for comments?

Why didn't python just use the traditional style of comments like C/C++/Java uses:
/**
* Comment lines
* More comment lines
*/
// line comments
// line comments
//
Is there a specific reason for this or is it just arbitrary?

Python doesn't use triple quotation marks for comments. Comments use the hash (a.k.a. pound) character:
# this is a comment
The triple quote thing is a doc string, and, unlike a comment, is actually available as a real string to the program:
>>> def bla():
... """Print the answer"""
... print 42
...
>>> bla.__doc__
'Print the answer'
>>> help(bla)
Help on function bla in module __main__:
bla()
Print the answer
It's not strictly required to use triple quotes, as long as it's a string. Using """ is just a convention (and has the advantage of being multiline).

A number of the answers got many of the points, but don't give the complete view of how things work. To summarize...
# comment is how Python does actual comments (similar to bash, and some other languages). Python only has "to the end of the line" comments, it has no explicit multi-line comment wrapper (as opposed to javascript's /* .. */). Most Python IDEs let you select-and-comment a block at a time, this is how many people handle that situation.
Then there are normal single-line python strings: They can use ' or " quotation marks (eg 'foo' "bar"). The main limitation with these is that they don't wrap across multiple lines. That's what multiline-strings are for: These are strings surrounded by triple single or double quotes (''' or """) and are terminated only when a matching unescaped terminator is found. They can go on for as many lines as needed, and include all intervening whitespace.
Either of these two string types define a completely normal string object. They can be assigned a variable name, have operators applied to them, etc. Once parsed, there are no differences between any of the formats. However, there are two special cases based on where the string is and how it's used...
First, if a string just written down, with no additional operations applied, and not assigned to a variable, what happens to it? When the code executes, the bare string is basically discarded. So people have found it convenient to comment out large bits of python code using multi-line strings (providing you escape any internal multi-line strings). This isn't that common, or semantically correct, but it is allowed.
The second use is that any such bare strings which follow immediately after a def Foo(), class Foo(), or the start of a module, are treated as string containing documentation for that object, and stored in the __doc__ attribute of the object. This is the most common case where strings can seem like they are a "comment". The difference is that they are performing an active role as part of the parsed code, being stored in __doc__... and unlike a comment, they can be read at runtime.

Triple-quotes aren't comments. They're string literals that span multiple lines and include those line breaks in the resulting string. This allows you to use
somestr = """This is a rather long string containing
several lines of text just as you would do in C.
Note that whitespace at the beginning of the line is\
significant."""
instead of
somestr = "This is a rather long string containing\n\
several lines of text just as you would do in C.\n\
Note that whitespace at the beginning of the line is\
significant."

Most scripting languages use # as a comment marker so to skip automatically the shebang (#!) which specifies to the program loader the interpreter to run (like in #!/bin/bash). Alternatively, the interpreter could be instructed to automatically skip the first line, but it's way more convenient just to define # as comment marker and that's it, so it's skipped as a consequence.

Guido - the creator of Python, actually weighs in on the topic here:
https://twitter.com/gvanrossum/status/112670605505077248?lang=en
In summary - for multiline comments, just use triple quotes. For academic purposes - yes it technically is a string, but it gets ignored because it is never used or assigned to a variable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is escaping of single quotes inconsistent on file read in Python? - python

Related

Writing and reading headers with struct

Is it possible to suppress Python's escape sequence processing on a given string without using the raw specifier?

Python str() - specify which kind of quotes to add/use?

python string good practise: ' vs " [duplicate]

Why does python use unconventional triple-quotation marks for comments?

Categories

Resources