Thanks in advance for your help.
When entering "example" at the command line, Python returns 'example'. I can not find anything on the web to explain this. All reference materials speaks to strings in the context of the print command, and I get all of the material about using double quotes, singles quotes, triple quotes, escape commands, etc.
I can not, however, find anything explaining why entering text surrounded by double quotes at the command line always returns the same text surrounded by single quotes. What gives? Thanks.
In Python both 'string' and "string" are used to represent string literals. It's not like Java where single and double quotes represent different data types to the compiler.
The interpreter evaluates each line you enter and displays this value to you. In both cases the interpreter is evaluating what you enter, getting a string, and displaying this value. The default way of displaying strings is in single quotes so both times the string is displayed enclosed in single quotes.
It does seem odd - in that it breaks Python's rule of There should be one - and preferably only one - obvious way to do it - but I think disallowing one of the options would have been worse.
You can also enter a string literal using triple quotes:
>>> """characters
... and
... newlines"""
'characters\nand\nnewlines'
You can use the command line to confirm that these are the same thing:
>>> type("characters")
<type 'str'>
>>> type('characters')
<type 'str'>
>>> "characters" == 'characters'
True
The interpreter uses the __repr__ method of an object to get the display to print to you. So on your own objects you can determine how they are displayed in the interpreter. We can't change the __repr__ method for built in types, but we can customise the interpreter output using sys.displayhook:
>>> import sys
>>> def customoutput(value):
... if isinstance(value,str):
... print '"%s"' % value
... else:
... sys.__displayhook__(value)
...
>>> sys.displayhook = customoutput
>>> 'string'
"string"
In python, single quotes and double quotes are semantically the same.
It struck me as strange at first, since in C++ and other strongly-typed languages single quotes are a char and doubles are a string.
Just get used to the fact that python doesn't care about types, and so there's no special syntax for marking a string vs. a character. Don't let it cloud your perception of a great language
Don't get confused.
In python single quotes and double quotes are same. The creates an string object.
Related
I'm trying to print a string that contains double backslash (one to escape the other) such that only one of the backslashes are printed. I thought this would happen automatically, but I must be missing some detail.
I have this little snippet:
for path in self.tokenized:
pdb.set_trace()
print(self.tokenized[path])
When I debug with that pdb.set_trace() I can see that my strings have double backslashes, and then I enter continue to print the remainder and it prints that same thing.
> /home/kendall/Development/path-parser/tokenize_custom.py(82)print_tokens()
-> print(self.tokenized[path])
(Pdb) self.tokenized[path]
['c:', '\\home', '\\kendall', '\\Desktop', '\\home\\kendall\\Desktop']
(Pdb) c
['c:', '\\home', '\\kendall', '\\Desktop', '\\home\\kendall\\Desktop']
Note that I'm writing a parser that parses Windows file paths -- thus the backslashes.
This is what it looks like to run the program:
kendall#kendall-XPS-8500:~/Development/path-parser$ python main.py -f c:\\home\\kendall\\Desktop
The issue you are having is that you're printing a list, which only knows one way to stringify its contents: repr. repr is only designed for debugging use. Idiomatically, when possible (classes are a notable exception), it outputs a syntactically valid python expression that can be directly fed into the interpretter to reproduce the original object - hence the escaped backslashes.
Instead, you need to loop through each list, and print each string individually.
You can use str.join() to do this for you.
To get the exact same output, minus the doubled backslashes, you'd need to do something like:
print("[{0}]".format(", ".join(self.tokenized[path])))
Given two nearly identical text files (plain text, created in MacVim), I get different results when reading them into a variable in Python. I want to know why this is and how I can produce consistent behavior.
For example, f1.txt looks like this:
This isn't a great example, but it works.
And f2.txt looks like this:
This isn't a great example, but it wasn't meant to be.
"But doesn't it demonstrate the problem?," she said.
When I read these files in, using something like the following:
f = open("f1.txt","r")
x = f.read()
I get the following when I look at the variables in the console. f1.txt:
>>> x
"This isn't a great example, but it works.\n\n"
And f2.txt:
>>> y
'This isn\'t a great example, but it wasn\'t meant to be. \n"But doesn\'t it demonstrate the problem?," she said.\n\n'
In other words, f1 comes in with only escaped newlines, while f2 also has its single quotes escaped.
repr() shows what's going on. first for f1:
>>> repr(x)
'"This isn\'t a great example, but it works.\\n\\n"'
And f2:
>>> repr(y)
'\'This isn\\\'t a great example, but it wasn\\\'t meant to be. \\n"But doesn\\\'t it demonstrate the problem?," she said.\\n\\n\''
This kind of behavior is driving me crazy. What's going on and how do I make it consistent? If it matters, I'm trying to read in plain text, manipulate it, and eventually write it out so that it shows the properly escaped characters (for pasting into Javascript code).
Python is giving you a string literal which, if you gave it back to Python, would result in the same string. This is known as the repr() (short for "representation") of the string. This may not (probably won't, in fact) match the string as it was originally specified, since there are so many ways to do that, and Python does not record anything about how it was originally specified.
It uses double quotes around your first example, which works fine because it doesn't contain any double quotes. The second string contains double quotes, so it can't use double quotes as a delimiter. Instead it uses single quotes and uses backslashes to escape the single quotes in the string (it doesn't have to escape the double quotes this way, and there are more of them than there are single quotes). This keeps the representation as short as possible.
There is no reason for this behavior to drive you crazy and no need to try to make it consistent. You only get the repr() of a string when you are peeking at values in Python's interactive mode. When you actually print or otherwise use the string, you get the string itself, not a reconstituted string literal.
If you want to get a JavaScript string literal, the easiest way is to use the json module:
import json
print json.dumps('I said, "Hello, world!"')
Both f1 and f2 contain perfectly normal, unescaped single quotes.
The fact that their repr looks different is meaningless.
There are a variety of different ways to represent the same string. For example, these are all equivalent literals:
"abc'def'ghi"
'abc\'def\'ghi'
'''abc'def'ghi'''
r"abc'def'ghi"
The repr function on a string always just generates some literal that is a valid representation of that string, but you shouldn't depend on exactly which one it generate. (In fact, you should rarely use it for anything but debugging purposes in the first place.)
Since the language doesn't define anywhere what algorithm it uses to generate a repr, it could be different for each version of each implementation.
Most of them will try to be clever, using single or double quotes to avoid as many escaped internal quotes as possible, but even that isn't guaranteed. If you really want to know the algorithm for a particular implementation and version, you pretty much have to look at the source. For example, in CPython 3.3, inside unicode_repr, it counts the number of quotes of each type; then if there are single quotes but no double quotes, it uses " instead of '.
If you want "the" representation of a string, you're out of luck, because there is no such thing. But if you want some particular representation of a string, that's no problem. You just have to know what format you want; most formats, someone's already written the code, and often it's in the standard library. You can make C literal strings, JSON-encoded strings, strings that can fit into ASCII RFC822 headers… But all of those formats have different rules from each other (and from Python literals), so you have to use the right function for the job.
Conclusion: It's impossible to override or disable Python's built-in escape sequence processing, such that, you can skip using the raw prefix specifier. I dug into Python's internals to figure this out. So if anyone tries designing objects that work on complex strings (like regex) as part of some kind of framework, make sure to specify in the docstrings that string arguments to the object's __init__() MUST include the r prefix!
Original question: I am finding it a bit difficult to force Python to not "change" anything about a user-inputted string, which may contain among other things, regex or escaped hexadecimal sequences. I've already tried various combinations of raw strings, .encode('string-escape') (and its decode counterpart), but I can't find the right approach.
Given an escaped, hexadecimal representation of the Documentation IPv6 address 2001:0db8:85a3:0000:0000:8a2e:0370:7334, using .encode(), this small script (called x.py):
#!/usr/bin/env python
class foo(object):
__slots__ = ("_bar",)
def __init__(self, input):
if input is not None:
self._bar = input.encode('string-escape')
else:
self._bar = "qux?"
def _get_bar(self): return self._bar
bar = property(_get_bar)
#
x = foo("\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
print x.bar
Will yield the following output when executed:
$ ./x.py
\x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4
Note the \x20 got converted to an ASCII space character, along with a few others. This is basically correct due to Python processing the escaped hex sequences and converting them to their printable ASCII values.
This can be solved if the initializer to foo() was treated as a raw string (and the .encode() call removed), like this:
x = foo(r"\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
However, my end goal is to create a kind of framework that can be used and I want to hide these kinds of "implementation details" from the end user. If they called foo() with the above IPv6 address in escaped hexadecimal form (without the raw specifier) and immediately print it back out, they should get back exactly what they put in w/o knowing or using the raw specifier. So I need to find a way to have foo's __init__() do whatever processing is necessary to enable that.
Edit: Per this SO question, it seems it's a defect of Python, in that it always performs some kind of escape sequence processing. There does not appear to be any kind of facility to completely turn off escape sequence processing, even temporarily. Sucks. I guess I am going to have to research subclassing str to create something like rawstr that intelligently determines what escape sequences Python processed in a string, and convert them back to their original format. This is not going to be fun...
Edit2: Another example, given the sample regex below:
"^.{0}\xcb\x00\x71[\x00-\xff]"
If I assign this to a var or pass it to a function without using the raw specifier, the \x71 gets converted to the letter q. Even if I add .encode('string-escape') or .replace('\\', '\\\\'), the escape sequences are still processed. thus resulting in this output:
"^.{0}\xcb\x00q[\x00-\xff]"
How can I stop this, again, without using the raw specifier? Is there some way to "turn off" the escape sequence processing or "revert" it after the fact thus that the q turns back into \x71? Is there a way to process the string and escape the backslashes before the escape sequence processing happens?
I think you have an understandable confusion about a difference between Python string literals (source code representation), Python string objects in memory, and how that objects can be printed (in what format they can be represented in the output).
If you read some bytes from a file into a bytestring you can write them back as is.
r"" exists only in source code there is no such thing at runtime i.e., r"\x" and "\\x" are equal, they may even be the exact same string object in memory.
To see that input is not corrupted, you could print each byte as an integer:
print " ".join(map(ord, raw_input("input something")))
Or just echo as is (there could be a difference but it is unrelated to your "string-escape" issue):
print raw_input("input something")
Identity function:
def identity(obj):
return obj
If you do nothing to the string then your users will receive the exact same object back. You can provide examples in the docs what you consider a concise readable way to represent input string as Python literals. If you find confusing to work with binary strings such as "\x20\x01" then you could accept ascii hex-representation instead: "2001" (you could use binascii.hexlify/unhexlify to convert one to another).
The regex case is more complex because there are two languages:
Escapes sequences are interpreted by Python according to its string literal syntax
Regex engine interprets the string object as a regex pattern that also has its own escape sequences
I think you will have to go the join route.
Here's an example:
>>> m = {chr(c): '\\x{0}'.format(hex(c)[2:].zfill(2)) for c in xrange(0,256)}
>>>
>>> x = "\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34"
>>> print ''.join(map(m.get, x))
\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34
I'm not entirely sure why you need that though. If your code needs to interact with other pieces of code, I'd suggest that you agree on a defined format, and stick to it.
Is there a way to influence the kind of quotes that python uses when casting a tuple/list to string?
For some NLP software I get tuples somewhat like this:
("It", ("isn't", "true"))
I want to cast it to a string and simply remove all double quotes and commas:
(It (Isn't true))
However, python is having its way with the quotes, it seems to prefer single quotes:
>>> print str(("It", ("Isn't" ,"true")))
('It', ("Isn't", 'true'))
, making my life more difficult. Of course I could write my own function for printing it out part-by-part, but there is so much similarity between the representation and native python tuples.
You can't rely on the exact representation that repr uses. I'd just do as you thought and write your own function -- I don't see it being more than a handful of lines of code. This should get you going.
def s_exp(x):
if isinstance(x, (tuple, list)):
return '(%s)' % (' '.join(map(s_exp, x)))
return str(x)
Writing your own function may be inevitable: if your strings contain brackets "(", ")" or spaces " " then you'll need some form of escaping to produce well-formed s-expressions.
Perhaps you can use json instead
>>> import json
>>> print json.dumps(("It", ("isn't", "true")))
["It", ["isn't", "true"]]
Python objects have a __str__ method that converts them into a string representation. This is what does the conversion and it's intelligent enough to use one kind of quote when the other is used in the string and also to do escaping if both are used.
In your example, the It got single quoted since that's what Python "prefers". The double quote was used for Isn't since it contains a `.
You should roll out your own converter really. Using a little recursion, it should be quite small.
The following code:
key = open("C:\Scripts\private.ppk",'rb').read()
reads the file and assigns its data to the var key.
For a reason, backslashes are multiplied in the process. How can I make sure they don't get multiplied?
You ... don't. They are escaped when they are read in so that they will process properly when they are written out / used. If you're declaring strings and don't want to double up the back slashes you can use raw strings r'c:\myfile.txt', but that doesn't really apply to the contents of a file you're reading in.
>>> s = r'c:\boot.ini'
>>> s
'c:\\boot.ini'
>>> repr(s)
"'c:\\\\boot.ini'"
>>> print s
c:\boot.ini
>>>
As you can see, the extra slashes are stored internally, but when you use the value in a print statement (write a file, test for values, etc.) they're evaluated properly.
You should read this great blog post on python and the backslash escape character.
And under some circumstances, if
Python prints information to the
console, you will see the two
backslashes rather than one. For
example, this is part of the
difference between the repr() function
and the str() function.
myFilename =
"c:\newproject\typenames.txt" print
repr(myFilename), str(myFilename)
produces
'c:\newproject\typenames.txt'
c:\newproject\typenames.txt
Backslashes are represented as escaped. You'll see two backslashes for each real one existing on the file, but that is normal behaviour.
The reason is that the backslash is used in order to create codes that represent characters that cannot be easily represented, such as new line '\n' or tab '\t'.
Are you trying to put single backslashes in a string? Strings with backslashes require and escape character, in this case "\". It will print to the screen with a single slash
In fact there is a solution - using eval, as long as the file content can be wrapped into quotes of some kind. Following worked for me (PATH contains some script that executes Matlab):
MATLAB_EXE = "C:\Program Files (x86)\MATLAB\R2012b\bin\matlab.exe"
content = open(PATH).read()
MATLAB_EXE in content # False
content = eval(f'r"""{content}"""')
MATLAB_EXE in content # True
This works by evaluating the content as python string literal, making double escapes transform into single ones. Raw string is used to prevent escapes forming special characters.