Possible to make custom string literal prefixes in Python?

Possible to make custom string literal prefixes in Python? - python

Let's say I have a custom class derived from str that implements/overrides some methods:
class mystr(str):
# just an example for a custom method:
def something(self):
return "anything"
Now currently I have to manually create instances of mystr by passing it a string in the constructor:
ms1 = mystr("my string")
s = "another string"
ms2 = mystr(s)
This is not too bad, but it lead to the idea that it would be cool to use a custom string prefix similar to b'bytes string' or r'raw string' or u'unicode string'.
Is it somehow possible in Python to create/register such a custom string literal prefix like m so that a literal m'my string' results in a new instance of mystr?
Or are those prefixes hard-coded into the Python interpreter?

Those prefixes are hardcoded in the interpreter, you can't register more prefixes.
What you could do however, is preprocess your Python files, by using a custom source codec. This is a rather neat hack, one that requires you to register a custom codec, and to understand and apply source code transformations.
Python allows you to specify the encoding of source code with a special comment at the top:
# coding: utf-8
would tell Python that the source code encoded with UTF-8, and will decode the file accordingly before parsing. Python looks up the codec for this in the codecs module registry. And you can register your own codecs.
The pyxl project uses this trick to parse out HTML syntax from Python files to replace them with actual Python syntax to build that HTML, all in a 'decoding' step. See the codec package in that project, where the register module registers a custom codec search function that transforms source code before Python actually parses and compiles it. A custom .pth file is installed into your site-packages directory to load this registration step at Python startup time. Another project that does the same thing to parse out Ruby-style string formatting is interpy.
All you have to do then, is build such a codec too that'll parse a Python source file (tokenizes it, perhaps with the tokenize module) and replaces string literals with your custom prefix with mystr(<string literal>) calls. Any file you want parsed you mark with # coding: yourcustomcodec.
I'll leave that part as an exercise for the reader. Good luck!
Note that the result of this transformation is then compiled into bytecode, which is cached; your transformation only has to run once per source code revision, all other imports of a module using your codec will load the cached bytecode.

One could use operator overloading to implicitly convert str into a custom class
class MyString(str):
def __or__( self, a ):
return MyString(self + a)
m = MyString('')
print( m, type(m) )
#('', <class 'MyString'>)
print m|'a', type(m|'a')
#('a', <class 'MyString'>)
This avoids the use of parenthesis effectively emulating a string prefix with one extra character ─ which I chose to be | but it could also be & or other binary comparison operator.

While the workarounds mentioned above are great, they could be dangerous. Hacking your python is really not a good idea. while you can't really make a prefix otherwise,
you could do the following:
class MyString(str):
def something(self):
return MyString("anything")
m = MyString
# The you can do:
m("hi")
# Rather than:
# m"hi"
That's probably the best safe solution you can find.
Two parentheses aren't really that much to type, and it can be less confusing to readers of your code.

Related

Why do some functions in Python change \ to \\

When I declare pass a file to shutil.copy as
shutil.copy(r'i:\myfile.txt', r'UNC to where I want it to go')
I get an error
No such file or directory 'i:\\myfile.txt'
I've experienced this problem before with the os module when I have a UNC path. Usually I just get frustrated enough that I forget using the os module and just put the file path into with open() or whatever I'm using it for.
It is my understanding that placing an r before '' is supposed to cause python to ignore escape characters and treat them as string literals, but the behavior I'm seeing leads me to believe that this is not the case. For some reason it takes the \ and changes it to \\.
I've seen this when using os.path.join where the \\ at the beginning of the the UNC Path gets turned into \\\\.
What is the best way to pass a string literal to ensure that all escape characters are ignored and the string is preserved?

Your string is not being modified by Python. It's the representation of your string that's coming out differently.
When the error is printed, Python calls repr() to print the value. This function will
Return a string containing a printable representation of an object. For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object. A class can control what this function returns for its instances by defining a repr() method.
This can be very nice when debugging: if I paste that string (quotes, escapes, and all) into the REPL I'll get the string in memory that you were working with. I can use this to interactively try your copy command, maybe tweaking the string a bit.
If you want to see your string in a printed form, you could do
source_path = r'i:\myfile.txt'
target_path = r'UNC to where I want it to go'
print(f'Copying {source_path} to {target_path}...')
shutil.copy(source_path, target_path)

Writing unicode symbols to files (as opposed to unicode code)

I'm new to python and unicode is starting to give me headaches.
Currently I write to file like this:
my_string = "马/馬"
f = codecs.open(local_filepath, encoding='utf-8', mode='w+')
f.write(my_string)
f.close()
And when I open file with i.e. Gedit, I can see something like this:
\u9a6c/\u99ac\tm\u01ce
While I'd like to see exactly what I've written:
马/馬
I've tried a few different variations, like writing my_string.decode() or my_string.encode('utf-8') instead of just my_string, I know those two methods are the opposites but I was not sure which one I needed. Neither worked anyway.
If I manually write these symbols to text file, then with python read the file, re-write what I've just read back to the same file and save, symbols get turned to the code \u9a6c. Not sure if this is importat, figured I'd just mention it to help identify the problem.
Edit: the strings came from SQL Alchemy objects repr method, which turned out to be where the problem lied. I didn't mention it because it just didn't occur to me it can be related to the problem somehow. Thanks again for your help!

From the comments it is now clear you are using either the repr() function or calling the object.__repr__() method directly.
Don't do that. You are writing debugging information to your file:
>>> my_string = u"马/馬"
>>> print repr(my_string)
u'\u9a6c/\u99ac'
The value produced is meant to be pastable back into a Python session so you can re-produce the exact same value, and as such it is ASCII-safe (so it can be used in Python 2 source code without encoding issues).
From the repr() documentation:
For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object.
Write the Unicode objects to your file directly instead, codecs.open() handles encoding to UTF-8 correctly if you do.

Python: base64.b64decode() vs .decode?

The Code Furies have turned their baleful glares upon me, and it's fallen to me to implement "Secure Transport" as defined by The Direct Project. Whether or not we internally use DNS rather than LDAP for sharing certificates, I'm obviously going to need to set up the former to test against, and that's what's got me stuck. Apparently, an X509 cert needs some massaging to be used in a CERT record, and I'm trying to work out how that's done.
The clearest thing I've found is a script on Videntity's blog, but not being versed in python, I'm hitting a stumbling block. Specifically, this line crashes:
decoded_clean_pk = clean_pk.decode('base64', strict)
since it doesn't seem to like (or rather, to know) whatever 'strict' is supposed to represent. I'm making the semi-educated guess that the line is supposed to decode the base64 data, but I learned from the Debian OpenSSL debacle some years back that blindly diddling with crypto-related code is a Bad Thing(TM).
So I turn the illustrious python wonks on SO to ask if that line might be replaced by this one (with the appropriate import added):
decoded_clean_pk = base64.b64decode(clean_pk)
The script runs after that change, and produces correct-looking output, but I've got enough instinct to know that I can't necessarily trust my instincts here. :)

This line should've work if you would've called like this:
decoded_clean_pk = clean_pk.decode('base64', 'strict')
Notice that strict has to be a string, otherwise python interpreter would try to search for a variable named strict and if it didn't find it or otherwise has other value than: strict, ignore, and replace, it'll probably would've complain about it.
Take a look at this code:
>>>b=base64.b64encode('hello world')
>>>b.decode('base64')
'hello world'
>>>base64.b64decode(b)
'hello world'
Both decode and b64decode works the same when .decode is passed the base64 argument string.
The difference is that str.decode will take a string of bytes as arguments and will return it's Unicode representation depending on the encoding argument you pass as first parameter. In this case, you're telling it to handle a bas64 string so it will do it ok.
To answer your question, both works the same, although b64decode/encode are meant to work only with base64 encodings and str.decode can handle as many encodings as the library is aware of.
For further information take a read at both of the doc sections: decode and b64decode.
UPDATE: Actually, and this is the most important example I guess :) take a look at the source code for encodings/base64_codec.py which is that decode() uses:
def base64_decode(input,errors='strict'):
""" Decodes the object input and returns a tuple (output
object, length consumed).
input must be an object which provides the bf_getreadbuf
buffer slot. Python strings, buffer objects and memory
mapped files are examples of objects providing this slot.
errors defines the error handling to apply. It defaults to
'strict' handling which is the only currently supported
error handling for this codec.
"""
assert errors == 'strict'
output = base64.decodestring(input)
return (output, len(input))
As you may see, it actually uses base64 module to do it :)
Hope this clarify in some way your question.

How to find undocumented methods in my code?

I am writing documentation for a project and I would like to make sure I did not miss any method. The code is written in Python and I am using PyCharm as an IDE.
Basically, I would need a REGEX to match something like:
def method_name(with, parameters):
someVar = something()
...
but it should NOT match:
def method_name(with, parameters):
""" The doc string """
...
I tried using PyCharm's search with REGEX feature with the pattern ):\s*[^"'] so it would match any line after : that doesn't start with " or ' after whitespace, but it doesn't work. Any idea why?

You mentioned you were using PyCharm: there is an inspection "Missing, empty, or incorrect docstring" that you can enable and will do that for you.
Note that you can then change the severity for it to show up more or less prominently.

There is a tool called pydocstyle which checks if all classes, functions, etc. have properly formatted docstrings.
Example from the README:
$ pydocstyle test.py
test.py:18 in private nested class `meta`:
D101: Docstring missing
test.py:27 in public function `get_user`:
D300: Use """triple double quotes""" (found '''-quotes)
test:75 in public function `init_database`:
D201: No blank lines allowed before function docstring (found 1)
I don't know about PyCharm, but pydocstyle can, for example, be integrated in Vim using the Syntastic plugin.

I don't know python, but I do know my regex.
And your regex has issues. First of all, as comments have mentioned, you may have to escape the closing parenthesis. Secondly, you don't match the new line following the function declaration. Finally, you look for single or double quotations at the START of a line, yet the start of a line contains whitespace.
I was able to match your sample file with \):\s*\n\s*["']. This is a multiline regex. Not all programs are able to match multiline regex. With grep, for example, you'd have to use this method.
A quick explanation of what this regex matches: it looks for a closing parenthesis followed by a semicolon. Any number of optional whitespace may follow that. Then there should be a new line followed by any number of whitespace (indentation, in this case). Finally, there must be a single or double quote. Note that this matches functions that do have comments. You'd want to invert this to find those without.

In case PyCharm is not available, there is a little tool called ckdoc written in Python 3.5.
Given one or more files, it finds modules, classes and functions without a docstring. It doesn't search in imported built-in or external libraries – it only considers objects defined in files residing in the same folder as the given file, or subfolders of that folder.
Example usage (after removing some docstrings)
> ckdoc/ckdoc.py "ckdoc/ckdoc.py"
ckdoc/ckdoc.py
module
ckdoc
function
Check.documentable
anykey_defaultdict.__getitem__
group_by
namegetter
type
Check
There are cases when it doesn't work. One such case is when using Anaconda with modules. A possible workaround in that case is to use ckdoc from Python shell. Import necessary modules and then call the check function.
> import ckdoc, main
> ckdoc.check(main)
/tmp/main.py
module
main
function
main
/tmp/custom_exception.py
type
CustomException
function
CustomException.__str__
False
The check function returns True if there are no missing docstrings.

Is it possible to suppress Python's escape sequence processing on a given string without using the raw specifier?

Conclusion: It's impossible to override or disable Python's built-in escape sequence processing, such that, you can skip using the raw prefix specifier. I dug into Python's internals to figure this out. So if anyone tries designing objects that work on complex strings (like regex) as part of some kind of framework, make sure to specify in the docstrings that string arguments to the object's __init__() MUST include the r prefix!
Original question: I am finding it a bit difficult to force Python to not "change" anything about a user-inputted string, which may contain among other things, regex or escaped hexadecimal sequences. I've already tried various combinations of raw strings, .encode('string-escape') (and its decode counterpart), but I can't find the right approach.
Given an escaped, hexadecimal representation of the Documentation IPv6 address 2001:0db8:85a3:0000:0000:8a2e:0370:7334, using .encode(), this small script (called x.py):
#!/usr/bin/env python
class foo(object):
__slots__ = ("_bar",)
def __init__(self, input):
if input is not None:
self._bar = input.encode('string-escape')
else:
self._bar = "qux?"
def _get_bar(self): return self._bar
bar = property(_get_bar)
#
x = foo("\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
print x.bar
Will yield the following output when executed:
$ ./x.py
\x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4
Note the \x20 got converted to an ASCII space character, along with a few others. This is basically correct due to Python processing the escaped hex sequences and converting them to their printable ASCII values.
This can be solved if the initializer to foo() was treated as a raw string (and the .encode() call removed), like this:
x = foo(r"\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
However, my end goal is to create a kind of framework that can be used and I want to hide these kinds of "implementation details" from the end user. If they called foo() with the above IPv6 address in escaped hexadecimal form (without the raw specifier) and immediately print it back out, they should get back exactly what they put in w/o knowing or using the raw specifier. So I need to find a way to have foo's __init__() do whatever processing is necessary to enable that.
Edit: Per this SO question, it seems it's a defect of Python, in that it always performs some kind of escape sequence processing. There does not appear to be any kind of facility to completely turn off escape sequence processing, even temporarily. Sucks. I guess I am going to have to research subclassing str to create something like rawstr that intelligently determines what escape sequences Python processed in a string, and convert them back to their original format. This is not going to be fun...
Edit2: Another example, given the sample regex below:
"^.{0}\xcb\x00\x71[\x00-\xff]"
If I assign this to a var or pass it to a function without using the raw specifier, the \x71 gets converted to the letter q. Even if I add .encode('string-escape') or .replace('\\', '\\\\'), the escape sequences are still processed. thus resulting in this output:
"^.{0}\xcb\x00q[\x00-\xff]"
How can I stop this, again, without using the raw specifier? Is there some way to "turn off" the escape sequence processing or "revert" it after the fact thus that the q turns back into \x71? Is there a way to process the string and escape the backslashes before the escape sequence processing happens?

I think you have an understandable confusion about a difference between Python string literals (source code representation), Python string objects in memory, and how that objects can be printed (in what format they can be represented in the output).
If you read some bytes from a file into a bytestring you can write them back as is.
r"" exists only in source code there is no such thing at runtime i.e., r"\x" and "\\x" are equal, they may even be the exact same string object in memory.
To see that input is not corrupted, you could print each byte as an integer:
print " ".join(map(ord, raw_input("input something")))
Or just echo as is (there could be a difference but it is unrelated to your "string-escape" issue):
print raw_input("input something")
Identity function:
def identity(obj):
return obj
If you do nothing to the string then your users will receive the exact same object back. You can provide examples in the docs what you consider a concise readable way to represent input string as Python literals. If you find confusing to work with binary strings such as "\x20\x01" then you could accept ascii hex-representation instead: "2001" (you could use binascii.hexlify/unhexlify to convert one to another).
The regex case is more complex because there are two languages:
Escapes sequences are interpreted by Python according to its string literal syntax
Regex engine interprets the string object as a regex pattern that also has its own escape sequences

I think you will have to go the join route.
Here's an example:
>>> m = {chr(c): '\\x{0}'.format(hex(c)[2:].zfill(2)) for c in xrange(0,256)}
>>>
>>> x = "\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34"
>>> print ''.join(map(m.get, x))
\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34
I'm not entirely sure why you need that though. If your code needs to interact with other pieces of code, I'd suggest that you agree on a defined format, and stick to it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Possible to make custom string literal prefixes in Python? - python

Related

Why do some functions in Python change \ to \\

Writing unicode symbols to files (as opposed to unicode code)

Python: base64.b64decode() vs .decode?

How to find undocumented methods in my code?

Is it possible to suppress Python's escape sequence processing on a given string without using the raw specifier?

Categories

Resources