refactor legacy python code: from u'...' to '...'

refactor legacy python code: from u'...' to '...' - python

I have a legacy code project which uses a lot of unicode strings like this: u'...'
I want to update the code to use from __future__ import unicode_literals
Any automated help from pycharm or an other tool?
Update
A simple search+replace does not work, since the code could contain strings like 'fuu' and I don't want that to be replace to 'fu'.

Yes, pycharm has automated find and replace with regex matching. You could also use a simple tool like sed.
But be forewarned, it is not the case that you can blindly change all modules to include the import:
from __future__ import unicode_literals
This can cause unintended problems, the issue is not with strings which were u'unicode' being changed into 'unicode', that part is of no consequence. The issue is with strings that actually should have been 'bytestrings' being changed into unicode.
Before you make this global change, you need to ensure that all places where bytestrings are used can really safely be changed to unicode. Those that can't need to be prefixed as b'bytestrings'.

Related

unicode_escape without deprecation warning

In Python 3.8, the following:
import codecs
codecs.decode("hello\\,world", "unicode_escape")
...produces a deprecation warning:
<ipython-input-7-28f185d30178>:1: DeprecationWarning: invalid escape sequence '\,'
codecs.decode("hello\\,", "unicode_escape", errors="strict")
Is this going to become an error in a future version of Python? Where is the reference for this?
If No, is there a way to not display this warning, if Yes, what can I use instead?
Edit: I am (obviously) not interested in fixing this for a string I would have written literally in my Python script. This is purely an example, in my use case strings come from external files, I cannot change them, and some contains invalid escape sequences such as this.

The problem is that you've got two layers of de-escaping occurring here. The first one is the regular string literal escaping, converting "hello\\,world" to a string with raw characters of hello\,world. Then codecs.decode tries to decode it with unicode_escape, which sees it as an attempt to escape , with \, which is an invalid escape.
The fix for your code as written is to use a raw string so the first level of escaping doesn't happen:
codecs.decode(r"hello\\,world", "unicode_escape")
# ^ Now it's a raw string, and both backslashes are in fact backslashes
If your data comes from elsewhere with invalid escapes, you can suppress the warning for now (see the warnings module for details), but it will eventually cause an exception in a future version of Python, so the long term solution is "Don't provide invalid data". Sorry that's not super-helpful.
The reason you get this warning is that, historically, Python has been kind of lax about spurious escapes. So if you wrote, say, a Windows path of "C:\yes" it said "hey, \y doesn't mean anything, so we'll just assume they wanted a literal backslash", while an equivalent path of "C:\no" saw \n and thought "Yup, they want a newline there".
This taught bad habits (not using raw strings when you should, because it usually worked without them) and created confusion when those habits bit you (why isn't it working this time!?!). So in the future, escapes like \y will be treated as errors, so that you end up writing r'C:\yes' and are used to doing so so you don't get bit by r'C:\no'. The warning is reminding you that this is bad code, and will eventually stop working (for good reason; as your own comment notes, you're okay with it ending up with either no backslashes or two backslashes, starting with just one, which is an insane set of options to accept without knowing the single, correct, desired result).
Alternative solution
If your goal is to "fix" bad strings, the best solution is probably to just write your own simple stripping regex, e.g.:
import re
bad_escape_re = re.compile(r'\\(?=[^\n\\\'"abfnrtv0-7xNuU])')
and then use it to strip unrecognized escapes a la:
good_string = bad_escape_re.sub('', bad_string)
which when run like so:
bad_escape_re.sub('', r'\a\b\c\d\e\f\g\,\.\n\t\x12')
produces a string with the repr '\\a\\bcde\\fg,.\\n\\t\\x12'. Note that's it's not perfect, and if you need it to validate the extended escapes to distinguish valid uses from invalid ones (\[0-7], \x, \N, \u and \U), it gets more complicated, but those are also cases where it's invariably heuristic and there is no good solution; without a human to interpret, \xab is legal and \xag is not, but it's entirely likely the former wasn't intended as an escape either.

Importing custom packages in python [duplicate]

Basically when I have a python file like:
python-code.py
and use:
import (python-code)
the interpreter gives me syntax error.
Any ideas on how to fix it? Are dashes illegal in python file names?

You should check out PEP 8, the Style Guide for Python Code:
Package and Module Names Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.
Since module names are mapped to file names, and some file systems are case insensitive and truncate long names, it is important that module names be chosen to be fairly short -- this won't be a problem on Unix, but it may be a problem when the code is transported to older Mac or Windows versions, or DOS.
In other words: rename your file :)

One other thing to note in your code is that import is not a function. So import(python-code) should be import python-code which, as some have already mentioned, is interpreted as "import python minus code", not what you intended. If you really need to import a file with a dash in its name, you can do the following::
python_code = __import__('python-code')
But, as also mentioned above, this is not really recommended. You should change the filename if it's something you control.

TLDR
Dashes are not illegal but you should not use them for 3 reasons:
You need special syntax to import files with dashes
Nobody expects a module name with a dash
It's against the recommendations of the Python Style Guide
If you definitely need to import a file name with a dash the special syntax is this:
module_name = __import__('module-name')
Curious about why we need special syntax?
The reason for the special syntax is that when you write import somename you're creating a module object with identifier somename (so you can later use it with e.g. somename.funcname). Of course module-name is not a valid identifier and hence the special syntax that gives a valid one.
You don't get why module-name is not valid identifier?
Don't worry -- I didn't either. Here's a tip to help you: Look at this python line: x=var1-var2. Do you see a subtraction on the right side of the assignment or a variable name with a dash?
PS
Nothing original in my answer except including what I considered to be the most relevant bits of information from all other answers in one place

The problem is that python-code is not an identifier. The parser sees this as python minus code. Of course this won't do what you're asking. You will need to use a filename that is also a valid python identifier. Try replacing the - with an underscore.

On Python 3 use import_module:
from importlib import import_module
python_code = import_module('python-code')
More generally,
import_module('package.subpackage.module')

You could probably import it through some __import__ hack, but if you don't already know how, you shouldn't. Python module names should be valid variable names ("identifiers") -- that means if you have a module foo_bar, you can use it from within Python (print foo_bar). You wouldn't be able to do so with a weird name (print foo-bar -> syntax error).

Although proper file naming is the best course, if python-code is not under our control, a hack using __import__ is better than copying, renaming, or otherwise messing around with other authors' code. However, I tried and it didn't work unless I renamed the file adding the .py extension. After looking at the doc to derive how to get a description for .py, I ended up with this:
import imp
try:
python_code_file = open("python-code")
python_code = imp.load_module('python_code', python_code_file, './python-code', ('.py', 'U', 1))
finally:
python_code_file.close()
It created a new file python-codec on the first run.

How to find undocumented methods in my code?

I am writing documentation for a project and I would like to make sure I did not miss any method. The code is written in Python and I am using PyCharm as an IDE.
Basically, I would need a REGEX to match something like:
def method_name(with, parameters):
someVar = something()
...
but it should NOT match:
def method_name(with, parameters):
""" The doc string """
...
I tried using PyCharm's search with REGEX feature with the pattern ):\s*[^"'] so it would match any line after : that doesn't start with " or ' after whitespace, but it doesn't work. Any idea why?

You mentioned you were using PyCharm: there is an inspection "Missing, empty, or incorrect docstring" that you can enable and will do that for you.
Note that you can then change the severity for it to show up more or less prominently.

There is a tool called pydocstyle which checks if all classes, functions, etc. have properly formatted docstrings.
Example from the README:
$ pydocstyle test.py
test.py:18 in private nested class `meta`:
D101: Docstring missing
test.py:27 in public function `get_user`:
D300: Use """triple double quotes""" (found '''-quotes)
test:75 in public function `init_database`:
D201: No blank lines allowed before function docstring (found 1)
I don't know about PyCharm, but pydocstyle can, for example, be integrated in Vim using the Syntastic plugin.

I don't know python, but I do know my regex.
And your regex has issues. First of all, as comments have mentioned, you may have to escape the closing parenthesis. Secondly, you don't match the new line following the function declaration. Finally, you look for single or double quotations at the START of a line, yet the start of a line contains whitespace.
I was able to match your sample file with \):\s*\n\s*["']. This is a multiline regex. Not all programs are able to match multiline regex. With grep, for example, you'd have to use this method.
A quick explanation of what this regex matches: it looks for a closing parenthesis followed by a semicolon. Any number of optional whitespace may follow that. Then there should be a new line followed by any number of whitespace (indentation, in this case). Finally, there must be a single or double quote. Note that this matches functions that do have comments. You'd want to invert this to find those without.

In case PyCharm is not available, there is a little tool called ckdoc written in Python 3.5.
Given one or more files, it finds modules, classes and functions without a docstring. It doesn't search in imported built-in or external libraries – it only considers objects defined in files residing in the same folder as the given file, or subfolders of that folder.
Example usage (after removing some docstrings)
> ckdoc/ckdoc.py "ckdoc/ckdoc.py"
ckdoc/ckdoc.py
module
ckdoc
function
Check.documentable
anykey_defaultdict.__getitem__
group_by
namegetter
type
Check
There are cases when it doesn't work. One such case is when using Anaconda with modules. A possible workaround in that case is to use ckdoc from Python shell. Import necessary modules and then call the check function.
> import ckdoc, main
> ckdoc.check(main)
/tmp/main.py
module
main
function
main
/tmp/custom_exception.py
type
CustomException
function
CustomException.__str__
False
The check function returns True if there are no missing docstrings.

How to properly handle non ASCII strings in python

I'm building an application that in the database has data with latin symbols. Users are able to enter this data.
What I've been doing so far is encode('latin2') every user input and decode('latin2') at the very end when displaying data in the template.
This is a bit annoying and I'm wondering if there is any better way of handling this.

Python's unicode type is designed to be the "natural" representation for strings. Besides the unicode type, strings are expected to be in some unspecified encoding but there's no way to "tag" them with the encoding used, and python will very insistently assume that strings are in ASCII or UTF-8 encoding. As such, you're probably asking for headaches if you write your whole program to assume that str means latin2. Encoding problems have a way of creeping in at odd places in the code and percolating through layers, sometimes getting bad data in your database, and ultimately causing odd behavior or nasty errors somewhere completely unrelated and impossible to debug.
I would recommend you see about converting your db data to UTF-8.
If you can't do that, I would strongly recommend moving your encoding/decoding calls right up to the moment you transmit data to/from the database. If you have any sort of database abstraction layer, you can probably configure it to handle that for you more or less automatically. Then you should make sure any user input is converted to the unicode type right away.
Using unicode types and explicitly encoding/decoding this way also has the advantage that if you do have encoding problems, you will probably notice sooner and you can just throw unicode-nazi at them to track them down (see How can you make python 2.x warn when coercing strings to unicode?).
For your markup problem: Flask and Jinja2 will by default escape any unsafe characters in your strings before rendering them into your HTML. To override the autoescaping, just use the safe filter:
<h1>More than just text!</h1>
<div>{{ html_data|safe }}</div>
See Flask Templates: Controlling Autoescaping for details, and use this with extreme caution since you're effectively loading code from the database and executing it. In real life, you'll probably want to scrub the data (see Python HTML sanitizer / scrubber / filter or Jinja2 escape all HTML but img, b, etc).

try add this to the top of your program.
import sys
reload(sys)
sys.setdefaultencoding('latin2')
We have to reload sys because:
>>> import sys
>>> sys.setdefaultencoding
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding
<built-in function setdefaultencoding>

How to make Python use a path that contains colons in it?

I have a program that includes an embedded Python 2.6 interpreter. When I invoke the interpreter, I call PySys_SetPath() to set the interpreter's import-path to the subdirectories installed next to my executable that contain my Python script files... like this:
PySys_SetPath("/path/to/my/program/scripts/type1:/path/to/my/program/scripts/type2");
(except that the path strings are dynamically generated based on the current location of my program's executable, not hard-coded as in the example above)
This works fine... except when the clever user decides to install my program underneath a folder that has a colon in its name. In that case, my PySys_SetPath() command ends up looking like this (note the presence of a folder named "path:to"):
PySys_SetPath("/path:to/my/program/scripts/type1:/path:to/my/program/scripts/type2");
... and this breaks all my Python scripts, because now Python looks for script files in "/path", and "to/my/program/scripts/type1" instead of in "/path:to/myprogram/scripts/type1", and so none of the import statements work.
My question is, is there any fix for this issue, other than telling the user to avoid colons in his folder names?
I looked at the makepathobject() function in Python/sysmodule.c, and it doesn't appear to support any kind of quoting or escaping to handle literal colons.... but maybe I am missing some nuance.

The problem you're running into is the PySys_SetPath function parses the string you pass using a colon as the delimiter. That parser sees each : character as delimiting a path, and there isn't a way around this (can't be escaped).
However, you can bypass this by creating a list of the individual paths (each of which may contain colons) and use PySys_SetObject to set the sys.path:
PyListObject *path;
path = (PyListObject *)PyList_New(0);
PyList_Append((PyObject *) path, PyString_FromString("foo:bar"));
PySys_SetObject("path", (PyObject *)path);
Now the interpreter will see "foo:bar" as a distinct component of the sys.path.

Supporting colons in a file path opens up a huge can of worms on multiple operating systems; it is not a valid path character on Windows or Mac OS X, for example, and it doesn't seem like a particularly reasonable thing to support in the context of a scripting environment either for exactly this reason. I'm actually a bit surprised that Linux allows colon filenames too, especially since : is a very common path separator character.
You might try escaping the colon out, i.e. converting /path:to/ to /path\:to/ and see if that works. Other than that, just tell the user to avoid using colons in their file names. They will run into all sorts of problems in quite a few different environments and it's a just plain bad idea.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.