Array of unicode literals - python

I'm writing code that's supposed to be Python 2.7 and Python 3.3+ compatible. When trying to run my code with Python 2.7, I get the following problem.
I'm importing unicode_literals from __future__ in each file and I'm having trouble getting the array function to work.
from array import array
from __future__ import unicode_literals
Trying to make a character array doesn't work
array("c", "test")
> TypeError: must be char, not unicode
Trying to make a unicode array also doesn't work
array("u", "test")
> TypeError: must be char, not unicode
Can I make an array that is compatible with unicode_literals?

This error is being thrown because of the first argument to array(), the typecode. In Python 2 this must be a non-unicode character (string of length 1), and in Python 3 this must be a unicode character. Since both of these are the meaning of str in the respective Python versions, this works in both:
array(str('u'), 'test')

This is a limitation of the array module that was fixed recently (https://bugs.python.org/issue20014). On Python 2.7.11 (or newer) the array constructor will accept both str and unicode as the first argument.
As a workaround you can use e.g. array(str("u"), "test"). I'm referring to the other answer by Dan Getz for an explanation of why this works.
Note that your first example using the "c" typecode still won't work on either Python 2.7 or Python 3.x. On Python 2.7, you need to pass a bytestring as the second argument (e.g. by passing b"test" as the second argument). The "c" typecode was removed in Python 3.0, so you should use "b" or "B" instead.

Related

Python string input gives error unless with quotes

I thought the input method will take any numeric or string input and print it out. But for string it does not work unless string is in quoted. Why?
In Python 2.7, input() evaluates the input as code, so strings need to be quoted. Python 2.7 has a method called raw_input() that treats all input as strings (no quotes needed).
In Python 3.x, the Python 2.7 raw_input() method was renamed input() and the Python 2.7 input() functionality was replaced by eval(input())
So you can use raw_input() in Python 2.7 or switch to Python 3.x
Make sure that you are running the code on Python 3 and not Python 2.
On python 2 you are needed to quote your input with ""
it seems like there are multiple solution to this though:
input() error - NameError: name '...' is not defined

Equivalent to unicode() function that works with both Python 2.7 and 3.x?

I'm trying to adapt some old code to make it usable with both Python 2 and 3. I'm using the six package for this task.
If I have u'abc' in 2.7, I can use the six.u() function and replace it with six.u('abc') to make it work in both 2.7 and 3.x.
How do I do something similar for:
unicode(value, errors='ignore', encoding='utf-8')
There is no unicode function in 3.x and I can't just replace it with str because that will change the meaning in 2.7.
if isinstance(value, basestring): # do something
There is no basestring in 3.x and again I can't just replace it with str without changing the meaning.
Of course, I can use the py2/3 checks with six.PY2 or six.PY3 to run one of two versions but is there a better way?
To answer the second part of the question, you can replace if isinstance(value, basestring): with six.string_types:
import six
if isinstance(value, six.string_types):
pass
To answer the first part, I would first recommend putting this at the top of your code:
from __future__ import unicode_literals
This will make all your Python2 str literals become unicode which will be a big first step in compatibility.
Second, if you really need some sort of compatibility conversion function, try this:
def py23_str(value):
try: # Python 2
return unicode(value, errors='ignore', encoding='utf-8')
except NameError: # Python 3
try:
return str(value, errors='ignore', encoding='utf-8')
except TypeError: # Wasn't a bytes object, no need to decode
return str(value)
I will say that I have written a few Python2/3 compatible libraries, and I have never needed to do this. Adding from __future__ import unicode_literals at the top of the code and calling .decode on bytes (or str in Python2) objects when they are created (i.e. reading from file in 'rb' mode) is all that I have needed so far.

What is tensorflow.compat.as_str()?

In the Google/Udemy Tensorflow tutorial there is the following code:
import tensorflow as tf
...
def read_data(filename):
"""Extract the first file enclosed in a zip file as a list of words"""
with zipfile.ZipFile(filename) as f:
data = tf.compat.as_str(f.read(f.namelist()[0])).split()
return data
This executes fine but I cannot find compat.as_str in the Tensorflow documentation or anywhere else.
Q1: What does compat.as_str do?
Q2: Is this tensorflow compat library documented somewhere?
Q3: This is a call to the tensorflow library, so how and why does it work in normal python code, rather than inside a tensorflow graph? I.e. I thought tensorflow library calls had to be inside a tensorflow graph defintion block:
graph = tf.Graph()
with graph.as_default()
... tensorflow function calls here ...
I am running python 2.7.
Basically, it comes from the fact that in Python 2, strings were dealt with primarily as bytes, not unicode.
In Python 3, all strings are natively unicode.
The purpose of the function is to ensure that whichever Python version you're using, you won't be bothered, hence the compat module name standing for compatibility.
Under the hood, tensorflow.compat.as_str converts both bytes and unicode strings to unicode strings.
Signature: tensorflow.compat.as_str(bytes_or_text, encoding='utf-8')
Docstring:
Returns the given argument as a unicode string.
Args:
bytes_or_text: A `bytes`, `str, or `unicode` object.
encoding: A string indicating the charset for decoding unicode.
Returns:
A `unicode` (Python 2) or `str` (Python 3) object.
Raises:
TypeError: If `bytes_or_text` is not a binary or unicode string.
The library is documented here.
tf.compat.as_str converts input into a string
I couldn't find any documentation, but you can look at the source code here
Tensorflow functions as a python module. The graph context is used to define a graph (mathematical computations) that will be used to train the model.
typical usage involves the Graph.as_default() context manager, which overrides the current default graph for the lifetime of the contex
In a current version of TF, the whole tf.compat group is nicely documented.
Basically some things behaves differently in python 2 and 3 (might be slightly inaccurate, python gurus can help me with this). Python3 uses 64-bit floats and python2 32-bit floats, there are also difference with respect to strings. Compat module tries things to behave in the same way (if you will check the source code you will see that they do different things depending on whether you use 2 or 3).
tf.compat.as_str:
Converts either bytes or unicode to bytes, using utf-8 encoding for
text.
This can be helpful if you save the data in tfrecords and want to make sure it will be saved in the same way no matter which python version is used.

PyPDF2<=1.19 has issues with PDF encoding

I am trying to encrypt PDF files under Python 3.3.2 using PyPDF2.
The code is very simple:
password = 'password';
# password = password.encode('utf-8')
PDFout.encrypt(user_pwd=password,owner_pwd=password)
however I am getting the following errors, depending if the encoding is on or off:
on: TypeError: slice indices must be integers or None or have an __index__ method
off: TypeError: Can't convert 'bytes' object to str implicitly
Would you know by any chance how to resolve that problem?
Thanks and Regards
Peter
It appears to me that the current version of PyPDF2 (1.19 as of this writing) has some bugs concerning compatibility with Python 3, and that is what is causing both error messages. The change log on GitHub for PyPDF2 indicates that Python 3 support was added in version 1.16, which was released only 3 1/2 months ago, so it is possible this bug hasn't either been reported or fixed yet. GitHub also shows that there is a branch of this project specifically for Python 3.3 support, which is not currently merged back into the main branch.
Both errors occur in the pdf.py file of the PyPDF2 module. Here is what is happening:
The PyPDF2 module creates some extra bytes as padding and concatenates it with your password. If the Python version is less than 3, the padding is created as a string literal. If the version is 3 or higher, the padding is encoded using the 'latin-1' encoding. In Python 3, this means the padding is a bytes object, and concatenating that with a string object (your password) produces the TypeError you saw. Under Python 2, the concatenation would work because both objects would be the same type.
When you encode your password using "utf-8", you resolve that problem since both the password and padding are bytes objects in that case. However, you end up running into a second bug later in the module. The pdf.py file creates and uses a variable "keylen" like this:
keylen = 128 / 8
... # later on in the code...
key = md5_hash[:keylen]
The division operator underwent a change in Python 2.2 which altered its default behavior starting in Python 3. In brief, "/" means floor division in Python 2 and returns an int, but it means true division in Python 3 and returns a float. Therefore, "keylen" would be 16 in Python 2, but instead 16.0 in Python 3. Floats, unlike ints, can't be used to splice arrays, so Python 3 throws the TypeError you saw when md5_hash[:keylen] is evaluated. Python 2 would run this without error, since keylen would be an int.
You could resolve this second problem by altering the module's source code to use the "//" operator (which means floor division and returns an int in both Python 2 and 3):
keylen = 128 // 8
However, you would then run into a third bug later in the code, also related to Python 3 compatibility. I won't belabor the point by describing it. The short answer to your question then, as far as I see it, is to either use Python 2, or patch the various code compatibility problems, or use a different PDF library for Python which has better support for Python 3 (if one exists which meets your particular requirements).
Try installing the most recent version of PyPDF2 - it now fully supports Python 3!
It seems that "some" support was added in 1.16, but it didn't cover all features. Now, Py 3 should be fully compatible with this library.

Is there any way I can import the python 3 version of 'bytes' into python 2?

I am writing python code and I want to support python 2 and 3. One of the most prominent data types I'm dealing with is immutable sequences of bytes, so I want to find an elegant way to deal with the disparity between python 2 'bytes' (aliased to 'str') and python 3 'bytes' (specifically, the different ways in which they slice and iterate are very annoying to me).
At first I tried using 'bytearray' because it seemed to have the same behavior in both python 2 and 3, but the fact that it is mutable is problematic, because I need my objects to be hashable.
If there is no way to access the python3 'bytes' behavior in python 2, the current solution I'm thinking of trying is this: convert all sequences (whether they be python 2 'bytes'/'str' or python 3 'bytes') to tuples of integers.
Is there anything else I should consider for a solution assuming I can't use the python 3 'bytes' type in python 2?
Use six module and its b() literal or binary_type class. This will take burden of checking Python version from you.

Categories

Resources