PyPDF2<=1.19 has issues with PDF encoding

PyPDF2<=1.19 has issues with PDF encoding - python

I am trying to encrypt PDF files under Python 3.3.2 using PyPDF2.
The code is very simple:
password = 'password';
# password = password.encode('utf-8')
PDFout.encrypt(user_pwd=password,owner_pwd=password)
however I am getting the following errors, depending if the encoding is on or off:
on: TypeError: slice indices must be integers or None or have an __index__ method
off: TypeError: Can't convert 'bytes' object to str implicitly
Would you know by any chance how to resolve that problem?
Thanks and Regards
Peter

It appears to me that the current version of PyPDF2 (1.19 as of this writing) has some bugs concerning compatibility with Python 3, and that is what is causing both error messages. The change log on GitHub for PyPDF2 indicates that Python 3 support was added in version 1.16, which was released only 3 1/2 months ago, so it is possible this bug hasn't either been reported or fixed yet. GitHub also shows that there is a branch of this project specifically for Python 3.3 support, which is not currently merged back into the main branch.
Both errors occur in the pdf.py file of the PyPDF2 module. Here is what is happening:
The PyPDF2 module creates some extra bytes as padding and concatenates it with your password. If the Python version is less than 3, the padding is created as a string literal. If the version is 3 or higher, the padding is encoded using the 'latin-1' encoding. In Python 3, this means the padding is a bytes object, and concatenating that with a string object (your password) produces the TypeError you saw. Under Python 2, the concatenation would work because both objects would be the same type.
When you encode your password using "utf-8", you resolve that problem since both the password and padding are bytes objects in that case. However, you end up running into a second bug later in the module. The pdf.py file creates and uses a variable "keylen" like this:
keylen = 128 / 8
... # later on in the code...
key = md5_hash[:keylen]
The division operator underwent a change in Python 2.2 which altered its default behavior starting in Python 3. In brief, "/" means floor division in Python 2 and returns an int, but it means true division in Python 3 and returns a float. Therefore, "keylen" would be 16 in Python 2, but instead 16.0 in Python 3. Floats, unlike ints, can't be used to splice arrays, so Python 3 throws the TypeError you saw when md5_hash[:keylen] is evaluated. Python 2 would run this without error, since keylen would be an int.
You could resolve this second problem by altering the module's source code to use the "//" operator (which means floor division and returns an int in both Python 2 and 3):
keylen = 128 // 8
However, you would then run into a third bug later in the code, also related to Python 3 compatibility. I won't belabor the point by describing it. The short answer to your question then, as far as I see it, is to either use Python 2, or patch the various code compatibility problems, or use a different PDF library for Python which has better support for Python 3 (if one exists which meets your particular requirements).

Try installing the most recent version of PyPDF2 - it now fully supports Python 3!
It seems that "some" support was added in 1.16, but it didn't cover all features. Now, Py 3 should be fully compatible with this library.

Related

What is the equivalent of Base64.encodeToString(bytes,11) in python?

I am trying to recreate a java program in python, but I am stuck at this point
import android.util.Base64 -> this is the package
Base64.encodeToString(bytes, 11)
The python module for encoding with base64 only gets one parameter, which is bytes.
apparently, according to the android docs, the second parameter is supposed to indicate a flag, but I can't find any information about the number 11
What does this mean and how can I implement this in python?

11 is a combination of flags (or-ed together), specifically:
NO_PADDING, which has value 1
NO_WRAP which has value 2 and
URL_SAFE which has value 8.
I don't know the exact way to reproduce this in Python, but I believe base64.urlsafe_b64encode gets you halfway there by implementing the equivalent of URL_SAFE. For NO_PADDING you could simply trim any trailing padding (i.e. = charcters) in the output.

SHA512 crypt returns *0 when rounds=5000

Since some days following python program returns *0:
import crypt
# broken:
>>> crypt.crypt('pw', '$6$rounds=5000$0123456789abcdef')
'*0'
# works:
>>> crypt.crypt("pw", '$6$0123456789abcdef')
'$6$0123456789abcdef$zAYvvEJcrKSqV2KUPTUM1K9eaGv20n9mUjWSDZW0QnwBRk0L...'
>>> crypt.crypt('pw', '$6$rounds=5001$0123456789abcdef')
'$6$rounds=5001$0123456789abcdef$mG98GkftS5iu1VOpowpXm1fgefTbWnRm4rbw...'
>>> crypt.crypt("pw", '$6$rounds=4999$0123456789abcdef')
'$6$rounds=4999$0123456789abcdef$ulXwrQtpwNd/t6NVUJo53AXMpp40IrpCHFyC...'
I did the same with a small C program using crypt_r and the output was the same. I read in some posts that *0 and *1 will be returned when there are errors.
According to the manpage crypt(3) specifying the rounds=xxx parameter is supported since glibc 2.7 and the default is 5000, when no rounds parameter is given (like in the second example). But why am I not allowed to set rounds to 5000?
I'm using Fedora 28 with glibc 2.27. The results are the same with different Python versions (even Python2 and Python3). Using crypt in PHP also works as expected. But the most interesting thing is that running the same command in a Docker container (fedora:28) works:
>>> crypt.crypt("pw", '$6$rounds=5000$0123456789abcdef')
'$6$rounds=5000$0123456789abcdef$zAYvvEJcrKSqV2KUPTUM1K9eaGv20n9mUjWS...'
Does anybody know the reason for this behavior?

The libxcrypt sources contain this:
/* Do not allow an explicit setting of zero rounds, nor of the
default number of rounds, nor leading zeroes on the rounds. */
This was introduced in a commit “Add more tests based on gaps in line coverage.” with this comment:
This change makes us pickier about non-default round parameters to $5$ and $6$ hashes; numbers outside the valid range are now rejected, as are numbers with leading zeroes and an explicit request for the
default number of rounds. This is in keeping with the observation, in
the Passlib documentation, that allowing more than one valid crypt
output string for any given (rounds, salt, phrase) triple is asking
for trouble.
I suggest to open an issue if this causes too many compatibility issues. Alternatively, you can remove the rounds=5000 specification, but based on a quick glance, the change looks to me as if it should be reverted. It's not part of the original libcrypt implementation in glibc.

What is tensorflow.compat.as_str()?

In the Google/Udemy Tensorflow tutorial there is the following code:
import tensorflow as tf
...
def read_data(filename):
"""Extract the first file enclosed in a zip file as a list of words"""
with zipfile.ZipFile(filename) as f:
data = tf.compat.as_str(f.read(f.namelist()[0])).split()
return data
This executes fine but I cannot find compat.as_str in the Tensorflow documentation or anywhere else.
Q1: What does compat.as_str do?
Q2: Is this tensorflow compat library documented somewhere?
Q3: This is a call to the tensorflow library, so how and why does it work in normal python code, rather than inside a tensorflow graph? I.e. I thought tensorflow library calls had to be inside a tensorflow graph defintion block:
graph = tf.Graph()
with graph.as_default()
... tensorflow function calls here ...
I am running python 2.7.

Basically, it comes from the fact that in Python 2, strings were dealt with primarily as bytes, not unicode.
In Python 3, all strings are natively unicode.
The purpose of the function is to ensure that whichever Python version you're using, you won't be bothered, hence the compat module name standing for compatibility.
Under the hood, tensorflow.compat.as_str converts both bytes and unicode strings to unicode strings.
Signature: tensorflow.compat.as_str(bytes_or_text, encoding='utf-8')
Docstring:
Returns the given argument as a unicode string.
Args:
bytes_or_text: A `bytes`, `str, or `unicode` object.
encoding: A string indicating the charset for decoding unicode.
Returns:
A `unicode` (Python 2) or `str` (Python 3) object.
Raises:
TypeError: If `bytes_or_text` is not a binary or unicode string.
The library is documented here.

tf.compat.as_str converts input into a string
I couldn't find any documentation, but you can look at the source code here
Tensorflow functions as a python module. The graph context is used to define a graph (mathematical computations) that will be used to train the model.
typical usage involves the Graph.as_default() context manager, which overrides the current default graph for the lifetime of the contex

In a current version of TF, the whole tf.compat group is nicely documented.
Basically some things behaves differently in python 2 and 3 (might be slightly inaccurate, python gurus can help me with this). Python3 uses 64-bit floats and python2 32-bit floats, there are also difference with respect to strings. Compat module tries things to behave in the same way (if you will check the source code you will see that they do different things depending on whether you use 2 or 3).
tf.compat.as_str:
Converts either bytes or unicode to bytes, using utf-8 encoding for
text.
This can be helpful if you save the data in tfrecords and want to make sure it will be saved in the same way no matter which python version is used.

Array of unicode literals

I'm writing code that's supposed to be Python 2.7 and Python 3.3+ compatible. When trying to run my code with Python 2.7, I get the following problem.
I'm importing unicode_literals from __future__ in each file and I'm having trouble getting the array function to work.
from array import array
from __future__ import unicode_literals
Trying to make a character array doesn't work
array("c", "test")
> TypeError: must be char, not unicode
Trying to make a unicode array also doesn't work
array("u", "test")
> TypeError: must be char, not unicode
Can I make an array that is compatible with unicode_literals?

This error is being thrown because of the first argument to array(), the typecode. In Python 2 this must be a non-unicode character (string of length 1), and in Python 3 this must be a unicode character. Since both of these are the meaning of str in the respective Python versions, this works in both:
array(str('u'), 'test')

This is a limitation of the array module that was fixed recently (https://bugs.python.org/issue20014). On Python 2.7.11 (or newer) the array constructor will accept both str and unicode as the first argument.
As a workaround you can use e.g. array(str("u"), "test"). I'm referring to the other answer by Dan Getz for an explanation of why this works.
Note that your first example using the "c" typecode still won't work on either Python 2.7 or Python 3.x. On Python 2.7, you need to pass a bytestring as the second argument (e.g. by passing b"test" as the second argument). The "c" typecode was removed in Python 3.0, so you should use "b" or "B" instead.

Is there any way I can import the python 3 version of 'bytes' into python 2?

I am writing python code and I want to support python 2 and 3. One of the most prominent data types I'm dealing with is immutable sequences of bytes, so I want to find an elegant way to deal with the disparity between python 2 'bytes' (aliased to 'str') and python 3 'bytes' (specifically, the different ways in which they slice and iterate are very annoying to me).
At first I tried using 'bytearray' because it seemed to have the same behavior in both python 2 and 3, but the fact that it is mutable is problematic, because I need my objects to be hashable.
If there is no way to access the python3 'bytes' behavior in python 2, the current solution I'm thinking of trying is this: convert all sequences (whether they be python 2 'bytes'/'str' or python 3 'bytes') to tuples of integers.
Is there anything else I should consider for a solution assuming I can't use the python 3 'bytes' type in python 2?

Use six module and its b() literal or binary_type class. This will take burden of checking Python version from you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.