What is tensorflow.compat.as_str()?

What is tensorflow.compat.as_str()? - python

In the Google/Udemy Tensorflow tutorial there is the following code:
import tensorflow as tf
...
def read_data(filename):
"""Extract the first file enclosed in a zip file as a list of words"""
with zipfile.ZipFile(filename) as f:
data = tf.compat.as_str(f.read(f.namelist()[0])).split()
return data
This executes fine but I cannot find compat.as_str in the Tensorflow documentation or anywhere else.
Q1: What does compat.as_str do?
Q2: Is this tensorflow compat library documented somewhere?
Q3: This is a call to the tensorflow library, so how and why does it work in normal python code, rather than inside a tensorflow graph? I.e. I thought tensorflow library calls had to be inside a tensorflow graph defintion block:
graph = tf.Graph()
with graph.as_default()
... tensorflow function calls here ...
I am running python 2.7.

Basically, it comes from the fact that in Python 2, strings were dealt with primarily as bytes, not unicode.
In Python 3, all strings are natively unicode.
The purpose of the function is to ensure that whichever Python version you're using, you won't be bothered, hence the compat module name standing for compatibility.
Under the hood, tensorflow.compat.as_str converts both bytes and unicode strings to unicode strings.
Signature: tensorflow.compat.as_str(bytes_or_text, encoding='utf-8')
Docstring:
Returns the given argument as a unicode string.
Args:
bytes_or_text: A `bytes`, `str, or `unicode` object.
encoding: A string indicating the charset for decoding unicode.
Returns:
A `unicode` (Python 2) or `str` (Python 3) object.
Raises:
TypeError: If `bytes_or_text` is not a binary or unicode string.
The library is documented here.

tf.compat.as_str converts input into a string
I couldn't find any documentation, but you can look at the source code here
Tensorflow functions as a python module. The graph context is used to define a graph (mathematical computations) that will be used to train the model.
typical usage involves the Graph.as_default() context manager, which overrides the current default graph for the lifetime of the contex

In a current version of TF, the whole tf.compat group is nicely documented.
Basically some things behaves differently in python 2 and 3 (might be slightly inaccurate, python gurus can help me with this). Python3 uses 64-bit floats and python2 32-bit floats, there are also difference with respect to strings. Compat module tries things to behave in the same way (if you will check the source code you will see that they do different things depending on whether you use 2 or 3).
tf.compat.as_str:
Converts either bytes or unicode to bytes, using utf-8 encoding for
text.
This can be helpful if you save the data in tfrecords and want to make sure it will be saved in the same way no matter which python version is used.

Related

Possible to make custom string literal prefixes in Python?

Let's say I have a custom class derived from str that implements/overrides some methods:
class mystr(str):
# just an example for a custom method:
def something(self):
return "anything"
Now currently I have to manually create instances of mystr by passing it a string in the constructor:
ms1 = mystr("my string")
s = "another string"
ms2 = mystr(s)
This is not too bad, but it lead to the idea that it would be cool to use a custom string prefix similar to b'bytes string' or r'raw string' or u'unicode string'.
Is it somehow possible in Python to create/register such a custom string literal prefix like m so that a literal m'my string' results in a new instance of mystr?
Or are those prefixes hard-coded into the Python interpreter?

Those prefixes are hardcoded in the interpreter, you can't register more prefixes.
What you could do however, is preprocess your Python files, by using a custom source codec. This is a rather neat hack, one that requires you to register a custom codec, and to understand and apply source code transformations.
Python allows you to specify the encoding of source code with a special comment at the top:
# coding: utf-8
would tell Python that the source code encoded with UTF-8, and will decode the file accordingly before parsing. Python looks up the codec for this in the codecs module registry. And you can register your own codecs.
The pyxl project uses this trick to parse out HTML syntax from Python files to replace them with actual Python syntax to build that HTML, all in a 'decoding' step. See the codec package in that project, where the register module registers a custom codec search function that transforms source code before Python actually parses and compiles it. A custom .pth file is installed into your site-packages directory to load this registration step at Python startup time. Another project that does the same thing to parse out Ruby-style string formatting is interpy.
All you have to do then, is build such a codec too that'll parse a Python source file (tokenizes it, perhaps with the tokenize module) and replaces string literals with your custom prefix with mystr(<string literal>) calls. Any file you want parsed you mark with # coding: yourcustomcodec.
I'll leave that part as an exercise for the reader. Good luck!
Note that the result of this transformation is then compiled into bytecode, which is cached; your transformation only has to run once per source code revision, all other imports of a module using your codec will load the cached bytecode.

One could use operator overloading to implicitly convert str into a custom class
class MyString(str):
def __or__( self, a ):
return MyString(self + a)
m = MyString('')
print( m, type(m) )
#('', <class 'MyString'>)
print m|'a', type(m|'a')
#('a', <class 'MyString'>)
This avoids the use of parenthesis effectively emulating a string prefix with one extra character ─ which I chose to be | but it could also be & or other binary comparison operator.

While the workarounds mentioned above are great, they could be dangerous. Hacking your python is really not a good idea. while you can't really make a prefix otherwise,
you could do the following:
class MyString(str):
def something(self):
return MyString("anything")
m = MyString
# The you can do:
m("hi")
# Rather than:
# m"hi"
That's probably the best safe solution you can find.
Two parentheses aren't really that much to type, and it can be less confusing to readers of your code.

PyPDF2<=1.19 has issues with PDF encoding

I am trying to encrypt PDF files under Python 3.3.2 using PyPDF2.
The code is very simple:
password = 'password';
# password = password.encode('utf-8')
PDFout.encrypt(user_pwd=password,owner_pwd=password)
however I am getting the following errors, depending if the encoding is on or off:
on: TypeError: slice indices must be integers or None or have an __index__ method
off: TypeError: Can't convert 'bytes' object to str implicitly
Would you know by any chance how to resolve that problem?
Thanks and Regards
Peter

It appears to me that the current version of PyPDF2 (1.19 as of this writing) has some bugs concerning compatibility with Python 3, and that is what is causing both error messages. The change log on GitHub for PyPDF2 indicates that Python 3 support was added in version 1.16, which was released only 3 1/2 months ago, so it is possible this bug hasn't either been reported or fixed yet. GitHub also shows that there is a branch of this project specifically for Python 3.3 support, which is not currently merged back into the main branch.
Both errors occur in the pdf.py file of the PyPDF2 module. Here is what is happening:
The PyPDF2 module creates some extra bytes as padding and concatenates it with your password. If the Python version is less than 3, the padding is created as a string literal. If the version is 3 or higher, the padding is encoded using the 'latin-1' encoding. In Python 3, this means the padding is a bytes object, and concatenating that with a string object (your password) produces the TypeError you saw. Under Python 2, the concatenation would work because both objects would be the same type.
When you encode your password using "utf-8", you resolve that problem since both the password and padding are bytes objects in that case. However, you end up running into a second bug later in the module. The pdf.py file creates and uses a variable "keylen" like this:
keylen = 128 / 8
... # later on in the code...
key = md5_hash[:keylen]
The division operator underwent a change in Python 2.2 which altered its default behavior starting in Python 3. In brief, "/" means floor division in Python 2 and returns an int, but it means true division in Python 3 and returns a float. Therefore, "keylen" would be 16 in Python 2, but instead 16.0 in Python 3. Floats, unlike ints, can't be used to splice arrays, so Python 3 throws the TypeError you saw when md5_hash[:keylen] is evaluated. Python 2 would run this without error, since keylen would be an int.
You could resolve this second problem by altering the module's source code to use the "//" operator (which means floor division and returns an int in both Python 2 and 3):
keylen = 128 // 8
However, you would then run into a third bug later in the code, also related to Python 3 compatibility. I won't belabor the point by describing it. The short answer to your question then, as far as I see it, is to either use Python 2, or patch the various code compatibility problems, or use a different PDF library for Python which has better support for Python 3 (if one exists which meets your particular requirements).

Try installing the most recent version of PyPDF2 - it now fully supports Python 3!
It seems that "some" support was added in 1.16, but it didn't cover all features. Now, Py 3 should be fully compatible with this library.

Murmurhash 2 results on Python and Haskell

Haskell and Python don't seem to agree on Murmurhash2 results. Python, Java, and PHP returned the same results but Haskell don't. Am I doing something wrong regarding Murmurhash2 on Haskell?
Here is my code for Haskell Murmurhash2:
import Data.Digest.Murmur32
main = do
print $ asWord32 $ hash32WithSeed 1 "woohoo"
And here is the code written in Python:
import murmur
if __name__ == "__main__":
print murmur.string_hash("woohoo", 1)
Python returned 3650852671 while Haskell returned 3966683799

From a quick inspection of the sources, it looks like the algorithm operates on 32 bits at a time. The Python version gets these by simply grabbing 4 bytes at a time from the input string, while the Haskell version converts each character to a single 32-bit Unicode index.
It's therefore not surprising that they yield different results.

The murmur-hash package (I am its author) does not promise to compute the same hashes as other languages. If you rely on hashes to be compatible with other software that computes hashes I suggest you create newtype wrappers that compute hashes the way you want them. For text, in particular, you need to at least specify the encoding. In your case you could convert the text to an ASCII string using Data.ByteString.Char8.pack, but that still doesn't give you the same hash since the ByteString instance is more of a placeholder.
BTW, I'm not actively improving that package because MurmurHash2 has been superseded by MurmurHash3, but I keep accepting patches.

Parsing binary files with Python

As a side project I would like to try to parse binary files (Mach-O files specifically). I know tools exist for this already (otool) so consider this a learning exercise.
The problem I'm hitting is that I don't understand how to convert the binary elements found into a python representation. For example, the Mach-O file format starts with a header which is defined by a C Struct. The first item is a uint_32 'magic number' field. When i do
magic = f.read(4)
I get
b'\xcf\xfa\xed\xfe'
This is starting to make sense to me. It's literally a byte array of 4 bytes. However I want to treat this like a 4-byte int that represents the original magic number. Another example is the numberOfSections field. I just want the number represented by 4-byte field, not an array of literal bytes.
Perhaps I'm thinking about this all wrong. Has anybody worked on anything similar? Do I need to write functions to look these 4-byte byte arrays and shift and combine their values to produce the number I want? Is endienness going to screw me here? Any pointers would be most helpful.

Take a look at the struct module:
In [1]: import struct
In [2]: magic = b'\xcf\xfa\xed\xfe'
In [3]: decoded = struct.unpack('<I', magic)[0]
In [4]: hex(decoded)
Out[4]: '0xfeedfacf'

There's Kaitai Struct project that solves exactly that problem. First, you describe a certain file format using a .ksy spec, then you compile it into a Python library (or, actually, a library in any other major programming language), import it, and, voila, parsing boils down to:
from mach_o import MachO
my_file = MachO.from_file("/path/to/your/file")
my_file.magic # => 0xfeedface
my_file.num_of_sections # => some other integer
my_file.sections # => list of objects that represent sections
They have a growing repository of file format specs. It doesn't have Mach-O file format spec (yet?), but there are complex formats like Java .class or Microsoft's PE executable described there, so I guess it shouldn't be a major problem to write spec for Mach-O format as well.
It is actually better than Construct or Hachoir, because it's compiled (as opposed to interpreted), thus it's faster, and it includes tons of other helpful tools like visualizer or format diagram maker. For example, this is a generated explanation diagram for PE executable format:

I would suggest the Construct module. It offers a very high level interface.

Are PyArg_ParseTuple() "s" format specifiers useful in Python 3.x C API?

I'm trying to write a Python C extension that processes byte strings, and I have something basically working for Python 2.x and Python 3.x.
For the Python 2.x code, near the start of my function, I currently have a line:
if (!PyArg_ParseTuple(args, "s#:in_bytes", &src_ptr, &src_len))
...
I notice that the s# format specifier accepts both Unicode strings and byte strings. I really just want it to accept byte strings and reject Unicode. For Python 2.x, this might be "good enough"--the standard hashlib seems to do the same, accepting Unicode as well as byte strings. However, Python 3.x is meant to clean up the Unicode/byte string mess and not let the two be interchangeable.
So, I'm surprised to find that in Python 3.x, the s format specifiers for PyArg_ParseTuple() still seem to accept Unicode and provide a "default encoded string version" of the Unicode. This seems to go against the principles of Python 3.x, making the s format specifiers unusable in practice. Is my analysis correct, or am I missing something?
Looking at the implementation for hashlib for Python 3.x (e.g. see md5module.c, function MD5_update() and its use of GET_BUFFER_VIEW_OR_ERROUT() macro) I see that it avoids the s format specifiers, and just takes a generic object (O specifier) and then does various explicit type checks using the GET_BUFFER_VIEW_OR_ERROUT() macro. Is this what we have to do?

I agree with you -- it's one of several spots where the C API migration of Python 3 was clearly not designed as carefully and thouroughly as the Python coder-visible parts. I do also agree that probably the best workaround for now is focusing on "buffer views", per that macro -- until and unless something better gets designed into a future Python C API (don't hold your breath waiting for that to happen, though;-).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.