Why do we need endianness here? - python

I am reading a source-code which downloads the zip-file and reads the data into numpy array. The code suppose to work on macos and linux and here is the snippet that I see:
def _read32(bytestream):
dt = numpy.dtype(numpy.uint32).newbyteorder('>')
return numpy.frombuffer(bytestream.read(4), dtype=dt)
This function is used in the following context:
with gzip.open(filename) as bytestream:
magic = _read32(bytestream)
It is not hard to see what happens here, but I am puzzled with the purpose of newbyteorder('>'). I read the documentation, and know what endianness mean, but can not understand why exactly developer added newbyteorder (in my opinion it is not really needed).

That's because data downloaded is in big endian format as described in source page: http://yann.lecun.com/exdb/mnist/
All the integers in the files are stored in the MSB first (high
endian) format used by most non-Intel processors. Users of Intel
processors and other low-endian machines must flip the bytes of the
header.

It is just a way of ensuring that the bytes are interpreted from the resulting array in the correct order, regardless of a system's native byteorder.
By default, the built in NumPy integer dtypes will use the byteorder that is native to your system. For example, my system is little-endian, so simply using the dtype numpy.dtype(numpy.uint32) will mean that values read into an array from a buffer with the bytes in big-endian order will not be interpreted correctly.
If np.frombuffer is to meant to recieve bytes that are known to be in a particular byteorder, the best practice is to modify the dtype using newbyteorder. This is mentioned in the documents for np.frombuffer:
Notes
If the buffer has data that is not in machine byte-order, this should be specified as part of the data-type, e.g.:
>>> dt = np.dtype(int)
>>> dt = dt.newbyteorder('>')
>>> np.frombuffer(buf, dtype=dt)
The data of the resulting array will not be byteswapped, but will be
interpreted correctly.

Related

How to read complex data from TB size binary file, fast and keep the most accuracy?

Use Python 3.9.2 read the beginning of TB size binary file (piece of it) as below:
file=open(filename,'rb')
bytes=file.read(8)
print(bytes)
b'\x14\x00\x80?\xb5\x0c\xf81'
I tried np.fromfile np.fromfile(np.complex64) ways to read the file filename.
float_data1 = np.fromfile(filename,np.float32)
float_data2 = np.fromfile(filename,np.complex64)
As the binary file always bigger than 500GB,even TB size,how to read complex data from TB size binary file, fast and keep the most acuuracy?
This is related to your ham post.
samples = np.fromfile(filename, np.complex128)
and
Those codes equal to -1.9726906072368233e-31,+3.6405886029665884e-23.
No, they don't equal that. That's just your interpretation of bytes as float64. That interpretation is incorrect!
You assume these are 64-bit floating point numbers. They are not; you really need to stop assuming that; it's wrong, and we can't help you if you still act as if it were 64-bit floats forming a 128 bit complex value.
Besides documents,I compare the byte content in the answer,that is more than reading docs.
As I already pointed out, that is wrong. Your computer can read anything as any type, just as you tell them, even if it's not the original type it's been stored in. You stored complex64, but read complex128. That's why your values are so inplausible.
It's 32-bit floats, forming a 64 bit complex value. The official block documentation for the file sink also points that out, and even explains the numpy dtype you need to use!
Anyways, you can use numpy's memmap functionality to map the file contents without reading them all to RAM. That works. Again, you need to use the right dtype, which is, to repeat this the 10th time, not complex128.
It's really easy:
data = numpy.memmap(filename, dtype=numpy.complex64)
done.

How to serialize tensorflow tensor as raw byte string

I'm trying to get tensorflow to write a tensor to something that would be ingestible by ffmpeg (e.g. PCM 32-bit floating-point little-endian). I'm not seeing a simple way to do this. tf.io.serialize_tensor seems to transform to TensorProto proto.
I can't use tensorflow-io because this needs to work in tensorflow-serving. tf.audio.write_wav doesn't work because it only supports a bit-depth of int16 and I need float32.
I'd like something like:
serialized_foo = magic_encoding(foo) # foo is a tensor of type float32 and serialized_foo is a string
tf.io.write_file("foo.raw", serialized_foo)
It seems like tf.io.serialize_tensor gets part of the way there. From what I can tell quickly inspecting the file, the protobuf format appears to be the raw byte string plus some sort of short header? Would it be possible just to modify the output of tf.io.serialize_tensor slightly, dropping the header?

Saving mat files in different numerical data formats in scipy.io.savemat

I know this is very simple and I can't believe I haven't found anything on this anywhere but here it goes:
I have a large, high dimensional matrix in python that I want to save in .mat format (for matlab).
I'm using the scipy.io.savemat method to save this matrix but it's always saved as double. I would like to save it as something of lower precision, like single or 16 bit float.
I convert the array to a low-precision data type before saving but it's always saved as double. Is there really no way of saving mat files in a lower-precision float type?
.savemat does not seem to take a dtype argument.
import scipy as sp
sp.io.savemat('test.mat', {'test': sp.array([0.001, 1, 1.004], dtype='Float16')})
Apparently it needs to be either single or double. scipy.io.savemat() does not support other float precisions and it'll casually default to double if it doesn't like your dtype without warning.

How to save double to file in python?

Let's say I need to save a matrix(each line corresponds one row) that could be loaded from fortran later. What method should I prefer? Is converting everything to string is the only one approach?
You can save them in binary format as well. Please see the documentation on the struct standard module, it has a pack function for converting Python object into binary data.
For example:
import struct
value = 3.141592654
data = struct.pack('d', value)
open('file.ext', 'wb').write(data)
You can convert each element of your matrix and write to a file. Fortran should be able to load that binary data. You can speed up the process by converting a row as a whole, like this:
row_data = struct.pack('d' * len(matrix_row), *matrix_row)
Please note, that 'd' * len(matrix_row) is a constant for your matrix size, so you need to calculate that format string only once.
I don't know fortran, so it's hard to tell what is easy for you to perform on that side for parsing.
It sounds like your options are either saving the doubles in plaintext (meaning, 'converting' them to string), or in binary (using struct and the likes). The decision for which one is better depends.
I would go with the plaintext solution, as it means the files will be easily readable, and you won't have to mess with different kinds of details (endianity, default double sizes).
But, there are cases where binary is better (for example, if you have a really big list of doubles and space is of importance, or if it is easier for you to parse it and you need the optimization) - but this is likely not your case.
You can use JSON
import json
matrix = [[2.3452452435, 3.34134], [4.5, 7.9]]
data = json.dumps(matrix)
open('file.ext', 'wb').write(data)
File content will look like:
[[2.3452452435, 3.3413400000000002], [4.5, 7.9000000000000004]]
If legibility and ease of access is important (and file size is reasonable), Fortran can easily parse a simple array of numbers, at least if it knows the size of the matrix beforehand (with something like READ(FILE_ID, '2(F)'), I think):
1.234 5.6789e4
3.1415 9.265358978
42 ...
Two nested for loops in your Python code can easily write your matrix in this form.

Alternatives to using pack_into() when manipulating a list of bytes?

I'm reading in a binary file into a list and parsing the binary data. I'm using unpack() to extract certain parts of the data as primitive data types, and I want to edit that data and insert it back into the original list of bytes. Using pack_into() would make it easy, except that I'm using Python 2.4, and pack_into() wasn't introduced until 2.5
Does anyone know of a good way to go about serializing the data this way so that I can accomplish essentially the same functionality as pack_into()?
Have you looked at the bitstring module? It's designed to make the construction, parsing and modification of binary data easier than using the struct and array modules directly.
It's especially made for working at the bit level, but will work with bytes just as well. It will also work with Python 2.4.
from bitstring import BitString
s = BitString(filename='somefile')
# replace byte range with new values
# The step of '8' signifies byte rather than bit indicies.
s[10:15:8] = '0x001122'
# Search and replace byte value with two bytes
s.replace('0xcc', '0xddee', bytealigned=True)
# Different interpretations of the data are available through properties
if s[5:7:8].int > 1000:
s[5:7:8] = 1000
# Use the bytes property to get back to a Python string
open('newfile', 'wb').write(s.bytes)
The underlying data stored in the BitString is just an array object, but with a comprehensive set of functions and special methods to make it simple to modify and interpret.
Do you mean editing data in a buffer object? Documentation on manipulating those at all from Python directly is fairly scarce.
If you just want to edit bytes in a string, it's simple enough, though; struct.pack_into is new to 2.5, but struct.pack isn't:
import struct
s = open("file").read()
ofs = 1024
fmt = "Ih"
size = struct.calcsize(fmt)
before, data, after = s[0:ofs], s[ofs:ofs+size], s[ofs+size:]
values = list(struct.unpack(fmt, data))
values[0] += 5
values[1] /= 2
data = struct.pack(fmt, *values)
s = "".join([before, data, after])

Categories

Resources