Endianness of integers in Python - python

I'm working on a program where I store some data in an integer and process it bitwise. For example, I might receive the number 48, which I will process bit-by-bit. In general the endianness of integers depends on the machine representation of integers, but does Python do anything to guarantee that the ints will always be little-endian? Or do I need to check endianness like I would in C and then write separate code for the two cases?
I ask because my code runs on a Sun machine and, although the one it's running on now uses Intel processors, I might have to switch to a machine with Sun processors in the future, which I know is big-endian.

Python's int has the same endianness as the processor it runs on. The struct module lets you convert byte blobs to ints (and viceversa, and some other data types too) in either native, little-endian, or big-endian ways, depending on the format string you choose: start the format with # or no endianness character to use native endianness (and native sizes -- everything else uses standard sizes), '~' for native, '<' for little-endian, '>' or '!' for big-endian.
This is byte-by-byte, not bit-by-bit; not sure exactly what you mean by bit-by-bit processing in this context, but I assume it can be accomodated similarly.
For fast "bulk" processing in simple cases, consider also the array module -- the fromstring and tostring methods can operate on large number of bytes speedily, and the byteswap method can get you the "other" endianness (native to non-native or vice versa), again rapidly and for a large number of items (the whole array).

If you need to process your data 'bitwise' then the bitstring module might be of help to you. It can also deal with endianness between platforms.
The struct module is the best standard method of dealing with endianness between platforms. For example this packs and unpack the integers 1, 2, 3 into two 'shorts' and one 'long' (2 and 4 bytes on most platforms) using native endianness:
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
To check the endianness of the platform programmatically you can use
>>> import sys
>>> sys.byteorder
which will either return "big" or "little".

The following snippet will tell you if your system default is little endian (otherwise it is big-endian)
import struct
little_endian = (struct.unpack('<I', struct.pack('=I', 1))[0] == 1)
Note, however, this will not affect the behavior of bitwise operators: 1<<1 is equal to 2 regardless of the default endianness of your system.

Check when?
When doing bitwise operations, the int in will have the same endianess as the ints you put in. You don't need to check that. You only need to care about this when converting to/from sequences of bytes, in both languages, afaik.
In Python you use the struct module for this, most commonly struct.pack() and struct.unpack().

Related

Why does a grouped struct.pack write wrong data?

I just spent ~30 minutes debugging and double checking Python & C# code, to find out that my struct.pack was writing the wrong data. When I separated this into separate calls, it works fine.
This is what I had before
file.write(struct.pack("fffHf", kf_time / frame_divisor, kf_in_tangent, kf_out_tangent, kf_interpolation_type, kf_value))
This is what I have now
file.write(struct.pack("f", kf_time / frame_divisor))
file.write(struct.pack("f", kf_in_tangent))
file.write(struct.pack("f", kf_out_tangent))
file.write(struct.pack("H", kf_interpolation_type))
file.write(struct.pack("f", kf_value))
Why does the first variation not write the data that I expected? What is so different than writing these separately?
(File is opened in binary mode, platform is 64 bit Windows, Python 3.5)
Presumably because, as the struct documentation clearly states:
Note By default, the result of packing a given C struct
includes pad bytes in order to maintain proper alignment
for the C types involved; similarly, alignment is taken
into account when unpacking. This behavior is chosen so
that the bytes of a packed struct correspond exactly to
the layout in memory of the corresponding C struct. To
handle platform-independent data formats or omit implicit
pad bytes, use standard size and alignment instead of
native size and alignment: see Byte Order, Size, and
Alignment for details.

How to determine whether the number is 32-bit or 64-bit integer in Python?

I want to differentiate between 32-bit and 64-bit integers in Python. In C it's very easy as we can declare variables using int_64 and int_32. But in Python how do we differentiate between 32-bit integers and 64-bit integers?
There's no need. The interpreter handles allocation behind the scenes, effectively promoting from one type to another as needed without you doing anything explicit.
The struct module mentioned in the other answers is the thing you need.
An example to make it clear.
import struct
struct.pack('qii', # Format string here.
100, # Your 64-bit integer
50, # Your first 32-bit integer
25) # Your second 32-bit integer
# This will return the following:
'd\x00\x00\x00\x00\x00\x00\x002\x00\x00\x00\x19\x00\x00\x00'
documentation for the formatting string.
Basically, you don't. There's no reason to. If you want to deal with types of known bit size, look at numpy datatypes.
If you want to put data into a specified format, look at the struct module.
The following snippet from an ipython interpreter session indicates one way of testing the type of an integer. Note, on my system, an int is a 64-bit data type, and a long is a multi-word type.
In [190]: isinstance(1,int)
Out[190]: True
In [191]: isinstance(1,long)
Out[191]: False
In [192]: isinstance(1L,long)
Out[192]: True
Also see an answer about sys.getsizeof. This function is not entirely relevant, since some additional overhead bytes are included. For example:
In [194]: import sys
In [195]: sys.getsizeof(1)
Out[195]: 24
In [196]: sys.getsizeof(1L)
Out[196]: 28

Building a list with binary numbers in binary representation in Python

I have a very big list of 0's and 1's that are represented as integers - by default - by python, I think: [randint(0, 1) for i in range(50*98)]
And I want to optimize the code so that it utilizes much less memory. The obvious way is to use just 1 bit to represent each of these numbers.
Is it possible to build a list of real binary numbers in python?
Regards,
Bruno
EDIT: Thank you all.
From the answers I found out Python doesn't do this by default, so I found this library (that is installed by Macports on OSX so it saves me some trouble) that does bit operations:
python-bitstring
This uses the bitstring module and constructs a BitArray object from your list:
from bitstring import BitArray
b = BitArray([randint(0, 1) for i in range(50*98)])
Internally this is now stored packed as bytes so will take considerably less memory. You can slice, index, check and set bits etc. with the usual notation and there additional methods such as set, all and any to modify the bits.
To get the data back as a binary string just use b.bin and to get out the byte packed data use b.tobytes() which will pad with zero bits up to a byte boundary.
As delnan already said in a comment, you will not be able to use real binary numbers if you mean bit-for-bit equivalent memory usage.
Integers (or longs) are of course real binary numbers in the meaning that you can address individual bits (using bit-wise operators, but that is easily hidden in a class). Also, long objects can become arbitrarily large, i.e. you can use them to simulate arbitrarily large bitsets. It is not going to be very fast if you do it in Python, but not very difficult either and a good start.
Using your binary generation scheme from above, you can do the following:
reduce(
lambda (a, p), b: (b << p | a, p + 1),
(random.randint(0, 1) for i in range(50*98)),
(0, 0)
)[0]
Of course, random supports arbitrarily large upper boundaries, so you can do just that:
r = random.randint(0, 2**(50*98))
This is not entirely the same, since the individual binary digits in the are not independent in the same way they are independent when you create each digit for itself. Then again, knowing you pRNGs work, they are not really independent in the other case, either. If this is a concern for you, you probably should not use the random module at all, but a hardware RNG.
It's called a bit vector, or a bitmap. Try e.g. BitVector. If you want to implement it yourself, you need to use a numeric object, not a list, and use bitwise operations to toggle bits, e.g.
bitmap = 0
bit = (1 << 24)
bitmap |= bit # enable bit
bitmap &= ~bit # disable bit
It seems you need some kind of bit-set. I'm not sure if this example entirely suits your needs, but it is worth a try.
Perhaps you could look into a lossless compression scheme for your data? Presumably there will be a lot of redundancy in such a list.

How do I write a long integer as binary in Python?

In Python, long integers have unlimited precision. I would like to write a 16 byte (128 bit) integer to a file. struct from the standard library supports only up to 8 byte integers. array has the same limitation. Is there a way to do this without masking and shifting each integer?
Some clarification here: I'm writing to a file that's going to be read in from non-Python programs, so pickle is out. All 128 bits are used.
I think for unsigned integers (and ignoring endianness) something like
import binascii
def binify(x):
h = hex(x)[2:].rstrip('L')
return binascii.unhexlify('0'*(32-len(h))+h)
>>> for i in 0, 1, 2**128-1:
... print i, repr(binify(i))
...
0 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
1 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'
340282366920938463463374607431768211455 '\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
might technically satisfy the requirements of having non-Python-specific output, not using an explicit mask, and (I assume) not using any non-standard modules. Not particularly elegant, though.
Two possible solutions:
Just pickle your long integer. This will write the integer in a special format which allows it to be read again, if this is all you want.
Use the second code snippet in this answer to convert the long int to a big endian string (which can be easily changed to little endian if you prefer), and write this string to your file.
The problem is that the internal representation of bigints does not directly include the binary data you ask for.
The PyPi bitarray module in combination with the builtin bin() function seems like a good combination for a solution that is simple and flexible.
bytes = bitarray(bin(my_long)[2:]).tobytes()
The endianness can be controlled with a few more lines of code. You'll have to evaluate the efficiency.
Why not use struct with the unsigned long long type twice?
import struct
some_file.write(struct.pack("QQ", var/(2**64), var%(2**64)))
That's documented here (scroll down to get the table with Q): http://docs.python.org/library/struct.html
This may not avoid the "mask and shift each integer" requirement. I'm not sure that avoiding mask and shift means in the context of Python long values.
The bytes are these:
def bytes( long_int ):
bytes = []
while long_int != 0:
b = long_int%256
bytes.insert( 0, b )
long_int //= 256
return bytes
You can then pack this list of bytes using struct.pack( '16b', bytes )
With Python 3.2 and later, you can use int.to_bytes and int.from_bytes: https://docs.python.org/3/library/stdtypes.html#int.to_bytes
You could pickle the object to binary, use protocol buffers (I don't know if they allow you to serialize unlimited precision integers though) or BSON if you do not want to write code.
But writing a function that dumps 16 byte integers by shifting it should not be so hard to do if it's not time critical.
This may be a little late, but I don't see why you can't use struct:
bigint = 0xFEDCBA9876543210FEDCBA9876543210L
print bigint,hex(bigint).upper()
cbi = struct.pack("!QQ",bigint&0xFFFFFFFFFFFFFFFF,(bigint>>64)&0xFFFFFFFFFFFFFFFF)
print len(cbi)
The bigint by itself is rejected, but if you mask it with &0xFFFFFFFFFFFFFFFF you can reduce it to an 8 byte int instead of 16. Then the upper part is shifted and masked as well. You may have to play with byte ordering a bit. I used the ! mark to tell it to produce a network endian byte order. Also, the msb and lsb (upper and lower bytes) may need to be reversed. I will leave that as an exercise for the user to determine. I would say saving things as network endian would be safer so you always know what the endianess of your data is.
No, don't ask me if network endian is big or little endian...
Based on #DSM's answer, and to support negative integers and varying byte sizes, I've created the following improved snippet:
def to_bytes(num, size):
x = num if num >= 0 else 256**size + num
h = hex(x)[2:].rstrip("L")
return binascii.unhexlify("0"*((2*size)-len(h))+h)
This will properly handle negative integers and let the user set the number of bytes

Reading 32bit Packed Binary Data On 64bit System

I'm attempting to write a Python C extension that reads packed binary data (it is stored as structs of structs) and then parses it out into Python objects. Everything works as expected on a 32 bit machine (the binary files are always written on 32bit architecture), but not on a 64 bit box. Is there a "preferred" way of doing this?
It would be a lot of code to post but as an example:
struct
{
WORD version;
BOOL upgrade;
time_t time1;
time_t time2;
} apparms;
File *fp;
fp = fopen(filePath, "r+b");
fread(&apparms, sizeof(apparms), 1, fp);
return Py_BuildValue("{s:i,s:l,s:l}",
"sysVersion",apparms.version,
"powerFailTime", apparms.time1,
"normKitExpDate", apparms.time2
);
Now on a 32 bit system this works great, but on a 64 bit my time_t sizes are different (32bit vs 64 bit longs).
Damn, you people are fast.
Patrick, I originally started using the struct package but found it just way to slow for my needs. Plus I was looking for an excuse to write a Python Extension.
I know this is a stupid question but what types do I need to watch out for?
Thanks.
Explicitly specify that your data types (e.g. integers) are 32-bit. Otherwise if you have two integers next to each other when you read them they will be read as one 64-bit integer.
When you are dealing with cross-platform issues, the two main things to watch out for are:
Bitness. If your packed data is written with 32-bit ints, then all of your code must explicitly specify 32-bit ints when reading and writing.
Byte order. If you move your code from Intel chips to PPC or SPARC, your byte order will be wrong. You will have to import your data and then byte-flip it so that it matches up with the current architecture. Otherwise 12 (0x0000000C) will be read as 201326592 (0x0C000000).
Hopefully this helps.
The 'struct' module should be able to do this, although alignment of structs in the middle of the data is always an issue. It's not very hard to get it right, however: find out (once) what boundary the structs-in-structs align to, then pad (manually, with the 'x' specifier) to that boundary. You can doublecheck your padding by comparing struct.calcsize() with your actual data. It's certainly easier than writing a C extension for it.
In order to keep using Py_BuildValue() like that, you have two options. You can determine the size of time_t at compiletime (in terms of fundamental types, so 'an int' or 'a long' or 'an ssize_t') and then use the right format character to Py_BuildValue -- 'i' for an int, 'l' for a long, 'n' for an ssize_t. Or you can use PyInt_FromSsize_t() manually, in which case the compiler does the upcasting for you, and then use the 'O' format characters to pass the result to Py_BuildValue.
You need to make sure you're using architecture independent members for your struct. For instance an int may be 32 bits on one architecture and 64 bits on another. As others have suggested, use the int32_t style types instead. If your struct contains unaligned members, you may need to deal with padding added by the compiler too.
Another common problem with cross architecture data is endianness. Intel i386 architecture is little-endian, but if you're reading on a completely different machine (e.g. an Alpha or Sparc), you'll have to worry about this too.
The Python struct module deals with both these situations, using the prefix passed as part of the format string.
# - Use native size, endianness and alignment. i= sizeof(int), l= sizeof(long)
= - Use native endianness, but standard sizes and alignment (i=32 bits, l=64 bits)
< - Little-endian standard sizes/alignment
Big-endian standard sizes/alignment
In general, if the data passes off your machine, you should nail down the endianness and the size / padding format to something specific — ie. use "<" or ">" as your format. If you want to handle this in your C extension, you may need to add some code to handle it.
What's your code for reading the binary data? Make sure you're copying the data into properly-sized types like int32_t instead of just int.
Why aren't you using the struct package?

Categories

Resources