In Python, long integers have unlimited precision. I would like to write a 16 byte (128 bit) integer to a file. struct from the standard library supports only up to 8 byte integers. array has the same limitation. Is there a way to do this without masking and shifting each integer?
Some clarification here: I'm writing to a file that's going to be read in from non-Python programs, so pickle is out. All 128 bits are used.
I think for unsigned integers (and ignoring endianness) something like
import binascii
def binify(x):
h = hex(x)[2:].rstrip('L')
return binascii.unhexlify('0'*(32-len(h))+h)
>>> for i in 0, 1, 2**128-1:
... print i, repr(binify(i))
...
0 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
1 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'
340282366920938463463374607431768211455 '\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
might technically satisfy the requirements of having non-Python-specific output, not using an explicit mask, and (I assume) not using any non-standard modules. Not particularly elegant, though.
Two possible solutions:
Just pickle your long integer. This will write the integer in a special format which allows it to be read again, if this is all you want.
Use the second code snippet in this answer to convert the long int to a big endian string (which can be easily changed to little endian if you prefer), and write this string to your file.
The problem is that the internal representation of bigints does not directly include the binary data you ask for.
The PyPi bitarray module in combination with the builtin bin() function seems like a good combination for a solution that is simple and flexible.
bytes = bitarray(bin(my_long)[2:]).tobytes()
The endianness can be controlled with a few more lines of code. You'll have to evaluate the efficiency.
Why not use struct with the unsigned long long type twice?
import struct
some_file.write(struct.pack("QQ", var/(2**64), var%(2**64)))
That's documented here (scroll down to get the table with Q): http://docs.python.org/library/struct.html
This may not avoid the "mask and shift each integer" requirement. I'm not sure that avoiding mask and shift means in the context of Python long values.
The bytes are these:
def bytes( long_int ):
bytes = []
while long_int != 0:
b = long_int%256
bytes.insert( 0, b )
long_int //= 256
return bytes
You can then pack this list of bytes using struct.pack( '16b', bytes )
With Python 3.2 and later, you can use int.to_bytes and int.from_bytes: https://docs.python.org/3/library/stdtypes.html#int.to_bytes
You could pickle the object to binary, use protocol buffers (I don't know if they allow you to serialize unlimited precision integers though) or BSON if you do not want to write code.
But writing a function that dumps 16 byte integers by shifting it should not be so hard to do if it's not time critical.
This may be a little late, but I don't see why you can't use struct:
bigint = 0xFEDCBA9876543210FEDCBA9876543210L
print bigint,hex(bigint).upper()
cbi = struct.pack("!QQ",bigint&0xFFFFFFFFFFFFFFFF,(bigint>>64)&0xFFFFFFFFFFFFFFFF)
print len(cbi)
The bigint by itself is rejected, but if you mask it with &0xFFFFFFFFFFFFFFFF you can reduce it to an 8 byte int instead of 16. Then the upper part is shifted and masked as well. You may have to play with byte ordering a bit. I used the ! mark to tell it to produce a network endian byte order. Also, the msb and lsb (upper and lower bytes) may need to be reversed. I will leave that as an exercise for the user to determine. I would say saving things as network endian would be safer so you always know what the endianess of your data is.
No, don't ask me if network endian is big or little endian...
Based on #DSM's answer, and to support negative integers and varying byte sizes, I've created the following improved snippet:
def to_bytes(num, size):
x = num if num >= 0 else 256**size + num
h = hex(x)[2:].rstrip("L")
return binascii.unhexlify("0"*((2*size)-len(h))+h)
This will properly handle negative integers and let the user set the number of bytes
Related
I have a very big list of 0's and 1's that are represented as integers - by default - by python, I think: [randint(0, 1) for i in range(50*98)]
And I want to optimize the code so that it utilizes much less memory. The obvious way is to use just 1 bit to represent each of these numbers.
Is it possible to build a list of real binary numbers in python?
Regards,
Bruno
EDIT: Thank you all.
From the answers I found out Python doesn't do this by default, so I found this library (that is installed by Macports on OSX so it saves me some trouble) that does bit operations:
python-bitstring
This uses the bitstring module and constructs a BitArray object from your list:
from bitstring import BitArray
b = BitArray([randint(0, 1) for i in range(50*98)])
Internally this is now stored packed as bytes so will take considerably less memory. You can slice, index, check and set bits etc. with the usual notation and there additional methods such as set, all and any to modify the bits.
To get the data back as a binary string just use b.bin and to get out the byte packed data use b.tobytes() which will pad with zero bits up to a byte boundary.
As delnan already said in a comment, you will not be able to use real binary numbers if you mean bit-for-bit equivalent memory usage.
Integers (or longs) are of course real binary numbers in the meaning that you can address individual bits (using bit-wise operators, but that is easily hidden in a class). Also, long objects can become arbitrarily large, i.e. you can use them to simulate arbitrarily large bitsets. It is not going to be very fast if you do it in Python, but not very difficult either and a good start.
Using your binary generation scheme from above, you can do the following:
reduce(
lambda (a, p), b: (b << p | a, p + 1),
(random.randint(0, 1) for i in range(50*98)),
(0, 0)
)[0]
Of course, random supports arbitrarily large upper boundaries, so you can do just that:
r = random.randint(0, 2**(50*98))
This is not entirely the same, since the individual binary digits in the are not independent in the same way they are independent when you create each digit for itself. Then again, knowing you pRNGs work, they are not really independent in the other case, either. If this is a concern for you, you probably should not use the random module at all, but a hardware RNG.
It's called a bit vector, or a bitmap. Try e.g. BitVector. If you want to implement it yourself, you need to use a numeric object, not a list, and use bitwise operations to toggle bits, e.g.
bitmap = 0
bit = (1 << 24)
bitmap |= bit # enable bit
bitmap &= ~bit # disable bit
It seems you need some kind of bit-set. I'm not sure if this example entirely suits your needs, but it is worth a try.
Perhaps you could look into a lossless compression scheme for your data? Presumably there will be a lot of redundancy in such a list.
I would like to flip from big to little endian this string:
\x00\x40
to have it like this:
\x40\x00
I guess the proper function to use would be struct.pack, but I can't find a way to make it properly work.
A small help would be very appreciated !
Thanks
You're not showing the whole code, so the simplest solution would be:
data = data[1] + data[0]
If you insist on using struct:
>>> from struct import pack, unpack
>>> unpack('<H', '\x12\x13')
(4882,)
>>> pack('>H', *unpack('<H', '\x12\x13'))
'\x13\x12'
Which first unpacks the string as a little-endian unsigned short, and then packs it back as big-endian unsigned short. You can have it the other way around, of course. When converting between BE and LE it doesn't matter which way you're converting - the conversion function is bi-directional.
data[::-1] works for any number of bytes.
little_endian = big_endian[1] + big_endian[0]
I could image that what would really want to do here is to convert the input data from big endian (or network) byteorder to your host byteorder (whatever that may be).
>>> from struct import unpack
>>> result = unpack('>H', '\x00\x40')
This would be a more portable approach than just swapping, which is bound to fail when the code is moved to a big endian machine that would not need to swap at all.
I'm tasked with reading a poorly formatted binary file and taking in the variables. Although I need to do it in C++ (ROOT, specifically), I've decided to do it in python because python makes sense to me, but my plan is to get it working in python and then tackle re-writing in in C++, so using easy to use python modules won't get me too far later down the road.
Basically, I do this:
In [5]: some_value
Out[5]: '\x00I'
In [6]: ''.join([str(ord(i)) for i in some_value])
Out[6]: '073'
In [7]: int(''.join([str(ord(i)) for i in some_value]))
Out[7]: 73
And I know there has to be a better way. What do you think?
EDIT:
A bit of info on the binary format.
alt text http://grab.by/3njm
alt text http://grab.by/3njv
alt text http://grab.by/3nkL
This is the endian test I am using:
# Read a uint32 for endianess
endian_test = rq1_file.read(uint32)
if endian_test == '\x04\x03\x02\x01':
print "Endian test: \\x04\\x03\\x02\\x01"
swapbits = True
elif endian_test == '\x01\x02\x03\x04':
print "Endian test: \\x01\\x02\\x03\\x04"
swapbits = False
Your int(''.join([str(ord(i)) for i in some_value])) works ONLY when all bytes except the last byte are zero.
Examples:
'\x01I' should be 1 * 256 + 73 == 329; you get 173
'\x01\x02' should be 1 * 256 + 2 == 258; you get 12
'\x01\x00' should be 1 * 256 + 0 == 256; you get 10
It also relies on an assumption that integers are stored in bigendian fashion; have you verified this assumption? Are you sure that '\x00I' represents the integer 73, and not the integer 73 * 256 + 0 == 18688 (or something else)? Please let us help you verify this assumption by telling us what brand and model of computer and what operating system were used to create the data.
How are negative integers represented?
Do you need to deal with floating-point numbers?
Is the requirement to write it in C++ immutable? What does "(ROOT, specifically)" mean?
If the only dictate is common sense, the preferred order would be:
Write it in Python using the struct module.
Write it in C++ but use C++ library routines (especially if floating-point is involved). Don't re-invent the wheel.
Roll your own conversion routines in C++. You could snarf a copy of the C source for the Python struct module.
Update
Comments after the file format details were posted:
The endianness marker is evidently optional, except at the start of a file. This is dodgy; it relies on the fact that if it is not there, the 3rd and 4th bytes of the block are the 1st 2 bytes of the header string, and neither '\x03\x04' nor '\x02\x01' can validly start a header string. The smart thing to do would be to read SIX bytes -- if first 4 are the endian marker, the next two are the header length, and your next read is for the header string; otherwise seek backwards 4 bytes then read the header string.
The above is in the nuisance category. The negative sizes are a real worry, in that they specify a MAXIMUM length, and there is no mention of how the ACTUAL length is determined. It says "The actual size of the entry is then given line by line". How? There is no documentation of what a "line of data" looks like. The description mentions "lines" many times; are these lines terminated by carriage return and/or line feed? If so, how does one tell the difference between say a line feed byte and the first byte of say a uint16 that belongs to the current "line" of data? If no linefeed or whatever, how does one know when the current line of data is finished? Is there a uintNN size in front of every variable or slice thereof?
Then it says that (2) above (negative size) also applies to the header string. The mind boggles. Do you have any examples (in documentation of the file layout, or in actual files) of "negative size" of (a) header string (b) data "line"?
Is this "decided format" publically available e.g. documentation on the web? Does the format have a searchable name? Are you sure you are the first person in the world to want to read that format?
Reading that file format, even with a full specification, is no trivial exercise, even for a binary-format-experienced person who's also experienced with Python (which BTW doesn't have a float128). How many person-hours have you been allocated for the task? What are the penalties for (a) delay (b) failure?
Your original question involved fixing your interesting way of trying to parse a uint16 -- doing much more is way outside the scope/intention of what SO questions are all about.
You're basically computing a "number-in-base-256", which is a polynomial, so, by Horner's method:
>>> v = 0
>>> for c in someval: v = v * 256 + ord(c)
More typical would be to use equivalent bit-operations rather than arithmetic -- the following's equivalent:
>>> v = 0
>>> for c in someval: v = v << 8 | ord(c)
import struct
result, = struct.unpack('>H', some_value)
The equivalent to the Python struct module is a C struct and/or union, so being afraid to use it is silly.
I'm not exactly sure how the format of the data is you want to extract, but maybe you better just write a couple of generic utility functions to extract the different data type you need:
def int1b(data, i):
return ord(data[i])
def int2b(data, i):
return (int1b(data, i) << 8) + int1b(data, i+1)
def int4b(data, i):
return (int2b(data, i) << 16) + int2b(data, i+2)
With such functions you can easily extract values from the data and they also can be translated rather easily to C.
I'm working on a program where I store some data in an integer and process it bitwise. For example, I might receive the number 48, which I will process bit-by-bit. In general the endianness of integers depends on the machine representation of integers, but does Python do anything to guarantee that the ints will always be little-endian? Or do I need to check endianness like I would in C and then write separate code for the two cases?
I ask because my code runs on a Sun machine and, although the one it's running on now uses Intel processors, I might have to switch to a machine with Sun processors in the future, which I know is big-endian.
Python's int has the same endianness as the processor it runs on. The struct module lets you convert byte blobs to ints (and viceversa, and some other data types too) in either native, little-endian, or big-endian ways, depending on the format string you choose: start the format with # or no endianness character to use native endianness (and native sizes -- everything else uses standard sizes), '~' for native, '<' for little-endian, '>' or '!' for big-endian.
This is byte-by-byte, not bit-by-bit; not sure exactly what you mean by bit-by-bit processing in this context, but I assume it can be accomodated similarly.
For fast "bulk" processing in simple cases, consider also the array module -- the fromstring and tostring methods can operate on large number of bytes speedily, and the byteswap method can get you the "other" endianness (native to non-native or vice versa), again rapidly and for a large number of items (the whole array).
If you need to process your data 'bitwise' then the bitstring module might be of help to you. It can also deal with endianness between platforms.
The struct module is the best standard method of dealing with endianness between platforms. For example this packs and unpack the integers 1, 2, 3 into two 'shorts' and one 'long' (2 and 4 bytes on most platforms) using native endianness:
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
To check the endianness of the platform programmatically you can use
>>> import sys
>>> sys.byteorder
which will either return "big" or "little".
The following snippet will tell you if your system default is little endian (otherwise it is big-endian)
import struct
little_endian = (struct.unpack('<I', struct.pack('=I', 1))[0] == 1)
Note, however, this will not affect the behavior of bitwise operators: 1<<1 is equal to 2 regardless of the default endianness of your system.
Check when?
When doing bitwise operations, the int in will have the same endianess as the ints you put in. You don't need to check that. You only need to care about this when converting to/from sequences of bytes, in both languages, afaik.
In Python you use the struct module for this, most commonly struct.pack() and struct.unpack().
I'm reading some MPEG Transport Stream protocol over UDP and it has some funky bitfields in it (length 13 for example). I'm using the "struct" library to do the broad unpacking, but is there a simple way to say "Grab the next 13 bits" rather than have to hand-tweak the bit manipulation? I'd like something like the way C does bit fields (without having to revert to C).
Suggestions?
The bitstring module is designed to address just this problem. It will let you read, modify and construct data using bits as the basic building blocks. The latest versions are for Python 2.6 or later (including Python 3) but version 1.0 supported Python 2.4 and 2.5 as well.
A relevant example for you might be this, which strips out all the null packets from a transport stream (and quite possibly uses your 13 bit field?):
from bitstring import Bits, BitStream
# Opening from a file means that it won't be all read into memory
s = Bits(filename='test.ts')
outfile = open('test_nonull.ts', 'wb')
# Cut the stream into 188 byte packets
for packet in s.cut(188*8):
# Take a 13 bit slice and interpret as an unsigned integer
PID = packet[11:24].uint
# Write out the packet if the PID doesn't indicate a 'null' packet
if PID != 8191:
# The 'bytes' property converts back to a string.
outfile.write(packet.bytes)
Here's another example including reading from bitstreams:
# You can create from hex, binary, integers, strings, floats, files...
# This has a hex code followed by two 12 bit integers
s = BitStream('0x000001b3, uint:12=352, uint:12=288')
# Append some other bits
s += '0b11001, 0xff, int:5=-3'
# read back as 32 bits of hex, then two 12 bit unsigned integers
start_code, width, height = s.readlist('hex:32, 2*uint:12')
# Skip some bits then peek at next bit value
s.pos += 4
if s.peek(1):
flags = s.read(9)
You can use standard slice notation to slice, delete, reverse, overwrite, etc. at the bit level, and there are bit level find, replace, split etc. functions. Different endiannesses are also supported.
# Replace every '1' bit by 3 bits
s.replace('0b1', '0b001')
# Find all occurrences of a bit sequence
bitposlist = list(s.findall('0b01000'))
# Reverse bits in place
s.reverse()
The full documentation is here.
It's an often-asked question. There's an ASPN Cookbook entry on it that has served me in the past.
And there is an extensive page of requirements one person would like to see from a module doing this.