In Python, read chunks of a file as decimal numbers - python

My input files could be arbitrary, and so I will use
f = open("in-file", 'rb')
The chunk size is about 4K Bytes, and so I will use
f.read(4096)
What I want to do is to read chunks by chunks from the file.
Moreover, as chunk is actually a $2^15$-bit (4KB) sequence, when reading a chunk, I need to transform it into a decimal value for further computation.
For example, if the first chunk is of form 0000...10, what I want is having another variable keeping the corresponding decimal value, eg., x=2.
From Convert string to list of bits and viceversa I know that its code can help me read chunks by chunks.
def tobits(s):
result = []
for c in s:
bits = bin(ord(c))[2:]
bits = '00000000'[len(bits):] + bits
result.extend([int(b) for b in bits])
return result
However, I don't know how to transform the output list into decimal value. Could someone give me some sample code? Thank you.

By referencing http://code.activestate.com/recipes/510399-byte-to-hex-and-hex-to-byte-string-conversion/ I found that the following code probably will run faster because it seems to be no arithmetic involved.
def ByteToHex( byteStr ):
return ''.join( [ "%02X " % ord( x ) for x in byteStr ] ).strip()
Therefore, the task of, for example, reading 2-byte chunks as decimal numbers can be accomplished by the following code:
in_file=open("in-file", "rb")
piece = in_file.read(2)
a=ByteToHex(piece)
a=int(a,16)

If I understand the question right, you want something like the following:
def bytes_to_long(bytes):
result = 0l
for c in bytes:
result *= 256
result += ord(c)
return result
That said, it's likely this is going to be somewhat slow, 4kB is a fairly big long and a lot of garbage ones are going to be created. You could probably improve this by using struct.unpack() and processing more than one byte per iteration, but then you have to deal with the right endianness and everything. On Python 3 you also probably don't need the ord() since it should return the bytes type from IO methods.

Related

Saving and loading bits/bytes in Python

I've been studying compression algorithms recently, and I'm trying to understand how I can store integers as bits in Python to save space.
So first I save '1' and '0' as strings in Python.
import os
import numpy as np
array= np.random.randint(0, 2, size = 200)
string = [str(i) for i in array]
with open('testing_int.txt', 'w') as f:
for i in string:
f.write(i)
print(os.path.getsize('testing_int.txt'))
I get back 200 bytes which makes sense, since each each char is represented by one byte in ascii (and utf-8 as well if characters are latin?).
Now if trying to save these ones and zeroes as bits, I should only take up around 25 bytes right?
200 bits/8 = 25 bytes.
However, when I try the following code below, I get 105 bytes.
Am I doing something wrong?
Using the same 'array variable' as above I tried this:
bytes_string = [bytes(i) for i in array]
with open('testing_bytes.txt', 'wb') as f:
for i in bytes_string:
f.write(i)
Then I tried this:
bin_string = [bin(i) for i in array]
with open('testing_bin.txt', 'wb') as f:
for i in bytes_string:
f.write(i)
This also takes up around 105 bytes.
So I tried looking at the text files, and I noticed that
both the 'bytes.txt' and 'bin.txt' are blank.
So I tried to read the 'bytes.txt' file via this code:
with open(r"C:\Users\Moondra\Desktop\testing_bytes\testing_bytes.txt", 'rb') as f:
x =f.read()
Now I get get back as this :
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
So I tried these commands:
>>> int.from_bytes(x, byteorder='big')
0
>>> int.from_bytes(x, byteorder='little')
0
>>>
So apparently I'm doing multiple things incorrectly.
I can't figure out:
1) Why I am not getting a text file that is 25 bytes
2) Why can I read back the bytes file correctly.
Thank you.
bytes_string = [bytes(i) for i in array]
It looks like you expect bytes(x) to give you a one-byte bytes object with the value of x. Follow the documentation, and you'll see that bytes() is initialized like bytearray(), and bytearray() says this about its argument:
If it is an integer, the array will have that size and will be initialized with null bytes.
So bytes(0) gives you an empty bytes object, and bytes(1) gives you a single byte with the ordinal zero. That's why bytes_string is about half the size of array and is made up completely of zero bytes.
As for why the bin() example didn't work, it looks like a simple case of copy-pasting and forgetting to change bytes_string to bin_string in the for loop.
This all still doesn't accomplish your goal of treating 0 or 1 value integers as bits. Python doesn't really have that sort of functionality built in. There are third-party modules that allow you to work at the bit level, but I can't speak to any of them specifically. Personally I would probably just roll my own specific to the application.
It looks like you're trying to bit shift all the values into a single byte. For example, you expect the integer values [0,1,0,1,0,1,0,1] to be packed into a byte that looks like the following binary number: 0b01010101. To do this, you need to use the bitwise shift operator and bitwise or operator along with the struct module to pack the values into an unsigned Char which represents the sequence of int values you have.
The code below takes the array of random integers in range [0,1] and shifts them together to make a binary number that can be packed into a single byte. I used 256 ints for convenience. The expected number of bytes for the file to be is then 32 (256/8). You will see that when it is run this is indeed what you get.
import struct
import numpy as np
import os
a = np.random.randint(0, 2, size = 256)
bool_data = []
bin_vals = []
for i in range(0, len(a), 8):
bin_val = (a[i] << 0) | (a[i+1] << 1) | \
(a[i+2] << 2) | (a[i+3] << 3) | \
(a[i+4] << 4) | (a[i+5] << 5) | \
(a[i+6] << 6) | (a[i+7] << 7)
bin_vals.append(struct.pack('B', bin_val))
with open("output.txt", 'wb') as f:
for val in bin_vals:
f.write(val)
print(os.path.getsize('output.txt'))
Please note, however, that this will only work for values of integers in the range [0,1] since if they are bigger it will shift more non-zeros and wreck the structure of the generated byte. The binary number may also exceed 1 byte in size in this case.
It seems like you're just using python in attempt to generate an array of bits for demonstration purposes, and to that token I would say that python probably isn't best suited for this. I would recommend using a lower level language such as C/C++ which has more direct access to data type than python does.

Reading binary file with python without knowing structure

I have a binary file containing the position of 8000 particles.
I know that each particle value should look like "-24.6151..." (I don't know with which precision the values are given by my program. I guess it is double precision(?).
But when I try to read the file with this code:
In: with open('.//results0epsilon/energybinary/energy_00004.dat', 'br') as f:
buffer = f.read()
print ("Lenght of buffer is %d" % len(buffer))
for i in buffer:
print(int(i))
I get as output:
Lenght of buffer is 64000
10
168
179
43
...
I skip the whole list of values but as you can see those values are far away from what I expect. I think I have some kind of decoding error.
I would appreciate any kind of help :)
What you are printing now are the bytes composing your floating point data. So it doesn't make sense as numerical values.
Of course, there's no 100% sure answer since we didn't see your data, but I'll try to guess:
You have 8000 values to read and the file size is 64000. So you probably have double IEEE values (8 bytes each). If it's not IEEE, then you're toast.
In that case you could try the following:
import struct
with open('.//results0epsilon/energybinary/energy_00004.dat', 'br') as f:
buffer = f.read()
print ("Length of buffer is %d" % len(buffer))
data = struct.unpack("=8000d",buffer)
if the data is printed bogus, it's probably an endianness problem. So change the =8000 by <8000 or >8000.
for reference and packing/unpacking formats: https://docs.python.org/3/library/struct.html

STL binary file reader with Python

I'm trying to write my "personal" python version of STL binary file reader, according to WIKIPEDIA : A binary STL file contains :
an 80-character (byte) headern which is generally ignored.
a 4-byte unsigned integer indicating the number of triangular facets in the file.
Each triangle is described by twelve 32-bit floating-point numbers: three for the normal and then three for the X/Y/Z coordinate of each vertex – just as with the ASCII version of STL. After these follows a 2-byte ("short") unsigned integer that is the "attribute byte count" – in the standard format, this should be zero because most software does not understand anything else. --Floating-point numbers are represented as IEEE floating-point numbers and are assumed to be little-endian--
Here is my code :
#! /usr/bin/env python3
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
The output is :
b'\x90\x08\x00\x00'
It represents an unsigned integer, I need to convert it without using any package (struct,stl...). Are there any (basic) rules to do it ?, I don't know what does \x mean ? How does \x90 represent one byte ?
most of the answers in google mention "C structs", but I don't know nothing about C.
Thank you for your time.
Since you're using Python 3, you can use int.from_bytes. I'm guessing the value is stored little-endian, so you'd just do:
nbtriangles = int.from_bytes(fichier.read(4), 'little')
Change the second argument to 'big' if it's supposed to be big-endian.
Mind you, the normal way to parse a fixed width type is the struct module, but apparently you've ruled that out.
For the confusion over the repr, bytes objects will display ASCII printable characters (e.g. a) or standard ASCII escapes (e.g. \t) if the byte value corresponds to one of them. If it doesn't, it uses \x##, where ## is the hexadecimal representation of the byte value, so \x90 represents the byte with value 0x90, or 144. You need to combine the byte values at offsets to reconstruct the int, but int.from_bytes does this for you faster than any hand-rolled solution could.
Update: Since apparent int.from_bytes isn't "basic" enough, a couple more complex, but only using top-level built-ins (not alternate constructors) solutions. For little-endian, you can do this:
def int_from_bytes(inbytes):
res = 0
for i, b in enumerate(inbytes):
res |= b << (i * 8) # Adjust each byte individually by 8 times position
return res
You can use the same solution for big-endian by adding reversed to the loop, making it enumerate(reversed(inbytes)), or you can use this alternative solution that handles the offset adjustment a different way:
def int_from_bytes(inbytes):
res = 0
for b in inbytes:
res <<= 8 # Adjust bytes seen so far to make room for new byte
res |= b # Mask in new byte
return res
Again, this big-endian solution can trivially work for little-endian by looping over reversed(inbytes) instead of inbytes. In both cases inbytes[::-1] is an alternative to reversed(inbytes) (the former makes a new bytes in reversed order and iterates that, the latter iterates the existing bytes object in reverse, but unless it's a huge bytes object, enough to strain RAM if you copy it, the difference is pretty minimal).
The typical way to interpret an integer is to use struct.unpack, like so:
import struct
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
nbtriangles=struct.unpack("<I", nbtriangles)
print(nbtriangles)
If you are allergic to import struct, then you can also compute it by hand:
def unsigned_int(s):
result = 0
for ch in s[::-1]:
result *= 256
result += ch
return result
...
nbtriangles = unsigned_int(nbtriangles)
As to what you are seeing when you print b'\x90\x08\x00\x00'. You are printing a bytes object, which is an array of integers in the range [0-255]. The first integer has the value 144 (decimal) or 90 (hexadecimal). When printing a bytes object, that value is represented by the string \x90. The 2nd has the value eight, represented by \x08. The 3rd and final integers are both zero. They are presented by \x00.
If you would like to see a more familiar representation of the integers, try:
print(list(nbtriangles))
[144, 8, 0, 0]
To compute the 32-bit integers represented by these four 8-bit integers, you can use this formula:
total = byte0 + (byte1*256) + (byte2*256*256) + (byte3*256*256*256)
Or, in hex:
total = byte0 + (byte1*0x100) + (byte2*0x10000) + (byte3*0x1000000)
Which results in:
0x00000890
Perhaps you can see the similarities to decimal, where the string "1234" represents the number:
4 + 3*10 + 2*100 + 1*1000

Python writing binary

I use python 3
I tried to write binary to file I use r+b.
for bit in binary:
fileout.write(bit)
where binary is a list that contain numbers.
How do I write this to file in binary?
The end file have to look like
b' x07\x08\x07\
Thanks
When you open a file in binary mode, then you are essentially working with the bytes type. So when you write to the file, you need to pass a bytes object, and when you read from it, you get a bytes object. In contrast, when opening the file in text mode, you are working with str objects.
So, writing “binary” is really writing a bytes string:
with open(fileName, 'br+') as f:
f.write(b'\x07\x08\x07')
If you have actual integers you want to write as binary, you can use the bytes function to convert a sequence of integers into a bytes object:
>>> lst = [7, 8, 7]
>>> bytes(lst)
b'\x07\x08\x07'
Combining this, you can write a sequence of integers as a bytes object into a file opened in binary mode.
As Hyperboreus pointed out in the comments, bytes will only accept a sequence of numbers that actually fit in a byte, i.e. numbers between 0 and 255. If you want to store arbitrary (positive) integers in the way they are, without having to bother about knowing their exact size (which is required for struct), then you can easily write a helper function which splits those numbers up into separate bytes:
def splitNumber (num):
lst = []
while num > 0:
lst.append(num & 0xFF)
num >>= 8
return lst[::-1]
bytes(splitNumber(12345678901234567890))
# b'\xabT\xa9\x8c\xeb\x1f\n\xd2'
So if you have a list of numbers, you can easily iterate over them and write each into the file; if you want to extract the numbers individually later you probably want to add something that keeps track of which individual bytes belong to which numbers.
with open(fileName, 'br+') as f:
for number in numbers:
f.write(bytes(splitNumber(number)))
where binary is a list that contain numbers
A number can have one thousand and one different binary representations (endianess, width, 1-complement, 2-complement, floats of different precision, etc). So first you have to decide in which representation you want to store your numbers. Then you can use the struct module to do so.
For example the byte sequence 0x3480 can be interpreted as 32820 (little-endian unsigned short), or -32716 (little-endian signed short) or 13440 (big-endian short).
Small example:
#! /usr/bin/python3
import struct
binary = [1234, 5678, -9012, -3456]
with open('out.bin', 'wb') as f:
for b in binary:
f.write(struct.pack('h', b)) #or whatever format you need
with open('out.bin', 'rb') as f:
content = f.read()
for b in content:
print(b)
print(struct.unpack('hhhh', content)) #same format as above
prints
210
4
46
22
204
220
128
242
(1234, 5678, -9012, -3456)

Python Binary conversion

I'm trying to convert some binary output from a file to different types, and I keep seeing odd things.
For instance, I have:
value = '\x11'
If you do
bin(ord(value))
you get the output
'0b10001'
whereas I was hoping to get
'0b00010001'
I'm basically trying to read in a 32 byte header, turn it into 1's and 0's, so I can grab various bits that have different meanings.
Why not just use bitwise operators?
def is_bit_set(i, x):
"""Check if the i-th bit in x is set"""
return x & (1 << i) > 0
To get the desired output, try:
"0b{:08b}".format(ord(value))
If efficiency is your concern, it's recommended to use native binary representation instead of literal(string) binary representation, for bitwise operation is much more compact and efficient.
format(ord('\x11'), '08b') will get you 00010001, which should be close enough to what you want.

Categories

Resources