How to read part of binary file with numpy?

How to read part of binary file with numpy? - python

I'm converting a matlab script to numpy, but have some problems with reading data from a binary file. Is there an equivelent to fseek when using fromfile to skip the beginning of the file? This is the type of extractions I need to do:
fid = fopen(fname);
fseek(fid, 8, 'bof');
second = fread(fid, 1, 'schar');
fseek(fid, 100, 'bof');
total_cycles = fread(fid, 1, 'uint32', 0, 'l');
start_cycle = fread(fid, 1, 'uint32', 0, 'l');
Thanks!

You can use seek with a file object in the normal way, and then use this file object in fromfile. Here's a full example:
import numpy as np
import os
data = np.arange(100, dtype=np.int)
data.tofile("temp") # save the data
f = open("temp", "rb") # reopen the file
f.seek(256, os.SEEK_SET) # seek
x = np.fromfile(f, dtype=np.int) # read the data into numpy
print x
# [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# 89 90 91 92 93 94 95 96 97 98 99]

There probably is a better answer… But when I've been faced with this problem, I had a file that I already wanted to access different parts of separately, which gave me an easy solution to this problem.
For example, say chunkyfoo.bin is a file consisting of a 6-byte header, a 1024-byte numpy array, and another 1024-byte numpy array. You can't just open the file and seek 6 bytes (because the first thing numpy.fromfile does is lseek back to 0). But you can just mmap the file and use fromstring instead:
with open('chunkyfoo.bin', 'rb') as f:
with closing(mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)) as m:
a1 = np.fromstring(m[6:1030])
a2 = np.fromstring(m[1030:])
This sounds like exactly what you want to do. Except, of course, that in real life the offset and length to a1 and a2 probably depend on the header, rather than being fixed comments.
The header is just m[:6], and you can parse that by explicitly pulling it apart, using the struct module, or whatever else you'd do once you read the data. But, if you'd prefer, you can explicitly seek and read from f before constructing m, or after, or even make the same calls on m, and it will work, without affecting a1 and a2.
An alternative, which I've done for a different non-numpy-related project, is to create a wrapper file object, like this:
class SeekedFileWrapper(object):
def __init__(self, fileobj):
self.fileobj = fileobj
self.offset = fileobj.tell()
def seek(self, offset, whence=0):
if whence == 0:
offset += self.offset
return self.fileobj.seek(offset, whence)
# ... delegate everything else unchanged
I did the "delegate everything else unchanged" by generating a list of attributes at construction time and using that in __getattr__, but you probably want something less hacky. numpy only relies on a handful of methods of the file-like object, and I think they're properly documented, so just explicitly delegate those. But I think the mmap solution makes more sense here, unless you're trying to mechanically port over a bunch of explicit seek-based code. (You'd think mmap would also give you the option of leaving it as a numpy.memmap instead of a numpy.array, which lets numpy have more control over/feedback from the paging, etc. But it's actually pretty tricky to get a numpy.memmap and an mmap to work together.)

This is what I do when I have to read arbitrary in an heterogeneous binary file.
Numpy allows to interpret a bit pattern in arbitray way by changing the dtype of the array.
The Matlab code in the question reads a char and two uint.
Read this paper (easy reading on user level, not for scientists) on what one can achieve with changing the dtype, stride, dimensionality of an array.
import numpy as np
data = np.arange(10, dtype=np.int)
data.tofile('f')
x = np.fromfile('f', dtype='u1')
print x.size
# 40
second = x[8]
print 'second', second
# second 2
total_cycles = x[8:12]
print 'total_cycles', total_cycles
total_cycles.dtype = np.dtype('u4')
print 'total_cycles', total_cycles
# total_cycles [2 0 0 0] !endianness
# total_cycles [2]
start_cycle = x[12:16]
start_cycle.dtype = np.dtype('u4')
print 'start_cycle', start_cycle
# start_cycle [3]
x.dtype = np.dtype('u4')
print 'x', x
# x [0 1 2 3 4 5 6 7 8 9]
x[3] = 423
print 'start_cycle', start_cycle
# start_cycle [423]

There is a quite new feature of numpy.fromfile()
offset int
The offset (in bytes) from the file’s current position. Defaults to 0. Only permitted for binary files.
New in version 1.17.0.
import numpy as np
import os
data = np.arange(100, dtype=np.int32)
data.tofile("temp") # save the data
x = np.fromfile("temp", dtype=np.int32, offset=256) # use the offset
print (x)
# [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# 89 90 91 92 93 94 95 96 97 98 99]

Related

Pycharm output adding up when using threading

I was learning about Threads so I made a simple program in sublime text.
import time
from threading import Thread
def func():
for a in range(1, 101):
print(a)
Threads = []
for i in range(25):
t = Thread(target=func)
Threads.append(t)
for i in Threads:
i.start()
for i in Threads:
i.join()
But after a few minutes, I started to get annoyed by the bad quality of autocompletion.
So I switched to Pycharm Edu and something weird happened with output. In cmd it was like this
60
60
97
68
58
59
70
71
74
95
89
68
53
92
91
92
93
99
100
89
96
and in Pycharm the output was
6545
46
47
54
76
775981
66
6555
55
608264
67
48
I don't understand what's going on.

print is in truth two distinct writes to stdout: the text, and then a newline. So print(a) is this:
sys.stdout.write(str(a))
sys.stdout.write('\n')
If now multiple Threads write at the same time, it is similar to this:
sys.stdout.write(str(a))
sys.stdout.write('\n')
sys.stdout.write(str(a))
sys.stdout.write('\n')
Or, sometimes:
sys.stdout.write(str(a))
sys.stdout.write(str(a))
sys.stdout.write('\n')
sys.stdout.write('\n')
So you get two numbers on one line and then two newlines.
The easiest fix is to join the strings & newline before using print:
def func():
for a in range(1, 101):
print(f'{a}\n', end='')
This produces the correct result.
(as to why this didn't happen in CMD: probably just luck)

numpy.where produces inconsistent results

I have a piece of code where I need to look for an index of a value in a numpy array.
For this task, I use numpy.where.
The problem is that numpy.where produces a wrong result, i.e. returns an empty array, in situations where I am certain that the searched value is in the array.
To make things worse, I tested that the element is really in the array with a for loop, and in case it is found, also look for it with numpy.where.
Oddly enough, then it finds a result, while literally a line later, it doesnt.
Here is how the code looks like:
# progenitors, descendants and progenitor_outputnrs are 2D-arrays that are filled from reading in files.
# outputnrs is a 1d-array.
ozi = 0
for i in range(descendants[ozi].shape[0]):
if descendants[ozi][i] > 0:
if progenitors[ozi][i] < 0:
oind = outputnrs[0] - progenitor_outputnrs[ozi][i] - 1
print "looking for prog", progenitors[ozi][i], "with outputnr", progenitor_outputnrs[ozi][i], "in", outputnrs[oind]
for p in progenitors[oind]:
if p == -progenitors[ozi][i]:
# the following line works...
print "found", p, np.where(progenitors[oind]==-progenitors[ozi][i])[0][0]
# the following line doesn't!
iind = np.where(progenitors[oind]==-progenitors[ozi][i])[0][0]
I get the output:
looking for prog -76 with outputnr 65 in 66
found 76 79
looking for prog -2781 with outputnr 65 in 66
found 2781 161
looking for prog -3797 with outputnr 63 in 64
found 3797 163
looking for prog -3046 with outputnr 65 in 66
found 3046 163
looking for prog -6488 with outputnr 65 in 66
found 6488 306
Traceback (most recent call last):
File "script.py", line 1243, in <module>
main()
File "script.py", line 974, in main
iind = np.where(progenitors[oind]==-progenitors[out][i])[0][0]
IndexError: index 0 is out of bounds for axis 0 with size 0
I use python 2.7.12 and numpy 1.14.2.
Does anyone have an idea why this is happening?

RAW CAN decoding

I'm trying to import CAN data using a virtual CAN network and am getting strange results when I unpack my CAN packet of data. I'm using Python 3.3.7
Code:
import socket, sys, struct
sock = socket.socket(socket.PF_CAN, socket.SOCK_RAW, socket.CAN_RAW)
interface = "vcan0"
try:
sock.bind((interface,))
except OSError:
sys.stderr.write("Could not bind to interface '%s'\n" % interface)
fmt = "<IB3x8s"
while True:
can_pkt = sock.recv(16)
can_id, length, data = struct.unpack(fmt, can_pkt)
can_id &= socket.CAN_EFF_MASK
data = data[:length]
print(data, can_id , can_pkt)
So when I have a CAN packet looking like this.
candump vcan0: vcan0 0FF [8] 77 9C 3C 21 A2 9A B9 66
output in Python: b'\xff\x00\x00\x00\x08\x00\x00\x00w\x9c<!\xa2\x9a\xb9f'
Where vcan0 is the interface, [x] is the number of bytes in the payload, the rest is an 8 byte hex payload.
Do I have the wrong formatting? Has PF_CAN been updated for newer Python version? Am I using CAN_RAW when I should be using CAN_BCM for my protocol family? Or am I just missing how to decode the unpacked data?
Any direction or answer would be much appreciated.
Also, here are some script outputs to can-utils values I've plucked. If I can't find anything, I'm probably just going to make collect a ton of data then decode for the bytes of data that don't translate over properly. I feel that i'm over complicating things, and possibly missing one key aspect.
Python3 output == can-utils/socketCAN (hex)
M= == 4D 3D
~3 == 7E 33
p == 70
. == 2E
# == 40
r: == 0D 3A
c == 63
5g == 35 67
y == 79
a == 61
) == 29
E == 45
M == 4D
C == 43
P> == 50 3E
SGN == 53 47 4E
8 == 38

Rather than laboriously complete that table you started, just look at any ASCII code chart. When you simply print a string, any characters that are actually ASCII text will print as that character: only unprintable characters get shown as hexadecimal escapes. If you want everything in hex, you need to explicitly request that:
import binascii
print(binascii.hexlify(data))
for example.

I'm sure you've already run into the python-can library? If not we have a native python version of socketcan that correctly parse data out of CAN messages. Some of the source might help you out - or you might want to use it directly. CAN_RAW is probably what you want, if you plan on leaving virtual can for real hardware you might also want to get the timestamp from the hardware.
Not all constants have been exposed in Python's socket module so there is also a ctypes version which made in easier to experiment with things like the socketcan broadcast manager. Docs for both are here.

Python Encoding Issues

I have data that I would like to decode from and its in Windows-1252 basically I send code to a socket and it sends it back and I have to decode the message and use IEEE-754 to get a certain value from it but I can seem to figure out all this encoding stuff. Here is my code.
def printKinds ():
test = "x40\x39\x19\x99\x99\x99\x99\x9A"
print (byt1Hex(test))
test = byt1Hex(test).replace(' ', '')
struct.unpack('<d', binascii.unhexlify(test))
print (test)
printKinds()
def byt1Hex( bytStr ):
return ' '.join( [ "%02X" % ord( x ) for x in bytStr ] )
So I use that and then I have to get the value from that.. But it's not working and I can not figure out why.
The current output I am getting is
struct.unpack('<d', binascii.unhexlify(data))
struct.error: unpack requires a bytes object of length 8
That the error the expected output I am looking for is 25.1
but when I encode it, It actually changes the string into the wrong values so when I do this:
print (byt1Hex(data))
I expect to get this.
40 39 19 99 99 99 99 9A
But I actually get this instead
78 34 30 39 19 99 99 99 99 9A

>>> import struct
>>> struct.pack('!d', 25.1)
b'#9\x19\x99\x99\x99\x99\x9a'
>>> struct.unpack('!d', _) #NOTE: no need to call byt1hex, unhexlify
(25.1,)
You send, receive bytes over the network. No need hexlify/unhexlify them; unless the protocol requires it (you should mention the protocol in the question then).

You have:
test = "x40\x39\x19\x99\x99\x99\x99\x9A"
You need:
test = "\x40\x39\x19\x99\x99\x99\x99\x9A"

python how to get BytesIO allocated memory length?

This is the code i am using to test the memory allocation
import pycurl
import io
url = "http://www.stackoverflow.com"
buf = io.BytesIO()
print(len(buf.getvalue())) #here i am getting 0 as length
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.CONNECTTIMEOUT, 10)
c.setopt(c.TIMEOUT, 10)
c.setopt(c.ENCODING, 'gzip')
c.setopt(c.FOLLOWLOCATION, True)
c.setopt(c.IPRESOLVE, c.IPRESOLVE_V4)
c.setopt(c.USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0')
c.setopt(c.WRITEFUNCTION, buf.write)
c.perform()
c.close()
print(len(buf.getvalue())) #here length of the dowloaded file
print(buf.getvalue())
buf.close()
How to get the allocated buffer/memory length by BytesIO ?
what am i doing wrong here ? python doesn't allocate fixed buffer length ?

I am not sure what you mean by allocated buffer/memory length, but if you want the length of the user data stored in the BytesIO object you can do
>>> bio = io.BytesIO()
>>> bio.getbuffer().nbytes
0
>>> bio.write(b'here is some data')
17
>>> bio.getbuffer().nbytes
17
But this seems equivalent to the len(buf.getvalue()) that you are currently using.
The actual size of the BytesIO object can be found using sys.getsizeof():
>>> bio = io.BytesIO()
>>> sys.getsizeof(bio)
104
Or you could be nasty and call __sizeof__() directly (which is like sys.getsizeof() but without garbage collector overhead applicable to the object):
>>> bio = io.BytesIO()
>>> bio.__sizeof__()
72
Memory for BytesIO is allocated as required, and some buffering does take place:
>>> bio = io.BytesIO()
>>> for i in range(20):
... _=bio.write(b'a')
... print(bio.getbuffer().nbytes, sys.getsizeof(bio), bio.__sizeof__())
...
1 106 74
2 106 74
3 108 76
4 108 76
5 110 78
6 110 78
7 112 80
8 112 80
9 120 88
10 120 88
11 120 88
12 120 88
13 120 88
14 120 88
15 120 88
16 120 88
17 129 97
18 129 97
19 129 97
20 129 97

io.BytesIO() returns a standard file object which has function tell(). It reports the current descriptor position and does not copy the whole buffer out to compute total size as len(bio.getvalue()) of bio.getbuffer().nbytes. It is a very fast and simple method to get the exact size of used memory in the buffer object.
However, if you preset your buffer, tell() will point at the beginning of the buffer and return 0, but the buffer size is not zero. In this case, you can move the pointer to the end of the buffer seek(0,2), which will report the total buffer size without copying the whole buffer into another chank of the memory.
I posted and recently updated an example code and a more detailed answer here

You can also use tracemalloc to get indirect information about the size of objects, by wrapping memory events in tracemalloc.get_traced_memory()
Do note that active threads (if any) and side effects of your program will affect the output, but it may also be more representative of the real memory cost if many samples are taken, as shown below.
>>> import tracemalloc
>>> from io import BytesIO
>>> tracemalloc.start()
>>>
>>> memory_traces = []
>>>
>>> with BytesIO() as bytes_fh:
... # returns (current memory usage, peak memory usage)
# ..but only since calling .start()
... memory_traces.append(tracemalloc.get_traced_memory())
... bytes_fh.write(b'a' * (1024**2)) # create 1MB of 'a'
... memory_traces.append(tracemalloc.get_traced_memory())
...
1048576
>>> print("used_memory = {}b".format(memory_traces[1][0] - memory_traces[0][0]))
used_memory = 1048870b
>>> 1048870 - 1024**2 # show small overhead
294

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read part of binary file with numpy? - python

Related

Pycharm output adding up when using threading

numpy.where produces inconsistent results

RAW CAN decoding

Python Encoding Issues

python how to get BytesIO allocated memory length?

Categories

Resources