python - Read file using win32file.ReadFile

python - Read file using win32file.ReadFile - python

Similar Question:
What's the correct way to use win32file.ReadFile to get the output from a pipe?
The issue I have however was not answered in that question. When I call
result, data = win32file.ReadFile(my_file, 4096, None)
result is always 0 which according to documentation means success:
The result is a tuple of (hr, string/PyOVERLAPPEDReadBuffer), where hr may be 0,
ERROR_MORE_DATA or ERROR_IO_PENDING.
Even if I set the buffer to 10 and the file is much bigger the result is 0 and data is a string containing the first 10 characters.
result, buf = win32file.ReadFile(self._handle, 10, None)
while result == winerror.ERROR_MORE_DATA:
result, data = win32file.ReadFile(self._handle, 2048, None)
buf += data
print "Hi"
return result, buf
"Hi" is never printed even if the file clearly contains more data.
The problem I have is how can I ensure that I'm reading the whole file without using a ridiculous large buffer?

As was already observed is that if the win32file.ReadFile result value hr is 0, then that means success. This is exactly the opposite from the win32 api documentation that says 0 means an error occurred.
To determine how many bytes were read you need to check the length of the returned string. If it is same size as the buffer size, then there might be more data. If it is smaller, the whole file has been read:
def readLines(self):
bufSize = 4096
win32file.SetFilePointer(self._handle, 0, win32file.FILE_BEGIN)
result, data = win32file.ReadFile(self._handle, bufSize, None)
buf = data
while len(data) == bufSize:
result, data = win32file.ReadFile(self._handle, bufSize, None)
buf += data
return buf.split('\r\n')
You need to add error handling to this, eg. check result if it actually is 0 and if not take according measures.

Use PeekNamedPipe to see how much data is left on the pipe to read.
result, buf = win32file.ReadFile(self._handle, bufSize, None)
_, nAvail, nMessage = win32pipe.PeekNamedPipe(self._handle, 0)
while nAvail > 0:
result, data = win32file.ReadFile(self._handle, bufSize, None)
buf += data
_, nAvail, nMessage = win32pipe.PeekNamedPipe(self._handle, 0)
This will tell you the total amount of data (nAvail) on the pipe and the amount of data left to read for the current message (nMessage). You can then either use a buffer of sufficient size to read the remaining data or read in chunks if you prefer.
If using message mode you probably want to read a single message at a time. If you can't predict the maximum message size then first read 0 bytes to block until a message is available, then query the message size and read those many bytes:
# Block until data is available (assumes PIPE_WAIT)
result, _ = win32file.ReadFile(self._handle, 0, None)
# result should be 234
_, nAvail, nMessage = win32pipe.PeekNamedPipe(self._handle, 0)
result, buf = win32file.ReadFile(self._handle, nMessage, None)
If using PIPE_NOWAIT then poll for messages using PeekNamedPipe and only read when nMessage > 0

I tried the solution by beginner_ and got the error message "a bytes like object is required, not str" when calling buf.split('\r\n')
I printed str(buf) and saw it's in the form of "b'bufstring'"
As a hack I changed the line to return str(buf)[2:-1].split('\\r\\n') and it works perfectly now.

Related

How to fix this IO bound python operation on 12GB .bin file?

I'm reading this book Hands-On Machine Learning for Algorithmic Trading and I came across a script that is supposed to parse a large .bin binary file and convert it to .h5. This file consists of something called ITCH data, you can find the technical documentation of the data here. The script is very inefficient, it reads a 12GB(12952050754 bytes) file 2 bytes at a time which is ultra slow(might take up to 4 hours on some decent 4cpu GCP instance) which is not very surprising. You can find the whole notebook here.
My problem is I don't understand how this .bin file is being read I mean I don't see where is the necessity of reading the file 2 bytes at a time, I think there is a way to read at a large buffer size but I'm not sure how to do it, or even convert the script to c++ if after optimizing this script, it is still being slow which I can do if I understand the inner workings of this I/O process, does anyone have suggestions?
here's a link to the file source of ITCH data, you can find small files(300 mb or less) which are for less time periods if you need to experiment with the code.
The bottleneck:
with file_name.open('rb') as data:
while True:
# determine message size in bytes
message_size = int.from_bytes(data.read(2), byteorder='big', signed=False)
# get message type by reading first byte
message_type = data.read(1).decode('ascii')
message_type_counter.update([message_type])
# read & store message
record = data.read(message_size - 1)
message = message_fields[message_type]._make(unpack(fstring[message_type], record))
messages[message_type].append(message)
# deal with system events
if message_type == 'S':
seconds = int.from_bytes(message.timestamp, byteorder='big') * 1e-9
print('\n', event_codes.get(message.event_code.decode('ascii'), 'Error'))
print(f'\t{format_time(seconds)}\t{message_count:12,.0f}')
if message.event_code.decode('ascii') == 'C':
store_messages(messages)
break
message_count += 1
if message_count % 2.5e7 == 0:
seconds = int.from_bytes(message.timestamp, byteorder='big') * 1e-9
d = format_time(time() - start)
print(f'\t{format_time(seconds)}\t{message_count:12,.0f}\t{d}')
res = store_messages(messages)
if res == 1:
print(pd.Series(dict(message_type_counter)).sort_values())
break
messages.clear()
And here's the store_messages() function:
def store_messages(m):
"""Handle occasional storing of all messages"""
with pd.HDFStore(itch_store) as store:
for mtype, data in m.items():
# convert to DataFrame
data = pd.DataFrame(data)
# parse timestamp info
data.timestamp = data.timestamp.apply(int.from_bytes, byteorder='big')
data.timestamp = pd.to_timedelta(data.timestamp)
# apply alpha formatting
if mtype in alpha_formats.keys():
data = format_alpha(mtype, data)
s = alpha_length.get(mtype)
if s:
s = {c: s.get(c) for c in data.columns}
dc = ['stock_locate']
if m == 'R':
dc.append('stock')
try:
store.append(mtype,
data,
format='t',
min_itemsize=s,
data_columns=dc)
except Exception as e:
print(e)
print(mtype)
print(data.info())
print(pd.Series(list(m.keys())).value_counts())
data.to_csv('data.csv', index=False)
return 1
return 0

According to the code, file format looks like its 2 bytes of message size, one byte of message type and then n bytes of actual message (defined by the previously read message size).
Low hanging fruit to optimize this is to read 3 bytes first into list, convert [0:1] to message size int and [2] to message type and then read the message ..
To further eliminate amount of required reads, you could read a fixed amount of data from the file into a list of and start extracting from it. While extracting, keep a index of already processed bytes stored and once that index or index+amount of data to be read goes over the size of the list, you prepopulate the list. This could lead to huge memory requirements if not done properly thought..

problem with rewriting a wav file in python

I have a problem with rewriting a .wav file (wave audio file). my project involves converting a wave file data into bytes of data and then reassembling a new audio file which sounds the same.
For some reason, when I try to do this with struct.pack, the result is similar, but not the same - it seems like the original data but not exactly.
note: for normal string it works, but for the type of data in which wave files are written, it doesn't.
My function for converting the original data to bytes:
def original_data_to_bytes_data(original_data):
"""
params: original data.
returns: all the data in bytes form, list of strings.
"""
original_data = str(''.join(format(ord(i), '08b') for i in original_data))
bytes_data = list()
for i in range(0, len(original_data), 8):
bytes_data.append(original_data[i:i+8])
return bytes_data
My function for converting the bytes to the original data:
def bytes_data_to_original_data(bytes_data):
"""
params: bytes_data - data, list of strings.
returns: original data.
"""
original_data =""
for i in bytes_data:
original_data += struct.pack('i', int(i, 2))
return original_data
Thanks for the help!

On Python 3 I get error message. On Python 2 it works without error so I assume that you also use Python 2.
I checked this
data = 'A'
result = bytes_data_to_original_data(original_data_to_bytes_data(data))
print(result)
print(type(data), type(result))
and it display the same text and the same typu
But when I check
print(data == result)
print(len(data), len(result))
print(repr(data), repr(result))
then it show that data and result are different
False
(1, 4)
("'A'", "'A\\x00\\x00\\x00'")
If I use "B" (byte) instead of "i" (integer) in code
struct.pack('B', int(i, 2))
then I get the same values - so wave should sound the same too.
it works also if I use bytes b"A" instead of string "A" because Python2 tread bytes as string.
def bytes_data_to_original_data(bytes_data):
"""
params: bytes_data - data, list of strings.
returns: original data.
"""
original_data = ""
for i in bytes_data:
original_data += struct.pack('B', int(i, 2))
return original_data
EDIT: In struct.pack() I changed 'b' (which need values -128..127) to 'B' (which works with values 0..255).

"Split" a image into packages of bytes

I am trying to do a project for college which consists of sending images using two Arduino Due boards and Python. I have two codes: one for the client (the one who sends the image) and one for the server (the one who receives the image). I know how to send the bytes and check if they are correct, however, I'm required to "split" the image into packages that have:
a header that has a size of 8 bytes and must be in this order:
the first byte must say the payload size;
the next three bytes must say how many packages will be sent in total;
the next three bytes must say which package I'm currently at;
the last byte must contain a code to an error message;
a payload containing data with a maximum size of 128 bytes;
an end of package (EOP) sequence (in this case, 3 bytes).
I managed to create the end of package sequence and append it correctly to a payload in order to send, however I'm facing issues on creating the header.
I'm currently trying to make the following loop:
with open(root.filename, 'rb') as f:
picture = f.read()
picture_size = len(picture)
packages = ceil(picture_size/128)
last_pack_size = (picture_size)
EOPs = 0
EOP_bytes = [b'\x15', b'\xff', b'\xd9']
for p in range(1,packages):
read_bytes = [None, int.to_bytes(picture[(p-1)*128], 1, 'big'),
int.to_bytes(picture[(p-1)*128 + 1], 1, 'big')]
if p != packages:
endrange = p*128+1
else:
endrange = picture_size
for i in range((p-1)*128 + 2, endrange):
read_bytes.append(int.to_bytes(picture[i], 1, 'big'))
read_bytes.pop(0)
if read_bytes == EOP_bytes:
EOPs += 1
print("read_bytes:", read_bytes)
print("EOP_bytes:", EOP_bytes)
print("EOPs", EOPs)
I expect at the end that the server receives the same amount of packages that the client has sent, and in the end I need to join the packages to recreate the image. I can manage to do that, I just need some help with creating the header.

Here is a a demo of how to construct your header, it's not a complete soultion but given you only asked for help constructing the header it may be what you are looking for.
headerArray = bytearray()
def Main():
global headerArray
# Sample Data
payloadSize = 254 # 0 - 254
totalPackages = 1
currentPackage = 1
errorCode = 101 # 0 - 254
AddToByteArray(payloadSize,1) # the first byte must say the payload size;
AddToByteArray(totalPackages,3) # the next three bytes must say how many packages will be sent in total;
AddToByteArray(currentPackage,3) # the next three bytes must say which package I'm currently at;
AddToByteArray(errorCode,1) # the last byte must contain a code to an error message;
def AddToByteArray(value,numberOfBytes):
global headerArray
allocate = value.to_bytes(numberOfBytes, 'little')
headerArray += allocate
Main()
# Output
print(f"Byte Array: {headerArray}")
for i in range(0,len(headerArray)):
print(f"Byte Position: {i} Value:{headerArray[i]}")
Obviously I have not included the logic to obtain the current package or total packages.

How do I remove the memory limit on openmpi processes?

I'm running a process with mpirun and 2 cores and it gets killed at the point when I'm mixing values between the two processes. Both processes use about 15% of the machines memory and even though the memory will increase when mixing, there should still be plenty of memory left. So I'm assuming that there is a limit on the amount of memory used for passing messages in between the processes. How do I find out what this limit is and how do I remove it?
The error message that I'm getting when mpirun dies is this:
File "Comm.pyx", line 864, in mpi4py.MPI.Comm.bcast (src/mpi4py.MPI.c:67787)
File "pickled.pxi", line 564, in mpi4py.MPI.PyMPI_bcast (src/mpi4py.MPI.c:31462)
File "pickled.pxi", line 93, in mpi4py.MPI._p_Pickle.alloc (src/mpi4py.MPI.c:26327)
SystemError: Negative size passed to PyBytes_FromStringAndSize
And this is the bit of the code that leads to the error:
sum_updates_j_k = numpy.zeros((self.col.J_total, self.K), dtype=numpy.float64))
comm.Reduce(self.updates_j_k, sum_updates_j_k, op=MPI.SUM)
sum_updates_j_k = comm.bcast(sum_updates_j_k, root=0)
The code usually works, it only runs into problems with larger amounts of data, which makes the size of the matrix that I'm exchanging between processes increase

The culprit is probably the following lines found in the code of PyMPI_bcast():
cdef int count = 0
...
if dosend: smsg = pickle.dump(obj, &buf, &count) # <----- (1)
with nogil: CHKERR( MPI_Bcast(&count, 1, MPI_INT, # <----- (2)
root, comm) )
cdef object rmsg = None
if dorecv and dosend: rmsg = smsg
elif dorecv: rmsg = pickle.alloc(&buf, count)
...
What happens here is that the object is first serialised at (1) using pickle.dump() and then the length of the pickled stream is broadcasted at (2).
There are two problems here and they both have to do with the fact that int is used for the length. The first problem is an integer cast inside pickle.dump and the other problem is that MPI_INT is used to transmit the length of the pickled stream. This limits the amount of data in your matrix to a certain size - namely the size that would result in a pickled object no bigger than 2 GiB (231-1 bytes). Any bigger object would result in an integer overflow and thus negative values in count.
This is clearly not an MPI issue but rather a bug in (or a feature of?) mpi4py.

I had the same problem with mpi4py recently. As pointed out by Hristo Iliev in his answer, it's a pickle problem.
This can be avoided by using the upper-case methods comm.Reduce(), comm.Bcast(), etc., which do not resort to pickle, as opposed to lower-case methods like comm.reduce(). As a bonus, upper case methods should be a bit faster as well.
Actually, you're already using comm.Reduce(), so I expect that switching to comm.Bcast() should solve your problem - it did for me.
NB: The syntax of upper-case methods is slightly different, but this tutorial can help you get started.
For example, instead of:
sum_updates_j_k = comm.bcast(sum_updates_j_k, root=0)
you would use:
comm.Bcast(sum_updates_j_k, root=0)

For such a case it is useful to have a function that can send numpy arrays in parts, e.g.:
from mpi4py import MPI
import math, numpy
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
def bcast_array_obj(obj = None, dtype = numpy.float64, root = 0):
"""Function for broadcasting of a numpy array object"""
reporter = 0 if root > 0 else 1
if rank == root:
for exp in range(11):
parts = pow(2, exp)
err = False
part_len = math.ceil(len(obj) / parts)
for part in range(parts):
part_begin = part * part_len
part_end = min((part + 1) * part_len, len(obj))
try:
comm.bcast(obj[part_begin: part_end], root = root)
except:
err = True
err *= comm.recv(source = reporter, tag = 2)
if err:
break
if err:
continue
comm.bcast(None, root = root)
print('The array was successfully sent in {} part{}'.\
format(parts, 's' if parts > 1 else ''))
return
sys.stderr.write('Failed to send the array even in 1024 parts')
sys.stderr.flush()
else:
obj = numpy.zeros(0, dtype = dtype)
while True:
err = False
try:
part_obj = comm.bcast(root = root)
except:
err = True
obj = numpy.zeros(0, dtype = dtype)
if rank == reporter:
comm.send(err, dest = root, tag = 2)
if err:
continue
if type(part_obj) != type(None):
frags = len(obj)
obj.resize(frags + len(part_obj))
obj[frags: ] = part_obj
else:
break
return obj
This function automatically determines optimal number of parts to break the input array.
For example,
if rank != 0:
z = bcast_array_obj(root = 0)
else:
z = numpy.zeros(1000000000, dtype = numpy.float64)
bcast_array_obj(z, root = 0)
outputs
The array was successfully sent in 4 parts

Apparently this is an issue in MPI itself and not in MPI4py. The actual variable which holds the size of the data being communicated is a signed 32 bit integer which will overflow to a negative value for around 2GB of data.
Maximum amount of data that can be sent using MPI::Send
It's been raised as an issue with MPI4py previously as well here.

How can I moderate how much data I get from a file stream in python?

I've got an embedded system I'm writing a user app against. The user app needs to take a firmware image and split it into chunks suitable for sending to the embedded system for programming. I'm starting with S-record files, and using Xmodem for file transfer (meaning each major 'file' transfer would need to be ended with an EOF), so the easiest thing for me to do would be to split the image file into a set files of full s-records no greater than the size of the receive buffer of the (single threaded) embedded system. My user app is written in python, and I have a C program that will split the firmware image into properly sized files, but I thought there may be a more 'pythonic' way of going about this, perhaps by using a custom stream handler.
Any thoughts?
Edit : to add to the discussion, I can feed my input file into a buffer. How could I use range to set a hard limit going into the buffer of either the file size, or a full S-record line ('S' delimited ASCII text)?

I thought this was an interesting question and the S-record format isn't too complicated, so I wrote an S-record encoder that appears to work from my limited testing.
import struct
def s_record_encode(fileobj, recordtype, address, buflen):
"""S-Record encode bytes from file.
fileobj file-like object to read data (if any)
recordtype 'S0' to 'S9'
address integer address
buflen maximum output buffer size
"""
# S-type to (address_len, has_data)
record_address_bytes = {
'S0':(2, True), 'S1':(2, True), 'S2':(3, True), 'S3':(4, True),
'S5':(2, False), 'S7':(4, False), 'S8':(3, False), 'S9':(2, False)
}
# params for this record type
address_len, has_data = record_address_bytes[recordtype]
# big-endian address as string, trimmed to length
address = struct.pack('>L', address)[-address_len:]
# read data up to 255 bytes minus address and checksum len
if has_data:
data = fileobj.read(0xff - len(address) - 1)
if not data:
return '', 0
else:
data = ''
# byte count is address + data + checksum
count = len(address) + len(data) + 1
count = struct.pack('B', count)
# checksum count + address + data
checksummed_record = count + address + data
checksum = struct.pack('B', sum(ord(d) for d in checksummed_record) & 0xff ^ 0xff)
# glue record type to hex encoded buffer
record = recordtype + (checksummed_record + checksum).encode('hex').upper()
# return buffer and how much data we read from the file
return record, len(data)
def s_record_test():
from cStringIO import StringIO
# from an example, this should encode to given string
fake_file = StringIO("\x0A\x0A\x0D\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")
encode_to = "S1137AF00A0A0D0000000000000000000000000061"
fake_file.seek(0)
record, buflen = s_record_encode(fake_file, 'S1', 0x7af0, 80)
print 'record', record
print 'encode_to', encode_to
assert record == encode_to
fake_file = StringIO()
for i in xrange(1000):
fake_file.write(struct.pack('>L', i))
fake_file.seek(0)
address = 0
while True:
buf, datalen = s_record_encode(fake_file, 'S2', address, 100)
if not buf:
break
print address, datalen, buf
address += datalen

If you already have a C program then you're in luck. Python is like a scripting language over C with most of the same functions. See Core tools for working with streams for all the familiar C I/O functions. Then you can make your program more Pythonic by rolling methods into Classes and using things like The Python Slice Notation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - Read file using win32file.ReadFile - python

I tried the solution by beginner_ and got the error message "a bytes like object is required, not str" when calling buf.split('\r\n') I printed str(buf) and saw it's in the form of "b'bufstring'" As a hack I changed the line to return str(buf)[2:-1].split('\\r\\n') and it works perfectly now.

Related

How to fix this IO bound python operation on 12GB .bin file?

problem with rewriting a wav file in python

"Split" a image into packages of bytes

How do I remove the memory limit on openmpi processes?

How can I moderate how much data I get from a file stream in python?

Categories

Resources