How to fix this IO bound python operation on 12GB .bin file? - python

I'm reading this book Hands-On Machine Learning for Algorithmic Trading and I came across a script that is supposed to parse a large .bin binary file and convert it to .h5. This file consists of something called ITCH data, you can find the technical documentation of the data here. The script is very inefficient, it reads a 12GB(12952050754 bytes) file 2 bytes at a time which is ultra slow(might take up to 4 hours on some decent 4cpu GCP instance) which is not very surprising. You can find the whole notebook here.
My problem is I don't understand how this .bin file is being read I mean I don't see where is the necessity of reading the file 2 bytes at a time, I think there is a way to read at a large buffer size but I'm not sure how to do it, or even convert the script to c++ if after optimizing this script, it is still being slow which I can do if I understand the inner workings of this I/O process, does anyone have suggestions?
here's a link to the file source of ITCH data, you can find small files(300 mb or less) which are for less time periods if you need to experiment with the code.
The bottleneck:
with file_name.open('rb') as data:
while True:
# determine message size in bytes
message_size = int.from_bytes(data.read(2), byteorder='big', signed=False)
# get message type by reading first byte
message_type = data.read(1).decode('ascii')
message_type_counter.update([message_type])
# read & store message
record = data.read(message_size - 1)
message = message_fields[message_type]._make(unpack(fstring[message_type], record))
messages[message_type].append(message)
# deal with system events
if message_type == 'S':
seconds = int.from_bytes(message.timestamp, byteorder='big') * 1e-9
print('\n', event_codes.get(message.event_code.decode('ascii'), 'Error'))
print(f'\t{format_time(seconds)}\t{message_count:12,.0f}')
if message.event_code.decode('ascii') == 'C':
store_messages(messages)
break
message_count += 1
if message_count % 2.5e7 == 0:
seconds = int.from_bytes(message.timestamp, byteorder='big') * 1e-9
d = format_time(time() - start)
print(f'\t{format_time(seconds)}\t{message_count:12,.0f}\t{d}')
res = store_messages(messages)
if res == 1:
print(pd.Series(dict(message_type_counter)).sort_values())
break
messages.clear()
And here's the store_messages() function:
def store_messages(m):
"""Handle occasional storing of all messages"""
with pd.HDFStore(itch_store) as store:
for mtype, data in m.items():
# convert to DataFrame
data = pd.DataFrame(data)
# parse timestamp info
data.timestamp = data.timestamp.apply(int.from_bytes, byteorder='big')
data.timestamp = pd.to_timedelta(data.timestamp)
# apply alpha formatting
if mtype in alpha_formats.keys():
data = format_alpha(mtype, data)
s = alpha_length.get(mtype)
if s:
s = {c: s.get(c) for c in data.columns}
dc = ['stock_locate']
if m == 'R':
dc.append('stock')
try:
store.append(mtype,
data,
format='t',
min_itemsize=s,
data_columns=dc)
except Exception as e:
print(e)
print(mtype)
print(data.info())
print(pd.Series(list(m.keys())).value_counts())
data.to_csv('data.csv', index=False)
return 1
return 0

According to the code, file format looks like its 2 bytes of message size, one byte of message type and then n bytes of actual message (defined by the previously read message size).
Low hanging fruit to optimize this is to read 3 bytes first into list, convert [0:1] to message size int and [2] to message type and then read the message ..
To further eliminate amount of required reads, you could read a fixed amount of data from the file into a list of and start extracting from it. While extracting, keep a index of already processed bytes stored and once that index or index+amount of data to be read goes over the size of the list, you prepopulate the list. This could lead to huge memory requirements if not done properly thought..

Related

"Split" a image into packages of bytes

I am trying to do a project for college which consists of sending images using two Arduino Due boards and Python. I have two codes: one for the client (the one who sends the image) and one for the server (the one who receives the image). I know how to send the bytes and check if they are correct, however, I'm required to "split" the image into packages that have:
a header that has a size of 8 bytes and must be in this order:
the first byte must say the payload size;
the next three bytes must say how many packages will be sent in total;
the next three bytes must say which package I'm currently at;
the last byte must contain a code to an error message;
a payload containing data with a maximum size of 128 bytes;
an end of package (EOP) sequence (in this case, 3 bytes).
I managed to create the end of package sequence and append it correctly to a payload in order to send, however I'm facing issues on creating the header.
I'm currently trying to make the following loop:
with open(root.filename, 'rb') as f:
picture = f.read()
picture_size = len(picture)
packages = ceil(picture_size/128)
last_pack_size = (picture_size)
EOPs = 0
EOP_bytes = [b'\x15', b'\xff', b'\xd9']
for p in range(1,packages):
read_bytes = [None, int.to_bytes(picture[(p-1)*128], 1, 'big'),
int.to_bytes(picture[(p-1)*128 + 1], 1, 'big')]
if p != packages:
endrange = p*128+1
else:
endrange = picture_size
for i in range((p-1)*128 + 2, endrange):
read_bytes.append(int.to_bytes(picture[i], 1, 'big'))
read_bytes.pop(0)
if read_bytes == EOP_bytes:
EOPs += 1
print("read_bytes:", read_bytes)
print("EOP_bytes:", EOP_bytes)
print("EOPs", EOPs)
I expect at the end that the server receives the same amount of packages that the client has sent, and in the end I need to join the packages to recreate the image. I can manage to do that, I just need some help with creating the header.
Here is a a demo of how to construct your header, it's not a complete soultion but given you only asked for help constructing the header it may be what you are looking for.
headerArray = bytearray()
def Main():
global headerArray
# Sample Data
payloadSize = 254 # 0 - 254
totalPackages = 1
currentPackage = 1
errorCode = 101 # 0 - 254
AddToByteArray(payloadSize,1) # the first byte must say the payload size;
AddToByteArray(totalPackages,3) # the next three bytes must say how many packages will be sent in total;
AddToByteArray(currentPackage,3) # the next three bytes must say which package I'm currently at;
AddToByteArray(errorCode,1) # the last byte must contain a code to an error message;
def AddToByteArray(value,numberOfBytes):
global headerArray
allocate = value.to_bytes(numberOfBytes, 'little')
headerArray += allocate
Main()
# Output
print(f"Byte Array: {headerArray}")
for i in range(0,len(headerArray)):
print(f"Byte Position: {i} Value:{headerArray[i]}")
Obviously I have not included the logic to obtain the current package or total packages.

python-lzw doesn't decompress larger blobs

I am new to python and we had been trying to use lzw code from GIT in the program.
https://github.com/joeatwork/python-lzw/blob/master/lzw/init.py
This is working well if we have a smaller blob but if the blob size increases it doesn't decompress the blob. So I had been reading the documentation but I am unable to understand the below which might be the reason why the full blob is not getting decompressed.
I have also attached a strip of the python code I am using.
Our control codes are
- CLEAR_CODE (codepoint 256). When this code is encountered, we flush
the codebook and start over.
- END_OF_INFO_CODE (codepoint 257). This code is reserved for
encoder/decoders over the integer codepoint stream (like the
mechanical bit that unpacks bits into codepoints)
When dealing with bytes, codes are emitted as variable
length bit strings packed into the stream of bytes.
codepoints are written with varying length
- initially 9 bits
- at 512 entries 10 bits
- at 1025 entries at 11 bits
- at 2048 entries 12 bits
- with max of 4095 entries in a table (including Clear and EOI)
code points are stored with their MSB in the most significant bit
available in the output character.
My code strip :
def decompress_without_eoi(buf):
# Decompress LZW into a bytes, ignoring End of Information code
def gen():
try:
for byte in lzw.decompress(buf):
yield byte
except ValueError as exc:
#print(repr(exc))
if 'End of information code' in repr(exc):
#print('Ignoring EOI error..\n')
pass
else:
raise
return
try:
#print('Trying a join..\n')
deblob = b''.join(gen())
except Exception as exc2:
#print(repr(exc2))
#print('Trying byte by byte..')
deblob=[]
try:
for byte in gen():
deblob.append(byte)
except Exception as exc3:
#print(repr(exc3))
return b''.join(deblob)
return deblob
#current function to deblob
def deblob3(row):
if pd.notnull(row[0]):
blob = row[0]
h = html2text.HTML2Text()
h.ignore_links=True
h.ignore_images = True #zzzz
if type(blob) != bytes:
blobbytes = blob.read()[:-10]
else:
blobbytes = blob[:-10]
if row[1]==361:
# If compressed, return up to EOI-257 code, which is last non-null code before tag
# print (row[0])
return h.handle(striprtf(decompress_without_eoi(blobbytes)))
elif row[1]==360:
# If uncompressed, return up to tag
return h.handle(striprtf(blobbytes))
This function has been called as per below
nf['IS_BLOB'] = nf[['IS_BLOB','COMPRESSION']].apply(deblob3,axis=1)

python - Read file using win32file.ReadFile

Similar Question:
What's the correct way to use win32file.ReadFile to get the output from a pipe?
The issue I have however was not answered in that question. When I call
result, data = win32file.ReadFile(my_file, 4096, None)
result is always 0 which according to documentation means success:
The result is a tuple of (hr, string/PyOVERLAPPEDReadBuffer), where hr may be 0,
ERROR_MORE_DATA or ERROR_IO_PENDING.
Even if I set the buffer to 10 and the file is much bigger the result is 0 and data is a string containing the first 10 characters.
result, buf = win32file.ReadFile(self._handle, 10, None)
while result == winerror.ERROR_MORE_DATA:
result, data = win32file.ReadFile(self._handle, 2048, None)
buf += data
print "Hi"
return result, buf
"Hi" is never printed even if the file clearly contains more data.
The problem I have is how can I ensure that I'm reading the whole file without using a ridiculous large buffer?
As was already observed is that if the win32file.ReadFile result value hr is 0, then that means success. This is exactly the opposite from the win32 api documentation that says 0 means an error occurred.
To determine how many bytes were read you need to check the length of the returned string. If it is same size as the buffer size, then there might be more data. If it is smaller, the whole file has been read:
def readLines(self):
bufSize = 4096
win32file.SetFilePointer(self._handle, 0, win32file.FILE_BEGIN)
result, data = win32file.ReadFile(self._handle, bufSize, None)
buf = data
while len(data) == bufSize:
result, data = win32file.ReadFile(self._handle, bufSize, None)
buf += data
return buf.split('\r\n')
You need to add error handling to this, eg. check result if it actually is 0 and if not take according measures.
Use PeekNamedPipe to see how much data is left on the pipe to read.
result, buf = win32file.ReadFile(self._handle, bufSize, None)
_, nAvail, nMessage = win32pipe.PeekNamedPipe(self._handle, 0)
while nAvail > 0:
result, data = win32file.ReadFile(self._handle, bufSize, None)
buf += data
_, nAvail, nMessage = win32pipe.PeekNamedPipe(self._handle, 0)
This will tell you the total amount of data (nAvail) on the pipe and the amount of data left to read for the current message (nMessage). You can then either use a buffer of sufficient size to read the remaining data or read in chunks if you prefer.
If using message mode you probably want to read a single message at a time. If you can't predict the maximum message size then first read 0 bytes to block until a message is available, then query the message size and read those many bytes:
# Block until data is available (assumes PIPE_WAIT)
result, _ = win32file.ReadFile(self._handle, 0, None)
# result should be 234
_, nAvail, nMessage = win32pipe.PeekNamedPipe(self._handle, 0)
result, buf = win32file.ReadFile(self._handle, nMessage, None)
If using PIPE_NOWAIT then poll for messages using PeekNamedPipe and only read when nMessage > 0
I tried the solution by beginner_ and got the error message "a bytes like object is required, not str" when calling buf.split('\r\n')
I printed str(buf) and saw it's in the form of "b'bufstring'"
As a hack I changed the line to return str(buf)[2:-1].split('\\r\\n') and it works perfectly now.

How can I moderate how much data I get from a file stream in python?

I've got an embedded system I'm writing a user app against. The user app needs to take a firmware image and split it into chunks suitable for sending to the embedded system for programming. I'm starting with S-record files, and using Xmodem for file transfer (meaning each major 'file' transfer would need to be ended with an EOF), so the easiest thing for me to do would be to split the image file into a set files of full s-records no greater than the size of the receive buffer of the (single threaded) embedded system. My user app is written in python, and I have a C program that will split the firmware image into properly sized files, but I thought there may be a more 'pythonic' way of going about this, perhaps by using a custom stream handler.
Any thoughts?
Edit : to add to the discussion, I can feed my input file into a buffer. How could I use range to set a hard limit going into the buffer of either the file size, or a full S-record line ('S' delimited ASCII text)?
I thought this was an interesting question and the S-record format isn't too complicated, so I wrote an S-record encoder that appears to work from my limited testing.
import struct
def s_record_encode(fileobj, recordtype, address, buflen):
"""S-Record encode bytes from file.
fileobj file-like object to read data (if any)
recordtype 'S0' to 'S9'
address integer address
buflen maximum output buffer size
"""
# S-type to (address_len, has_data)
record_address_bytes = {
'S0':(2, True), 'S1':(2, True), 'S2':(3, True), 'S3':(4, True),
'S5':(2, False), 'S7':(4, False), 'S8':(3, False), 'S9':(2, False)
}
# params for this record type
address_len, has_data = record_address_bytes[recordtype]
# big-endian address as string, trimmed to length
address = struct.pack('>L', address)[-address_len:]
# read data up to 255 bytes minus address and checksum len
if has_data:
data = fileobj.read(0xff - len(address) - 1)
if not data:
return '', 0
else:
data = ''
# byte count is address + data + checksum
count = len(address) + len(data) + 1
count = struct.pack('B', count)
# checksum count + address + data
checksummed_record = count + address + data
checksum = struct.pack('B', sum(ord(d) for d in checksummed_record) & 0xff ^ 0xff)
# glue record type to hex encoded buffer
record = recordtype + (checksummed_record + checksum).encode('hex').upper()
# return buffer and how much data we read from the file
return record, len(data)
def s_record_test():
from cStringIO import StringIO
# from an example, this should encode to given string
fake_file = StringIO("\x0A\x0A\x0D\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")
encode_to = "S1137AF00A0A0D0000000000000000000000000061"
fake_file.seek(0)
record, buflen = s_record_encode(fake_file, 'S1', 0x7af0, 80)
print 'record', record
print 'encode_to', encode_to
assert record == encode_to
fake_file = StringIO()
for i in xrange(1000):
fake_file.write(struct.pack('>L', i))
fake_file.seek(0)
address = 0
while True:
buf, datalen = s_record_encode(fake_file, 'S2', address, 100)
if not buf:
break
print address, datalen, buf
address += datalen
If you already have a C program then you're in luck. Python is like a scripting language over C with most of the same functions. See Core tools for working with streams for all the familiar C I/O functions. Then you can make your program more Pythonic by rolling methods into Classes and using things like The Python Slice Notation.

Interpreting WAV Data

I'm trying to write a program to display PCM data. I've been very frustrated trying to find a library with the right level of abstraction, but I've found the python wave library and have been using that. However, I'm not sure how to interpret the data.
The wave.getparams function returns (2 channels, 2 bytes, 44100 Hz, 96333 frames, No compression, No compression). This all seems cheery, but then I tried printing a single frame:'\xc0\xff\xd0\xff' which is 4 bytes. I suppose it's possible that a frame is 2 samples, but the ambiguities do not end there.
96333 frames * 2 samples/frame * (1/44.1k sec/sample) = 4.3688 seconds
However, iTunes reports the time as closer to 2 seconds and calculations based on file size and bitrate are in the ballpark of 2.7 seconds. What's going on here?
Additionally, how am I to know if the bytes are signed or unsigned?
Many thanks!
Thank you for your help! I got it working and I'll post the solution here for everyone to use in case some other poor soul needs it:
import wave
import struct
def pcm_channels(wave_file):
"""Given a file-like object or file path representing a wave file,
decompose it into its constituent PCM data streams.
Input: A file like object or file path
Output: A list of lists of integers representing the PCM coded data stream channels
and the sample rate of the channels (mixed rate channels not supported)
"""
stream = wave.open(wave_file,"rb")
num_channels = stream.getnchannels()
sample_rate = stream.getframerate()
sample_width = stream.getsampwidth()
num_frames = stream.getnframes()
raw_data = stream.readframes( num_frames ) # Returns byte data
stream.close()
total_samples = num_frames * num_channels
if sample_width == 1:
fmt = "%iB" % total_samples # read unsigned chars
elif sample_width == 2:
fmt = "%ih" % total_samples # read signed 2 byte shorts
else:
raise ValueError("Only supports 8 and 16 bit audio formats.")
integer_data = struct.unpack(fmt, raw_data)
del raw_data # Keep memory tidy (who knows how big it might be)
channels = [ [] for time in range(num_channels) ]
for index, value in enumerate(integer_data):
bucket = index % num_channels
channels[bucket].append(value)
return channels, sample_rate
"Two channels" means stereo, so it makes no sense to sum each channel's duration -- so you're off by a factor of two (2.18 seconds, not 4.37). As for signedness, as explained for example here, and I quote:
8-bit samples are stored as unsigned
bytes, ranging from 0 to 255. 16-bit
samples are stored as 2's-complement
signed integers, ranging from -32768
to 32767.
This is part of the specs of the WAV format (actually of its superset RIFF) and thus not dependent on what library you're using to deal with a WAV file.
I know that an answer has already been accepted, but I did some things with audio a while ago and you have to unpack the wave doing something like this.
pcmdata = wave.struct.unpack("%dh"%(wavedatalength),wavedata)
Also, one package that I used was called PyAudio, though I still had to use the wave package with it.
Each sample is 16 bits and there 2 channels, so the frame takes 4 bytes
The duration is simply the number of frames divided by the number of frames per second. From your data this is: 96333 / 44100 = 2.18 seconds.
Building upon this answer, you can get a good performance boost by using numpy.fromstring or numpy.fromfile. Also see this answer.
Here is what I did:
def interpret_wav(raw_bytes, n_frames, n_channels, sample_width, interleaved = True):
if sample_width == 1:
dtype = np.uint8 # unsigned char
elif sample_width == 2:
dtype = np.int16 # signed 2-byte short
else:
raise ValueError("Only supports 8 and 16 bit audio formats.")
channels = np.fromstring(raw_bytes, dtype=dtype)
if interleaved:
# channels are interleaved, i.e. sample N of channel M follows sample N of channel M-1 in raw data
channels.shape = (n_frames, n_channels)
channels = channels.T
else:
# channels are not interleaved. All samples from channel M occur before all samples from channel M-1
channels.shape = (n_channels, n_frames)
return channels
Assigning a new value to shape will throw an error if it requires data to be copied in memory. This is a good thing, since you want to use the data in place (using less time and memory overall). The ndarray.T function also does not copy (i.e. returns a view) if possible, but I'm not sure how you ensure that it does not copy.
Reading directly from the file with np.fromfile will be even better, but you would have to skip the header using a custom dtype. I haven't tried this yet.

Categories

Resources