I am using python to read mass amounts of data and split them into various files. I am looking for a way to speed up the code that I already have. The numbers coming in are little-endian 32bit floats. I have run several tests.
First test 8 minutes to complete:
f = open('filename','rb')
#file_out is a list of many open writing files 'wb'
while chunk:
for i in range(self.num_files):
chunk = f.read(4)
file_out[i].write(chunk)
This was acceptably fast, but when I try to add some operations, things slow down dramatically to 56 minutes:
file_old = [0,0,0,...,0]
f = open('filename','rb')
#file_out is a list of many open writing files 'wb'
while chunk:
for i in range(self.num_files):
chunk = f.read(4)
num_chunk = numpy.fromstring(chunk, dtype = numpy.float32)
file_out[i].write(num_chunk-file_old[i])
file_old[i] = num_chunk
I ran cProfile on the above code on a shortened sample. Here are the results:
write = 3.457
Numpy fromstring = 2.274
read = 1.370
How could I speed this up?
I was able to discover a much faster way of reading in the data using numpy.fromfile. I wrote a quick little test script shown below:
from os.path import join
import numpy
import struct
from time import time
def main():
#Set the path name and filename
folder = join("Tone_Tests","1khz_10ns_0907153323")
fn = join(folder,"Channel1.raw32")
#Test 1
start = time()
f = open(fn,'rb')
array = read_fromstring(f)
f.close()
print "Test fromString = ",time()-start
del array
#Test 2
start = time()
f = open(fn,'rb')
array = read_struct(f)
f.close()
print "Test fromStruct = ",time()-start
del array
#Test 3
start = time()
f = open(fn,'rb')
array = read_fromfile(f)
f.close()
print "Test fromfile = ",time()-start
del array
def read_fromstring(f):
#Use Numpy fromstring, read each 4 bytes, convert, store in list
data = []
chunk = f.read(4)
while chunk:
num_chunk = numpy.fromstring(chunk, dtype = 'float32')
data.append(num_chunk)
chunk = f.read(4)
return numpy.array(data)
def read_struct(f):
#Same as numpy froms string but using the struct.
data = []
chunk = f.read(4)
while chunk:
num_chunk = struct.unpack('<f',chunk)
data.append(num_chunk)
chunk = f.read(4)
return numpy.array(data)
def read_fromfile(f):
return numpy.fromfile(f, dtype = 'float32', count = -1)
The timed outputs from the terminal were:
Test fromString = 4.43499994278
Test fromStruct = 2.42199993134
Test fromfile = 0.00399994850159
Using python -m cProfile -s time filename.py > profile.txt shows that the times were:
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.456 1.456 4.272 4.272 Read_Data_tester.py:42(read_fromstring)
1 1.162 1.162 2.369 2.369 Read_Data_tester.py:56(read_struct)
1 0.000 0.000 0.005 0.005 Read_Data_tester.py:70(read_fromfile)
I think you may be able to make use of threading (using the threading module).
This will alow you to run functions in parralel with your main code therefore you could start one a third of the way through the file, another halfway and another threequarters the way through. There fore each only has to process a quater of the data and so it should take only a quater of the time.
(I say should as there is overhead so wont be quite that fast)
Related
I'm running the following function for an ML model.
def get_images(filename):
bin_file = open(filename, 'rb')
buf = bin_file.read() # all the file are put into memory
bin_file.close() # release the measure of operating system
index = 0
magic, num_images, num_rows, num_colums = struct.unpack_from(big_endian + four_bytes, buf, index)
index += struct.calcsize(big_endian + four_bytes)
images = [] # temp images as tuple
for x in range(num_images):
im = struct.unpack_from(big_endian + picture_bytes, buf, index)
index += struct.calcsize(big_endian + picture_bytes)
im = list(im)
for i in range(len(im)):
if im[i] > 1:
im[i] = 1
However, I am receiving an error at the line:
im = struct.unpack_from(big_endian + picture_bytes, buf, index)
With the error:
error: unpack_from requires a buffer of at least 784 bytes
I have noticed this error is only occurring at certain iterations. I cannot figure out why this is might be the case. The dataset is a standard MNIST dataset which is freely available online.
I have also looked through similar questions on SO (e.g. error: unpack_from requires a buffer) but they don't seem to resolve the issue.
You didn't include the struct formats in your mre so it is hard to say why you are getting the error. Either you are using a partial/corrupted file or your struct formats are wrong.
This answer uses the test file 't10k-images-idx3-ubyte.gz' and file formats found at http://yann.lecun.com/exdb/mnist/
Open the file and read it into a bytes object (gzip is used because of the file's type).
import gzip,struct
with gzip.open(r'my\path\t10k-images-idx3-ubyte.gz','rb') as f:
data = bytes(f.read())
print(len(data))
The file format spec says the header is 16 bytes (four 32 bit ints) - separate it from the pixels with a slice then unpack it
hdr,pixels = data[:16],data[16:]
magic, num_images, num_rows, num_cols = struct.unpack(">4L",hdr)
# print(len(hdr),len(pixels))
# print(magic, num_images, num_rows, num_cols)
There are a number of ways to iterate over the individual images.
img_size = num_rows * num_cols
imgfmt = "B"*img_size
for i in range(num_images):
start = i * img_size
end = start + img_size
img = pixels[start:end]
img = struct.unpack(imgfmt,img)
# do work on the img
Or...
imgfmt = "B"*img_size
for img in struct.iter_unpack(imgfmt, pixels):
img = [p if p == 0 else 1 for p in img]
The itertools grouper recipe would probably also work.
I'm trying to read noncontiguous fields from a binary file in Python using numpy fromfile function. It's based on this Matlab code using fread:
fseek(file, 0, 'bof');
q = fread(file, inf, 'float32', 8);
8 indicates the number of bytes I want to skip after reading each value. I was wondering if there was a similar option in fromfile, or if there is another way of reading specific values from a binary file in Python. Thanks for your help.
Henrik
Something like this should work, untested:
import struct
floats = []
with open(filename, 'rb') as f:
while True:
buff = f.read(4) # 'f' is 4-bytes wide
if len(buff) < 4: break
x = struct.unpack('f', buff)[0] # Convert buffer to float (get from returned tuple)
floats.append(x) # Add float to list (for example)
f.seek(8, 1) # The second arg 1 specifies relative offset
Using struct.unpack()
Some code I am using (not in python) takes input files written in specific way. I usually prepare such input files with python scripts. One of them takes the following format:
100
0 1 2
3 4 5
6 7 8
where 100 is just an overall parameter and the rest is a matrix. In python 2, I used to do it in the following way:
# python 2.7
import numpy as np
Matrix = np.arange(9)
Matrix.shape = 3,3
f = open('input.inp', 'w')
print >> f, 100
np.savetxt(f, Matrix)
I just converted to python 3 recently. Running above script with 2to3 gets me something like:
# python 3.6
import numpy as np
Matrix = np.arange(9)
Matrix.shape = 3,3
f = open('input.inp', 'w')
print(100, file=f)
np.savetxt(f, Matrix)
The first error I got was TypeError: write() argument must be str, not bytes, because there are something like fh.write(asbytes(format % tuple(row) + newline)) during the execution of numpy.savetxt. I was able to fix this problem through opening the file as a binary: f = open('input.inp', 'wb'). But this will cause the print() to fail. Is there a way to harmonize these two?
I ran into this same issue converting to python3. All strings in python3 are interpreted as unicode by default now, so you have to convert. I found the solution of writing to a string first and then writing the string to the file to be the most appealing. This is a working version of your snippet in python3 using this method:
# python 3.6
from io import BytesIO
import numpy as np
Matrix = np.arange(9)
Matrix.shape = 3,3
f = open('input.inp', 'w')
print(100, file=f)
fh = BytesIO()
np.savetxt(fh, Matrix, fmt='%d')
cstr = fh.getvalue()
fh.close()
print(cstr.decode('UTF-8'), file=f)
My current approach is this:
def get_hash(path=PATH, hash_type='md5'):
func = getattr(hashlib, hash_type)()
with open(path, 'rb') as f:
for block in iter(lambda: f.read(1024*func.block_size, b''):
func.update(block)
return func.hexdigest()
It takes about 3.5 seconds to calculate the md5sum of a 842MB iso file on an i5 # 1.7 GHz. I have tried different methods of reading the file, but all of them yield slower results. Is there, perhaps, a faster solution?
EDIT: I replaced 2**16 (inside the f.read()) with 1024*func.block_size, since the default block_size for most hashing functions supported by hashlib is 64 (except for 'sha384' and 'sha512' - for them, the default block_size is 128). Therefore, the block size is still the same (65536 bits).
EDIT(2): I did something wrong. It takes 8.4 seconds instead of 3.5. :(
EDIT(3): Apparently Windows was using the disk at +80% when I ran the function again. It really takes 3.5 seconds. Phew.
Another solution (~-0.5 sec, slightly faster) is to use os.open():
def get_hash(path=PATH, hash_type='md5'):
func = getattr(hashlib, hash_type)()
f = os.open(path, (os.O_RDWR | os.O_BINARY))
for block in iter(lambda: os.read(f, 2048*func.block_size), b''):
func.update(block)
os.close(f)
return func.hexdigest()
Note that these results are not final.
Using an 874 MiB random data file which required 2 seconds with the md5 openssl tool I was able to improve speed as follows.
Using your first method required 21 seconds.
Reading the entire file (21 seconds) to buffer and then updating required 2 seconds.
Using the following function with a buffer size of 8096 required 17 seconds.
Using the following function with a buffer size of 32767 required 11 seconds.
Using the following function with a buffer size of 65536 required 8 seconds.
Using the following function with a buffer size of 131072 required 8 seconds.
Using the following function with a buffer size of 1048576 required 12 seconds.
def md5_speedcheck(path, size):
pts = time.process_time()
ats = time.time()
m = hashlib.md5()
with open(path, 'rb') as f:
b = f.read(size)
while len(b) > 0:
m.update(b)
b = f.read(size)
print("{0:.3f} s".format(time.process_time() - pts))
print("{0:.3f} s".format(time.time() - ats))
Human time is what I noted above. Whereas processor time for all of these is about the same with the difference being taken in IO blocking.
The key determinant here is to have a buffer size that is big enough to mitigate disk latency, but small enough to avoid VM page swaps. For my particular machine it appears that 64 KiB is about optimal.
I'm curious to understand why in the first example, the memory consumption happens like I was imaging:
s = StringIO()
s.write('abc'*10000000)
# Memory increases: OK
s.seek(0)
s.truncate()
# Memory decreases: OK
while in this second example, at the end, I use the same thing but the memory does not seem to decrease after the truncate method.
The following code is in a method of a class.
from StringIO import StringIO
import requests
self.BUFFER_SIZE = 5 * 1024 * 2 ** 10 # 5 MB
self.MAX_MEMORY = 3 * 1024 * 2 ** 10 # 3 MB
r = requests.get(self.target, stream=True) # stream=True to not download the data at once
chunks = r.iter_content(chunk_size=self.BUFFER_SIZE)
buff = StringIO()
# Get the MAX_MEMORY first data
for chunk in chunks:
buff.write(chunk)
if buff.len > self.MAX_MEMORY:
break
# Left the loop because there is no more chunks: it stays in memory
if buff.len < self.MAX_MEMORY:
self.data = buff.getvalue()
# Otherwise, prepare a temp file and process the remaining chunks
else:
self.path = self._create_tmp_file_path()
with open(self.path, 'w') as f:
# Write the first downloaded data
buff.seek(0)
f.write(buffer.read())
# Free the buffer ?
buff.seek(0)
buff.truncate()
###################
# Memory does not decrease
# And another 5MB will be added to the memory hiting the next line which is normal because it is the size of a chunk
# But if the buffer was freed, the memory would stay steady: - 5 MB + 5 MB
# Write the remaining chunks directly into the file
for chunk in chunks:
f.write(chunk)
Any thoughts?
Thanks.