understanding memory consumption: free the memory of a StringIO? - python

I'm curious to understand why in the first example, the memory consumption happens like I was imaging:
s = StringIO()
s.write('abc'*10000000)
# Memory increases: OK
s.seek(0)
s.truncate()
# Memory decreases: OK
while in this second example, at the end, I use the same thing but the memory does not seem to decrease after the truncate method.
The following code is in a method of a class.
from StringIO import StringIO
import requests
self.BUFFER_SIZE = 5 * 1024 * 2 ** 10 # 5 MB
self.MAX_MEMORY = 3 * 1024 * 2 ** 10 # 3 MB
r = requests.get(self.target, stream=True) # stream=True to not download the data at once
chunks = r.iter_content(chunk_size=self.BUFFER_SIZE)
buff = StringIO()
# Get the MAX_MEMORY first data
for chunk in chunks:
buff.write(chunk)
if buff.len > self.MAX_MEMORY:
break
# Left the loop because there is no more chunks: it stays in memory
if buff.len < self.MAX_MEMORY:
self.data = buff.getvalue()
# Otherwise, prepare a temp file and process the remaining chunks
else:
self.path = self._create_tmp_file_path()
with open(self.path, 'w') as f:
# Write the first downloaded data
buff.seek(0)
f.write(buffer.read())
# Free the buffer ?
buff.seek(0)
buff.truncate()
###################
# Memory does not decrease
# And another 5MB will be added to the memory hiting the next line which is normal because it is the size of a chunk
# But if the buffer was freed, the memory would stay steady: - 5 MB + 5 MB
# Write the remaining chunks directly into the file
for chunk in chunks:
f.write(chunk)
Any thoughts?
Thanks.

Related

download percent in python progress bar exceeds 100

This is a part of a script to download files. The problem with it is the downloaded_percent sometimes exceeds 100.
with open(outputname, 'wb') as DEST:
chunksize = 8192
downloaded = 0
for chunk in SESSION.iter_content(chunk_size=chunksize):
DEST.write(chunk)
downloaded += chunksize
downloaded_percent = (downloaded * 100) / file_size ## <<< FIXME exceeds 100
progress_info = {file_size: f'%{downloaded_percent:.2f}'}
print(progress_info, end='\r')
This is what I've tried so far:
Using if statement before write, but it causes the exact opposite, i.e. downloaded_percent would never reach 100:
if chunk:
DEST.write(chunk)
downloaded += chunksize
downloaded_percent = (downloaded * 100) / file_size ## <<< FIXME never reaches 100
Using if statement to set downloaded_percent to 100 if it is above that:
DEST.write(chunk)
downloaded += chunksize
downloaded_percent = (downloaded * 100) / file_size ## <<< FIXME exceeds 100
if downloaded_percent > 100:
downloaded_percent = 100
This does the trick, but it does not seem efficient to check the amount in every iteration.
I was wondering if there are better ways to do so.
Have you tried:
downloaded_percent = (downloaded / file_size) * 100
Assumming that file_size is always >= downloaded.

Python Read part of large binary file

I have large binary file (size ~2.5Gb). It contains header (size 336 byte) and seismic signal data (x, y and z channels) with type int32. Count of discrete is 223 200 000.
I need read part of signal. For example, I want get part of signal in interval of discrete [216 000 000, 219 599 999].
I wrote the function:
def reading(path, start_moment, end_moment):
file_data = open(path, 'rb')
if start_moment is not None:
bytes_value = start_moment * 4 * 3
file_data.seek(336 + bytes_value)
else:
file_data.seek(336)
if end_moment is None:
try:
signals = np.fromfile(file_data, dtype=np.int32)
except MemoryError:
return None
finally:
file_data.close()
else:
moment_count = end_moment - start_moment + 1
try:
signals = np.fromfile(file_data, dtype=np.int32,
count=moment_count * 3)
except MemoryError:
return None
finally:
file_data.close()
channel_count = 3
signal_count = signals.shape[0] // channel_count
signals = np.reshape(signals, newshape=(signal_count, channel_count))
return signals
If I run script with the function in PyCharm IDE I get error:
Traceback (most recent call last): File
"D:/AppsBuilding/test/testReadBaikal8.py", line 41, in
signal_2 = reading(path=path, start_moment=216000000, end_moment=219599999) File
"D:/AppsBuilding/test/testReadBaikal8.py", line 27, in reading
count=moment_count * 3) OSError: obtaining file position failed
But if I run script with parameters: start_moment=7200000, end_moment=10799999 all ok.
On my PC was installed Windows7 32bit. Memory size is 1.95Gb
Please, help me resolve this problem.
Divide the file into small segments, freeing memory after each small
piece of content is processed
def read_in_block(file_path):
BLOCK_SIZE = 1024
with open(file_path, "r") as f:
while True:
block = f.read(BLOCK_SIZE)
if block:
yield block
else:
return
print block
I don't use Numpy but I don't see anything obviously wrong with your code. However, you say the file is approximately 2.5 GB in size. A triplet index of 219,599,999 requires a file at least 2.45 GB in size:
$ calc
; 219599999 * 4 * 3
2635199988
; 2635199988 / 1024^3
~2.45422123745083808899
Are you sure your file is really that large?
I also don't use MS Windows but the following toy programs work for me. The first creates a data file that mimics the structure of yours. The second shows that it can read the final data triplet. What happens if you run these on your system?
fh = open('x', 'wb')
fh.write(b'0123456789')
for i in range(0, 1000):
s = bytes('{:03d}'.format(i), 'ascii')
fh.write(b'a' + s + b'b' + s + b'c' + s)
Read the data from file x:
fh = open('x', 'rb')
triplet = 999
fh.seek(10 + triplet * 3 * 4)
data = fh.read(3 * 4)
print(data)

Writing a large number of files in python, noticable slowdown at the end

I am trying to write a large number of files [2000-2500] to disk after processing. I noticed that the first 100 or so images are fast to write to disk, then there is a slowdown. Why is this happening and what can I do to speed up the process?
This is my code that writes the images:
for b in range(Data.shape[1]):
t0 = time.clock()
img = Data[:,b]
img = np.reshape(img,(501,501))
save = os.path.join(savedir,"%s_%s"%(item,b))
plt.imshow(img).figure.savefig(save)
print "Saved %s of %s in %s seconds"%(b,Data.shape[1],time.clock()-t0)
Edit:
Saved 0 of 1024 in 0.103277 seconds
Saved 1 of 1024 in 0.0774039999999 seconds
Saved 2 of 1024 in 0.0883339999998 seconds
Saved 3 of 1024 in 0.0922500000001 seconds
Saved 4 of 1024 in 0.0972509999999 seconds
And after a few iterations:
Saved 1018 of 1024 in 2.152941 seconds
Saved 1019 of 1024 in 2.163633 seconds
Saved 1020 of 1024 in 2.198959 seconds
Saved 1021 of 1024 in 2.172303 seconds
Saved 1022 of 1024 in 2.19014 seconds
Saved 1023 of 1024 in 2.203727 seconds
Each time you use plt.imshow, you create a new AxesImage, which will each take up some memory. To speed things up, you could clear the figure clf() after each save.
You can check this using len(plt.gca().images) to see how many images you have open. Without the clf() line, you will see than number growing by 1 each iteration.
for b in range(Data.shape[1]):
img = Data[:,b]
img = np.reshape(img,(501,501))
print "Saving %s of %s"%(b,Data.shape[1])
save = os.path.join(savedir,"%s_%s"%(item,b))
plt.imshow(img).figure.savefig(save)
print "There are %d image(s) open"%len(plt.gca().images)
plt.gcf().clf() # clear the figure

Is there a faster way (than this) to calculate the hash of a file (using hashlib) in Python?

My current approach is this:
def get_hash(path=PATH, hash_type='md5'):
func = getattr(hashlib, hash_type)()
with open(path, 'rb') as f:
for block in iter(lambda: f.read(1024*func.block_size, b''):
func.update(block)
return func.hexdigest()
It takes about 3.5 seconds to calculate the md5sum of a 842MB iso file on an i5 # 1.7 GHz. I have tried different methods of reading the file, but all of them yield slower results. Is there, perhaps, a faster solution?
EDIT: I replaced 2**16 (inside the f.read()) with 1024*func.block_size, since the default block_size for most hashing functions supported by hashlib is 64 (except for 'sha384' and 'sha512' - for them, the default block_size is 128). Therefore, the block size is still the same (65536 bits).
EDIT(2): I did something wrong. It takes 8.4 seconds instead of 3.5. :(
EDIT(3): Apparently Windows was using the disk at +80% when I ran the function again. It really takes 3.5 seconds. Phew.
Another solution (~-0.5 sec, slightly faster) is to use os.open():
def get_hash(path=PATH, hash_type='md5'):
func = getattr(hashlib, hash_type)()
f = os.open(path, (os.O_RDWR | os.O_BINARY))
for block in iter(lambda: os.read(f, 2048*func.block_size), b''):
func.update(block)
os.close(f)
return func.hexdigest()
Note that these results are not final.
Using an 874 MiB random data file which required 2 seconds with the md5 openssl tool I was able to improve speed as follows.
Using your first method required 21 seconds.
Reading the entire file (21 seconds) to buffer and then updating required 2 seconds.
Using the following function with a buffer size of 8096 required 17 seconds.
Using the following function with a buffer size of 32767 required 11 seconds.
Using the following function with a buffer size of 65536 required 8 seconds.
Using the following function with a buffer size of 131072 required 8 seconds.
Using the following function with a buffer size of 1048576 required 12 seconds.
def md5_speedcheck(path, size):
pts = time.process_time()
ats = time.time()
m = hashlib.md5()
with open(path, 'rb') as f:
b = f.read(size)
while len(b) > 0:
m.update(b)
b = f.read(size)
print("{0:.3f} s".format(time.process_time() - pts))
print("{0:.3f} s".format(time.time() - ats))
Human time is what I noted above. Whereas processor time for all of these is about the same with the difference being taken in IO blocking.
The key determinant here is to have a buffer size that is big enough to mitigate disk latency, but small enough to avoid VM page swaps. For my particular machine it appears that 64 KiB is about optimal.

Python Binary File Manipulation Speed Up

I am using python to read mass amounts of data and split them into various files. I am looking for a way to speed up the code that I already have. The numbers coming in are little-endian 32bit floats. I have run several tests.
First test 8 minutes to complete:
f = open('filename','rb')
#file_out is a list of many open writing files 'wb'
while chunk:
for i in range(self.num_files):
chunk = f.read(4)
file_out[i].write(chunk)
This was acceptably fast, but when I try to add some operations, things slow down dramatically to 56 minutes:
file_old = [0,0,0,...,0]
f = open('filename','rb')
#file_out is a list of many open writing files 'wb'
while chunk:
for i in range(self.num_files):
chunk = f.read(4)
num_chunk = numpy.fromstring(chunk, dtype = numpy.float32)
file_out[i].write(num_chunk-file_old[i])
file_old[i] = num_chunk
I ran cProfile on the above code on a shortened sample. Here are the results:
write = 3.457
Numpy fromstring = 2.274
read = 1.370
How could I speed this up?
I was able to discover a much faster way of reading in the data using numpy.fromfile. I wrote a quick little test script shown below:
from os.path import join
import numpy
import struct
from time import time
def main():
#Set the path name and filename
folder = join("Tone_Tests","1khz_10ns_0907153323")
fn = join(folder,"Channel1.raw32")
#Test 1
start = time()
f = open(fn,'rb')
array = read_fromstring(f)
f.close()
print "Test fromString = ",time()-start
del array
#Test 2
start = time()
f = open(fn,'rb')
array = read_struct(f)
f.close()
print "Test fromStruct = ",time()-start
del array
#Test 3
start = time()
f = open(fn,'rb')
array = read_fromfile(f)
f.close()
print "Test fromfile = ",time()-start
del array
def read_fromstring(f):
#Use Numpy fromstring, read each 4 bytes, convert, store in list
data = []
chunk = f.read(4)
while chunk:
num_chunk = numpy.fromstring(chunk, dtype = 'float32')
data.append(num_chunk)
chunk = f.read(4)
return numpy.array(data)
def read_struct(f):
#Same as numpy froms string but using the struct.
data = []
chunk = f.read(4)
while chunk:
num_chunk = struct.unpack('<f',chunk)
data.append(num_chunk)
chunk = f.read(4)
return numpy.array(data)
def read_fromfile(f):
return numpy.fromfile(f, dtype = 'float32', count = -1)
The timed outputs from the terminal were:
Test fromString = 4.43499994278
Test fromStruct = 2.42199993134
Test fromfile = 0.00399994850159
Using python -m cProfile -s time filename.py > profile.txt shows that the times were:
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.456 1.456 4.272 4.272 Read_Data_tester.py:42(read_fromstring)
1 1.162 1.162 2.369 2.369 Read_Data_tester.py:56(read_struct)
1 0.000 0.000 0.005 0.005 Read_Data_tester.py:70(read_fromfile)
I think you may be able to make use of threading (using the threading module).
This will alow you to run functions in parralel with your main code therefore you could start one a third of the way through the file, another halfway and another threequarters the way through. There fore each only has to process a quater of the data and so it should take only a quater of the time.
(I say should as there is overhead so wont be quite that fast)

Categories

Resources