This is a part of a script to download files. The problem with it is the downloaded_percent sometimes exceeds 100.
with open(outputname, 'wb') as DEST:
chunksize = 8192
downloaded = 0
for chunk in SESSION.iter_content(chunk_size=chunksize):
DEST.write(chunk)
downloaded += chunksize
downloaded_percent = (downloaded * 100) / file_size ## <<< FIXME exceeds 100
progress_info = {file_size: f'%{downloaded_percent:.2f}'}
print(progress_info, end='\r')
This is what I've tried so far:
Using if statement before write, but it causes the exact opposite, i.e. downloaded_percent would never reach 100:
if chunk:
DEST.write(chunk)
downloaded += chunksize
downloaded_percent = (downloaded * 100) / file_size ## <<< FIXME never reaches 100
Using if statement to set downloaded_percent to 100 if it is above that:
DEST.write(chunk)
downloaded += chunksize
downloaded_percent = (downloaded * 100) / file_size ## <<< FIXME exceeds 100
if downloaded_percent > 100:
downloaded_percent = 100
This does the trick, but it does not seem efficient to check the amount in every iteration.
I was wondering if there are better ways to do so.
Have you tried:
downloaded_percent = (downloaded / file_size) * 100
Assumming that file_size is always >= downloaded.
Related
I have large binary file (size ~2.5Gb). It contains header (size 336 byte) and seismic signal data (x, y and z channels) with type int32. Count of discrete is 223 200 000.
I need read part of signal. For example, I want get part of signal in interval of discrete [216 000 000, 219 599 999].
I wrote the function:
def reading(path, start_moment, end_moment):
file_data = open(path, 'rb')
if start_moment is not None:
bytes_value = start_moment * 4 * 3
file_data.seek(336 + bytes_value)
else:
file_data.seek(336)
if end_moment is None:
try:
signals = np.fromfile(file_data, dtype=np.int32)
except MemoryError:
return None
finally:
file_data.close()
else:
moment_count = end_moment - start_moment + 1
try:
signals = np.fromfile(file_data, dtype=np.int32,
count=moment_count * 3)
except MemoryError:
return None
finally:
file_data.close()
channel_count = 3
signal_count = signals.shape[0] // channel_count
signals = np.reshape(signals, newshape=(signal_count, channel_count))
return signals
If I run script with the function in PyCharm IDE I get error:
Traceback (most recent call last): File
"D:/AppsBuilding/test/testReadBaikal8.py", line 41, in
signal_2 = reading(path=path, start_moment=216000000, end_moment=219599999) File
"D:/AppsBuilding/test/testReadBaikal8.py", line 27, in reading
count=moment_count * 3) OSError: obtaining file position failed
But if I run script with parameters: start_moment=7200000, end_moment=10799999 all ok.
On my PC was installed Windows7 32bit. Memory size is 1.95Gb
Please, help me resolve this problem.
Divide the file into small segments, freeing memory after each small
piece of content is processed
def read_in_block(file_path):
BLOCK_SIZE = 1024
with open(file_path, "r") as f:
while True:
block = f.read(BLOCK_SIZE)
if block:
yield block
else:
return
print block
I don't use Numpy but I don't see anything obviously wrong with your code. However, you say the file is approximately 2.5 GB in size. A triplet index of 219,599,999 requires a file at least 2.45 GB in size:
$ calc
; 219599999 * 4 * 3
2635199988
; 2635199988 / 1024^3
~2.45422123745083808899
Are you sure your file is really that large?
I also don't use MS Windows but the following toy programs work for me. The first creates a data file that mimics the structure of yours. The second shows that it can read the final data triplet. What happens if you run these on your system?
fh = open('x', 'wb')
fh.write(b'0123456789')
for i in range(0, 1000):
s = bytes('{:03d}'.format(i), 'ascii')
fh.write(b'a' + s + b'b' + s + b'c' + s)
Read the data from file x:
fh = open('x', 'rb')
triplet = 999
fh.seek(10 + triplet * 3 * 4)
data = fh.read(3 * 4)
print(data)
I am trying to write a large number of files [2000-2500] to disk after processing. I noticed that the first 100 or so images are fast to write to disk, then there is a slowdown. Why is this happening and what can I do to speed up the process?
This is my code that writes the images:
for b in range(Data.shape[1]):
t0 = time.clock()
img = Data[:,b]
img = np.reshape(img,(501,501))
save = os.path.join(savedir,"%s_%s"%(item,b))
plt.imshow(img).figure.savefig(save)
print "Saved %s of %s in %s seconds"%(b,Data.shape[1],time.clock()-t0)
Edit:
Saved 0 of 1024 in 0.103277 seconds
Saved 1 of 1024 in 0.0774039999999 seconds
Saved 2 of 1024 in 0.0883339999998 seconds
Saved 3 of 1024 in 0.0922500000001 seconds
Saved 4 of 1024 in 0.0972509999999 seconds
And after a few iterations:
Saved 1018 of 1024 in 2.152941 seconds
Saved 1019 of 1024 in 2.163633 seconds
Saved 1020 of 1024 in 2.198959 seconds
Saved 1021 of 1024 in 2.172303 seconds
Saved 1022 of 1024 in 2.19014 seconds
Saved 1023 of 1024 in 2.203727 seconds
Each time you use plt.imshow, you create a new AxesImage, which will each take up some memory. To speed things up, you could clear the figure clf() after each save.
You can check this using len(plt.gca().images) to see how many images you have open. Without the clf() line, you will see than number growing by 1 each iteration.
for b in range(Data.shape[1]):
img = Data[:,b]
img = np.reshape(img,(501,501))
print "Saving %s of %s"%(b,Data.shape[1])
save = os.path.join(savedir,"%s_%s"%(item,b))
plt.imshow(img).figure.savefig(save)
print "There are %d image(s) open"%len(plt.gca().images)
plt.gcf().clf() # clear the figure
My current approach is this:
def get_hash(path=PATH, hash_type='md5'):
func = getattr(hashlib, hash_type)()
with open(path, 'rb') as f:
for block in iter(lambda: f.read(1024*func.block_size, b''):
func.update(block)
return func.hexdigest()
It takes about 3.5 seconds to calculate the md5sum of a 842MB iso file on an i5 # 1.7 GHz. I have tried different methods of reading the file, but all of them yield slower results. Is there, perhaps, a faster solution?
EDIT: I replaced 2**16 (inside the f.read()) with 1024*func.block_size, since the default block_size for most hashing functions supported by hashlib is 64 (except for 'sha384' and 'sha512' - for them, the default block_size is 128). Therefore, the block size is still the same (65536 bits).
EDIT(2): I did something wrong. It takes 8.4 seconds instead of 3.5. :(
EDIT(3): Apparently Windows was using the disk at +80% when I ran the function again. It really takes 3.5 seconds. Phew.
Another solution (~-0.5 sec, slightly faster) is to use os.open():
def get_hash(path=PATH, hash_type='md5'):
func = getattr(hashlib, hash_type)()
f = os.open(path, (os.O_RDWR | os.O_BINARY))
for block in iter(lambda: os.read(f, 2048*func.block_size), b''):
func.update(block)
os.close(f)
return func.hexdigest()
Note that these results are not final.
Using an 874 MiB random data file which required 2 seconds with the md5 openssl tool I was able to improve speed as follows.
Using your first method required 21 seconds.
Reading the entire file (21 seconds) to buffer and then updating required 2 seconds.
Using the following function with a buffer size of 8096 required 17 seconds.
Using the following function with a buffer size of 32767 required 11 seconds.
Using the following function with a buffer size of 65536 required 8 seconds.
Using the following function with a buffer size of 131072 required 8 seconds.
Using the following function with a buffer size of 1048576 required 12 seconds.
def md5_speedcheck(path, size):
pts = time.process_time()
ats = time.time()
m = hashlib.md5()
with open(path, 'rb') as f:
b = f.read(size)
while len(b) > 0:
m.update(b)
b = f.read(size)
print("{0:.3f} s".format(time.process_time() - pts))
print("{0:.3f} s".format(time.time() - ats))
Human time is what I noted above. Whereas processor time for all of these is about the same with the difference being taken in IO blocking.
The key determinant here is to have a buffer size that is big enough to mitigate disk latency, but small enough to avoid VM page swaps. For my particular machine it appears that 64 KiB is about optimal.
In Django I am trying to use FileField on my model and set that using a existing file on the filesystem. I tried this and I was only getting 10s of KBs in the media directory.
c = MyClass()
f = open('D:\\bin.jpg')
df = File(file)
c.file.save('newFile', df)
f.close()
c.save()
FileField.save calls File.chunks and it looks like for binary files it not getting the whole thing. Am I missing something here?
f_text = File(open('D:\\text.txt'))
print f_text.size / 1024. / 1024
>> 13.7466430664
print sum([len(c) for c in f_text.chunks()]) / 1024. / 1024
>> 13.7466430664
f_bin = File(open('D:\\bin.jpg'))
print f_bin.size / 1024. / 1024
>> 0.741801261902
print sum([len(c) for c in f_bin.chunks()]) / 1024. / 1024
>> 0.00253677368164
f = MyClass.objects.get(id=50).file
# is file as f_bin uploaded using Django admin tool
print f.size / 1024. / 1024
>> 0.741801261902
print sum([len(c) for c in f.chunks()]) / 1024. / 1024
>> 0.741801261902
System: Windows 7
Django: 1.5.1
Python: 2.7.5
You need to open the file in binary mode:
f_bin = File(open('D:\\bin.jpg', 'rb'))
See Reading and Writing Files in the Python documentation.
I'm curious to understand why in the first example, the memory consumption happens like I was imaging:
s = StringIO()
s.write('abc'*10000000)
# Memory increases: OK
s.seek(0)
s.truncate()
# Memory decreases: OK
while in this second example, at the end, I use the same thing but the memory does not seem to decrease after the truncate method.
The following code is in a method of a class.
from StringIO import StringIO
import requests
self.BUFFER_SIZE = 5 * 1024 * 2 ** 10 # 5 MB
self.MAX_MEMORY = 3 * 1024 * 2 ** 10 # 3 MB
r = requests.get(self.target, stream=True) # stream=True to not download the data at once
chunks = r.iter_content(chunk_size=self.BUFFER_SIZE)
buff = StringIO()
# Get the MAX_MEMORY first data
for chunk in chunks:
buff.write(chunk)
if buff.len > self.MAX_MEMORY:
break
# Left the loop because there is no more chunks: it stays in memory
if buff.len < self.MAX_MEMORY:
self.data = buff.getvalue()
# Otherwise, prepare a temp file and process the remaining chunks
else:
self.path = self._create_tmp_file_path()
with open(self.path, 'w') as f:
# Write the first downloaded data
buff.seek(0)
f.write(buffer.read())
# Free the buffer ?
buff.seek(0)
buff.truncate()
###################
# Memory does not decrease
# And another 5MB will be added to the memory hiting the next line which is normal because it is the size of a chunk
# But if the buffer was freed, the memory would stay steady: - 5 MB + 5 MB
# Write the remaining chunks directly into the file
for chunk in chunks:
f.write(chunk)
Any thoughts?
Thanks.