how to predict accurately calculate progress during streaming download - python

I am trying to use a streaming post request to download files from my online database while providing an accurate indication of the relative download progress to drive a progress bar in my QT application.
I thought I could simply compare chunk_size * chunk amount to the file size to know how much relative data I have downloaded, but it doesn't seem to work that way.
To test my understanding of chunks I set the chunk_size to be the same as the file size (about 9.8MB). This is my test code:
with closing(requests.post(ipAddress,
headers={'host':hostURL},
data=dataDict,
timeout=timeout,
stream=True)) as responseObject:
chunkNumber = 0
for chunk in responseObject.iter_content(chunk_size=10276044):
print chunkNumber
chunkNumber += 1
content += chunk
I expected to only see one or two chunks, but instead I see chunkNumber increase to anywhere between 1600 and over 4000 when I run the test multiple times.
I am obviously misinterpreting the use of chunk_size, so my question is:
How can I accurately determine the relative progress of the download during the iter_content() loop so that I can drive a progress bar from 0 to 100%?
Cheers,
frank

The solution I found for my own project was to find the length of the response and divide by 100.
This is Python 3 code so just remove the parentheses from the print statements to be compatible with 2.
f = open(title, 'wb')
response = requests.get(url, params=query, headers=HDR, stream=True)
size = int(response.headers.get('content-length'))
CHUNK = size//100 # size for a percentage point
if CHUNK > 1000000: # place chunk cap on >1MB files
CHUNK = 100000 # 0.1MB
print(size, 'bytes')
print("Writing to file in chunks of {} bytes...".format(CHUNK))
actual = 0 # current progress
try:
for chunk in response.iter_content(chunk_size=CHUNK):
if not chunk: break
f.write(chunk)
actual += len(chunk) # move progress bar
percent = int((actual/size)*100)
if 'idlelib' in sys.modules: # you can take these conditions out if you don't have windows
#if you do then import sys, os at the start of the program
if not(percent % 5):
print('{}%'.format(percent), end=' ')
else:
os.system('title {}% {}/{}'.format(percent, actual, size))
except Exception as e:
print(e)
finally:
f.close()

Related

Unable to find headers of jpg while reading a raw disk image (dd)

I am reading a raw disk image using python 3. My task is to retrieve (carve) jpgs as individual files from the disk image. As I know header pattern (\xd8\xff\xe0 or \xd8\xff\xe1) of jpg. I want to know where I get this while reading file.
fobj = open('carve.dd', 'rb')
data = fobj.read(32)
while data != '':
head_loc = findheader(data)
print(head_loc)
data = fobj.read(32)
def findheader(data) : # to find header in each 32 bytes of data of raw disk image
for i in range(0, len(data) - 3) :
if data[i] == b'\xff' :
if data[i+1:i+4] == b'\xd8\xff\xe0' or data[i+1:i+4] == b'\xd8\xff\xe1' :
return i
return -1
The same code is working fine in Python 2. In Python 2, I am able to get headers in just a few seconds from image. Can someone help me out, what is the problem in Python 3?
This code snippet is actually from this https://github.com/darth-cheney/JPEG-Recover/blob/master/jpegrecover2.py
This runs fine in Python 2 but not in Python 3. Please forget about inconsistent tab error when you run the code in link. I again retyped in VS code.
Like the old saying goes, I've got some bad news and some good news. The bad is I can't figure out why your code doesn't work the same in both version 2 and version 3 of Python.
The good is that I was able to reproduce the problem using the sample data you provided, but—more importantly—able to devise something that not only works consistently in both versions, it's likely much faster because it doesn't use a for loop to search through each chunk of data looking for the .jpg header patterns.
from __future__ import print_function
LIMIT = 100000 # Number of chunks (for testing).
CHUNKSIZE = 32 # Bytes.
HDRS = b'\xff\xd8\xff\xe0', b'\xff\xd8\xff\xe1'
IMG_PATH = r'C:\vols\Files\Temp\carve.dd.002'
print('Searching...')
with open(IMG_PATH, 'rb') as file:
chunk_index = 0
found = 0
while True:
data = file.read(CHUNKSIZE)
if not data:
break
# Search for each of the headers in each chunk.
for hdr in HDRS:
offset = 0
while offset < (CHUNKSIZE - len(hdr)):
try:
head_loc = data[offset:].index(hdr)
except ValueError: # Not found.
break
found += 1
file_offset = chunk_index*CHUNKSIZE + head_loc
print('found: #{} at {:,}'.format(found, file_offset))
offset += (head_loc + len(hdr))
chunk_index += 1
if LIMIT and (chunk_index == LIMIT): break # Stop after this many chunks.
print('total found {}'.format(found))

GZipping files with python

Is it normal that gzip algorithm can make file size large after compression?
E.g. it's needed to split a large file of 8.2Mb into small 101024 chunks of 81 bytes and compress them using gzip library. After it's done I see that folder with gzipped files has become larger in size and it is 13Mb now in comparison with total chunks size without compression. And for example there is a piece of code here:
def gzip_it(filenumber, chunk, path=FOLDER_PATH, prefix=FILE_NAME_PREFIX):
with gzip.open(os.path.join(path, prefix + "{:07d}".format(filenumber) + ".gz"), mode="wb") as chunk_file:
chunk_file.write(gzip.compress(chunk))
def split_and_write(file, thread_num):
spare_to_distribute_inner = SPARE_TO_DISTRIBUTE
initial_position = 0 if thread_num == 0 else BYTES_PER_THREAD * thread_num
initial_file_num = 0 if thread_num == 0 else FILES_PER_THREAD * thread_num
with open(file, mode="rb") as file:
file.seek(initial_position)
while initial_file_num < FILES_PER_THREAD * (thread_num + 1):
if spare_to_distribute_inner:
chunk = file.read(CHUNK_FILE_SIZE + 1)
gzip_it(initial_file_num, chunk)
initial_file_num += 1
initial_position += (CHUNK_FILE_SIZE + 1)
spare_to_distribute_inner -= 1
else:
if initial_file_num == FILES_TOTAL - 1:
chunk = file.read(CHUNK_FILE_SIZE + SPARE_TO_DISTRIBUTE_REMAINDER)
gzip_it(initial_file_num, chunk)
make_marker_file(str(SOURCE_FILE_SIZE).encode())
break
else:
chunk = file.read(CHUNK_FILE_SIZE)
gzip_it(initial_file_num, chunk)
initial_file_num += 1
initial_position += CHUNK_FILE_SIZE
def main():
for thread in range(VIRTUAL_THREADS):
pool.submit(split_and_write, "cry_cmake.exe", thread)
Yes it is completely normal that files become larger after compression. This happens usually with files that are already compressed.
What you are doing is wrong. Your chunks are too small to be compressed meaningfully. Try making chunks of 1MiB or more.
Basically in a compression, the algorithm looks for repeated sequences and shortens them, creating an initial dictionary with the original sequence and the shortened version.
If the chunks are so small, it can't really find long repeated sequences and it needs to repeat this initial dictionary per every chunk.
How come you want to split the original file first and compress each minichunk by itself? In most use cases people compress first and split afterwards.
An alternative for your case can be to split the original file into the minichunks but do not compress each of it separately but instead put all of them in one directory and then make a .tgz out of the directory:
tar -c -z -f result.tgz chunks_directory/
Then the compression takes place after tar has bundled all the files again but after unpacking, you will receive all the minichunk files again.

search a 2GB WAV file for dropouts using wave module

`What is the best way to analyze a 2GB WAV file (1khz Tone) for audio dropouts using wave module? I tried the script below
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
for i in xrange(file1.getnframes()):
frame = file1.readframes(i)
zero = True
for j in xrange(len(frame)):
# check if amplitude is greater than 0
# the ord() function converts the hex values to integers
if ord(frame[j]) > 0:
zero = False
break
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
file1.close()
file2.close()
I haven't used the wave module before, but file1.readframes(i) looks like it's reading 1 frame when you're at the first frame, 2 frames when you're at the second frame, 10 frames when you're in the tenth frame, and a 2Gb CD quality file might have a million frames - by the time you're at frame 100,000 reading 100,000 frames ... getting slower each time through the loop as well?
And from my comment, in Python 2 range() generates an in-memory array of the full size first, and xrange() doesn't, but not using range at all helps even more.
And push the looping down into the lower layers with any() to make the code shorter, and possibly faster:
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
chunksize = file1.getframerate()
chunk = file1.readframes(chunksize)
while chunk:
if not any(ord(sample) for sample in chunk):
print >> file2, 'dropout at second %s' % (file1.tell()/chunksize)
chunk = file1.readframes(chunksize)
file1.close()
file2.close()
This should read the file in 1-second chunks.
I think a simple solution to this would be to consider that the frame rate on audio files is pretty high. A sample file on my computer happens to have a framerate of 8,000. That means for every second of audio, I have 8,000 samples. If you have missing audio, I'm sure it will exist across multiple frames within a second, so you can essentially reduce your comparisons as drastically as your standards would allow. If I were you, I would try iterating over every 1,000th sample instead of every single sample in the audio file. That basically means it will examine every 1/8th of a second of audio to see if it's dead. Not as precise, but hopefully it will get the job done.
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
for i in range(file1.getnframes()):
frame = file1.readframes(i)
zero = True
for j in range(0, len(frame), 1000):
# check if amplitude is greater than 0
# the ord() function converts the hex values to integers
if ord(frame[j]) > 0:
zero = False
break
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
file1.close()
file2.close()
At the moment, you're reading the entire file into memory, which is not ideal. If you look at the methods available for a "Wave_read" object, one of them is setpos(pos), which sets the position of the file pointer to pos. If you update this position, you should be able to only keep the frame you want in memory at any given time, preventing errors. Below is a rough outline:
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
def scan_frame(frame):
for j in range(len(frame)):
# check if amplitude is less than 0
# It makes more sense here to check for the desired case (low amplitude)
# rather than breaking at higher amplitudes
if ord(frame[j]) <= 0:
return True
for i in range(file1.getnframes()):
frame = file1.readframes(1) # only read the frame at the current file position
zero = scan_frame(frame)
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
pos = file1.tell() # States current file position
file1.setpos(pos + len(frame)) # or pos + 1, or whatever a single unit in a wave
# file is, I'm not entirely sure
file1.close()
file2.close()
Hope this can help!

What is the optimal way to process a very large (over 30GB) text file and also show progress

[newbie question]
Hi,
I'm working on a huge text file which is well over 30GB.
I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.
Im currently using this:
f = open(file_path,'r')
for one_line in f.readlines():
do_some_processing(one_line)
f.close()
Also how can I show overall progress of how much data has been crunched so far ?
Thank you all very much.
File handles are iterable, and you should probably use a context manager. Try this:
with open(file_path, 'r') as fh:
for line in fh:
process(line)
That might be enough.
I use a function like this for a similiar problem. You can wrap up any iterable with it.
Change this
for one_line in f.readlines():
You just need to change your code to
# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):
You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.
def progress_meter(iterable, chunksize):
""" Prints progress through iterable at chunksize intervals."""
scan_start = time.time()
since_last = time.time()
for idx, val in enumerate(iterable):
if idx % chunksize == 0 and idx > 0:
print idx
print 'avg rate', idx / (time.time() - scan_start)
print 'inst rate', chunksize / (time.time() - since_last)
since_last = time.time()
print
yield val
Using readline imposes to find the end of each line in your file. If some lines are very long, it might lead your interpreter to crash (not enough memory to buffer the full line).
In order to show progress you can check the file size for example using:
import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size
The progress of your task can then be the number of bytes processed divided by the file size times 100 to have a percentage.

How to get progress of file move in python?

I've got a little script for sorting out my dowloaded files and it works great, but I'd like to print out the progress of a file move, for when it's doing the big ones, right now I do something like:
print "moving..."
os.renames(pathTofile, newName)
print "done"
But I'd like to be able to see something like a progress bar ( [..... ] style) or a percentage printed to stdout.
I don't need/want a gui of any sort, just the simplest / least-work ( :) ) way to get the operation progress).
Thanks!
You won't be able to get that kind of information using os.renames. Your best bet is to replace that with a home grown file copy operation but call stat on the file beforehand in order to get the complete size so you can track how far through you are.
Something like this:
source_size = os.stat(SOURCE_FILENAME).st_size
copied = 0
source = open(SOURCE_FILENAME, 'rb')
target = open(TARGET_FILENAME, 'wb')
while True:
chunk = source.read(32768)
if not chunk:
break
target.write(chunk)
copied += len(chunk)
print '\r%02d%%' % (copied * 100 / source_size),
source.close()
target.close()
Note however that this will more than likely be markedly slower than using os.rename.
There isn't any way to get a progress bar because the "rename" call that moves the file is a single OS call.
It's worth noting that the "rename" call only takes time if the source and destination are on different physical volumes. If they're on the same volume, then the rename will take almost no time. If you know that you're copying data between volumes, you may wish to use functions from the shutil module such as copyfileobj. There is no callback for progress monitoring, however you can implement your own source or destination file-like object to track progress.
This example method expands on the answer by Benno by estimating the time remaining and removing the progress line when the copy is complete.
def copy_large_file(src, dst):
'''
Copy a large file showing progress.
'''
print('copying "{}" --> "{}"'.format(src, dst))
# Start the timer and get the size.
start = time.time()
size = os.stat(src).st_size
print('{} bytes'.format(size))
# Adjust the chunk size to the input size.
divisor = 10000 # .1%
chunk_size = size / divisor
while chunk_size == 0 and divisor > 0:
divisor /= 10
chunk_size = size / divisor
print('chunk size is {}'.format(chunk_size))
# Copy.
try:
with open(src, 'rb') as ifp:
with open(dst, 'wb') as ofp:
copied = 0 # bytes
chunk = ifp.read(chunk_size)
while chunk:
# Write and calculate how much has been written so far.
ofp.write(chunk)
copied += len(chunk)
per = 100. * float(copied) / float(size)
# Calculate the estimated time remaining.
elapsed = time.time() - start # elapsed so far
avg_time_per_byte = elapsed / float(copied)
remaining = size - copied
est = remaining * avg_time_per_byte
est1 = size * avg_time_per_byte
eststr = 'rem={:>.1f}s, tot={:>.1f}s'.format(est, est1)
# Write out the status.
sys.stdout.write('\r{:>6.1f}% {} {} --> {} '.format(per, eststr, src, dst))
sys.stdout.flush()
# Read in the next chunk.
chunk = ifp.read(chunk_size)
except IOError as obj:
print('\nERROR: {}'.format(obj))
sys.exit(1)
sys.stdout.write('\r\033[K') # clear to EOL
elapsed = time.time() - start
print('copied "{}" --> "{}" in {:>.1f}s"'.format(src, dst, elapsed))
You can see a fully functioning version in the gist entry here: https://gist.github.com/jlinoff/0f7b290dc4e1f58ad803.

Categories

Resources