I've got a little script for sorting out my dowloaded files and it works great, but I'd like to print out the progress of a file move, for when it's doing the big ones, right now I do something like:
print "moving..."
os.renames(pathTofile, newName)
print "done"
But I'd like to be able to see something like a progress bar ( [..... ] style) or a percentage printed to stdout.
I don't need/want a gui of any sort, just the simplest / least-work ( :) ) way to get the operation progress).
Thanks!
You won't be able to get that kind of information using os.renames. Your best bet is to replace that with a home grown file copy operation but call stat on the file beforehand in order to get the complete size so you can track how far through you are.
Something like this:
source_size = os.stat(SOURCE_FILENAME).st_size
copied = 0
source = open(SOURCE_FILENAME, 'rb')
target = open(TARGET_FILENAME, 'wb')
while True:
chunk = source.read(32768)
if not chunk:
break
target.write(chunk)
copied += len(chunk)
print '\r%02d%%' % (copied * 100 / source_size),
source.close()
target.close()
Note however that this will more than likely be markedly slower than using os.rename.
There isn't any way to get a progress bar because the "rename" call that moves the file is a single OS call.
It's worth noting that the "rename" call only takes time if the source and destination are on different physical volumes. If they're on the same volume, then the rename will take almost no time. If you know that you're copying data between volumes, you may wish to use functions from the shutil module such as copyfileobj. There is no callback for progress monitoring, however you can implement your own source or destination file-like object to track progress.
This example method expands on the answer by Benno by estimating the time remaining and removing the progress line when the copy is complete.
def copy_large_file(src, dst):
'''
Copy a large file showing progress.
'''
print('copying "{}" --> "{}"'.format(src, dst))
# Start the timer and get the size.
start = time.time()
size = os.stat(src).st_size
print('{} bytes'.format(size))
# Adjust the chunk size to the input size.
divisor = 10000 # .1%
chunk_size = size / divisor
while chunk_size == 0 and divisor > 0:
divisor /= 10
chunk_size = size / divisor
print('chunk size is {}'.format(chunk_size))
# Copy.
try:
with open(src, 'rb') as ifp:
with open(dst, 'wb') as ofp:
copied = 0 # bytes
chunk = ifp.read(chunk_size)
while chunk:
# Write and calculate how much has been written so far.
ofp.write(chunk)
copied += len(chunk)
per = 100. * float(copied) / float(size)
# Calculate the estimated time remaining.
elapsed = time.time() - start # elapsed so far
avg_time_per_byte = elapsed / float(copied)
remaining = size - copied
est = remaining * avg_time_per_byte
est1 = size * avg_time_per_byte
eststr = 'rem={:>.1f}s, tot={:>.1f}s'.format(est, est1)
# Write out the status.
sys.stdout.write('\r{:>6.1f}% {} {} --> {} '.format(per, eststr, src, dst))
sys.stdout.flush()
# Read in the next chunk.
chunk = ifp.read(chunk_size)
except IOError as obj:
print('\nERROR: {}'.format(obj))
sys.exit(1)
sys.stdout.write('\r\033[K') # clear to EOL
elapsed = time.time() - start
print('copied "{}" --> "{}" in {:>.1f}s"'.format(src, dst, elapsed))
You can see a fully functioning version in the gist entry here: https://gist.github.com/jlinoff/0f7b290dc4e1f58ad803.
Related
Is it normal that gzip algorithm can make file size large after compression?
E.g. it's needed to split a large file of 8.2Mb into small 101024 chunks of 81 bytes and compress them using gzip library. After it's done I see that folder with gzipped files has become larger in size and it is 13Mb now in comparison with total chunks size without compression. And for example there is a piece of code here:
def gzip_it(filenumber, chunk, path=FOLDER_PATH, prefix=FILE_NAME_PREFIX):
with gzip.open(os.path.join(path, prefix + "{:07d}".format(filenumber) + ".gz"), mode="wb") as chunk_file:
chunk_file.write(gzip.compress(chunk))
def split_and_write(file, thread_num):
spare_to_distribute_inner = SPARE_TO_DISTRIBUTE
initial_position = 0 if thread_num == 0 else BYTES_PER_THREAD * thread_num
initial_file_num = 0 if thread_num == 0 else FILES_PER_THREAD * thread_num
with open(file, mode="rb") as file:
file.seek(initial_position)
while initial_file_num < FILES_PER_THREAD * (thread_num + 1):
if spare_to_distribute_inner:
chunk = file.read(CHUNK_FILE_SIZE + 1)
gzip_it(initial_file_num, chunk)
initial_file_num += 1
initial_position += (CHUNK_FILE_SIZE + 1)
spare_to_distribute_inner -= 1
else:
if initial_file_num == FILES_TOTAL - 1:
chunk = file.read(CHUNK_FILE_SIZE + SPARE_TO_DISTRIBUTE_REMAINDER)
gzip_it(initial_file_num, chunk)
make_marker_file(str(SOURCE_FILE_SIZE).encode())
break
else:
chunk = file.read(CHUNK_FILE_SIZE)
gzip_it(initial_file_num, chunk)
initial_file_num += 1
initial_position += CHUNK_FILE_SIZE
def main():
for thread in range(VIRTUAL_THREADS):
pool.submit(split_and_write, "cry_cmake.exe", thread)
Yes it is completely normal that files become larger after compression. This happens usually with files that are already compressed.
What you are doing is wrong. Your chunks are too small to be compressed meaningfully. Try making chunks of 1MiB or more.
Basically in a compression, the algorithm looks for repeated sequences and shortens them, creating an initial dictionary with the original sequence and the shortened version.
If the chunks are so small, it can't really find long repeated sequences and it needs to repeat this initial dictionary per every chunk.
How come you want to split the original file first and compress each minichunk by itself? In most use cases people compress first and split afterwards.
An alternative for your case can be to split the original file into the minichunks but do not compress each of it separately but instead put all of them in one directory and then make a .tgz out of the directory:
tar -c -z -f result.tgz chunks_directory/
Then the compression takes place after tar has bundled all the files again but after unpacking, you will receive all the minichunk files again.
I am trying to use a streaming post request to download files from my online database while providing an accurate indication of the relative download progress to drive a progress bar in my QT application.
I thought I could simply compare chunk_size * chunk amount to the file size to know how much relative data I have downloaded, but it doesn't seem to work that way.
To test my understanding of chunks I set the chunk_size to be the same as the file size (about 9.8MB). This is my test code:
with closing(requests.post(ipAddress,
headers={'host':hostURL},
data=dataDict,
timeout=timeout,
stream=True)) as responseObject:
chunkNumber = 0
for chunk in responseObject.iter_content(chunk_size=10276044):
print chunkNumber
chunkNumber += 1
content += chunk
I expected to only see one or two chunks, but instead I see chunkNumber increase to anywhere between 1600 and over 4000 when I run the test multiple times.
I am obviously misinterpreting the use of chunk_size, so my question is:
How can I accurately determine the relative progress of the download during the iter_content() loop so that I can drive a progress bar from 0 to 100%?
Cheers,
frank
The solution I found for my own project was to find the length of the response and divide by 100.
This is Python 3 code so just remove the parentheses from the print statements to be compatible with 2.
f = open(title, 'wb')
response = requests.get(url, params=query, headers=HDR, stream=True)
size = int(response.headers.get('content-length'))
CHUNK = size//100 # size for a percentage point
if CHUNK > 1000000: # place chunk cap on >1MB files
CHUNK = 100000 # 0.1MB
print(size, 'bytes')
print("Writing to file in chunks of {} bytes...".format(CHUNK))
actual = 0 # current progress
try:
for chunk in response.iter_content(chunk_size=CHUNK):
if not chunk: break
f.write(chunk)
actual += len(chunk) # move progress bar
percent = int((actual/size)*100)
if 'idlelib' in sys.modules: # you can take these conditions out if you don't have windows
#if you do then import sys, os at the start of the program
if not(percent % 5):
print('{}%'.format(percent), end=' ')
else:
os.system('title {}% {}/{}'.format(percent, actual, size))
except Exception as e:
print(e)
finally:
f.close()
`What is the best way to analyze a 2GB WAV file (1khz Tone) for audio dropouts using wave module? I tried the script below
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
for i in xrange(file1.getnframes()):
frame = file1.readframes(i)
zero = True
for j in xrange(len(frame)):
# check if amplitude is greater than 0
# the ord() function converts the hex values to integers
if ord(frame[j]) > 0:
zero = False
break
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
file1.close()
file2.close()
I haven't used the wave module before, but file1.readframes(i) looks like it's reading 1 frame when you're at the first frame, 2 frames when you're at the second frame, 10 frames when you're in the tenth frame, and a 2Gb CD quality file might have a million frames - by the time you're at frame 100,000 reading 100,000 frames ... getting slower each time through the loop as well?
And from my comment, in Python 2 range() generates an in-memory array of the full size first, and xrange() doesn't, but not using range at all helps even more.
And push the looping down into the lower layers with any() to make the code shorter, and possibly faster:
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
chunksize = file1.getframerate()
chunk = file1.readframes(chunksize)
while chunk:
if not any(ord(sample) for sample in chunk):
print >> file2, 'dropout at second %s' % (file1.tell()/chunksize)
chunk = file1.readframes(chunksize)
file1.close()
file2.close()
This should read the file in 1-second chunks.
I think a simple solution to this would be to consider that the frame rate on audio files is pretty high. A sample file on my computer happens to have a framerate of 8,000. That means for every second of audio, I have 8,000 samples. If you have missing audio, I'm sure it will exist across multiple frames within a second, so you can essentially reduce your comparisons as drastically as your standards would allow. If I were you, I would try iterating over every 1,000th sample instead of every single sample in the audio file. That basically means it will examine every 1/8th of a second of audio to see if it's dead. Not as precise, but hopefully it will get the job done.
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
for i in range(file1.getnframes()):
frame = file1.readframes(i)
zero = True
for j in range(0, len(frame), 1000):
# check if amplitude is greater than 0
# the ord() function converts the hex values to integers
if ord(frame[j]) > 0:
zero = False
break
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
file1.close()
file2.close()
At the moment, you're reading the entire file into memory, which is not ideal. If you look at the methods available for a "Wave_read" object, one of them is setpos(pos), which sets the position of the file pointer to pos. If you update this position, you should be able to only keep the frame you want in memory at any given time, preventing errors. Below is a rough outline:
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
def scan_frame(frame):
for j in range(len(frame)):
# check if amplitude is less than 0
# It makes more sense here to check for the desired case (low amplitude)
# rather than breaking at higher amplitudes
if ord(frame[j]) <= 0:
return True
for i in range(file1.getnframes()):
frame = file1.readframes(1) # only read the frame at the current file position
zero = scan_frame(frame)
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
pos = file1.tell() # States current file position
file1.setpos(pos + len(frame)) # or pos + 1, or whatever a single unit in a wave
# file is, I'm not entirely sure
file1.close()
file2.close()
Hope this can help!
I have a huge number of report files (about 650 files) which takes about 320 M of hard disk and I want to process them. There are a lot of entries in each file; I should count and log them based on their content. Some of them are related to each other and I should find, log and count them too; matches may be in different files. I have wrote a simple script to do the job. I used python profiler and it just took about 0.3 seconds to run the script for one single file with 2000 lines that we need half of them for processing. But for the whole directory it took 1 hour and a half to be done. This is how my script looks like:
# imports
class Parser(object):
def __init__(self):
# load some configurations
# open some log files
# set some initial values for some variables
def parse_packet(self, tags):
# extract some values from line
def found_matched(self, packet):
# search in the related list to find matched line
def save_packet(self, packet):
# write the line in the appropriate files and increase or decrease some counters
def parse(self, file_addr):
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
def process_files(self):
if not os.path.isdir(self.src_dir):
self.log('No such file or directory: ' + str(self.src_dir))
sys.exit(1)
input_dirs = os.walk(self.src_dir)
for dname in input_dirs:
file_list = dname[2]
for fname in file_list:
self.parse(os.path.join(dname[0], fname))
self.finalize_process()
def finalize_process(self):
# closing files
I want to decrease the time at least to the 10% percent of current execution time. Maybe multiprocessing can help me or just some enhancement in current script will do the task. Anyway could you please help me in this?
Edit 1:
I have changed my code according to #Reut Sharabani's answer:
def parse(self, file_addr):
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
def process_files(self):
if not os.path.isdir(self.src_dir):
self.log('No such file or directory: ' + str(self.src_dir))
sys.exit(1)
input_dirs = os.walk(self.src_dir)
for dname in input_dirs:
process_pool = multiprocessing.Pool(10)
for fname in file_list:
file_list = [os.path.join(dname[0], fname) for fname in dname[2]]
process_pool.map(self.parse, file_list)
self.finalize_process()
I also added below lines before my class definition to avoid PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup
__builtin__.instancemethod failed:
import copy_reg
import types
def _pickle_method(m):
if m.im_self is None:
return getattr, (m.im_class, m.im_func.func_name)
else:
return getattr, (m.im_self, m.im_func.func_name)
copy_reg.pickle(types.MethodType, _pickle_method)
Another thing that I have done into my code was not to keep open log files during file processing; I open and close them for writing each entry just to avoid ValueError: I/O operation on closed file.
Now the problem is that I have some files which are being processed multiple times. I also got wrong counts for my packets. What did I do wrong? Should I put process_pool = multiprocessing.Pool(10) before the for loop? Consider that I have just one directory right now and it doesn't seem to be the problem.
EDIT 2:
I also tried using ThreadPoolExecutor this way:
with ThreadPoolExecutor(max_workers=10) as executor:
for fname in file_list:
executor.submit(self.parse, fname)
Results were correct, but it took an hour and a half to be completed.
First of all, "about 650 files which takes about 320 M" is not a lot. Given that modern hard disks easily read and write 100 MB/s, the I/O performance of your system probably is not your bottleneck (also supported by "it just took about 0.3 seconds to run the script for one single file with 2000 lines", which clearly indicates CPU-limitation). However, the exact way you are reading files from within Python may not be efficient.
Furthermore, a simple multiprocessing-based architecture, run on a common multi core system, will allow you to perform your analysis much faster (no need to involve celery here, no need to cross machine boundaries).
multiprocessing architecture
Just have a look at multiprocessing, your architecture likely will involve one manager process (the parent), which defines a task Queue, and a Pool of worker processes. The manager (or feeder) puts tasks (e.g. file names) into the queue, and the workers consume these. After finishing with a task, a worker lets the manager know, and proceeds consuming the next one.
file processing method
This is quite inefficient:
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
...
readlines() reads the entire file before the list comprehension is evaluated. Only after that you again iterate through all lines. Hence, you iterate three times through your data. Combine everything into a single loop, so that you iterate the lines only once.
You should be using threads here. If you're blocked by cpu later, you can use processes.
To explain I first created a ten thousand files (0.txt ... 9999.txt), with a line count that's equivalent to the name (+1), using this command:
for i in `seq 0 999`; do for j in `seq 0 $i`; do echo $i >> $i.txt; done ; done
Next, I've created a python script using a ThreadPool with 10 threads to count the lines of all files that have an even value:
#!/usr/bin/env python
from multiprocessing.pool import ThreadPool
import time
import sys
print "creating %s threads" % sys.argv[1]
thread_pool = ThreadPool(int(sys.argv[1]))
files = ["%d.txt" % i for i in range(1000)]
def count_even_value_lines(filename):
with open(filename, 'r') as f:
# do some processing
line_count = 0
for line in f.readlines():
if int(line.strip()) % 2 == 0:
line_count += 1
print "finished file %s" % filename
return line_count
start = time.time()
print sum(thread_pool.map(count_even_value_lines, files))
total = time.time() - start
print total
As you can see this takes no time, and the results are correct. 10 files are processed in parallel and the cpu is fast enough to handle the results. If you want even more you may consider using threads and processes to utilize all cpus as well as not letting IO block you.
Edit:
As comments suggest, I was wrong and this is not I/O blocked, so you can speed it up using multiprocessing (cpu blocked). Because I used a ThreadPool which has the same interface as Pool you can make minimal edits and have the same code running:
#!/usr/bin/env python
import multiprocessing
import time
import sys
files = ["%d.txt" % i for i in range(2000)]
# function has to be defined before pool is opened and workers are forked
def count_even_value_lines(filename):
with open(filename, 'r') as f:
# do some processing
line_count = 0
for line in f:
if int(line.strip()) % 2 == 0:
line_count += 1
return line_count
print "creating %s processes" % sys.argv[1]
process_pool = multiprocessing.Pool(int(sys.argv[1]))
start = time.time()
print sum(process_pool.map(count_even_value_lines, files))
total = time.time() - start
print total
Results:
me#EliteBook-8470p:~/Desktop/tp$ python tp.py 1
creating 1 processes
25000000
21.2642059326
me#EliteBook-8470p:~/Desktop/tp$ python tp.py 10
creating 10 processes
25000000
12.4360249043
Aside from using parallel processing, your parse method is rather inefficient as #Jan-PhilipGehrcke already pointed out. To expand on his recommendation: The classical variant:
def parse(self, file_addr):
with open(file_addr, 'r') as f:
line_no = 0
for line in f:
line_no += 1
if line_no % 2 != 0:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
Or using your style (assuming you use python 3):
def parse(self, file_addr):
with open(file_addr, 'r') as f:
filtered = (l for index,l in enumerate(f) if index % 2 != 0)
for line in filtered:
# and so on
The thing to notice here, is the use of iterators, all operations to build the filtered list (which is not actually a list!!) operate on and return iterators, which means that at no point the entire file is loaded into a list.
[newbie question]
Hi,
I'm working on a huge text file which is well over 30GB.
I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.
Im currently using this:
f = open(file_path,'r')
for one_line in f.readlines():
do_some_processing(one_line)
f.close()
Also how can I show overall progress of how much data has been crunched so far ?
Thank you all very much.
File handles are iterable, and you should probably use a context manager. Try this:
with open(file_path, 'r') as fh:
for line in fh:
process(line)
That might be enough.
I use a function like this for a similiar problem. You can wrap up any iterable with it.
Change this
for one_line in f.readlines():
You just need to change your code to
# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):
You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.
def progress_meter(iterable, chunksize):
""" Prints progress through iterable at chunksize intervals."""
scan_start = time.time()
since_last = time.time()
for idx, val in enumerate(iterable):
if idx % chunksize == 0 and idx > 0:
print idx
print 'avg rate', idx / (time.time() - scan_start)
print 'inst rate', chunksize / (time.time() - since_last)
since_last = time.time()
print
yield val
Using readline imposes to find the end of each line in your file. If some lines are very long, it might lead your interpreter to crash (not enough memory to buffer the full line).
In order to show progress you can check the file size for example using:
import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size
The progress of your task can then be the number of bytes processed divided by the file size times 100 to have a percentage.