I've written this code that downloads a file from the internet and saves it to my computer.
To make it more efficient, I added MultiProcessing to my code to be able to download multiple files at the same time and it works, However, It keeps printing the progressbar I added again and again.
What I want is for the progress bars to display once and keep updating, like they would before the Multi Processing functionality is added. I've added my code below to reproduce.
from multiprocessing import Process
from alive_progress import alive_bar
import requests
import time
import os
def download(url):
curr_dir = os.getcwd()
x = requests.head(url)
y = requests.head(x.headers['Location'])
file_size = int(int(y.headers['content-length']) / 1024)
chunk_size = 1024
def compute():
response = requests.get(url, stream=True)
with open(curr_dir + '\\' + str(time.time()) + '.mp4', 'wb') as f:
for chunk in response.iter_content(chunk_size=chunk_size):
f.write(chunk)
yield 1024
with alive_bar(file_size, bar='classic2', spinner='classic') as bar:
for i in compute():
bar()
print("Downloaded!")
if __name__ == '__main__':
processess = []
num_processess = 2
for i in num_processess:
process = Process(target=download, args=(links[i],))
processess.append(process)
for process in processess:
process.start()
for process in processess:
process.join()
Alive-progress doesn't support showing and updating multiple progress bars. You have to use another library, such as the tqdm.
The following is an example of using the tqdm for your scenario. The key point is to call the tqdm.set_lock() to specify a synchronization mechanism for inter-process interaction and control positions of progress bars via the position argument of tqdm().
import multiprocessing
import tqdm
def download(url, id, tqdm_lock):
...
tqdm.tqdm.set_lock(tqdm_lock)
with tqdm.tqdm(total=file_size, position=id) as bar:
for i in compute():
bar.update(1)
bar.clear()
...
if __name__ == '__main__':
tqdm_lock = multiprocessing.RLock()
processess = []
num_processess = 2
links = [...]
for i in num_processess:
process = Process(target=download, args=(links[i], i, tqdm_lock))
processess.append(process)
for process in processess:
process.start()
for process in processess:
process.join()
Update 2
If you want multiple progress bars, then I would use package tqdm.
This is how I would approach it:
First find out for each URL how many CHUNK_SIZE chunks there are. CHUNK_SIZE is set at 1024, but consider increasing this for large files. A potential issue is that the 'content-length' header key is not always present. In this case, the URL is considered to consist of a single chunk and the progress bar created will be updated only once when the entire file has been downloaded..
Then each submitted task creates a progress bar whose size is the number of chunks it retrieved in step 1 and designated for a specific position based on its task number. Then the chunks are retrieved and the progress bar is updated. The logic is predicated on the file being retrieved never varying in size when the content-length key is present in the fetched header. That is, the size of the file does not change between the head and get requests being issued so that the progress bar size set from the head command will match the actual number of chunks read when the download is done.
In the code below I have commented out specific code pertaining to the writing of downloaded files to disk and have gotten rid of the compute generator function, which now seems to be unnecessary. I have also added a delay between successive fetching of chunks so that the progress bar does not progress too fast:
import requests
from tqdm import tqdm
CHUNK_SIZE = 1024
def get_number_of_chunks(url):
r = requests.head(url, allow_redirects=True)
headers = r.headers
if 'content-length' in headers:
n_chunks, remainder = divmod(int(headers['content-length']), CHUNK_SIZE)
if remainder:
n_chunks += 1
else:
n_chunks = 1
return n_chunks
def download(task_number, url):
n_chunks = get_number_of_chunks(url)
response = requests.get(url, stream=True)
#with open(str(time.time()) + '.mp4', 'wb') as f:
if True:
with tqdm(total=n_chunks, position=task_number) as bar:
for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
#f.write(chunk)
if n_chunks != 1:
bar.update(1)
# For demo purposes:
import time
time.sleep(.1)
if n_chunks == 1:
bar.update(1)
if __name__ == '__main__':
from multiprocessing.pool import ThreadPool
links = [
'http://localhost/friends/images/nav.png',
'http://localhost/friends/images/race.jpg',
]
n_writers = len(links)
pool = ThreadPool(n_writers)
pool.starmap(download, enumerate(links))
pool.close()
pool.join()
Multiprocessing Version
If you must use multiprocessing, then thanks to relent95, who showed the way:
import requests
from tqdm import tqdm
CHUNK_SIZE = 1024
def init_pool_processes(lock):
"""
Note: The lock only needs to be set once for each pool process.
"""
tqdm.set_lock(lock)
def get_number_of_chunks(url):
r = requests.head(url, allow_redirects=True)
headers = r.headers
if 'content-length' in headers:
n_chunks, remainder = divmod(int(headers['content-length']), CHUNK_SIZE)
if remainder:
n_chunks += 1
else:
n_chunks = 1
return n_chunks
def download(task_number, url):
n_chunks = get_number_of_chunks(url)
response = requests.get(url, stream=True)
#with open(str(time.time()) + '.mp4', 'wb') as f:
if True:
with tqdm(total=n_chunks, position=task_number) as bar:
for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
#f.write(chunk)
if n_chunks != 1:
bar.update(1)
# For demo purposes:
import time
time.sleep(.1)
if n_chunks == 1:
bar.update(1)
if __name__ == '__main__':
from multiprocessing import Pool, Lock
links = [
'http://localhost/friends/images/nav.png',
'http://localhost/friends/images/race.jpg',
]
n_writers = len(links)
pool = Pool(n_writers, initializer=init_pool_processes, initargs=(Lock(),))
pool.starmap(download, enumerate(links))
pool.close()
pool.join()
Related
I have a function that zip streams data into a bytebuffer, from that bytebuffer I create 5000lines/chunks, now I am trying to write these chunks back to s3 bucket in separate files, since I am using AWS Lambda I have cannot let single thread handle all the workflow as there 5 minute constraint after which AWS Lambda times out, coming from Java background where threads are pretty simple to implement but in python I am getting confused how to execute pool of thread to take care of uploading file to s3 part of my process, here is my code:
import io
import zipfile
import boto3
import sys
import multiprocessing
# from multiprocessing.dummy import Pool as ThreadPool
import time
s3_client = boto3.client('s3')
s3 = boto3.resource('s3', 'us-east-1')
def stream_zip_file():
# pool = ThreadPool(threads)
start_time_main = time.time()
start_time_stream = time.time()
obj = s3.Object(
bucket_name='monkey-business-dev-data',
key='sample-files/daily/banana/large/banana.zip'
)
end_time_stream = time.time()
# process_queue = multiprocessing.Queue()
buffer = io.BytesIO(obj.get()["Body"].read())
output = io.BytesIO()
print (buffer)
z = zipfile.ZipFile(buffer)
foo2 = z.open(z.infolist()[0])
print(sys.getsizeof(foo2))
line_counter = 0
file_clounter = 0
for line in foo2:
line_counter += 1
output.write(line)
if line_counter >= 5000:
file_clounter += 1
line_counter = 0
# pool.map(upload_to_s3, (output, file_clounter))
# upload_to_s3(output, file_clounter)
# process_queue.put(output)
output.close()
output = io.BytesIO()
if line_counter > 0:
# process_queue.put(output)
# upload_to_s3(output, file_clounter)
# pool.map(upload_to_s3, args =(output, file_clounter))
output.close()
print('Total Files: {}'.format(file_clounter))
print('Total Lines: {}'.format(line_counter))
output.seek(0)
start_time_upload = time.time()
end_time_upload = time.time()
output.close()
z.close()
end_time_main = time.time()
print('''
main: {}
stream: {}
upload: {}
'''.format((end_time_main-start_time_main),(end_time_stream-start_time_stream),(end_time_upload-start_time_upload)))
def upload_to_s3(output, file_name):
output.seek(0)
s3_client.put_object(
Bucket='monkey-business-dev-data', Key='sample-files/daily/banana/large/{}.txt'.format(file_name),
ServerSideEncryption='AES256',
Body=output,
ACL='bucket-owner-full-control'
)
# consumer_process = multiprocessing.Process(target=data_consumer, args=(process_queue))
# consumer_process.start()
#
#
# def data_consumer(queue):
# while queue.empty() is False:
if __name__ == '__main__':
stream_zip_file()
Now I have tried several ways to do it, my specific requirement is to have threadpool with size of 10 threads and these threads would always pool a queue, if chunk is available to upload on queue thread would execute and start uploading the chunk meanwhile one thread would always continuously pool the queue for new chunk and if chunk gets available a new thread (if thread 1 is still busy in s3 upload) will automatically start and upload the file to s3 and so on. I have checked many answers here and on google but nothing seems to work or make sense to my feeble mind.
i am test the fastest way between two process. i got two process, one write data, one receive data. my script show write and read from a file is fater than pipe. How can this happen? memory is faster than disk??
write and read from file:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from mutiprocesscomunicate import gen_data
data_size = 128 * 1024 # KB
def send_data_task(file_name):
with open(file_name, 'wb+') as fd:
for i in range(data_size):
fd.write(gen_data(1))
fd.write('\n'.encode('ascii'))
# end EOF
fd.write('EOF'.encode('ascii'))
print('send done.')
def get_data_task(file_name):
offset = 0
fd = open(file_name, 'r+')
i = 0
while True:
data = fd.read(1024)
offset += len(data)
if 'EOF' in data:
fd.truncate()
break
if not data:
fd.close()
fd = None
fd = open(file_name, 'r+')
fd.seek(offset)
continue
print("recv done.")
if __name__ == '__main__':
import multiprocessing
pipe_out = pipe_in = 'throught_file'
p = multiprocessing.Process(target=send_data_task, args=(pipe_out,), kwargs=())
p1 = multiprocessing.Process(target=get_data_task, args=(pipe_in,), kwargs=())
p.daemon = True
p1.daemon = True
import time
start_time = time.time()
p1.start()
import time
time.sleep(0.5)
p.start()
p.join()
p1.join()
import os
os.sync()
print('through file', data_size / (time.time() - start_time), 'KB/s')
open(pipe_in, 'w+').truncate()
use pipe
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import multiprocessing
from mutiprocesscomunicate import gen_data
data_size = 128 * 1024 # KB
def send_data_task(pipe_out):
for i in range(data_size):
pipe_out.send(gen_data(1))
# end EOF
pipe_out.send("")
print('send done.')
def get_data_task(pipe_in):
while True:
data = pipe_in.recv()
if not data:
break
print("recv done.")
if __name__ == '__main__':
pipe_out, pipe_in = multiprocessing.Pipe()
p = multiprocessing.Process(target=send_data_task, args=(pipe_out,), kwargs=())
p1 = multiprocessing.Process(target=get_data_task, args=(pipe_in,), kwargs=())
p.daemon = True
p1.daemon = True
import time
start_time = time.time()
p1.start()
p.start()
p.join()
p1.join()
print('through pipe', data_size / (time.time() - start_time), 'KB/s')
create data function:
def gen_data(size):
onekb = "a" * 1024
return (onekb * size).encode('ascii')
result:
through file 110403.02025891568 KB/s
through pipe 75354.71358973449 KB/s
i use Mac os with python3.
update
if data is just 1kb, pipe is 100 faster than file. but if date if big, like 128MB result is above.
A pipe has a limited capacity, in order to match speeds of producer and consumer (via back pressure flow control) rather than consume an unlimited amount of memory. The particular limit on OS X, according to this Unix stack exchange answer, is 16KiB. As you're writing 128KiB, this means 8 times as many system calls (and context switches), at least. When working with files, the size is limited by your disk space or quota only, and without a fdatasync or similar, it won't need to make it to disk; it can be read again directly from cache. On the other hand, when your data is small, the time to find a place to put the file dominates leaving the pipe far faster.
When you do use fdatasync, or just exceed the available memory for disk caching, writing to disk also slows down to match actual disk transfer speeds.
Because quite often file data is first written into the page cache (which is in RAM) by the OS kernel.
I have Python application which uses threading and requests modules for processing many pages. Basic function for page downloading looks like this:
def get_page(url):
error = None
data = None
max_page_size = 10 * 1024 * 1024
try:
s = requests.Session()
s.max_redirects = 10
s.keep_alive = False
r = s.get('http://%s' % url if not url.startswith('http://') else url,
headers=headers, timeout=10.0, stream=True)
raw_data = io.BytesIO()
size = 0
for chunk in r.iter_content(4096):
size += len(chunk)
raw_data.write(chunk)
if size > max_page_size:
r.close()
raise SpyderError('too_large')
fetch_result = 'ok'
finally:
del s
It works well in most cases but sometimes application freezes because of very slow connection with some servers or some other network problems. How can I setup a global guaranteed timeout for whole function? Should I use asyncio or coroutines?
I am using requests to download files, but for large files I need to check the size of the file on disk every time because I can't display the progress in percentage and I would also like to know the download speed. How can I go about doing it ? Here's my code :
import requests
import sys
import time
import os
def downloadFile(url, directory) :
localFilename = url.split('/')[-1]
r = requests.get(url, stream=True)
start = time.clock()
f = open(directory + '/' + localFilename, 'wb')
for chunk in r.iter_content(chunk_size = 512 * 1024) :
if chunk :
f.write(chunk)
f.flush()
os.fsync(f.fileno())
f.close()
return (time.clock() - start)
def main() :
if len(sys.argv) > 1 :
url = sys.argv[1]
else :
url = raw_input("Enter the URL : ")
directory = raw_input("Where would you want to save the file ?")
time_elapsed = downloadFile(url, directory)
print "Download complete..."
print "Time Elapsed: " + time_elapsed
if __name__ == "__main__" :
main()
I think one way to do it would be to read the file every time in the for loop and calculate the percentage of progress based on the header Content-Length. But that would be again an issue for large files(around 500MB). Is there any other way to do it?
see here: Python progress bar and downloads
i think the code would be something like this, it should show the average speed since start as bytes per second:
import requests
import sys
import time
def downloadFile(url, directory) :
localFilename = url.split('/')[-1]
with open(directory + '/' + localFilename, 'wb') as f:
start = time.clock()
r = requests.get(url, stream=True)
total_length = r.headers.get('content-length')
dl = 0
if total_length is None: # no content length header
f.write(r.content)
else:
for chunk in r.iter_content(1024):
dl += len(chunk)
f.write(chunk)
done = int(50 * dl / total_length)
sys.stdout.write("\r[%s%s] %s bps" % ('=' * done, ' ' * (50-done), dl//(time.clock() - start)))
print ''
return (time.clock() - start)
def main() :
if len(sys.argv) > 1 :
url = sys.argv[1]
else :
url = raw_input("Enter the URL : ")
directory = raw_input("Where would you want to save the file ?")
time_elapsed = downloadFile(url, directory)
print "Download complete..."
print "Time Elapsed: " + time_elapsed
if __name__ == "__main__" :
main()
An improved version of the accepted answer for python3 using io.Bytes (write to memory), result in Mbps, support for ipv4/ipv6, size and port arguments.
import sys, time, io, requests
def speed_test(size=5, ipv="ipv4", port=80):
if size == 1024:
size = "1GB"
else:
size = f"{size}MB"
url = f"http://{ipv}.download.thinkbroadband.com:{port}/{size}.zip"
with io.BytesIO() as f:
start = time.perf_counter()
r = requests.get(url, stream=True)
total_length = r.headers.get('content-length')
dl = 0
if total_length is None: # no content length header
f.write(r.content)
else:
for chunk in r.iter_content(1024):
dl += len(chunk)
f.write(chunk)
done = int(30 * dl / int(total_length))
sys.stdout.write("\r[%s%s] %s Mbps" % ('=' * done, ' ' * (30-done), dl//(time.perf_counter() -
start) / 100000))
print( f"\n{size} = {(time.perf_counter() - start):.2f} seconds")
Usage Examples:
speed_test()
speed_test(10)
speed_test(50, "ipv6")
speed_test(1024, port=8080)
Output Sample:
[==============================] 61.34037 Mbps
100MB = 17.10 seconds
Available Options:
size: 5, 10, 20, 50, 100, 200, 512, 1024
ipv: ipv4, ipv6
port: 80, 81, 8080
Updated on 20221011:
time.perf_counter() replaced time.clock(), which has been deprecated on python 3.3 (kudos to shiro)
I had a problem with a specific slow server to download a big file
no Content-Length header.
big file (42GB),
no compression,
slow server (<1MB/s),
Beeing this big, I had also problem with memory usage during the request. Requests doesn't write output on file, like urlibs does, looks like it keep it in memory.
No content length header makes the accepted answer.. not monitoring.
So I wrote this -basic- method to monitor speed during the csv download following just the "requests" documentation.
It needs a fname (complete output path), a link (http or https) and you can specify custom headers.
BLOCK=5*1024*1024
try:
with open(fname, 'wb') as f:
r = requests.get(link, headers=headers, stream=True)
## This is, because official dozumentation suggest it,
## saying it's more reliable thatn cycling directly on iterlines, to don't lose data
lines = r.iter_lines()
## Init the base vars, for monitor and block management
## Obj is a byte object, because iterlines returno objects
tsize = 0; obj = bytearray(); t0=time.time(); i=0;
for line in lines:
## calculate the line size, in bytes, and add to the byte object
tsize+=len(line)
obj.extend(line)
## When condition reached,
if tsize > BLOCK:
## Increment the block number
i+=1;
## Calculate the speed.. this is in MB/s,
## but you can easily change to KB/s, or Blocks/s
t1=time.time()
t=t1-t0;
speed=round(5/t, 2);
## Write the block to the file.
f.write(obj)
## Write stats
print('got', i*5, 'MB ', 'block' ,i, ' #', speed,'MB/s')
## Reinit all the base vars, for a new block
obj=bytearray(); tsize=0; t0=time.time()
## Write the last block part to the file.
f.write(obj)
except Exception as e:
print("Error: ", e, 0)
When downloading a large file with python, I want to put a time limit not only for the connection process, but also for the download.
I am trying with the following python code:
import requests
r = requests.get('http://ipv4.download.thinkbroadband.com/1GB.zip', timeout = 0.5, prefetch = False)
print r.headers['content-length']
print len(r.raw.read())
This does not work (the download is not time limited), as correctly noted in the docs: https://requests.readthedocs.org/en/latest/user/quickstart/#timeouts
This would be great if it was possible:
r.raw.read(timeout = 10)
The question is, how to put a time limit to the download?
And the answer is: do not use requests, as it is blocking. Use non-blocking network I/O, for example eventlet:
import eventlet
from eventlet.green import urllib2
from eventlet.timeout import Timeout
url5 = 'http://ipv4.download.thinkbroadband.com/5MB.zip'
url10 = 'http://ipv4.download.thinkbroadband.com/10MB.zip'
urls = [url5, url5, url10, url10, url10, url5, url5]
def fetch(url):
response = bytearray()
with Timeout(60, False):
response = urllib2.urlopen(url).read()
return url, len(response)
pool = eventlet.GreenPool()
for url, length in pool.imap(fetch, urls):
if (not length):
print "%s: timeout!" % (url)
else:
print "%s: %s" % (url, length)
Produces expected results:
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
When using Requests' prefetch=False parameter, you get to pull in arbitrary-sized chunks of the respone at a time (rather than all at once).
What you'll need to do is tell Requests not to preload the entire request and keep your own time of how much you've spent reading so far, while fetching small chunks at a time. You can fetch a chunk using r.raw.read(CHUNK_SIZE). Overall, the code will look something like this:
import requests
import time
CHUNK_SIZE = 2**12 # Bytes
TIME_EXPIRE = time.time() + 5 # Seconds
r = requests.get('http://ipv4.download.thinkbroadband.com/1GB.zip', prefetch=False)
data = ''
buffer = r.raw.read(CHUNK_SIZE)
while buffer:
data += buffer
buffer = r.raw.read(CHUNK_SIZE)
if TIME_EXPIRE < time.time():
# Quit after 5 seconds.
data += buffer
break
r.raw.release_conn()
print "Read %s bytes out of %s expected." % (len(data), r.headers['content-length'])
Note that this might sometimes use a bit more than the 5 seconds allotted as the final r.raw.read(...) could lag an arbitrary amount of time. But at least it doesn't depend on multithreading or socket timeouts.
Run download in a thread which you can then abort if not finished on time.
import requests
import threading
URL='http://ipv4.download.thinkbroadband.com/1GB.zip'
TIMEOUT=0.5
def download(return_value):
return_value.append(requests.get(URL))
return_value = []
download_thread = threading.Thread(target=download, args=(return_value,))
download_thread.start()
download_thread.join(TIMEOUT)
if download_thread.is_alive():
print 'The download was not finished on time...'
else:
print return_value[0].headers['content-length']