I am downloading files over http and displaying the progress using urllib and the following code - which works fine:
import sys
from urllib import urlretrieve
urlretrieve('http://example.com/file.zip', '/tmp/localfile', reporthook=dlProgress)
def dlProgress(count, blockSize, totalSize):
percent = int(count*blockSize*100/totalSize)
sys.stdout.write("\r" + "progress" + "...%d%%" % percent)
sys.stdout.flush()
Now I would also like to restart the download if it is going too slow (say less than 1MB in 15 seconds). How can I achieve this?
This should work.
It calculates the actual download rate and aborts if it is too low.
import sys
from urllib import urlretrieve
import time
url = "http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz" # 14.135.620 Byte
startTime = time.time()
class TooSlowException(Exception):
pass
def convertBToMb(bytes):
"""converts Bytes to Megabytes"""
bytes = float(bytes)
megabytes = bytes / 1048576
return megabytes
def dlProgress(count, blockSize, totalSize):
global startTime
alreadyLoaded = count*blockSize
timePassed = time.time() - startTime
transferRate = convertBToMb(alreadyLoaded) / timePassed # mbytes per second
transferRate *= 60 # mbytes per minute
percent = int(alreadyLoaded*100/totalSize)
sys.stdout.write("\r" + "progress" + "...%d%%" % percent)
sys.stdout.flush()
if transferRate < 4 and timePassed > 2: # download will be slow at the beginning, hence wait 2 seconds
print "\ndownload too slow! retrying..."
time.sleep(1) # let's not hammer the server
raise TooSlowException
def main():
try:
urlretrieve(url, '/tmp/localfile', reporthook=dlProgress)
except TooSlowException:
global startTime
startTime = time.time()
main()
if __name__ == "__main__":
main()
Something like this:
class Timeout(Exception):
pass
def try_one(func,t=3):
def timeout_handler(signum, frame):
raise Timeout()
old_handler = signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(t) # triger alarm in 3 seconds
try:
t1=time.clock()
func()
t2=time.clock()
except Timeout:
print('{} timed out after {} seconds'.format(func.__name__,t))
return None
finally:
signal.signal(signal.SIGALRM, old_handler)
signal.alarm(0)
return t2-t1
The call 'try_one' with the func you want to time out and the time to timeout:
try_one(downloader,15)
OR, you can do this:
import socket
socket.setdefaulttimeout(15)
HolyMackerel! Use the tools!
import urllib2, sys, socket, time, os
def url_tester(url = "http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz"):
file_name = url.split('/')[-1]
u = urllib2.urlopen(url,None,1) # Note the timeout to urllib2...
file_size = int(u.info().getheaders("Content-Length")[0])
print ("\nDownloading: {} Bytes: {:,}".format(file_name, file_size))
with open(file_name, 'wb') as f:
file_size_dl = 0
block_sz = 1024*4
time_outs=0
while True:
try:
buffer = u.read(block_sz)
except socket.timeout:
if time_outs > 3: # file has not had activity in max seconds...
print "\n\n\nsorry -- try back later"
os.unlink(file_name)
raise
else: # start counting time outs...
print "\nHmmm... little issue... I'll wait a couple of seconds"
time.sleep(3)
time_outs+=1
continue
if not buffer: # end of the download
sys.stdout.write('\rDone!'+' '*len(status)+'\n\n')
sys.stdout.flush()
break
file_size_dl += len(buffer)
f.write(buffer)
status = '{:20,} Bytes [{:.2%}] received'.format(file_size_dl,
file_size_dl * 1.0 / file_size)
sys.stdout.write('\r'+status)
sys.stdout.flush()
return file_name
This prints a status as expected. If I unplug my ethernet cable, I get:
Downloading: Python-2.7.3.tgz Bytes: 14,135,620
827,392 Bytes [5.85%] received
sorry -- try back later
If I unplug the cable, then plug it back in in less than 12 seconds, I get:
Downloading: Python-2.7.3.tgz Bytes: 14,135,620
716,800 Bytes [5.07%] received
Hmmm... little issue... I'll wait a couple of seconds
Hmmm... little issue... I'll wait a couple of seconds
Done!
The file is successfully downloaded.
You can see that urllib2 supports both timeouts and reconnects. If you disconnect and stay disconnected for 3 * 4 seconds == 12 seconds, it will timeout for good and raise a fatal exception. This could be dealt with as well.
Related
I've written a Python script that will download files from a website. To speed it up, I've made the downloading of the files multithreaded. Obviously, this is faster than doing the downloads serially, but I've come across some effects that I cannot explain.
The first x files (seems proportional to the amount of threads created) downloaded are incredibly fast--the output shows upwards of 40 files per second--but after that, slows down a lot.
Up to a point (near 200 threads), the maximum speed at which I can download files averages 10 files per second. If I increase the thread count to, say, 700, it still maxes out at 10 files per second. Increasing the thread count to a very large number (over 1,000) seems to limit the download speed based on CPU speed.
So, my questions are:
Why are the first files I download downloaded so fast compared to the rest and can I maintain the original speed?
Why does the thread count have such diminishing returns for download speeds?
Here is my script:
#!/usr/bin/python
import inspect
import math
from queue import Queue
from urllib.request import ProxyHandler, build_opener
from ast import literal_eval
from time import time, sleep
from datetime import timedelta
import random
from threading import Thread, activeCount
import os
proxies = Queue()
threads = Queue()
agents = []
total_files = 0
finished_files = 0
downloaded_files = 0
start_time = 0
class Config(object):
DEBUG = False
PROXIES_PATH = '/home/shane/bin/proxies.txt'
AGENTS_PATH = '/home/shane/bin/user-agents.txt'
DESTINATION_PATH = '/home/shane/images/%d.jpg'
SOURCE_URL = 'https://example.org/%d.jpg'
MAX_THREADS = 500
TIMEOUT = 62
RETRIES = 1
RETRIES_TIME = 1
def get_files_per_second():
return float(downloaded_files) / (time() - start_time)
def get_time_remaining():
delta = timedelta(seconds=float(total_files - finished_files) / get_files_per_second())
seconds = delta.total_seconds()
days, remainder = divmod(seconds, 86400)
hours, remainder = divmod(remainder, 3600)
minutes, seconds = divmod(remainder, 60)
days = str(int(days)).zfill(2)
hours = str(int(hours)).zfill(2)
minutes = str(int(minutes)).zfill(2)
seconds = str(int(seconds)).zfill(2)
return "%s:%s:%s:%s" % (days, hours, minutes, seconds)
def release_proxy(opener):
if Config.DEBUG:
print('Releasing proxy')
for handler in opener.handlers:
if type(handler) is ProxyHandler:
proxies.put(handler)
return
raise Exception('No proxy found')
def get_new_proxy():
if Config.DEBUG:
print('Getting new proxy')
if proxies.empty():
raise Exception('No proxies')
return proxies.get()
def get_new_agent():
if len(agents) == 0:
raise Exception('No user agents')
return random.choice(agents)
def get_new_opener():
opener = build_opener(get_new_proxy())
opener.addheaders = [('User-Agent', get_new_agent())]
return opener
def download(opener, source, destination, tries=0):
global finished_files, downloaded_files
if Config.DEBUG:
print('Downloading %s to %s' % (source, destination))
try:
result = opener.open(source, timeout=Config.TIMEOUT).read()
with open(destination, 'wb') as d:
d.write(result)
release_proxy(opener)
finished_files += 1
downloaded_files += 1
to_print = '(%d/%d files) (%d proxies) (%f files/second, %s left) (%d threads) %s'
print(to_print % (finished_files, total_files, proxies.qsize(), round(get_files_per_second(), 2), get_time_remaining(), activeCount(), source))
except Exception as e:
if Config.DEBUG:
print(e)
if tries < Config.RETRIES:
sleep(Config.RETRIES_TIME)
download(opener, source, destination, tries + 1)
else:
if proxies.qsize() < Config.MAX_THREADS * 2:
release_proxy(opener)
download(get_new_opener(), source, destination, 0)
class Downloader(Thread):
def __init__(self, source, destination):
Thread.__init__(self)
self.source = source
self.destination = destination
def run(self):
if Config.DEBUG:
print('Running thread')
download(get_new_opener(), self.source, self.destination)
if threads.qsize() > 0:
threads.get().start()
def populate_proxies():
if Config.DEBUG:
print('Populating proxies')
with open(Config.PROXIES_PATH, 'r') as fh:
for line in fh:
line = line.replace('\n', '')
if Config.DEBUG:
print('Adding %s to proxies' % line)
proxies.put(ProxyHandler(literal_eval(line)))
def populate_agents():
if Config.DEBUG:
print('Populating agents')
with open(Config.AGENTS_PATH, 'r') as fh:
for line in fh:
line = line.replace('\n', '')
if Config.DEBUG:
print('Adding %s to agents' % line)
agents.append(line)
def populate_threads():
global total_files, finished_files
if Config.DEBUG:
print('Populating threads')
for x in range(0, 100000):
destination = Config.SOURCE_URL % x
# queue threads
print('Queueing %s' % destination)
threads.put(Downloader(source, destination))
def start_work():
global start_time
if threads.qsize() == 0:
raise Exception('No work to be done')
start_time = time()
for x in range(0, min(threads.qsize(), Config.MAX_THREADS)):
if Config.DEBUG:
print('Starting thread %d' % x)
threads.get().start()
populate_proxies()
populate_agents()
populate_threads()
start_work()
The no. of threads you are using is a very high number, python does not actually run the threads in parallel, it just switches between them frequently, which seems like parallel threads.
If the task is CPU intensive, then use multi-processing, else if the task is I/O intensive, threads will be useful.
Keep the thread count low (10-70), on a normal Quad-core PC, 8GB ram, else the switching time will reduce the speed of your code.
Check these 2 links:
Stack Over Flow Question
Executive Summary On this page.
I've written a code which is supposed to receive some images and make them black & white. I'm measuring the response time for each task (response time = the time each image is received and is turned to black & white). Here is the code:
from __future__ import print_function
import signal
signal.signal(signal.SIGINT, signal.SIG_DFL)
from select import select
import socket
from struct import pack
from struct import unpack
#from collections import deque
import commands
from PIL import Image
import time
host = commands.getoutput("hostname -I")
port = 5005
backlog = 5
BUFSIZE = 4096
queueList = []
start = []
end = []
temp = []
def processP(q):
i = 0
while q:
name = q.pop(0)
col = Image.open(name)
gray = col.convert('L')
bw = gray.point(lambda x: 0 if x<128 else 255, '1')
bw.save("%d+30.jpg" % (i+1))
end.append(time.time())
#print(temp)
i = i + 1
class Receiver:
''' Buffer binary data from socket conn '''
def __init__(self, conn):
self.conn = conn
self.buff = bytearray()
def get(self, size):
''' Get size bytes from the buffer, reading
from conn when necessary
'''
while len(self.buff) < size:
data = self.conn.recv(BUFSIZE)
if not data:
break
self.buff.extend(data)
# Extract the desired bytes
result = self.buff[:size]
# and remove them from the buffer
del self.buff[:size]
return bytes(result)
def save(self, fname):
''' Save the remaining bytes to file fname '''
with open(fname, 'wb') as f:
if self.buff:
f.write(bytes(self.buff))
while True:
data = self.conn.recv(BUFSIZE)
if not data:
break
f.write(data)
def read_tcp(s):
conn, addr = s.accept()
print('Connected with', *addr)
# Create a buffer for this connection
receiver = Receiver(conn)
# Get the length of the file name
name_size = unpack('B', receiver.get(1))[0]
name = receiver.get(name_size).decode()
# Save the file
receiver.save(name)
conn.close()
print('saved\n')
queueList.append(name)
print('name', name)
start.append(time.time())
if (name == "sample.jpg"):
print('------------ok-------------')
processP(queueList)
print("Start: ", start)
print('--------------------------')
print("End: ", end)
while start:
temp.append(end.pop(0) - start.pop(0))
print('****************************')
print("Start: ", start)
print('--------------------------')
print("End: ", end)
print("Temp: ", temp)
i = 0
while i < len(temp)-1:
if (temp[i]<temp[i+1]):
print('yes')
else:
print('No')
i = i + 1
def read_udp(s):
data,addr = s.recvfrom(1024)
print("received message:", data)
def run():
# create tcp socket
tcp = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
tcp.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
try:
tcp.bind((host,port))
except socket.error as err:
print('Bind failed', err)
return
tcp.listen(1)
# create udp socket
udp = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) # UDP
udp.bind((host,port))
print('***Socket now listening at***:', host, port)
input = [tcp,udp]
try:
while True:
inputready,outputready,exceptready = select(input,[],[])
for s in inputready:
if s == tcp:
read_tcp(s)
elif s == udp:
read_udp(s)
else:
print("unknown socket:", s)
# Hit Break / Ctrl-C to exit
except KeyboardInterrupt:
print('\nClosing')
raise
tcp.close()
udp.close()
if __name__ == '__main__':
run()
Now for some evaluation purposes, I send a single image many times. When I look at the response times I see that sometimes the response time of the 8th image, for example, is more than the response time of the 9th one.
So my question is that since the size and the time needed for processing each of images are the same (I'm sending a single image several times), Why is the response time for each image variable? Shouldn't the response time of the next image be longer (or at least equal) that the previous one (For example, the response time for 4th image > the response time for 3rd image)?
Your list contains the actual elapsed time it took for each image processing call. This value will be influenced by many things, including the amount of load on the system at that time.
When your program is running, it does not have exclusive access to all of the resources (cpu, ram, disk) of the system it's running on. There could be dozens, hundreds or thousands of other processes being managed by the OS vying for resources. Given this, it is highly unlikely that you would ever see even the same image processed in the exact same amount of time between two runs, when you are measuring with sub-second accuracy. The amount of time it takes can (and will) go up and down with each successive call.
I'm reading XML events with the requests library as stated in the code below. How do I raise a connection-lost error once the request is started? The Server is emulating a HTTP push / long polling -> http://en.wikipedia.org/wiki/Push_technology#Long_polling and will not end by default.
If there is no new message after 10minutes, the while loop should be exited.
import requests
from time import time
if __name__ == '__main__':
#: Set a default content-length
content_length = 512
try:
requests_stream = requests.get('http://agent.mtconnect.org:80/sample?interval=0', stream=True, timeout=2)
while True:
start_time = time()
#: Read three lines to determine the content-length
for line in requests_stream.iter_lines(3, decode_unicode=None):
if line.startswith('Content-length'):
content_length = int(''.join(x for x in line if x.isdigit()))
#: pause the generator
break
#: Continue the generator and read the exact amount of the body.
for xml in requests_stream.iter_content(content_length):
print "Received XML document with content length of %s in %s seconds" % (len(xml), time() - start_time)
break
except requests.exceptions.RequestException as e:
print('error: ', e)
The server push could be tested with curl via command line:
curl http://agent.mtconnect.org:80/sample\?interval\=0
This might not be the best method, but you can use multiprocessing to run the requests in a separate process.
Something like this should work:
import multiprocessing
import requests
import time
class RequestClient(multiprocessing.Process):
def run(self):
# Write all your code to process the requests here
content_length = 512
try:
requests_stream = requests.get('http://agent.mtconnect.org:80/sample?interval=0', stream=True, timeout=2)
start_time = time.time()
for line in requests_stream.iter_lines(3, decode_unicode=None):
if line.startswith('Content-length'):
content_length = int(''.join(x for x in line if x.isdigit()))
break
for xml in requests_stream.iter_content(content_length):
print "Received XML document with content length of %s in %s seconds" % (len(xml), time.time() - start_time)
break
except requests.exceptions.RequestException as e:
print('error: ', e)
While True:
childProcess = RequestClient()
childProcess.start()
# Wait for 10mins
start_time = time.time()
while time.time() - start_time <= 600:
# Check if the process is still active
if not childProcess.is_alive():
# Request completed
break
time.sleep(5) # Give the system some breathing time
# Check if the process is still active after 10mins.
if childProcess.is_alive():
# Shutdown the process
childProcess.terminate()
raise RuntimeError("Connection Timed-out")
Not the perfect code for your problem, but you get the idea.
I have implemented a multiprocessing downloader.
How can I print the status bar (complete rate, download speed) which can refresh automatically
in different part on the terminal.
Like this:
499712 [6.79%] 68k/s // keep refreshing
122712 [16.79%] 42k/s // different process/thread
99712 [56.32%] 10k/s
code:
download(...)
...
f = open(tmp_file_path, 'wb')
print "Downloading: %s Bytes: %s" % (self.file_name, self.file_size)
file_size_dl = 0
block_sz = 8192
start_time = time.time()
while True:
buffer = self.opening.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
end_time = time.time()
cost_time = end_time - start_time
if cost_time == 0:
cost_time = 1
status = "\r%10d [%3.2f%%] %3dk/s" % (file_size_dl,
file_size_dl * 100. / self.file_size,
file_size_dl * 100. / 1024 / 1024 / cost_time)
print status,
sys.stdout.flush()
f.close()
DownloadProcess inherits Process class and trigger the download method.
I use queue to store the url. Here is starting process
...
for i in range(3):
t = DownloadProcess(queue)
t.start()
for url in urls:
queue.put(url)
queue.join()
Below is a demo that has implemented both multi-processing and multi-threading. To try one or the other just uncomment the import lines at the top of the code. If you have a progress bar on a single line then you can use the technique that you have of printing '\r' to move the cursor back to the start of the line. But if you want to have multi-line progress bars then you are going to have to get a little fancier. I just cleared the screen each time I wanted to print the progress bars. Check out the article console output on Unix in Python it helped me a great deal in producing the code below. He shows both techniques. You can also give the curses library that is part of python standard library a shot. The question Multiline progress bars asks a similar thing. The main thread/process spawns the child threads that do the work and communicate their progress back to the main thread using a queue. I highly recommend using queues for inter-process/thread communication. The main thread then displays the progress and waits for all children to end execution before exiting itself.
code
import time, random, sys, collections
from multiprocessing import Process as Task, Queue
#from threading import Thread as Task
#from Queue import Queue
def download(status, filename):
count = random.randint(5, 30)
for i in range(count):
status.put([filename, (i+1.0)/count])
time.sleep(0.1)
def print_progress(progress):
sys.stdout.write('\033[2J\033[H') #clear screen
for filename, percent in progress.items():
bar = ('=' * int(percent * 20)).ljust(20)
percent = int(percent * 100)
sys.stdout.write("%s [%s] %s%%\n" % (filename, bar, percent))
sys.stdout.flush()
def main():
status = Queue()
progress = collections.OrderedDict()
workers = []
for filename in ['test1.txt', 'test2.txt', 'test3.txt']:
child = Task(target=download, args=(status, filename))
child.start()
workers.append(child)
progress[filename] = 0.0
while any(i.is_alive() for i in workers):
time.sleep(0.1)
while not status.empty():
filename, percent = status.get()
progress[filename] = percent
print_progress(progress)
print 'all downloads complete'
main()
demo
i am developing a script to download online live streaming videos.
My Script:
print "Recording video..."
response = urllib2.urlopen("streaming online video url")
filename = time.strftime("%Y%m%d%H%M%S",time.localtime())+".avi"
f = open(filename, 'wb')
video_file_size_start = 0
video_file_size_end = 1048576 * 7 # end in 7 mb
block_size = 1024
while True:
try:
buffer = response.read(block_size)
if not buffer:
break
video_file_size_start += len(buffer)
if video_file_size_start > video_file_size_end:
break
f.write(buffer)
except Exception, e:
logger.exception(e)
f.close()
above script is working fine to download 7Mb of video from live streaming contents and storing it in to *.avi files.
However, I would like to download just 10 secs of video regardless of the file size and store it in avi file.
I tried different possibilities but to no success.
Could any one please share your knowledge here to fix my issue.
Thanks in advance.
I don't think there is any way of doing that without constantly analysing the video, which will be way to costly. So you could take a guess of how many MB you need and once done check it's long enough. If it's too long, just cut it. Instead of guessing you could also build up some statistics of how much you need to retrieve. You could also replace the while True with:
start_time_in_seconds = time.time()
time_limit = 10
while time.time() - start_time_in_seconds < time_limit:
...
This should give you at least 10 seconds of video, unless connecting takes too much time (less then 10 seconds then) or server sends more for buffering (but that's unlikely for live streams).
You can use the 'Content-Length' header to retrieve the video filesize if it exists.
video_file_size_end = response.info().getheader('Content-Length')
response.read() does not work. response.iter_content() seem to do the trick.
import time
import requests
print("Recording video...")
filename = time.strftime("/tmp/" + "%Y%m%d%H%M%S",time.localtime())+".avi"
file_handle = open(filename, 'wb')
chunk_size = 1024
start_time_in_seconds = time.time()
time_limit = 10 # time in seconds, for recording
time_elapsed = 0
url = "http://demo.codesamplez.com/html5/video/sample"
with requests.Session() as session:
response = session.get(url, stream=True)
for chunk in response.iter_content(chunk_size=chunk_size):
if time_elapsed > time_limit:
break
# to print time elapsed
if int(time.time() - start_time_in_seconds)- time_elapsed > 0 :
time_elapsed = int(time.time() - start_time_in_seconds)
print(time_elapsed, end='\r', flush=True)
if chunk:
file_handle.write(chunk)
file_handle.close()