When downloading a large file with python, I want to put a time limit not only for the connection process, but also for the download.
I am trying with the following python code:
import requests
r = requests.get('http://ipv4.download.thinkbroadband.com/1GB.zip', timeout = 0.5, prefetch = False)
print r.headers['content-length']
print len(r.raw.read())
This does not work (the download is not time limited), as correctly noted in the docs: https://requests.readthedocs.org/en/latest/user/quickstart/#timeouts
This would be great if it was possible:
r.raw.read(timeout = 10)
The question is, how to put a time limit to the download?
And the answer is: do not use requests, as it is blocking. Use non-blocking network I/O, for example eventlet:
import eventlet
from eventlet.green import urllib2
from eventlet.timeout import Timeout
url5 = 'http://ipv4.download.thinkbroadband.com/5MB.zip'
url10 = 'http://ipv4.download.thinkbroadband.com/10MB.zip'
urls = [url5, url5, url10, url10, url10, url5, url5]
def fetch(url):
response = bytearray()
with Timeout(60, False):
response = urllib2.urlopen(url).read()
return url, len(response)
pool = eventlet.GreenPool()
for url, length in pool.imap(fetch, urls):
if (not length):
print "%s: timeout!" % (url)
else:
print "%s: %s" % (url, length)
Produces expected results:
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
When using Requests' prefetch=False parameter, you get to pull in arbitrary-sized chunks of the respone at a time (rather than all at once).
What you'll need to do is tell Requests not to preload the entire request and keep your own time of how much you've spent reading so far, while fetching small chunks at a time. You can fetch a chunk using r.raw.read(CHUNK_SIZE). Overall, the code will look something like this:
import requests
import time
CHUNK_SIZE = 2**12 # Bytes
TIME_EXPIRE = time.time() + 5 # Seconds
r = requests.get('http://ipv4.download.thinkbroadband.com/1GB.zip', prefetch=False)
data = ''
buffer = r.raw.read(CHUNK_SIZE)
while buffer:
data += buffer
buffer = r.raw.read(CHUNK_SIZE)
if TIME_EXPIRE < time.time():
# Quit after 5 seconds.
data += buffer
break
r.raw.release_conn()
print "Read %s bytes out of %s expected." % (len(data), r.headers['content-length'])
Note that this might sometimes use a bit more than the 5 seconds allotted as the final r.raw.read(...) could lag an arbitrary amount of time. But at least it doesn't depend on multithreading or socket timeouts.
Run download in a thread which you can then abort if not finished on time.
import requests
import threading
URL='http://ipv4.download.thinkbroadband.com/1GB.zip'
TIMEOUT=0.5
def download(return_value):
return_value.append(requests.get(URL))
return_value = []
download_thread = threading.Thread(target=download, args=(return_value,))
download_thread.start()
download_thread.join(TIMEOUT)
if download_thread.is_alive():
print 'The download was not finished on time...'
else:
print return_value[0].headers['content-length']
Related
I am working on a simple web scraper and rn trying to implement some multithreading. While my code works as intended with some servers(reducing time of execution vastly), my primary goal is to make it work with few specific ones. So when I try it with the ones in sites list, I get performance like I am still using sequential code. Any guesses what can cause this?
import requests, time
from bs4 import BeautifulSoup
from threading import Thread
from random import choice
# Enable to get some logging info
#---------------------------------
# import logging
# import http.client
# http.client.HTTPConnection.debuglevel = 1
# logging.basicConfig()
# logging.getLogger().setLevel(logging.DEBUG)
# requests_log = logging.getLogger("requests.packages.urllib3")
# requests_log.setLevel(logging.DEBUG)
# requests_log.propagate = True
sites = [
"https://pikabu.ru/community/blackhumour",
"https://www.pikabu.ru/tag/%D0%9C%D0%B5%D0%BC%D1%8B/hot"
]
class Pikabu_Downloader(Thread):
def __init__(self, url, name, *args, **kwargs):
super().__init__(*args, **kwargs)
self.url = url
self.name = name
self.begin = time.time()
def run(self):
print("Beginning with thread number",self.name, ",", round(time.time()-self.begin, 4), " seconds has passed")
html_data = self._get_html()
print("After requests.get with thread number", self.name, ",", round(time.time()-self.begin, 4), " seconds has passed")
if html_data is None:
return
self.soup = BeautifulSoup(html_data, "html.parser")
print("After making soup with thread number", self.name, ",", round(time.time() - self.begin, 4), " seconds has passed")
def _get_html(self):
try:
user_agents = ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)', 'AppleWebKit/537.36 (KHTML, like Gecko)', 'Chrome/74.0.3729.169', 'Safari/537.36')
print(f"Go {self.url}...")
res = requests.get(self.url, headers={'User-Agent': choice(user_agents)}, stream = True)#, allow_redirects=False)
except Exception as exc:
print(exc)
else:
return res.text
test = "https://readingbooks.site/read/?name=1984&"
def download():
pikabu_urls = []
for url in sites:
pikabu = [url + "?page=" + str(x) for x in range(1, 10)]
pikabu_urls = pikabu_urls + pikabu
pikabu_dls = [Pikabu_Downloader(url=page, name=str(i)) for i, page in enumerate(pikabu_urls)]
# Comment the string above and enable 2 underlying strings to get result from test server
# tests = [test + "page=" + str(x) for x in range(1, pages)]
# pikabu_dls = [Pikabu_Downloader(url=page, name=str(i)) for i, page in enumerate(tests)]
for pikabu_dl in pikabu_dls:
pikabu_dl.start()
for pikabu_dl in pikabu_dls:
pikabu_dl.join()
download()
And the result is something like
...
After requests.get with thread number 1 , 1.6904 seconds has passed
After making soup with thread number 1 , 1.7554 seconds has passed
After requests.get with thread number 2 , 2.9805 seconds has passed
After making soup with thread number 2 , 3.0455 seconds has passed
After requests.get with thread number 3 , 4.3225 seconds has passed
After making soup with thread number 3 , 4.3895 seconds has passed
...
What can cause such latency between thread executions? I was hoping to get each thread to finish almost simultaneously and to get more...asynchronous output, like with server from test. If I set a timeout of 5 sec inside requests.get, most of the requests wont even work.
After I investigated your case, I would point out some issues that you have encountered:
Do not print when it is on parallel tasks, it will cause the bottle-neck on the way of rendering to screen
The large of tasks are not always good for performance, it depends on how much your memory will process. Imagine that you have 1000 links, you have to create 1000 task objects? No, only place-holder for 5-20 by leveraging ThreadPool
Server also is a problem to deal with when taking request. Downloaded size, low bandwidth, network, distancing,.. caused response late will affect your physic machine. Your sites are weight, it seems consuming 1-3000ms each request so when you test it with small size (20 links), it makes you feel it runs sequentially
Your code is running parallel, since you do a little bit trick to put it on different threads, it is not quite right because we need a fully async library, such like asyncio and aiohttp. The aiohttp will take care numerous async requests on the Coroutine whereas asyncio will support syntax and operate on your main thread.
I did a small experiment on colab, please be noticed that I didn't use asyncio and aiohttp on colab because of stuck, but I have implemented on several projects before and it worked faster than below fastest method.
The second function is your implementation
import urllib.request
from threading import Thread
import time, requests
from random import choice
user_agents = ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)', 'AppleWebKit/537.36 (KHTML, like Gecko)', 'Chrome/74.0.3729.169', 'Safari/537.36')
timeout = 5
sites = [
"https://pikabu.ru/community/blackhumour",
"https://www.pikabu.ru/tag/%D0%9C%D0%B5%D0%BC%D1%8B/hot"
]
URLS = []
for url in sites:
pikabu = [url + "?page=" + str(x) for x in range(25)]
URLS.extend(pikabu)
def convert_to_threads():
return [Thread(target=load_url, args=(page, timeout)) for page in URLS]
def running_threads():
threads = convert_to_threads()
start = time.time()
for i in threads:
i.start()
for i in threads:
i.join()
print(f'Finish with {len(URLS)} requests {time.time() - start}')
def load_url(url, timeout):
res = requests.get(url, headers={'User-Agent': choice(user_agents)}, stream = True)#, allow_redirects=False)
return res.text
def running_sequence():
start = time.time()
for url in URLS:
load_url(url, timeout)
print(f'Finish with {len(URLS)} requests {time.time() - start}')
def running_thread_pool():
start = time.time()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=15) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, timeout): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
# else:
# print('%r page is %d length' % (url, len(data)))
print(f'Finish with {len(URLS)} requests {time.time() - start}')
In short, I recommend you use ThreadPool (prefer in colab), or asyncio and aiohttp (not in colab) to gain speed
Wrote this crawler in Python, it dumps several parameters to JSON output file based on the input list of domains.
Have this question:
Do I need to close the HTTP connection in each thread? Input data is ca. 5 Million items. It process at the beginning at a rate ca. 50 iterations per second, but later after some time it drops to 1-2 per second and/or hangs (no kernel messages and no errors on stdout)? Can this be code or is network limiting related? I suspect software since when I restart it, it starts again with high rate (ca. 50 iteration per second)
Any tips how to improve the code below are also welcome, especially improve on speed and crawling throughput.
Code in questions:
import urllib2
import pprint
from tqdm import tqdm
import lxml.html
from Queue import Queue
from geoip import geolite2
import pycountry
from tld import get_tld
resfile = open("out.txt",'a')
concurrent = 200
def doWork():
while True:
url = q.get()
status = getStatus(url)
doSomethingWithResult(status)
q.task_done()
def getStatus(ourl):
try:
response = urllib2.urlopen("http://"+ourl)
peer = response.fp._sock.fp._sock.getpeername()
ip = peer[0]
header = response.info()
html = response.read()
html_element = lxml.html.fromstring(html)
generator = html_element.xpath("//meta[#name='generator']/#content")
try:
match = geolite2.lookup(ip)
if match is not None:
country= match.country
try:
c=pycountry.countries.lookup(country)
country=c.name
except:
country=""
except:
country=""
try:
res=get_tld("http://www"+ourl, as_object=True)
tld=res.suffix
except:
tld=""
try:
match = re.search(r'[\w\.-]+#[\w\.-]+', html)
email=match.group(0)
except:
email=""
try:
item= generator[0]
val = "{ \"Domain\":\"http://"+ourl.rstrip()+"\",\"IP:\""+ip+"\"," + "\"Server\":"+ "\""+str(header.getheader("Server")).replace("None","")+"\",\"PoweredBy\":" + "\""+str(header.getheader("X-Powered-By")).replace("None","")+"\""+",\"MetaGenerator\":\""+item+"\",\"Email\":\""+email+"\",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
except:
val = "{ \"Domain\":\"http://"+ourl.rstrip()+"\",\"IP:\""+ip+"\"," + "\"Server\":"+ "\""+str(header.getheader("Server")).replace("None","")+"\",\"PoweredBy\":" + "\""+str(header.getheader("X-Powered-By")).replace("None","")+"\""+",\"MetaGenerator\":\"\",\"Email\":\""+email+"\",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
return val
except Exception as e:
#print "error"+str(e)
pass
def doSomethingWithResult(status):
if status:
resfile.write(str(status)+"\n")
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=doWork)
t.daemon = True
t.start()
try:
for url in tqdm(open('list.txt')):
q.put(url.strip())
status = open("status.txt",'w')
status.write(str(url.strip()))
q.join()
except KeyboardInterrupt:
sys.exit(1)
Update 1:
Closing the Socket and FileDescriptor makes it work better, does not seem to hang anymore after some time. Performance is 50 reqs/sec on home laptop and ca 100 req/sec on a VPS
from threading import Thread
import httplib, sys
import urllib2
import pprint
from tqdm import tqdm
import lxml.html
from Queue import Queue
from geoip import geolite2
import pycountry
from tld import get_tld
import json
resfile = open("out.txt",'a')
concurrent = 200
def doWork():
while True:
url = q.get()
status = getStatus(url)
doSomethingWithResult(status)
q.task_done()
def getStatus(ourl):
try:
response = urllib2.urlopen("http://"+ourl)
realsock = response.fp._sock.fp._sock
peer = response.fp._sock.fp._sock.getpeername()
ip = peer[0]
header = response.info()
html = response.read()
realsock.close()
response.close()
html_element = lxml.html.fromstring(html)
generator = html_element.xpath("//meta[#name='generator']/#content")
try:
match = geolite2.lookup(ip)
if match is not None:
country= match.country
try:
c=pycountry.countries.lookup(country)
country=c.name
except:
country=""
except:
country=""
try:
res=get_tld("http://www"+ourl, as_object=True)
tld=res.suffix
except:
tld=""
try:
match = re.search(r'[\w\.-]+#[\w\.-]+', html)
email=match.group(0)
except:
email=""
try:
item= generator[0]
val = "{ \"Domain\":"+json.dumps("http://"+ourl.rstrip())+",\"IP\":\""+ip+"\",\"Server\":"+json.dumps(str(header.getheader("Server")).replace("None",""))+",\"PoweredBy\":" +json.dumps(str(header.getheader("X-Powered-By")).replace("None",""))+",\"MetaGenerator\":"+json.dumps(item)+",\"Email\":"+json.dumps(email)+",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
except:
val = "{ \"Domain\":"+json.dumps("http://"+ourl.rstrip())+",\"IP\":\""+ip+"\"," + "\"Server\":"+json.dumps(str(header.getheader("Server")).replace("None",""))+",\"PoweredBy\":" +json.dumps(str(header.getheader("X-Powered-By")).replace("None",""))+",\"MetaGenerator\":\"\",\"Email\":"+json.dumps(email)+",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
return val
except Exception as e:
print "error"+str(e)
pass
def doSomethingWithResult(status):
if status:
resfile.write(str(status)+"\n")
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=doWork)
t.daemon = True
t.start()
try:
for url in tqdm(open('list.txt')):
q.put(url.strip())
status = open("status.txt",'w')
status.write(str(url.strip()))
q.join()
except KeyboardInterrupt:
sys.exit(1)
The handles will be automatically garbage collected, but, you will be better off closing the handles yourself, especially as you are doing this in a tight loop.
You also asked for suggestions for improvement. A big one would be to stop using urllib2 and start using requests instead.
There are many possible options, why your crawling rate drops.
1.) Take care not to crawl to much data from the same domain. Some web servers are configured just to allow one connection per IP address in parallel.
2.) Try to send randomized browser-like http headers (user-agent, referrer, ...) to prevent web server scraping protection, if set.
3.) Use a mature http (parallel) library, like pycurl (has MultiCurl) or requests (grequests). They perform faster for sure.
I have Python application which uses threading and requests modules for processing many pages. Basic function for page downloading looks like this:
def get_page(url):
error = None
data = None
max_page_size = 10 * 1024 * 1024
try:
s = requests.Session()
s.max_redirects = 10
s.keep_alive = False
r = s.get('http://%s' % url if not url.startswith('http://') else url,
headers=headers, timeout=10.0, stream=True)
raw_data = io.BytesIO()
size = 0
for chunk in r.iter_content(4096):
size += len(chunk)
raw_data.write(chunk)
if size > max_page_size:
r.close()
raise SpyderError('too_large')
fetch_result = 'ok'
finally:
del s
It works well in most cases but sometimes application freezes because of very slow connection with some servers or some other network problems. How can I setup a global guaranteed timeout for whole function? Should I use asyncio or coroutines?
I'm reading XML events with the requests library as stated in the code below. How do I raise a connection-lost error once the request is started? The Server is emulating a HTTP push / long polling -> http://en.wikipedia.org/wiki/Push_technology#Long_polling and will not end by default.
If there is no new message after 10minutes, the while loop should be exited.
import requests
from time import time
if __name__ == '__main__':
#: Set a default content-length
content_length = 512
try:
requests_stream = requests.get('http://agent.mtconnect.org:80/sample?interval=0', stream=True, timeout=2)
while True:
start_time = time()
#: Read three lines to determine the content-length
for line in requests_stream.iter_lines(3, decode_unicode=None):
if line.startswith('Content-length'):
content_length = int(''.join(x for x in line if x.isdigit()))
#: pause the generator
break
#: Continue the generator and read the exact amount of the body.
for xml in requests_stream.iter_content(content_length):
print "Received XML document with content length of %s in %s seconds" % (len(xml), time() - start_time)
break
except requests.exceptions.RequestException as e:
print('error: ', e)
The server push could be tested with curl via command line:
curl http://agent.mtconnect.org:80/sample\?interval\=0
This might not be the best method, but you can use multiprocessing to run the requests in a separate process.
Something like this should work:
import multiprocessing
import requests
import time
class RequestClient(multiprocessing.Process):
def run(self):
# Write all your code to process the requests here
content_length = 512
try:
requests_stream = requests.get('http://agent.mtconnect.org:80/sample?interval=0', stream=True, timeout=2)
start_time = time.time()
for line in requests_stream.iter_lines(3, decode_unicode=None):
if line.startswith('Content-length'):
content_length = int(''.join(x for x in line if x.isdigit()))
break
for xml in requests_stream.iter_content(content_length):
print "Received XML document with content length of %s in %s seconds" % (len(xml), time.time() - start_time)
break
except requests.exceptions.RequestException as e:
print('error: ', e)
While True:
childProcess = RequestClient()
childProcess.start()
# Wait for 10mins
start_time = time.time()
while time.time() - start_time <= 600:
# Check if the process is still active
if not childProcess.is_alive():
# Request completed
break
time.sleep(5) # Give the system some breathing time
# Check if the process is still active after 10mins.
if childProcess.is_alive():
# Shutdown the process
childProcess.terminate()
raise RuntimeError("Connection Timed-out")
Not the perfect code for your problem, but you get the idea.
i am developing a script to download online live streaming videos.
My Script:
print "Recording video..."
response = urllib2.urlopen("streaming online video url")
filename = time.strftime("%Y%m%d%H%M%S",time.localtime())+".avi"
f = open(filename, 'wb')
video_file_size_start = 0
video_file_size_end = 1048576 * 7 # end in 7 mb
block_size = 1024
while True:
try:
buffer = response.read(block_size)
if not buffer:
break
video_file_size_start += len(buffer)
if video_file_size_start > video_file_size_end:
break
f.write(buffer)
except Exception, e:
logger.exception(e)
f.close()
above script is working fine to download 7Mb of video from live streaming contents and storing it in to *.avi files.
However, I would like to download just 10 secs of video regardless of the file size and store it in avi file.
I tried different possibilities but to no success.
Could any one please share your knowledge here to fix my issue.
Thanks in advance.
I don't think there is any way of doing that without constantly analysing the video, which will be way to costly. So you could take a guess of how many MB you need and once done check it's long enough. If it's too long, just cut it. Instead of guessing you could also build up some statistics of how much you need to retrieve. You could also replace the while True with:
start_time_in_seconds = time.time()
time_limit = 10
while time.time() - start_time_in_seconds < time_limit:
...
This should give you at least 10 seconds of video, unless connecting takes too much time (less then 10 seconds then) or server sends more for buffering (but that's unlikely for live streams).
You can use the 'Content-Length' header to retrieve the video filesize if it exists.
video_file_size_end = response.info().getheader('Content-Length')
response.read() does not work. response.iter_content() seem to do the trick.
import time
import requests
print("Recording video...")
filename = time.strftime("/tmp/" + "%Y%m%d%H%M%S",time.localtime())+".avi"
file_handle = open(filename, 'wb')
chunk_size = 1024
start_time_in_seconds = time.time()
time_limit = 10 # time in seconds, for recording
time_elapsed = 0
url = "http://demo.codesamplez.com/html5/video/sample"
with requests.Session() as session:
response = session.get(url, stream=True)
for chunk in response.iter_content(chunk_size=chunk_size):
if time_elapsed > time_limit:
break
# to print time elapsed
if int(time.time() - start_time_in_seconds)- time_elapsed > 0 :
time_elapsed = int(time.time() - start_time_in_seconds)
print(time_elapsed, end='\r', flush=True)
if chunk:
file_handle.write(chunk)
file_handle.close()