I wrote a script that "parses" all domains from the file. After the launch, everything works as it should. But when there are several domains left at the end, it gets stuck. Sometimes it takes a long time to parse the last couple of domains. I can't figure out what the problem is. Who has faced such a situation? Tell me how to cure it.
After the launch, everything works out very quickly (as it should) until the end. At the end, it stops when there are several domains left. There is no difference, 1000 domains or 10 000 domains.
Complete code:
import re
import sys
import json
import requests
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
pool = 100
with open("Rules.json") as file:
REGEX = json.loads(file.read())
ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0'}
def Domain_checker(domain):
try:
r = requests.get("http://" + domain, verify=False, headers=ua)
r.encoding = "utf-8"
for company in REGEX.keys():
for type in REGEX[company]:
check_entry = 0
for ph_regex in REGEX[company][type]:
if bool(re.search(ph_regex, r.text)) is True:
check_entry += 1
if check_entry == len(REGEX[company][type]):
title = BeautifulSoup(r.text, "lxml")
Found_domain = "\nCompany: {0}\nRule: {1}\nURL: {2}\nTitle: {3}\n".format(company, type, r.url, title.title.text)
print(Found_domain)
with open("/tmp/__FOUND_DOMAINS__.txt", "a", encoding='utf-8', errors = 'ignore') as file:
file.write(Found_domain)
except requests.exceptions.ConnectionError:
pass
except requests.exceptions.TooManyRedirects:
pass
except requests.exceptions.InvalidSchema:
pass
except requests.exceptions.InvalidURL:
pass
except UnicodeError:
pass
except requests.exceptions.ChunkedEncodingError:
pass
except requests.exceptions.ContentDecodingError:
pass
except AttributeError:
pass
except ValueError:
pass
return domain
if __name__ == '__main__':
with open(sys.argv[1], "r", encoding='utf-8', errors = 'ignore') as file:
Domains = file.read().split()
pool = 100
print("Pool = ", pool)
results = ThreadPool(pool).imap_unordered(Domain_checker, Domains)
string_num = 0
for result in results:
print("{0} => {1}".format(string_num, result))
string_num += 1
with open("/tmp/__FOUND_DOMAINS__.txt", encoding='utf-8', errors = 'ignore') as found_domains:
found_domains = found_domains.read()
print("{0}\n{1}".format("#" * 40, found_domains))
requests.get("http://" + domain, headers=ua, verify=False, timeout=10)
The problem is resolved after installing timeout
Thank you to the user with the nickname "eri" (https://ru.stackoverflow.com/users/16574/eri) :)
i want to parse images from a "certain" manga and chapter. here's my code:
import requests, bs4, os, urllib.request
try:
url = "http://manganelo.com/chapter/read_one_punch_man_manga_online_free3/chapter_136"
res = requests.get(url)
print("[+] Asking a request to " + url)
# slice the url so it only contains the name and chapter
name = url[34:].replace("/", "_")
os.mkdir(name)
print("[+] Making '{}' directory".format(name))
os.chdir(os.path.join(os.getcwd(), name))
soup = bs4.BeautifulSoup(res.text, "html.parser")
for img in soup.findAll("img"):
manga_url = img.get("src")
manga_name = img.get("alt") + ".jpg"
urllib.request.urlretrieve(manga_url, manga_name)
print("[+] Downloading: " + manga_name)
except Exception as e:
print("[-] Error: " + str(e))
it works fine BUT only for a specific chapter, let's say i put chapter 130, when i try to run the code it returns blank file but if i put chapter 136 or others it works fine. How can this happen?
you can replace urllib.request.urlretrieve(manga_url, manga_name)
with :
r = requests.get(manga_url, stream=True)
if r.status_code == 200:
with open(manga_name, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
Actually Remote server is apparently checking the user agent header and rejecting requests from Python's urllib.
On the other hand you can use :
opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'whatever')
opener.retrieve(manga_url, manga_name)
This works for me
Hope this helps
This is the code I am using for downloading the images from Google page. This code is taking time in Evaluating and downloading the images. Hence, I thought of using the Beautifulsoup Library for faster evaluation and download. Check the below original code:
import time
import sys
import os
import urllib2
search_keyword = ['Australia']
keywords = [' high resolution']
def download_page(url):
import urllib2
try:
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib2.Request(url, headers = headers)
response = urllib2.urlopen(req)
page = response.read()
return page
except:
return"Page Not found"
def _images_get_next_item(s):
start_line = s.find('rg_di')
if start_line == -1:
end_quote = 0
link = "no_links"
return link, end_quote
else:
start_line = s.find('"class="rg_meta"')
start_content = s.find('"ou"',start_line+1)
end_content = s.find(',"ow"',start_content+1)
content_raw = str(s[start_content+6:end_content-1])
return content_raw, end_content
def _images_get_all_items(page):
items = []
while True:
item, end_content = _images_get_next_item(page)
if item == "no_links":
break
else:
items.append(item)
time.sleep(0.1)
page = page[end_content:]
return items
t0 = time.time()
i= 0
while i<len(search_keyword):
items = []
iteration = "Item no.: " + str(i+1) + " -->" + " Item name = " + str(search_keyword[i])
print (iteration)
print ("Evaluating...")
search_keywords = search_keyword[i]
search = search_keywords.replace(' ','%20')
try:
os.makedirs(search_keywords)
except OSError, e:
if e.errno != 17:
raise
pass
j = 0
while j<len(keywords):
pure_keyword = keywords[j].replace(' ','%20')
url = 'https://www.google.com/search?q=' + search + pure_keyword + '&espv=2&biw=1366&bih=667&site=webhp&source=lnms&tbm=isch&sa=X&ei=XosDVaCXD8TasATItgE&ved=0CAcQ_AUoAg'
raw_html = (download_page(url))
time.sleep(0.1)
items = items + (_images_get_all_items(raw_html))
j = j + 1
print ("Total Image Links = "+str(len(items)))
print ("\n")
info = open('output.txt', 'a')
info.write(str(i) + ': ' + str(search_keyword[i-1]) + ": " + str(items) + "\n\n\n")
info.close()
t1 = time.time()
total_time = t1-t0
print("Total time taken: "+str(total_time)+" Seconds")
print ("Starting Download...")
k=0
errorCount=0
while(k<len(items)):
from urllib2 import Request,urlopen
from urllib2 import URLError, HTTPError
try:
req = Request(items[k], headers={"User-Agent": "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"})
response = urlopen(req,None,15)
output_file = open(search_keywords+"/"+str(k+1)+".jpg",'wb')
data = response.read()
output_file.write(data)
response.close();
print("completed ====> "+str(k+1))
k=k+1;
except IOError:
errorCount+=1
print("IOError on image "+str(k+1))
k=k+1;
except HTTPError as e:
errorCount+=1
print("HTTPError"+str(k))
k=k+1;
except URLError as e:
errorCount+=1
print("URLError "+str(k))
k=k+1;
i = i+1
print("\n")
print("Everything downloaded!")
print("\n"+str(errorCount)+" ----> total Errors")
I thought editing the below code, will help in making the code work with BeautifulSoup Library and my work will be completed faster:
def download_page(url):
import urllib2
try:
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib2.Request(url, headers = headers)
#response = urllib2.urlopen(req)
#page = response.read()
return BeautifulSoup(urlopen(Request(req)), 'html.parser')
except:
return"Page Not found"
But the above code is returning blank. Kindly, let me know what I might do to make the code work excellently well with BeautifulSoup without any trouble.
You can't just pass Google headers like that. The search engine is ALOT more complex than simply substituting some keywords into a GET URL.
HTML is a markup language only useful for one way rendering of human readable information. For your application, you need machine readable markup rather than trying to decipher human readable text. Google already has a very comprehensive API https://developers.google.com/custom-search/ which is easy to use, and a much better way of achieving this than using BeautifulSoup
Hi guys i'm fairly new in python. what i'm trying to do is to move my old code into multiprocessing however i'm facing some errors that i hope anyone could help me out. My code is used to check a few thousand links given in a text form to check for certain tags. Once found it will output it to me. Due to the reason i have a few thousand links to check, speed is an issue and hence the need for me to move to multi processing.
Update: i'm having return errors of HTTP 503 errors. Am i sending too much request or am i missin gout something?
Multiprocessing code:
from mechanize import Browser
from bs4 import BeautifulSoup
import sys
import socket
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
no_stock = []
def main(lines):
done = False
tries = 1
while tries and not done:
try:
r = br.open(lines, timeout=15)
r = r.read()
soup = BeautifulSoup(r,'html.parser')
done = True # exit the loop
except socket.timeout:
print('Failed socket retrying')
tries -= 1 # to exit when tries == 0
except Exception as e:
print '%s: %s' % (e.__class__.__name__, e)
print sys.exc_info()[0]
tries -= 1 # to exit when tries == 0
if not done:
print('Failed for {}\n'.format(lines))
table = soup.find_all('div', {'class' : "empty_result"})
results = soup.find_all('strong', style = 'color: red;')
if table or results:
no_stock.append(lines)
if __name__ == "__main__":
r = br.open('http://www.randomweb.com/') #avoid redirection
fileName = "url.txt"
pool = Pool(processes=2)
with open(fileName, "r+") as f:
lines = pool.map(main, f)
with open('no_stock.txt','w') as f :
f.write('No. of out of stock items : '+str(len(no_stock))+'\n'+'\n')
for i in no_stock:
f.write(i + '\n')
Traceback:
Traceback (most recent call last):
File "test2.py", line 43, in <module>
lines = pool.map(main, f)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
UnboundLocalError: local variable 'soup' referenced before assignment
my txt file is something like this:-
http://www.randomweb.com/item.htm?uuid=44733096229
http://www.randomweb.com/item.htm?uuid=4473309622789
http://www.randomweb.com/item.htm?uuid=447330962291
....etc
from mechanize import Browser
from bs4 import BeautifulSoup
import sys
import socket
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
br = Browser()
no_stock = []
def main(line):
done = False
tries = 3
while tries and not done:
try:
r = br.open(line, timeout=15)
r = r.read()
soup = BeautifulSoup(r,'html.parser')
done = True # exit the loop
except socket.timeout:
print('Failed socket retrying')
tries -= 1 # to exit when tries == 0
except:
print('Random fail retrying')
print sys.exc_info()[0]
tries -= 1 # to exit when tries == 0
if not done:
print('Failed for {}\n'.format(i))
table = soup.find_all('div', {'class' : "empty_result"})
results = soup.find_all('strong', style = 'color: red;')
if table or results:
no_stock.append(i)
if __name__ == "__main__":
fileName = "url.txt"
pool = Pool(cpu_count() * 2) # Creates a Pool with cpu_count * 2 threads.
with open(fileName, "rb") as f:
lines = pool.map(main, f)
with open('no_stock.txt','w') as f :
f.write('No. of out of stock items : '+str(len(no_stock))+'\n'+'\n')
for i in no_stock:
f.write(i + '\n')
pool.map takes two parameters, the fist is a function(in your code, is main), the other is an iterable, each item of iterable will be a parameter of the function(in your code, is each line of the file)
in the following code below I am trying to first check if the URL status code and then start the relevant thread and do the same for adding it to queue,
however if urls are too many then I get TimeOut error.
all code added below
but just discovered another bug if I am passing a mp3 file along with some jpeg images the mp3 file downloaded of its correct size is opening as one of the image in urls passed.
_fdUtils
def getParser():
parser = argparse.ArgumentParser(prog='FileDownloader',
description='Utility to download files from internet')
parser.add_argument('-v', '--verbose', default=logging.DEBUG,
help='by default its on, pass None or False to not spit in shell')
parser.add_argument('-st', '--saveTo', default=None, action=FullPaths,
help='location where you want files to download to')
parser.add_argument('-urls', nargs='*',
help='urls of files you want to download.')
parser.add_argument('-se', nargs='*', default=[1], help='Split each url passed to urls by the'\
" respective split order, if a url doesn't have a split default is taken 1 ")
return parser.parse_args()
def getResponse(url):
return requests.head(url, allow_redirects=True, timeout=10, headers={'Accept-Encoding': 'identity'})
def isWorkingURL(url):
response = getResponse(url)
return response.status_code in [302, 200, 100, 204, 300]
def getUrl(url):
""" gets the actual url to download file from.
"""
response = getResponse(url)
return response.headers.get('location', url)
error stack Trace:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "python/file_download.py", line 181, in run
_grabAndWriteToDisk(self, split, url, self.__saveTo, 0, self.queue)
File "python/file_download.py", line 70, in _grabAndWriteToDisk
resp = requests.get(url, headers={'Range': 'bytes=%s' % irange}, stream=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 382, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 505, in send
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 167, in resolve_redirects
allow_redirects=False,
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 485, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/adapters.py", line 381, in send
raise Timeout(e)
Timeout: HTTPConnectionPool(host='ia600506.us.archive.org', port=80): Read timed out. (read timeout=<object object at 0x1002b40b0>)
there we go again:
import argparse
import logging
import Queue
import os
import requests
import signal
import socket
import sys
import time
import threading
import utils as _fdUtils
from collections import OrderedDict
from itertools import izip_longest
from socket import error as SocketError, timeout as SocketTimeout
# timeout in seconds
TIMEOUT = 10
socket.setdefaulttimeout(TIMEOUT)
DESKTOP_PATH = os.path.expanduser("~/Desktop")
appName = 'FileDownloader'
logFile = os.path.join(DESKTOP_PATH, '%s.log' % appName)
_log = _fdUtils.fdLogger(appName, logFile, logging.DEBUG, logging.DEBUG, console_level=logging.DEBUG)
queue = Queue.Queue()
STOP_REQUEST = threading.Event()
maxSplits = threading.BoundedSemaphore(3)
threadLimiter = threading.BoundedSemaphore(5)
lock = threading.Lock()
pulledSize = 0
dataDict = {}
def _grabAndWriteToDisk(threadName, url, saveTo, first=None, queue=None, mode='wb', irange=None):
""" Function to download file..
Args:
url(str): url of file to download
saveTo(str): path where to save file
first(int): starting byte of the range
queue(Queue.Queue): queue object to set status for file download
mode(str): mode of file to be downloaded
irange(str): range of byte to download
"""
fileName = _fdUtils.getFileName(url)
filePath = os.path.join(saveTo, fileName)
fileSize = _fdUtils.getUrlSizeInBytes(url)
downloadedFileSize = 0 if not first else first
block_sz = 8192
resp = requests.get(url, headers={'Range': 'bytes=%s' % irange}, stream=True)
for fileBuffer in resp.iter_content(block_sz):
if not fileBuffer:
break
with open(filePath, mode) as fd:
downloadedFileSize += len(fileBuffer)
fd.write(fileBuffer)
mode = 'a'
status = r"%10d [%3.2f%%]" % (downloadedFileSize, downloadedFileSize * 100. / fileSize)
status = status + chr(8)*(len(status)+1)
sys.stdout.write('%s\r' % status)
time.sleep(.01)
sys.stdout.flush()
if downloadedFileSize == fileSize:
STOP_REQUEST.set()
queue.task_done()
_log.debug("Downloaded %s %s%% using %s and saved to %s", fileName,
downloadedFileSize * 100. / fileSize, threadName.getName(), saveTo)
def _downloadChunk(url, idx, irange, fileName, sizeInBytes):
_log.debug("Downloading %s for first chunk %s of %s " % (irange, idx+1, fileName))
pulledSize = irange[-1]
try:
resp = requests.get(url, allow_redirects=False, timeout=TIMEOUT,
headers={'Range': 'bytes=%s-%s' % (str(irange[0]), str(irange[-1]))},
stream=True)
except (SocketTimeout, requests.exceptions), e:
_log.error(e)
return
chunk_size = str(irange[-1])
for chunk in resp.iter_content(chunk_size):
status = r"%10d [%3.2f%%]" % (pulledSize, pulledSize * 100. / int(chunk_size))
status = status + chr(8)*(len(status)+1)
sys.stdout.write('%s\r' % status)
sys.stdout.flush()
pulledSize += len(chunk)
dataDict[idx] = chunk
time.sleep(.03)
if pulledSize == sizeInBytes:
_log.info("%s downloaded %3.0f%%", fileName, pulledSize * 100. / sizeInBytes)
class ThreadedFetch(threading.Thread):
""" docstring for ThreadedFetch
"""
def __init__(self, saveTo, queue):
super(ThreadedFetch, self).__init__()
self.queue = queue
self.__saveTo = saveTo
def run(self):
threadLimiter.acquire()
try:
items = self.queue.get()
url = items[0]
split = items[-1]
fileName = _fdUtils.getFileName(url)
# grab split chunks in separate thread.
if split > 1:
maxSplits.acquire()
try:
sizeInBytes = _fdUtils.getUrlSizeInBytes(url)
byteRanges = _fdUtils.getRangeSegements(sizeInBytes, split)
filePath = os.path.join(self.__saveTo, fileName)
downloaders = [
threading.Thread(
target=_downloadChunk,
args=(url, idx, irange, fileName, sizeInBytes),
)
for idx, irange in enumerate(byteRanges)
]
# start threads, let run in parallel, wait for all to finish
for th in downloaders:
th.start()
# this makes the wait for all thread to finish
# which confirms the dataDict is up-to-date
for th in downloaders:
th.join()
downloadedSize = 0
with open(filePath, 'wb') as fh:
for _idx, chunk in sorted(dataDict.iteritems()):
downloadedSize += len(chunk)
status = r"%10d [%3.2f%%]" % (downloadedSize, downloadedSize * 100. / sizeInBytes)
status = status + chr(8)*(len(status)+1)
fh.write(chunk)
sys.stdout.write('%s\r' % status)
time.sleep(.04)
sys.stdout.flush()
if downloadedSize == sizeInBytes:
_log.info("%s, saved to %s", fileName, self.__saveTo)
self.queue.task_done()
finally:
maxSplits.release()
else:
while not STOP_REQUEST.isSet():
self.setName("primary_%s_thread" % fileName.split(".")[0])
# if downlaod whole file in single chunk no need
# to start a new thread, so directly download here.
_grabAndWriteToDisk(self, url, self.__saveTo, 0, self.queue)
finally:
threadLimiter.release()
def main(appName):
args = _fdUtils.getParser()
saveTo = args.saveTo if args.saveTo else DESKTOP_PATH
# spawn a pool of threads, and pass them queue instance
# each url will be downloaded concurrently
unOrdUrls = dict(izip_longest(args.urls, args.se, fillvalue=1))
ordUrls = OrderedDict([(k, unOrdUrls[k]) for k in sorted(unOrdUrls, key=unOrdUrls.get, reverse=False) if _fdUtils.isWorkingURL(k, _log) and _fdUtils.notOnDisk(k, saveTo)])
print "length: %s " % len(ordUrls)
for i in xrange(len(ordUrls)):
t = ThreadedFetch(saveTo, queue)
t.daemon = True
t.start()
try:
# populate queue with data
for url, split in ordUrls.iteritems():
url = _fdUtils.getUrl(url)
print url
queue.put((url, int(split)))
# wait on the queue until everything has been processed
queue.join()
_log.info('All tasks completed.')
except (KeyboardInterrupt, SystemExit):
_log.critical('! Received keyboard interrupt, quitting threads.')
if __name__ == "__main__":
# change the name of MainThread.
threading.currentThread().setName("FileDownloader")
myapp = threading.currentThread().getName()
main(myapp)
I see two problems in your code. Since it's incomplete, I'm not sure how it's supposed to work, so I can't promise either one is the particular one you're running into first, but I'm pretty sure you need to fix both.
First:
queue.put((_fdUtils.getUrl(url), int(split)))
That's going to call _fdUtils.getUrl(url) in the main thread, and put the result on the queue. Your comments clearly imply that you intended the downloading to happen on the background threads.
If you wanted to pass a function to be called, just pass the function and its argument as separate members of the tuple, or wrap it up in a closure or a partial:
queue.put((lambda: _fdUtils.getUrl(url), int(split)))
Second:
t = ThreadedFetch(saveTo, queue)
t.daemon = True
t.start()
This starts a thread for every URL. That's almost never a good idea. Generally, downloaders don't use more than 4-16 threads at a time, and no more than 2-4 to the same site. You could easily be timing out because you're spamming some sit too fast and its server or router is making you back off for a while. Or, with a huge number of requests, you could be flooding your own network and blocking ACKs or even rebooting the router (especially if you have either a cheap home WiFi router or ADSL with a crappy provider).
Also, a much simpler way to do this would be to use a smart pool, like a multiprocessing.dummy.Pool (multiprocessing.dummy means it acts like the multiprocessing module but uses threads) or, even better, a concurrent.futures.ThreadPoolExecutor. In fact, if you look at the docs, a parallel downloader is the first example for ThreadPoolExecutor.