I am making a webscraping program that goes through each URL in a list of URLs, opens the page with that URL, and extracts some information from the soup. Most of the time it works fine, but occasionally the program will stop advancing through the list but not terminate the program, show warnings/exceptions, or otherwise show signs of error. My code, stripped down to the relevant parts, looks like this:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
# some code...
for url in url_list:
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
page = urlopen(req)
soup = bs(page, features="html.parser")
# do some stuff with the soup...
When the program stalls, if I terminate it manually (using PyCharm), I get this traceback:
File "/Path/to/my/file.py", line 48, in <module>
soup = bs(page, features="html.parser")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/__init__.py", line 266, in __init__
markup = markup.read()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 454, in read
return self._readall_chunked()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 564, in _readall_chunked
value.append(self._safe_read(chunk_left))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 610, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1052, in recv_into
return self.read(nbytes, buffer)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 911, in read
return self._sslobj.read(len, buffer)
KeyboardInterrupt
Here's what I have tried and learned:
Added a check to make sure that the page status is always 200 when making the soup. The fail condition never occurs.
Added a print statement after the soup is created. This print statement does not trigger after stalling.
The URLs are always valid. This is confirmed by the fact that the program does not stall on the same URL every time, and double confirmed by a similar program I have with nearly identical code that shows the same behavior on a different set of URLs.
I have tried running through this step-by-step with a debugger. The problem has not occurred in the 30 or so iterations I've checked manually, which may just be coincidence.
The page returns the correct headers when bs4 stalls. The problem seems to be isolated to the creation of the soup.
What could cause this behavior?
Related
#Downloading All XKCD Comics
url = "http://xkcd.com"
os.makedirs("xkcd", exist_ok=True)
while not url.endswith("#"):
print("Downloading page %s..." % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
comicElem = soup.select("#comic img")
if comicElem == []:
print("Could not find comic image.")
else:
comicUrl = comicElem[0].get("src")
#Download the image.
print('Downloading image %s...' % (comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
imageFile = open(os.path.join("xkcd", os.path.basename(comicUrl)),"wb")
for chunk in res.iter_content(None):
imageFile.write(chunk)
imageFile.close()
prevLink = soup.select("a[rel=prev]")[0]
url = "http://xkcd.com" + prevLink.get("href")
print("Done.")
Full code is stated above. Full output is stated below.
Downloading page http://xkcd.com...
C:/Users/emosc/PycharmProjects/RequestsLearning/main.py:38: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 38 of the file C:/Users/emosc/PycharmProjects/RequestsLearning/main.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
soup = bs4.BeautifulSoup(res.text)
Traceback (most recent call last):
File "C:/Users/emosc/PycharmProjects/RequestsLearning/main.py", line 46, in <module>
res = requests.get(comicUrl)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\sessions.py", line 528, in request
prep = self.prepare_request(req)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\sessions.py", line 456, in prepare_request
p.prepare(
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\models.py", line 316, in prepare
self.prepare_url(url, params)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\models.py", line 390, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '//imgs.xkcd.com/comics/rapid_test_results.png': No schema supplied. Perhaps you meant http:////imgs.xkcd.com/comics/rapid_test_results.png?
Downloading image //imgs.xkcd.com/comics/rapid_test_results.png...
I have never ever seen an image link like (only with 2 backslashes not 4) http:////imgs.xkcd.com/comics/rapid_test_results.png this and BS4 recommends me to use that and I dont know how to solve this error. Typically followed Automate the Boring Stuff with Python book, same code as from that book but shoots this error when I try to scrape the site. Thanks for any help.
The http:// and https:// protocols are both examples of schemas. Print your URLs before every usage in your code, and check to see if those two are not 1:1 included at the start of your url. Failure to add http://url or https://url will lead to the error as shown, so make sure http:// is added.
You can add this to check what the URL starts with
if comicURL.startswith("//"):
continue
Put the portion of the code below in a try, except block, it'd run.
res = requests.get(comicUrl)
add "http:"
like this: comicUrl = 'http:'+comicElem[0].get('src')
Weird problem here. I have a Python 3 script that runs 24/7 and uses Selenium and Firefox to go to a web page and every 5 minutes downloads a file from a download link (which I can't just download with urllib, or whatever, because even though the link address for the download file remains constant, the data in the file is constantly changing and is different every time the page is reloaded and also depending on the criteria specified). The script runs fine almost all the time but I can't get rid of this one error that pops up every once in a while which terminates the script. Here's the error:
Traceback (most recent call last):
File "/Users/Shared/ROTH_1/Folio/get_F_notes.py", line 248, in <module>
driver.get(search_url)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 187, in get
self.execute(Command.GET, {'url': url})
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 173, in execute
response = self.command_executor.execute(driver_command, params)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(command_info[0], url, body=data)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 379, in _request
self._conn.request(method, parsed_url.path, body, headers)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 1090, in request
self._send_request(method, url, body, headers)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 966, in putrequest
raise CannotSendRequest(self.__state)
http.client.CannotSendRequest: Request-sent
And here is the part of my script where the problem comes in, specifically, the script hits the "except ConnectionRefusedError:" part and, as intended, prints out "WARNING 1 : ConnectionRefusedError: search page did not load; now trying again". However, I get the above error, I think, when the loop begins again and tries to "driver.get(search_url)" again. The script chokes at that point and gives me the above error.
I have researched this quite a bit and it seems possible that the script is trying to reuse the same connection from the first attempt. The fix seems to be to create a new connection. But that is all I have been able to gather and, I have no idea how to create a new connection with Selenium. Do you? Or, is there some other issue here?
search_url = 'https://www.example.com/download_page'
loop_get_search_page = 1
while loop_get_search_page < 7:
if loop_get_search_page == 6:
print('WARNING: tried to load search page 5 times; exiting script to try again later')
##### log out
try:
driver.find_element_by_link_text('Sign Out')
except NoSuchElementException:
print('WARNING: NoSuchElementException: Unable to find the link text for the "Sign Out" button')
driver.quit()
raise SystemExit
try:
driver.get(search_url)
except TimeoutException:
print('WARNING ', loop_get_search_page, ': TimeoutException: search page did not load; now trying again', sep='')
loop_get_search_page += 1
continue
except ConnectionRefusedError:
print('WARNING ', loop_get_search_page, ': ConnectionRefusedError: search page did not load; now trying again')
loop_get_search_page += 1
continue
else:
break
Just ran into this problem myself. In my case, I had another thread running on the side that was also making requests via WebDriver. Turns out WebDriver is not threadsafe.
Check out the discussion at Can Selenium use multi threading in one browser? and the links there for more context.
When I removed the other thread, everything started working as expected.
Is it possible that you're running every 5m in a new thread?
The only way I know of to "create a new connection" is to launch a new instance of the WebDriver. That can get slow if you're doing a lot of requests, but since you're only doing things every 5m, it shouldn't really affect your throughput. As long as you always clean up your WebDriver instance when your dl is done, this might be a good option for you.
I want to click on load-more until it disappear on that page.
I have tried but it sometimes work or giving error. It is not perfect solution which i did.
I can have multiple url in a list and hit one by one and load-more until it disappear from that page.
Thanks in advance for helping.
Code
driver = webdriver.Firefox()
url = ["https://www.zomato.com/HauzKhasSocial","https://www.zomato.com/ncr/wendys-sector-29-gurgaon","https://www.zomato.com/vaultcafecp"]
for load in url:
driver.get(load)
xpath_content='//div[#class = "load-more"]'
temp_xpath="true"
while temp_xpath:
try:
#driver.implicitly.wait(15)
#WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH,xpath_content)))
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH,xpath_content)))
#urls=driver.find_element_by_xpath(xpath_content)
urls=driver.find_element_by_xpath(xpath_content)
text=urls.text
if text:
temp_xpath=text
print "XPATH=",temp_xpath
driver.find_element_by_xpath(xpath_content).click()
#driver.execute_script('$("div.load-more").click();')
except TimeoutException:
print driver.title, "no xpath of pagination"
temp_xpath=""
continue
Most of time I get following error while running my program.
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 173, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 380, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 373, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
You probably get the BadStatus error because of a bug that has been fixed in the latest versions of Selenium webdrivers. I got into a similar situation recently and here's the thread of discussion with the developers that helped me out.
https://bugs.chromium.org/p/chromedriver/issues/detail?id=1548
I am planning to open a bunch of links where the only thing changing is the year at the end of the links. I am using the code below but it is returning a bunch of errors. My aim is to open that link and filter some things on the page but first I need to open all the pages so I have the test code. Code below:
from xlwt import *
from urllib.request import urlopen
from bs4 import BeautifulSoup, SoupStrainer
from xlwt.Style import *
j=2014
for j in range(2015):
conv=str(j)
content = urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_%s").read() %conv
j+=1
print(content)
Errors:
Traceback (most recent call last):
File "F:\urltest.py", line 11, in <module>
content = urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_%s").read() %conv
File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "C:\Python34\lib\urllib\request.py", line 469, in open
response = meth(req, response)
File "C:\Python34\lib\urllib\request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python34\lib\urllib\request.py", line 507, in error
return self._call_chain(*args)
File "C:\Python34\lib\urllib\request.py", line 441, in _call_chain
result = func(*args)
File "C:\Python34\lib\urllib\request.py", line 587, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
A little guidance required. If there is any other way to pass the variables[2014, 2015 etc] also it would be great.
That may be because you are declaring j and then modifying it at the end of your loop. range() already does this for you so you don't have to increment it. Also, your string interpolation syntax looks wrong. Be sure to include the variable immediately after the string. print("Hi %s!" % name).
Try:
for j in range(2015):
conv=str(j)
content = urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_%s" % conv).read()
Also, I am assuming you don't want to query from years 0 to 2015. You can call range(start_year, end_year) to iterate from [start_year, end_year).
As cesar pointed out in his answer, incrementing j is not needed since you are already looping with it. Also, j=0 in the beginning doesn't have any effect because your loop starts from 0 anyway.
This will create a dictionary called contents where each key is referring to the page of the corresponding year:
import urllib2
url = "http://en.wikipedia.org/wiki/List_of_Telugu_films_of_%d"
contents = {year:urllib2.urlopen(url % year).read()
for year in range(2014,2015+1)}
However, if you have multiple pages to load, I think the best way would be to save each file to your local disk first and then load from there for further processing.
This would be because you probably want to go back to your parsing process multiple times but want to download the files only once. So consider doing something like:
#reading, (only once)
for year in range(start_year,end_year+1):
with open('year_%d.txt' % year,'w') as f:
f.write(urllib2.urlopen(url % year).read())
#processing
for year in range(start_year,end_year+1):
with open('year_%d.txt','r') as f:
page = f.read()
process(page)
I have a strange bug when trying to urlopen a certain page from Wikipedia. This is the page:
http://en.wikipedia.org/wiki/OpenCola_(drink)
This is the shell session:
>>> f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
Traceback (most recent call last):
File "C:\Program Files\Wing IDE 4.0\src\debug\tserver\_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "c:\Python26\Lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "c:\Python26\Lib\urllib2.py", line 397, in open
response = meth(req, response)
File "c:\Python26\Lib\urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "c:\Python26\Lib\urllib2.py", line 435, in error
return self._call_chain(*args)
File "c:\Python26\Lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "c:\Python26\Lib\urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
This happened to me on two different systems in different continents. Does anyone have an idea why this happens?
Wikipedias stance is:
Data retrieval: Bots may not be used
to retrieve bulk content for any use
not directly related to an approved
bot task. This includes dynamically
loading pages from another website,
which may result in the website being
blacklisted and permanently denied
access. If you would like to download
bulk content or mirror a project,
please do so by downloading or hosting
your own copy of our database.
That is why Python is blocked. You're supposed to download data dumps.
Anyways, you can read pages like this in Python 2:
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib2.urlopen( req )
print con.read()
Or in Python 3:
import urllib
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib.request.urlopen( req )
print(con.read())
To debug this, you'll need to trap that exception.
try:
f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
except urllib2.HTTPError, e:
print e.fp.read()
When I print the resulting message, it includes the following
"English
Our servers are currently experiencing
a technical problem. This is probably
temporary and should be fixed soon.
Please try again in a few minutes. "
Often times websites will filter access by checking if they are being accessed by a recognised user agent. Wikipedia is just treating your script as a bot and rejecting it. Try spoofing as a browser. The following link takes to you an article to show you how.
http://wolfprojects.altervista.org/changeua.php
Some websites will block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends. I don't know and can't imagine why wikipedia does/would do this, but have you tried spoofing your headers?
As Jochen Ritzel mentioned, Wikipedia blocks bots.
However, bots will not get blocked if they use the PHP api.
To get the Wikipedia page titled "love":
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=love&prop=revisions&rvprop=content
I made a workaround for this using php which is not blocked by the site I needed.
it can be accessed like this:
path='http://phillippowers.com/redirects/get.php?
file=http://website_you_need_to_load.com'
req = urllib2.Request(path)
response = urllib2.urlopen(req)
vdata = response.read()
This will return the html code to you