urllib IncompleteRead() error can I solve by just re-requesting? - python

I am running a script that is scraping several hundred pages on a site but recently I have been running into IncompleteRead() errors. My understanding is from looking on stackoverflow is that they can happen for any number of unknown reasons.
The error is caused randomly by the Request() function I believe from searching around:
for ec in unq:
print(ec)
url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
ec, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
3.5.2.3
2.1.3.15
2.5.1.72
1.5.1.2
6.1.1.9
3.2.2.27
Traceback (most recent call last):
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 554, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 521, in _read_next_chunk_size
return int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 571, in _readall_chunked
chunk_left = self._get_chunk_left()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 556, in _get_chunk_left
raise IncompleteRead(b'')
IncompleteRead: IncompleteRead(0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<ipython-input-20-82f1876d3006>", line 5, in <module>
html = urlopen(url).read()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 464, in read
return self._readall_chunked()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 578, in _readall_chunked
raise IncompleteRead(b''.join(value))
IncompleteRead: IncompleteRead(1772944 bytes read)
The error happens randomly, as in not always the same url causes it, with https://www.brenda-enzymes.org/enzyme.php?ecno=3.2.2.27 causing this specific one.
Some solutions seems to introduce a try clause but within the except they store the partial data (I think). Why is the the case, why not just resubmit the request?
If so how would I just re run the request as doing that normally seems to solve the issue. Beyond this I have no idea how I can fix the problem.
As per Serges answer, a try function seems to be the way:
The stacktrace let think that you are reading a chunked tranfer encoded reponse and that for any reason you lost the connection between 2 chunks.
As you have said, this can happen for numerous causes, and the occurence is at random. So:
you cannot predict when or for what file it will happen
you cannot prevent it to happen
The best you can do is to catch the error and retry, after an optional delay.
For example:
import time
for ec in unq:
print(ec)
url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
ec, headers={'User-Agent': 'Mozilla/5.0'})
sleep = 0
for i in range(4):
try:
html = urlopen(url).read()
break
except http.client.IncompleteRead:
if i == 3:
raise # give up after 4 attempts
time.sleep(sleep) # optionaly add a delay here
sleep += 5
soup = BeautifulSoup(html, 'html.parser')

The stacktrace let think that you are reading a chunked tranfer encoded reponse and that for any reason you lost the connection between 2 chunks.
As you have said, this can happen for numerous causes, and the occurence is at random. So:
you cannot predict when or for what file it will happen
you cannot prevent it to happen
The best you can do is to catch the error and retry, after an optional delay.
For example:
for ec in unq:
print(ec)
url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
ec, headers={'User-Agent': 'Mozilla/5.0'})
for i in range(4):
try:
html = urlopen(url).read()
break
except http.client.IncompleteRead:
if i == 3:
raise # give up after 4 attempts
# optionaly add a delay here
soup = BeautifulSoup(html, 'html.parser')

I have faced with same issue and found this solution
After some little changes the code looks like here:
from http.client import IncompleteRead, HTTPResponse
from urllib.request import urlopen
from urllib.error import URLError, HTTPError
...
def patch_http_response_read(func):
def inner(args):
try:
return func(args)
except IncompleteRead as e:
return e.partial
return inner
HTTPResponse.read = patch_http_response_read(HTTPResponse.read)
try:
response = urlopen(my_url)
result = json.loads(response.read().decode('UTF-8'))
except URLError as e:
print('URL Error Reason: ', e.reason)
except HTTPError as e:
print('HTTP Error code: ', e.code)
I'm not sure that it is a better way. But it works in my case. I'll be happy if this advice will be useful to you or help to you to found something different good solution. Happy coding!

Related

MissingSchema(error) thrown when following tutorial

I am following "The Complete Python Course: Beginner to Advanced!" In SkillShare, and there is a point where my code breaks while the code in the tutorial continues just fine.
The tutorial is about making a webscraper with BeautifulSoup, Pillow, and IO. I'm supposed to be able to do a search for anything in bing, then save the pictures on the images search results to a folder in my computer.
Here's the Code:
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
search = input("Search for:")
params = {"q": search}
r = requests.get("http://bing.com/images/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "iusc"})
for item in links:
img_obj = requests.get(item.attrs["href"])
print("getting", item.attrs["href"])
title = item.attrs["href"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images" + title, img.format)
Whenever I run it, at the end it gives me a raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL
I tried adding
img_obj = requests.get("https://" + item.attrs["href"])
but it keeps giving me the same error.
I have gone and looked at the bing code, and the only change I have done is change the "thumb" class to "iusc". I tried using the "thumb" class as in the tutorial but then the program just runs without saving anything and eventually just finishes.
Thank you for your help
EDIT: Here is the whole error that is being thrown, as requested by baileythegreen:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 14, in <module>
img_obj = requests.get(item.attrs["href"])
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 515, in request
prep = self.prepare_request(req)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 443, in prepare_request
p.prepare(
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 318, in prepare
self.prepare_url(url, params)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 392, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0': No scheme supplied. Perhaps you meant http:///images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0?
Edit 2: I followed hawschiat instructions, and I am getting a different error this time:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 15, in <module>
print("getting", item.attrs["href"])
KeyError: 'href'
However, if I keep the "src" attribute in the print line, I get
getting http://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 18, in <module>
img.save(r'C://Users/user/PycharmProjects/webscrapery/scraped_images' + title, img.format)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\Image.py", line 2209, in save
fp = builtins.open(filename, "w+b")
OSError: [Errno 22] Invalid argument: 'C://Users/user/PycharmProjects/webscrapery/scraped_imageshttp://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7'
I tried using the 'r' character in front of the C: path, but it keeps giving me the same error. I also tried to change the forward slashes to back slashes, and putting 2 slashes in front of the C. I also made sure I have permission to write on the scrapped_images folder, which I do, as well as webscrapery.
The last line of your stack trace gives you a hint of the cause of the error. The URL scraped from the webpage is not a full URL, but rather the path to the resource.
To make it a full URL, you can simply prepend it with the scheme and authority. In your case, that would be https://bing.com.
That being said, I don't think the URL you obtained is actually the URL to the image. Inspecting Bing Image's webpage using Chrome's developer tool, we can see that the structure of the page looks something like this:
Notice that the anchor (a) element points to the preview page while its child element img contains the actual path to the resource.
With that in mind, we can rewrite your code to something like:
links = soup.findAll("img", {"class": "mimg"})
for item in links:
img_obj = requests.get(item.attrs["src"])
print("getting", item.attrs["src"])
title = item.attrs["src"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images\\" + title, img.format)
And this should achieve what you are trying to do.

Multiprocessing in python - UnboundLocalError: local variable 'data' referenced before assignment

I am trying to access an API which to return a set of products. Since the execution is slow I was hoping could use multiprocessing to make it faster. The API works perfectly when accessed using a simple for loop.
Here is my code:
from multiprocessing import Pool
from urllib2 import Request, urlopen, URLError
import json
def f(a):
request = Request('API'+ str(a))
try:
response = urlopen(request)
data = response.read()
except URLError, e:
print 'URL ERROR:', e
s=json.loads(data)
#count += len(s['Results'])
#print count
products=[]
for i in range(len(s['Results'])):
if (s['Results'][i]['IsSyndicated']==False):
try:
products.append(int(s['Results'][i]['ProductId']))
except ValueError as e:
products.append(s['Results'][i]['ProductId'])
return products
list=[0,100,200]
if __name__ == '__main__':
p = Pool(4)
result=p.map(f, list)
print result
Here is the error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\z080302\Desktop\WinPython-32bit-2.7.6.3\python-2.7.6\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "C:/Users/z080302/Desktop/Python_Projects/mp_test.py", line 36, in <module>
result=p.map(f, list)
File "C:\Users\z080302\Desktop\WinPython-32bit-2.7.6.3\python-2.7.6\lib\multiprocessing\pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Users\z080302\Desktop\WinPython-32bit-2.7.6.3\python-2.7.6\lib\multiprocessing\pool.py", line 554, in get
raise self._value
UnboundLocalError: local variable 'data' referenced before assignment
I was thinking even with multiprocessing the function will still be executed sequentially. So why am I getting UnboundLocalError?
In this code:
try:
response = urlopen(request)
data = response.read()
except URLError, e:
print 'URL ERROR:', e
If urlopen throws a URLError exception, the following line (data = response.read() is never executed. So when you come to:
s=json.loads(data)
The variable data has never been assigned. You probably want to abort processing in the event of a URLError, since that suggests you will not have any JSON data.
The accepted answer is about the actual problem, but I thought I'll add my experience for others who come here because of mystic errors raised by multiprocessings ApplyResult.get with raise self._value. If you are getting TypeError, ValueError or basically any other error which in your case has nothing to do with multiprocessing then it's because that error is raised not by multiprocessing really, but by your code that you are running in the process that you are attempting to manage (or the thread if you happen to be using multiprocessing.pool.ThreadPool which I was).

Exception handling in Python and Praw

I am having trouble with the following code:
import praw
import argparse
# argument handling was here
def main():
r = praw.Reddit(user_agent='Python Reddit Image Grabber v0.1')
for i in range(len(args.subreddits)):
try:
r.get_subreddit(args.subreddits[i]) # test to see if the subreddit is valid
except:
print "Invalid subreddit"
else:
submissions = r.get_subreddit(args.subreddits[i]).get_hot(limit=100)
print [str(x) for x in submissions]
if __name__ == '__main__':
main()
subreddit names are taken as arguments to the program.
When an invalid args.subreddits is passed to get_subreddit, it throws an exception which should be caught in the above code.
When a valid args.subreddit name is given as an argument, the program runs fine.
But when an invalid args.subreddit name is given, the exception is not thrown, and instead the following uncaught exception is outputted.
Traceback (most recent call last):
File "./pyrig.py", line 33, in <module>
main()
File "./pyrig.py", line 30, in main
print [str(x) for x in submissions]
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 434, in get_content
page_data = self.request_json(url, params=params)
File "/usr/local/lib/python2.7/dist-packages/praw/decorators.py", line 95, in wrapped
return_value = function(reddit_session, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 469, in request_json
response = self._request(url, params, data)
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 342, in _request
response = handle_redirect()
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 316, in handle_redirect
url = _raise_redirect_exceptions(response)
File "/usr/local/lib/python2.7/dist-packages/praw/internal.py", line 165, in _raise_redirect_exceptions
.format(subreddit))
praw.errors.InvalidSubreddit: `soccersdsd` is not a valid subreddit
I can't tell what I am doing wrong. I have also tried rewriting the exception code as
except praw.errors.InvalidSubreddit:
which also does not work.
EDIT: exception info for Praw can be found here
File "./pyrig.py", line 30, in main
print [str(x) for x in submissions]
The problem, as your traceback indicates is that the exception doesn't occur when you call get_subreddit In fact, it also doesn't occur when you call get_hot. The first is a lazy invocation that just creates a dummy Subreddit object but doesn't do anything with it. The second, is a generator that doesn't make any requests until you actually try to iterate over it.
Thus you need to move the exception handling code around your print statement (line 30) which is where the request is actually made that results in the exception.

Why is urllib.urlopen() only working once? - Python

I'm writing a crawler to download the static html pages using urllib.
The get_page function works for 1 cycle but when i try to loop it, it doesn't open the content to the next url i've fed in.
How do i make urllib.urlopen continuously download HTML pages?
If it is not possible, is there any other suggestion to download
webpages within my python code?
my code below only returns the html for the 1st website in the seed list:
import urllib
def get_page(url):
return urllib.urlopen(url).read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
The same crawl "once-only" problem also occurs with urllib2:
import urllib2
def get_page(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
Without the exception, i'm getting an IOError with urllib:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 91, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 4, in get_page
return urllib.urlopen(url).read().decode('utf8')
File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 207, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 462, in open_file
return self.open_local_file(url)
File "/usr/lib/python2.7/urllib.py", line 476, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html'
Without the exception, i'm getting a ValueError with urllib2:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 95, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 7, in get_page
response = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 392, in open
protocol = req.get_type()
File "/usr/lib/python2.7/urllib2.py", line 254, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http://www.pmo.gov.sg/content/pmosite/aboutpmo.html
ANSWERED:
The IOError and ValueError occurred because there was some sort of Unicode byte order mark (BOM). A non-break space was found in the second URL. Thanks for all your help and suggestion in solving the problem!!
your code is choking on .read().decode('utf8').
but you wouldn't see that since you are just swallowing exceptions. urllib works fine "more than once".
import urllib
def get_page(url):
return urllib.urlopen(url).read()
seeds = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for seed in seeds:
print 'here'
print get_page(seed)
Both of your examples work fine for me. The only explanation I can think of for your exact errors is that the second URL string contains some sort of non-printable character (a Unicode BOM, perhaps) that got filtered out when pasting the code here. Try copying the code back from this site into your file, or retyping the entire second string from scratch.

Getting and trapping HTTP response using Mechanize in Python

I am trying to get the response codes from Mechanize in python. While I am able to get a 200 status code anything else isn't returned (404 throws and exception and 30x is ignored). Is there a way to get the original status code?
Thanks
Errors will throw an exception, so just use try:...except:... to handle them.
Your Mechanize browser object has a method set_handle_redirect() that you can use to turn 30x redirection on or off. Turn it off and you get an error for redirects that you handle just like you handle any other error:
>>> from mechanize import Browser
>>> browser = Browser()
>>> resp = browser.open('http://www.oxfam.com') # this generates a redirect
>>> resp.geturl()
'http://www.oxfam.org/'
>>> browser.set_handle_redirect(False)
>>> resp = browser.open('http://www.oxfam.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build\bdist.win32\egg\mechanize\_mechanize.py", line 209, in open
File "build\bdist.win32\egg\mechanize\_mechanize.py", line 261, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 301: Moved Permanently
>>>
>>> from urllib2 import HTTPError
>>> try:
... resp = browser.open('http://www.oxfam.com')
... except HTTPError, e:
... print "Got error code", e.code
...
Got error code 301
In twill, do get_browser().get_code()
twill is an outstanding automation and test layer built on top of mechanize, to make it easier to use. It is seriously handy.

Categories

Resources