I'm trying to scrape tables using urllib and BeautifulSoup, and I get the error:
"urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found"
I've heard that this is related to the site requiring cookies, but I still get this error after my 2nd attempt:
import urllib.request
from bs4 import BeautifulSoup
import re
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
file = opener.open(testURL).read().decode()
soup = BeautifulSoup(file)
tables = soup.find_all('tr',{'style': re.compile("color:#4A3C8C")})
print(tables)
A fiew suggestions:
Use HTTPCookieProcessor if you must handle cookies.
You don't have to use a custom User-Agent, but if you want to simulate Mozilla you'll have to use it's full name. This site won't accept 'Mozilla/5.0' and will keep redirecting.
You can catch such exceptions with HTTPError.
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor())
user_agent = 'Mozilla/5.0 (Windows NT 6.1; rv:54.0) Gecko/20100101 Firefox/54.0'
opener.addheaders = [('user-agent', user_agent)]
try:
response = opener.open(testURL)
except urllib.error.HTTPError as e:
print(e)
except Exception as e:
print(e)
else:
file = response.read().decode()
soup = BeautifulSoup(file, 'html.parser')
... etc ...
Related
I used urllib.request.Request for the url of a memidex.com page, but the urllib.request.urlopen(url) line goes on to fail to open the url.
url = urllib.request.Request("http://www.memidex.com/" + term)
my_request = urllib.request.urlopen(url)
info = BeautifulSoup(my_request, "html.parser")
I've tried using the same code for a different website and it worked for that one so I have no idea why it's not working for memidex.com.
You need to add headers to your url request in order to overcome the error. BTW 'HTTP Error 403: Forbidden' was your error right?
Hope the below code helps you.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "http://www.memidex.com/"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read()
print(data)
My question is about the urllib module in python 3. The following piece of code
import urllib.request
import urllib.parse
url = "https://google.com/search?q=stackoverflow"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
try:
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
file = open('googlesearch.txt.', 'w')
file.write(str(resp.read()))
file.close()
except Exception as e:
print(str(e))
works as I expect and writes the content of the google search 'stackoverflow' in a file. We need to set a valid User-Agent, otherwise google does not allow the request and returns a 405 Invalid Method error.
I think the following piece of code
import urllib.request
import urllib.parse
url = "https://google.com/search"
values = {'q': 'stackoverflow'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
try:
req = urllib.request.Request(url, data=data, headers=headers)
resp = urllib.request.urlopen(req)
file = open('googlesearch.txt.', 'w')
file.write(str(resp.read()))
file.close()
except Exception as e:
print(str(e))
should produce the same output as the first one, as it is the same google search with the same User-Agent. However, this piece of code throws an exception with message: 'HTTP Error 405: Method Not Allowed'.
My question is: what is wrong with the second piece of code? Why does it not produce the same output as the first one?
You get the 405 response because you are sending a POST request instead of a GET request. Method not allowed should not have anything to do with your user-agent header. It's about sending a http request with a incorrect method (get, post, put, head, options, patch, delete).
Urllib sends a POST because you include the data argument in the Request constructor as is documented here:
https://docs.python.org/3/library/urllib.request.html#urllib.request.Request
method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). The default is 'GET' if data is None or 'POST' otherwise.
It's highly recommended to use the requests library instead of urllib, because it has a much more sensible api.
import requests
response = requests.get('https://google.com/search', {'q': 'stackoverflow'})
response.raise_for_status() # raise exception if status code is 4xx or 5xx
with open('googlesearch.txt', 'w') as fp:
fp.write(response.text)
https://github.com/requests/requests
https://docs.python.org/3.4/howto/urllib2.html#data
If you do not pass the data argument, urllib uses a GET request. One
way in which GET and POST requests differ is that POST requests often
have “side-effects”: they change the state of the system in some way
(for example by placing an order with the website for a hundredweight
of tinned spam to be delivered to your door).
I am trying to download a PDF, however I get the following error: HTTP Error 403: Forbidden
I am aware that the server is blocking for whatever reason, but I cant seem to find a solution.
import urllib.request
import urllib.parse
import requests
def download_pdf(url):
full_name = "Test.pdf"
urllib.request.urlretrieve(url, full_name)
try:
url = ('http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf')
print('initialized')
hdr = {}
hdr = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36',
'Content-Length': '136963',
}
print('HDR recieved')
req = urllib.request.Request(url, headers=hdr)
print('Header sent')
resp = urllib.request.urlopen(req)
print('Request sent')
respData = resp.read()
download_pdf(url)
print('Complete')
except Exception as e:
print(str(e))
You seem to have already realised this; the remote server is apparently checking the user agent header and rejecting requests from Python's urllib. But urllib.request.urlretrieve() doesn't allow you to change the HTTP headers, however, you can use urllib.request.URLopener.retrieve():
import urllib.request
opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'whatever')
filename, headers = opener.retrieve(url, 'Test.pdf')
N.B. You are using Python 3 and these functions are now considered part of the "Legacy interface", and URLopener has been deprecated. For that reason you should not use them in new code.
The above aside, you are going to a lot of trouble to simply access a URL. Your code imports requests, but you don't use it - you should though because it is much easier than urllib. This works for me:
import requests
url = 'http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf'
r = requests.get(url)
with open('0580_s03_qp_1.pdf', 'wb') as outfile:
outfile.write(r.content)
I am not able to access website. Given below is the code which I used it. I tried to use for application as well for "Multiple Circuit Tor Solution". Hopefully i will get help soon.
import urllib2
proxy_support = urllib2.ProxyHandler({'http':'80.82.69.72:3128'})
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler(debuglevel=1))
url_set_cookie = 'my_website_address'
req = urllib2.Request(url_set_cookie)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14')
opener.open(req)
Getting Error
URLError: <urlopen error [Errno 110] Connection timed out>
Try this:
import urllib2
proxy = urllib2.ProxyHandler({'http': '0.0.0.0:9090'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.google.com/')
datum = response.read().decode("UTF-8")
response.close()
print datum
Refer this
See, if that helps.
For some reasons this part where I fetch JSON data from following url will only works sometimes. And sometimes it will return 404 error, and complain about missing header attribute. It will work 100% of the time if I paste it onto a web browser. So I'm sure the link is not broken or something.
I get the following error in Python:
AttributeError: 'HTTPError' object has no attribute 'header'
What's the reason for this and can it be fixed?
Btw I removed API key since it is private.
try:
url = "http://api.themoviedb.org/3/search/person?api_key=API-KEY&query=natalie+portman"
header = { 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.16) Gecko/20110319 Firefox/3.6.16' }
req = urllib2.Request(url, None, header)
f = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.code
print e.msg
print e.header
print e.fp.read()
As is documented here, you need to explicitly accept JSON. Just add the second line after the first.
req = urllib2.Request(url, None, header)
req.add_header('Accept', 'application/json')