urllib2 urlopen works very randomly

urllib2 urlopen works very randomly - python

For some reasons this part where I fetch JSON data from following url will only works sometimes. And sometimes it will return 404 error, and complain about missing header attribute. It will work 100% of the time if I paste it onto a web browser. So I'm sure the link is not broken or something.
I get the following error in Python:
AttributeError: 'HTTPError' object has no attribute 'header'
What's the reason for this and can it be fixed?
Btw I removed API key since it is private.
try:
url = "http://api.themoviedb.org/3/search/person?api_key=API-KEY&query=natalie+portman"
header = { 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.16) Gecko/20110319 Firefox/3.6.16' }
req = urllib2.Request(url, None, header)
f = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.code
print e.msg
print e.header
print e.fp.read()

As is documented here, you need to explicitly accept JSON. Just add the second line after the first.
req = urllib2.Request(url, None, header)
req.add_header('Accept', 'application/json')

Related

urllib.request.urlopen not working for a specific website

I used urllib.request.Request for the url of a memidex.com page, but the urllib.request.urlopen(url) line goes on to fail to open the url.
url = urllib.request.Request("http://www.memidex.com/" + term)
my_request = urllib.request.urlopen(url)
info = BeautifulSoup(my_request, "html.parser")
I've tried using the same code for a different website and it worked for that one so I have no idea why it's not working for memidex.com.

You need to add headers to your url request in order to overcome the error. BTW 'HTTP Error 403: Forbidden' was your error right?
Hope the below code helps you.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "http://www.memidex.com/"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read()
print(data)

urlopen via urllib.request with valid User-Agent returns 405 error

My question is about the urllib module in python 3. The following piece of code
import urllib.request
import urllib.parse
url = "https://google.com/search?q=stackoverflow"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
try:
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
file = open('googlesearch.txt.', 'w')
file.write(str(resp.read()))
file.close()
except Exception as e:
print(str(e))
works as I expect and writes the content of the google search 'stackoverflow' in a file. We need to set a valid User-Agent, otherwise google does not allow the request and returns a 405 Invalid Method error.
I think the following piece of code
import urllib.request
import urllib.parse
url = "https://google.com/search"
values = {'q': 'stackoverflow'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
try:
req = urllib.request.Request(url, data=data, headers=headers)
resp = urllib.request.urlopen(req)
file = open('googlesearch.txt.', 'w')
file.write(str(resp.read()))
file.close()
except Exception as e:
print(str(e))
should produce the same output as the first one, as it is the same google search with the same User-Agent. However, this piece of code throws an exception with message: 'HTTP Error 405: Method Not Allowed'.
My question is: what is wrong with the second piece of code? Why does it not produce the same output as the first one?

You get the 405 response because you are sending a POST request instead of a GET request. Method not allowed should not have anything to do with your user-agent header. It's about sending a http request with a incorrect method (get, post, put, head, options, patch, delete).
Urllib sends a POST because you include the data argument in the Request constructor as is documented here:
https://docs.python.org/3/library/urllib.request.html#urllib.request.Request
method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). The default is 'GET' if data is None or 'POST' otherwise.
It's highly recommended to use the requests library instead of urllib, because it has a much more sensible api.
import requests
response = requests.get('https://google.com/search', {'q': 'stackoverflow'})
response.raise_for_status() # raise exception if status code is 4xx or 5xx
with open('googlesearch.txt', 'w') as fp:
fp.write(response.text)
https://github.com/requests/requests

https://docs.python.org/3.4/howto/urllib2.html#data
If you do not pass the data argument, urllib uses a GET request. One
way in which GET and POST requests differ is that POST requests often
have “side-effects”: they change the state of the system in some way
(for example by placing an order with the website for a hundredweight
of tinned spam to be delivered to your door).

urllib redirect error

I'm trying to scrape tables using urllib and BeautifulSoup, and I get the error:
"urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found"
I've heard that this is related to the site requiring cookies, but I still get this error after my 2nd attempt:
import urllib.request
from bs4 import BeautifulSoup
import re
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
file = opener.open(testURL).read().decode()
soup = BeautifulSoup(file)
tables = soup.find_all('tr',{'style': re.compile("color:#4A3C8C")})
print(tables)

A fiew suggestions:
Use HTTPCookieProcessor if you must handle cookies.
You don't have to use a custom User-Agent, but if you want to simulate Mozilla you'll have to use it's full name. This site won't accept 'Mozilla/5.0' and will keep redirecting.
You can catch such exceptions with HTTPError.
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor())
user_agent = 'Mozilla/5.0 (Windows NT 6.1; rv:54.0) Gecko/20100101 Firefox/54.0'
opener.addheaders = [('user-agent', user_agent)]
try:
response = opener.open(testURL)
except urllib.error.HTTPError as e:
print(e)
except Exception as e:
print(e)
else:
file = response.read().decode()
soup = BeautifulSoup(file, 'html.parser')
... etc ...

ValueError: Expecting value: line 1 column 1 (char 0)

Checked the other answers for similar problems, but couldn't find anything that solved this particular problem. I can't figure out why I'm getting error, because I don't believe I'm missing any values. Also, I think it's odd that it says line 1 column 1 (char 0) - any of you wonderful people have any ideas?
import json
import urllib.request
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
url = "http://api.usatoday.com/open/articles/topnews?encoding=json&api_key=98jv5a93qs"
headers={"User-Agent":user_agent,}
request = urllib.request.Request(url, None, headers)
parsed_json = json.loads(str(request))
for i in range(6):
title = parsed_json['stories'][i]['title']
link = parsed_json['stories'][i]['link']
print(title)
print(link)
print("-----------------------------------")

you are trying to parse the response JSON. but you didn't event sent the request.
you should send your Request and then parse the response JSON:
import json
import urllib.request
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
url = "http://api.usatoday.com/open/articles/topnews?encoding=json&api_key=98jv5a93qs"
headers={"User-Agent":user_agent,}
request = urllib.request.Request(url, None, headers)
res = urllib.request.urlopen(request)
parsed_json = json.loads(res.readall())
for i in range(6):
title = parsed_json['stories'][i]['title']
link = parsed_json['stories'][i]['link']
print(title)
print(link)
print("-----------------------------------")

From what I've seen in both the docs (or v. 2) and at the URL above, the issue is that you are trying to parse JSON which is not JSON. I suggest wrapping your call to json.loads in a try... except block and handle bad JSON. This is generally good practice anyway.
For good measure I looked up the source code for the json module. It looks like all errors from Py2k point to value errors, thought I could not find the specific error you mention.
Based on my read of the JSON module, you'll also be able to get more information if you use try...except and print the properties of the error module as well.

Get content of http request as well as the response url in single request

How can I get the content of http response as well as the response url(Not the requested url) in single request.
To get response I used:
from urllib2 import Request,urlopen
try:
headers = { 'User-Agent' : 'Mozilla/5.0 (X11; U; Linux i686; en-US;)' }
request = Request(url, data, headers)
print urlopen(request).read()
except Exception, e:
raise Exception(e)
If I want only headers(Headers will have the response url), I used
try:
headers = { 'User-Agent' : 'Mozilla/5.0 (X11; U; Linux i686; en-US;)' }
request = Request(url, data, headers)
request.get_method = lambda : 'HEAD'
print urlopen(request).geturl()
except Exception, e:
raise Exception(e)
I am making two request to get content & url.
How I can get both in a single request. If my function returns content & url as tuple that would be better.

If you assign the urlopen(request) to a variable, you can use the both of the attributes with a single request
response = urlopen(request)
request_body = response.read()
request_url = response.geturl()
print 'URL: %s\nRequest_Body: %s' % ( request_url, request_body )

I would refactor your code into something like this. I don't know why you would want to catch the exception only to raise it again without doing anything to it.
from urllib2 import Request,urlopen
headers = { 'User-Agent' : 'Mozilla/5.0 (X11; U; Linux i686; en-US;)' }
request = Request(url, data, headers)
request.get_method = lambda : 'GET'
response = urlopen(request)
return response.read(), response.get_url()
If you do want to catch the exception. You should put it only around the urlopen call.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

urllib2 urlopen works very randomly - python

As is documented here, you need to explicitly accept JSON. Just add the second line after the first. req = urllib2.Request(url, None, header) req.add_header('Accept', 'application/json')

Related

urllib.request.urlopen not working for a specific website

urlopen via urllib.request with valid User-Agent returns 405 error

urllib redirect error

ValueError: Expecting value: line 1 column 1 (char 0)

Get content of http request as well as the response url in single request

Categories

Resources