So I'm trying to get the URL of a page in python3...
If I do the following,
from urllib.request import urlopen
html = urlopen("http://google.com/")
html.read()
I get the html as desired.
However, if I were to choose a different url, as in the following,
from urllib.request import urlopen
html = urlopen("http://www.stackoverflow.com/")
html.read()
I get the following error after the second line:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 461, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 574, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 499, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 433, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 582, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Any ideas why this would be happening and how to fix it?
If you look closer at the error message you'll see that it is a HTTP error and a special one:
HTTP Error 403: Forbidden
So you talked to the server and got your response back but you don't know why you were denied.
You can get a more detailed message in an HTML returned by the server with something like this:
from urllib.request import urlopen
from urllib.error import HTTPError
try:
html = urlopen("http://www.stackoverflow.com/")
except HTTPError as e:
print(e.read().decode('utf-8'))
html.read()
For me it says:
<h2 data-translate="what_happened">What happened?</h2>
<p>The owner of this website (www.stackoverflow.com) has banned your access based on your browser's signature (213702c58d2116a6-ua48).</p>
You can treat HTTPError as a file object (https://docs.python.org/3/library/urllib.error.html#urllib.error.HTTPError):
Though being an exception (a subclass of URLError), an HTTPError can
also function as a non-exceptional file-like return value (the same
thing that urlopen() returns). This is useful when handling exotic
HTTP errors, such as requests for authentication.
Related
I'm trying to make a terminal app to crawl a website and return the time of the entered city name. this is my code so far:
import re
import urllib.request
city = input('Enter city name: ')
url = 'https://time.is/'
rawData = urllib.request.urlopen(url).read()
decodedData = rawData.decode('utf-8')
print(decodedData)
after the last line i get this error:
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
rawData = urllib.request.urlopen(url).read()
File "~/Python\Python35-32\lib\urllib\request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "~/Python\Python35-32\lib\urllib\request.py", line 472, in open
response = meth(req, response)
File "~/Python\Python35-32\lib\urllib\request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "~/Python\Python35-32\lib\urllib\request.py", line 510, in error
return self._call_chain(*args)
File "~/Python\Python35-32\lib\urllib\request.py", line 444, in _call_chain
result = func(*args)
File "~/Python\Python35-32\lib\urllib\request.py", line 590, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
why do i get this error? what's wrong?
[EDIT]
the reason is time.is banns requests. Always remember to read terms and conditions when doing web scraping. free APIs can be found to do the same job too.
When this happens, I usually open the debugger and try to find out whats being called when I access the website. It seems like time.is doesn't like having scripts call their website.
A quick search yielded this:
1532027279136 0 161_(UTC,_UTC+00:00) 1532027279104
Time.is is for humans. To use from scripts and apps, please ask about our API. Thank you!
Here are some APIs you could use to build your project. https://www.programmableweb.com/category/time/api
I have a problem with accessing specific web site.
The Web site automatically redirect to Check Page which is displaying "check your Browser"
The Check page returns HTTP 503 errors in first time.
Then web browser(chrome, IE etc) automatically re-access again.
Finally I can get into web site.
The problem is I want to access to site in Python.
So I use urllib and urllib2 both.
u = urllib.open(url)
print u.read()
Same with urllib2, but it doesn't work raising 503 error.
urllib also get HTTP 503 code but it doesn't raise error.
So I need to re-access without changing cookie
u = urllib.open(url)
u = urllib.open(url) ## cookie is changed
print u.read()
Simply I tried to call open function twice. But cookie is changed and it doesn't work
(Check Page Again)
So I use urllib2 with cooklib
import os.path
cj = None
ClientCookie = None
cookielib = None
import cookielib
import urllib2
cj = cookielib.LWPCookieJar()
if os.path.isfile('cookie.lpw'):
cj.load('cookie.lpw')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
theurl = url
txdata = None
txheaders = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
req = urllib2.Request(theurl, txdata, txheaders)
handle = urllib2.urlopen(req) ## error raised
Error Code Here
Traceback (most recent call last):
File "<pyshell#20>", line 1, in <module>
handle = urlopen(req)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 503: Service Temporarily Unavailable
Simply I want to re-access the site when got HTTP 503 error without change cookies.
But I don't know how to do it.
Somebody help me please.
I cannot consistently get JSON from a given url. It works only about 60% of the time
jsonurl = urlopen('http://www.reddit.com/r/funny/hot.json?limit=16')
r_content = json.load(jsonurl)['data']['children']
The program crashes on the second line sometimes, because the info from the url is not retrieved properly for some reason
With some debugging, I found out that I was getting the following error from the first line:
<addinfourl at 4321460952 whose fp = <socket._fileobject object at 0x10185b050>>
This error occurs about 40% of the time, the other 60% of the time, the code works perfectly. What am I doing wrong? How do I make the url opening more consistent?
It is usually not an issue from the client side. Your code is consistent in behavior but the server response can vary.
I ran your code a few times and It does throw up certain issues:
>>> jsonurl = urlopen('http://www.reddit.com/r/funny/hot.json?limit=16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 429: Unknown
You have to handle cases where server response is anything but HTTP 200. You can wrap your code in a try / except block and you should pass jsonurl to json.loads() only when your request succeeds.
Also urlopen returns a file-like descriptor. Hence if you print jsourl, it simply provides jsonurl.__repr__() value. See below:
>>> jsonurl.__repr__()
'<addinfourl at 4393153672 whose fp = <socket._fileobject object at 0x105978450>>'
You have to look for the following::
>>> jsonurl.getcode()
200
>>>
and only if it 200, should you process the data obtained from the request.
I've setup Kannel in Ubuntu using a USB Modem and I can send SMS via the browser using the URL as seen below
localhost:13013/cgi-bin/sendsms?username=kannel&password=kannel&to=+254781923855&text='Kid got swag'
In python, I have the following script which works only if the message to be sent does not have spaces.
import urllib.request
def send_sms(mobile_no, message):
url="http://%s:%d/cgi-bin/sendsms?username=%s&password=%s&to=%s&text=%s" \
% ('localhost', 13013, 'kannel', 'kannel', str(mobile_no), message)
f = urllib.request.urlopen(url)
print("sms sent")
If I call the function with NO spaces in the message, it works and the message is sent.
sms.send_sms('+254781923855', 'kid_got_swag')
If I have spaces in the message, it fails with the error belw
sms.send_sms('+254781923855', 'kid got swag')
Traceback (most recent call last):
File "/home/lukik/workspace/projx/src/short_message.py", line 24, in <module>
sms.send_sms('+254781923855', 'kid got swag')
File "/home/lukik/workspace/projx/src/short_message.py", line 18, in send_sms
f = urllib.request.urlopen(url)
File "/usr/lib/python3.2/urllib/request.py", line 139, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 376, in open
response = meth(req, response)
File "/usr/lib/python3.2/urllib/request.py", line 488, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.2/urllib/request.py", line 414, in error
return self._call_chain(*args)
File "/usr/lib/python3.2/urllib/request.py", line 348, in _call_chain
result = func(*args)
File "/usr/lib/python3.2/urllib/request.py", line 496, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
I've tried other variants of calling urllib but they all fail coz of the spaces in the message....
In your request you send via browser, the message is inside quotes -
&text='Kid got swag'
Try that in your request -
url="http://%s:%d/cgi-bin/sendsms?username=%s&password=%s&to=%s&text='%s'" \
% ('localhost', 13013, 'kannel', 'kannel', str(mobile_no), message)
Notice the single quotes at &text='%s'.
PS: I'd recommend using requests for requests like this. You could construct your urls better that way, like this -
>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get("http://httpbin.org/get", params=payload)
URLs are not permitted to contain spaces. When you tried in your browser, the browser took care of correctly encoding the URL before issuing the request. In your program you need to encode the URL. Fortunately urllib has functions built-in to take careof the details.
http://docs.python.org/3.3/library/urllib.parse.html#url-quoting
You need to URL-encode the values you pass as parameters, otherwise a broken URL gets constructed, that's why the request fails. I believe urllib.parse.urlencode does what you need.
I have the following python script and it works beautifully.
import urllib2
url = 'http://abc.com' # write the url here
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print data
however, some of the URL's I give it may redirect it 2 or more times. How can I have python wait for redirects to complete before loading the data.
For instance when using the above code with
http://www.google.com/search?hl=en&q=KEYWORD&btnI=1
which is the equvilant of hitting the im lucky button on a google search, I get:
>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>>
Ive tried the (url, data, timeout) however, I am unsure what to put there.
EDIT:
I actually found out if I dont redirect and just used the header of the first link, I can grab the location of the next redirect and use that as my final link
Use requests as the other answer states, here is an example. The redirect will be in r.url. In the example below the http is redirected to https
For HEAD:
In [1]: import requests
...: r = requests.head('http://github.com', allow_redirects=True)
...: r.url
Out[1]: 'https://github.com/'
For GET:
In [1]: import requests
...: r = requests.get('http://github.com')
...: r.url
Out[1]: 'https://github.com/'
Note for HEAD you have to specify allow_redirects, if you don't you can get it in the headers but this is not advised.
In [1]: import requests
In [2]: r = requests.head('http://github.com')
In [3]: r.headers.get('location')
Out[3]: 'https://github.com/'
To download the page you will need GET, you can then access the page using r.content
You might be better off with Requests library which has better APIs for controlling redirect handling:
https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
Requests:
https://pypi.org/project/requests/ (urllib replacement for humans)