Why Can't I load this page using Python? - python

If I use urllib to load this url( https://www.fundingcircle.com/my-account/sell-my-loans/ ) I get a 400 status error.
e.g. The following returns a 400 error
>>> import urllib
>>> f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
>>> print f.read()
However, if I copy and paste the url into my browser, I see a web page with the information that I want to see.
I have tried using a try, except, and then reading the error. But the returned data just tells me that the page does not exist. e.g.
import urllib
try:
f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
except Exception as e:
eString = e.read()
print eString
Why can't Python load the page?

If Python is given a 404 status then that'd be because the server refuses to give you the page.
Why that is is difficult to know, because servers are black boxes. But your browser gives the server more than just the URL, it also gives it a set of HTTP headers. Most likely the server alters behaviour based on the contents of one or more of those headers.
You need to look in your browser development tools and see what your browser sends, then try and replicate some of those headers from Python. Obvious candidates are the User-Agent header, followed by Accept and Cookie headers.
However, in this specific case, the server is responding with a 401 Unauthorized; you are given a login page. It does this both for the browser and Python:
>>> import urllib
>>> urllib.urlopen('https://www.fundingcircle.com/my-account/sell-my-loans/')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 451, in open_https
return self.http_error(url, fp, errcode, errmsg, headers)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 372, in http_error
result = method(url, fp, errcode, errmsg, headers)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 683, in http_error_401
errcode, errmsg, headers)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 381, in http_error_default
raise IOError, ('http error', errcode, errmsg, headers)
IOError: ('http error', 401, 'Unauthorized', <httplib.HTTPMessage instance at 0x1066f9a28>)
but Python's urllib doesn't have a handler for the 401 status code and turns that into an exception.
The response body contains a login form; you'll have to write code to log in here, and presumably track cookies.
That task would be a lot easier with more specialised tools. You could use robobrowser to load the page, parse the form and give you the tools to fill it out, then post the form for you and track the cookies required to keep you logged in. It is built on top of the excellent requests and BeautifulSoup libraries.

Related

Why does this code not download the file and the downloader can download it successfully

The problem begins with this link
https://i1.pixiv.net/img-zip-ugoira/img/2017/04/05/00/24/41/62259492_ugoira600x600.zip
the file downloaded with the downloader is complete.
enter image description here
and I try to use python to download the file
from urllib import request
import sys
request.urlretrieve('https://i1.pixiv.net/img-zip-ugoira/img/2017/04/05/00/24/41/62259492_ugoira600x600.zip', '123.zip')
Traceback (most recent call last):
File "C:/Users/ssshooter/PycharmProjects/first/111.py", line 3, in <module>
request.urlretrieve('https://i1.pixiv.net/img-zip-ugoira/img/2017/04/05/00/24/41/62259492_ugoira600x600.zip', '123.zip')
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
It doesn't work.
The differences are:
You're using different SSL information: You're browser has a built-in set of certificate authorities. Python uses a set which comes with the OS. They differ & if the site you're accessing uses one know to your browser but not known to python, the python will throw an exception.
You're accessing using different User-Agents. Your browser is telling the server it's Chrome or IE or whatever. Python is telling the server it's python. For whatever reason, the server may decide it doesn't like that and return Forbidden.
The server may be working harder than you think: while it appears the request is for a simple file, you're really requesting a resource. It may be (though unlikely in this case) that the resource you're requesting results in multiple interactions between the server and your browser -- cookies, javascript, etc -- which are executed successfully in your browser, returned to the server & it then delivers the file. Your python request is not doing any of that.
Your browser (may) have existing state which your python does not. You say you can access the file using your browser, but it could be that works only because you've accessed other resources on the site, or logged in, or whatever. Your browser is communicating that information (perhaps a session_id via cookie?) with the server recognizes. Your python code states with no previous state, so the server forbids that.
Which is it in this case? You'll need to investigate. Can you get wget or curl to work? Debug your browser's access: what headers are being sent, what are you receiving in reply?

Running GET with SSL and authentication in Python

I can download things from my controlled server in one way - by passing the document ID into a link like so :
https://website/deployLink/442/document/download/$NUMBER
If I navigate to this in my browser, it downloads the file with ID $NUMBER.
The problem is, I have 9,000 files on my server, which is SSL encrypted and usually requires signing in with a username/password on a dialog box popup which appears on the web-page.
I posted a similar thread to this already, where I downloaded the files via WGET. Now I would like to try and use Python, and I'd like to provide the username/password and get through the SSL encryption.
Here is my attempt to grab one file, which results in a 401 error. Full stacktrace below.
import urllib2
import ctypes
from HTMLParser import HTMLParser
# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
top_level_url = "https://website.com/home.html"
password_mgr.add_password(None, top_level_url, "admin", "password")
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib2.build_opener(handler)
# Install the opener.
# Now all calls to urllib2.urlopen use our opener.
urllib2.install_opener(opener)
# Grab website
response = urllib2.urlopen('https://website/deployLink/442/document/download/1')
html = response.read()
class MyHTMLParser(HTMLParser):
url=''https://website/deployLink/442/document/download/1')'
# Save the file
webpage = urllib2.urlopen(url)
with open('Test.doc','wb') as localFile:
localFile.write(webpage.read())
What have I done incorrectly here? Is what I am attempting possible?
C:\Python27\python.exe C:/Users/ADMIN/PycharmProjects/GetFile.py
Traceback (most recent call last):
File "C:/Users/ADMIN/PycharmProjects/GetFile.py", line 22, in <module>
response = urllib2.urlopen('https://website/deployLink/442/document/download/1')
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 437, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 475, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 401: Processed
Process finished with exit code 1
Here's my authent page with some info removed for privacy :
Authent url ends in :443.
Assuming your code above is accurate, then I think your problem is related to the URIs in your add_password method. You have this when setting up the username/password:
# Add the username and password.
top_level_url = "https://website.com/home.html"
password_mgr.add_password(None, top_level_url, "admin", "password")
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
And then your subsequent request goes to this URI:
# Grab website
response = urllib2.urlopen('https://website/deployLink/442/document/download/1')
(I'm assuming they've been "scrubbed" incorrectly, and they should be the same, and just move on. See: "website" vs. "website.com")
The second URI is not a child of the first URI based on their respective path portions. The URI path /deployLink/442/document/download/1 is not a child of /home.html. From the perspective of the library, you'd have no auth data for the second URI.

Logging into a Coursera account Using Python

I had learned a lot of things from MOOCs so I wanted to return something back to them for this purpose I was thinking of designing a small app in kivy which thus requires python implementation, Actually the thing I wanted to achieve was to log in to my Coursera account via program and collect the information about the courses I am currently pursuing, for this first I have to log in to the coursera( https://accounts.coursera.org/signin?post_redirect=https%3A%2F%2Fwww.coursera.org%2F ), Upon searching the Web I came across this piece of code :
import urllib2, cookielib, urllib
username = "abcdef#abcdef.com"
password = "uvwxyz"
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
info = opener.open("https://accounts.coursera.org/signin",login_data)
for line in info:
print line
and some similar codes as well, but none worked for me, every approach lead to me this type of error:
Traceback (most recent call last):
File "C:\Python27\Practice\web programming\coursera login.py", line 9, in <module>
info = opener.open("https://accounts.coursera.org/signin",login_data)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
Is the error due to https protocol or there is something that I am missing?
I don't want to use any 3rd party libraries.
I'm using requests for this purpose and I think it is a great python library. Here is some example code how it could work:
import requests
from requests.auth import HTTPBasicAuth
credentials = HTTPBasicAuth('username', 'password')
response = requests.get("https://accounts.coursera.org/signin", auth=credentials)
print response.status_code
# if everything was fine then it prints
>>> 200
Here is the link to requests:
http://docs.python-requests.org/en/latest/
I think you need to use HTTPBasicAuthHandler module of urllib2. Check section 'Basic Authentication'. https://docs.python.org/2/howto/urllib2.html
And I strongly recommend you requests module. It will make your code better. http://docs.python-requests.org/en/latest/

How do I get rid of spaces in the 'message' when sending SMS via Kannel

I've setup Kannel in Ubuntu using a USB Modem and I can send SMS via the browser using the URL as seen below
localhost:13013/cgi-bin/sendsms?username=kannel&password=kannel&to=+254781923855&text='Kid got swag'
In python, I have the following script which works only if the message to be sent does not have spaces.
import urllib.request
def send_sms(mobile_no, message):
url="http://%s:%d/cgi-bin/sendsms?username=%s&password=%s&to=%s&text=%s" \
% ('localhost', 13013, 'kannel', 'kannel', str(mobile_no), message)
f = urllib.request.urlopen(url)
print("sms sent")
If I call the function with NO spaces in the message, it works and the message is sent.
sms.send_sms('+254781923855', 'kid_got_swag')
If I have spaces in the message, it fails with the error belw
sms.send_sms('+254781923855', 'kid got swag')
Traceback (most recent call last):
File "/home/lukik/workspace/projx/src/short_message.py", line 24, in <module>
sms.send_sms('+254781923855', 'kid got swag')
File "/home/lukik/workspace/projx/src/short_message.py", line 18, in send_sms
f = urllib.request.urlopen(url)
File "/usr/lib/python3.2/urllib/request.py", line 139, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 376, in open
response = meth(req, response)
File "/usr/lib/python3.2/urllib/request.py", line 488, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.2/urllib/request.py", line 414, in error
return self._call_chain(*args)
File "/usr/lib/python3.2/urllib/request.py", line 348, in _call_chain
result = func(*args)
File "/usr/lib/python3.2/urllib/request.py", line 496, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
I've tried other variants of calling urllib but they all fail coz of the spaces in the message....
In your request you send via browser, the message is inside quotes -
&text='Kid got swag'
Try that in your request -
url="http://%s:%d/cgi-bin/sendsms?username=%s&password=%s&to=%s&text='%s'" \
% ('localhost', 13013, 'kannel', 'kannel', str(mobile_no), message)
Notice the single quotes at &text='%s'.
PS: I'd recommend using requests for requests like this. You could construct your urls better that way, like this -
>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get("http://httpbin.org/get", params=payload)
URLs are not permitted to contain spaces. When you tried in your browser, the browser took care of correctly encoding the URL before issuing the request. In your program you need to encode the URL. Fortunately urllib has functions built-in to take careof the details.
http://docs.python.org/3.3/library/urllib.parse.html#url-quoting
You need to URL-encode the values you pass as parameters, otherwise a broken URL gets constructed, that's why the request fails. I believe urllib.parse.urlencode does what you need.

Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

I have a strange bug when trying to urlopen a certain page from Wikipedia. This is the page:
http://en.wikipedia.org/wiki/OpenCola_(drink)
This is the shell session:
>>> f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
Traceback (most recent call last):
File "C:\Program Files\Wing IDE 4.0\src\debug\tserver\_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "c:\Python26\Lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "c:\Python26\Lib\urllib2.py", line 397, in open
response = meth(req, response)
File "c:\Python26\Lib\urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "c:\Python26\Lib\urllib2.py", line 435, in error
return self._call_chain(*args)
File "c:\Python26\Lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "c:\Python26\Lib\urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
This happened to me on two different systems in different continents. Does anyone have an idea why this happens?
Wikipedias stance is:
Data retrieval: Bots may not be used
to retrieve bulk content for any use
not directly related to an approved
bot task. This includes dynamically
loading pages from another website,
which may result in the website being
blacklisted and permanently denied
access. If you would like to download
bulk content or mirror a project,
please do so by downloading or hosting
your own copy of our database.
That is why Python is blocked. You're supposed to download data dumps.
Anyways, you can read pages like this in Python 2:
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib2.urlopen( req )
print con.read()
Or in Python 3:
import urllib
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib.request.urlopen( req )
print(con.read())
To debug this, you'll need to trap that exception.
try:
f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
except urllib2.HTTPError, e:
print e.fp.read()
When I print the resulting message, it includes the following
"English
Our servers are currently experiencing
a technical problem. This is probably
temporary and should be fixed soon.
Please try again in a few minutes. "
Often times websites will filter access by checking if they are being accessed by a recognised user agent. Wikipedia is just treating your script as a bot and rejecting it. Try spoofing as a browser. The following link takes to you an article to show you how.
http://wolfprojects.altervista.org/changeua.php
Some websites will block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends. I don't know and can't imagine why wikipedia does/would do this, but have you tried spoofing your headers?
As Jochen Ritzel mentioned, Wikipedia blocks bots.
However, bots will not get blocked if they use the PHP api.
To get the Wikipedia page titled "love":
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=love&prop=revisions&rvprop=content
I made a workaround for this using php which is not blocked by the site I needed.
it can be accessed like this:
path='http://phillippowers.com/redirects/get.php?
file=http://website_you_need_to_load.com'
req = urllib2.Request(path)
response = urllib2.urlopen(req)
vdata = response.read()
This will return the html code to you

Categories

Resources