I had learned a lot of things from MOOCs so I wanted to return something back to them for this purpose I was thinking of designing a small app in kivy which thus requires python implementation, Actually the thing I wanted to achieve was to log in to my Coursera account via program and collect the information about the courses I am currently pursuing, for this first I have to log in to the coursera( https://accounts.coursera.org/signin?post_redirect=https%3A%2F%2Fwww.coursera.org%2F ), Upon searching the Web I came across this piece of code :
import urllib2, cookielib, urllib
username = "abcdef#abcdef.com"
password = "uvwxyz"
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
info = opener.open("https://accounts.coursera.org/signin",login_data)
for line in info:
print line
and some similar codes as well, but none worked for me, every approach lead to me this type of error:
Traceback (most recent call last):
File "C:\Python27\Practice\web programming\coursera login.py", line 9, in <module>
info = opener.open("https://accounts.coursera.org/signin",login_data)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
Is the error due to https protocol or there is something that I am missing?
I don't want to use any 3rd party libraries.
I'm using requests for this purpose and I think it is a great python library. Here is some example code how it could work:
import requests
from requests.auth import HTTPBasicAuth
credentials = HTTPBasicAuth('username', 'password')
response = requests.get("https://accounts.coursera.org/signin", auth=credentials)
print response.status_code
# if everything was fine then it prints
>>> 200
Here is the link to requests:
http://docs.python-requests.org/en/latest/
I think you need to use HTTPBasicAuthHandler module of urllib2. Check section 'Basic Authentication'. https://docs.python.org/2/howto/urllib2.html
And I strongly recommend you requests module. It will make your code better. http://docs.python-requests.org/en/latest/
Related
I'm trying to make a terminal app to crawl a website and return the time of the entered city name. this is my code so far:
import re
import urllib.request
city = input('Enter city name: ')
url = 'https://time.is/'
rawData = urllib.request.urlopen(url).read()
decodedData = rawData.decode('utf-8')
print(decodedData)
after the last line i get this error:
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
rawData = urllib.request.urlopen(url).read()
File "~/Python\Python35-32\lib\urllib\request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "~/Python\Python35-32\lib\urllib\request.py", line 472, in open
response = meth(req, response)
File "~/Python\Python35-32\lib\urllib\request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "~/Python\Python35-32\lib\urllib\request.py", line 510, in error
return self._call_chain(*args)
File "~/Python\Python35-32\lib\urllib\request.py", line 444, in _call_chain
result = func(*args)
File "~/Python\Python35-32\lib\urllib\request.py", line 590, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
why do i get this error? what's wrong?
[EDIT]
the reason is time.is banns requests. Always remember to read terms and conditions when doing web scraping. free APIs can be found to do the same job too.
When this happens, I usually open the debugger and try to find out whats being called when I access the website. It seems like time.is doesn't like having scripts call their website.
A quick search yielded this:
1532027279136 0 161_(UTC,_UTC+00:00) 1532027279104
Time.is is for humans. To use from scripts and apps, please ask about our API. Thank you!
Here are some APIs you could use to build your project. https://www.programmableweb.com/category/time/api
I can download things from my controlled server in one way - by passing the document ID into a link like so :
https://website/deployLink/442/document/download/$NUMBER
If I navigate to this in my browser, it downloads the file with ID $NUMBER.
The problem is, I have 9,000 files on my server, which is SSL encrypted and usually requires signing in with a username/password on a dialog box popup which appears on the web-page.
I posted a similar thread to this already, where I downloaded the files via WGET. Now I would like to try and use Python, and I'd like to provide the username/password and get through the SSL encryption.
Here is my attempt to grab one file, which results in a 401 error. Full stacktrace below.
import urllib2
import ctypes
from HTMLParser import HTMLParser
# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
top_level_url = "https://website.com/home.html"
password_mgr.add_password(None, top_level_url, "admin", "password")
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib2.build_opener(handler)
# Install the opener.
# Now all calls to urllib2.urlopen use our opener.
urllib2.install_opener(opener)
# Grab website
response = urllib2.urlopen('https://website/deployLink/442/document/download/1')
html = response.read()
class MyHTMLParser(HTMLParser):
url=''https://website/deployLink/442/document/download/1')'
# Save the file
webpage = urllib2.urlopen(url)
with open('Test.doc','wb') as localFile:
localFile.write(webpage.read())
What have I done incorrectly here? Is what I am attempting possible?
C:\Python27\python.exe C:/Users/ADMIN/PycharmProjects/GetFile.py
Traceback (most recent call last):
File "C:/Users/ADMIN/PycharmProjects/GetFile.py", line 22, in <module>
response = urllib2.urlopen('https://website/deployLink/442/document/download/1')
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 437, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 475, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 401: Processed
Process finished with exit code 1
Here's my authent page with some info removed for privacy :
Authent url ends in :443.
Assuming your code above is accurate, then I think your problem is related to the URIs in your add_password method. You have this when setting up the username/password:
# Add the username and password.
top_level_url = "https://website.com/home.html"
password_mgr.add_password(None, top_level_url, "admin", "password")
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
And then your subsequent request goes to this URI:
# Grab website
response = urllib2.urlopen('https://website/deployLink/442/document/download/1')
(I'm assuming they've been "scrubbed" incorrectly, and they should be the same, and just move on. See: "website" vs. "website.com")
The second URI is not a child of the first URI based on their respective path portions. The URI path /deployLink/442/document/download/1 is not a child of /home.html. From the perspective of the library, you'd have no auth data for the second URI.
I have a problem with accessing specific web site.
The Web site automatically redirect to Check Page which is displaying "check your Browser"
The Check page returns HTTP 503 errors in first time.
Then web browser(chrome, IE etc) automatically re-access again.
Finally I can get into web site.
The problem is I want to access to site in Python.
So I use urllib and urllib2 both.
u = urllib.open(url)
print u.read()
Same with urllib2, but it doesn't work raising 503 error.
urllib also get HTTP 503 code but it doesn't raise error.
So I need to re-access without changing cookie
u = urllib.open(url)
u = urllib.open(url) ## cookie is changed
print u.read()
Simply I tried to call open function twice. But cookie is changed and it doesn't work
(Check Page Again)
So I use urllib2 with cooklib
import os.path
cj = None
ClientCookie = None
cookielib = None
import cookielib
import urllib2
cj = cookielib.LWPCookieJar()
if os.path.isfile('cookie.lpw'):
cj.load('cookie.lpw')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
theurl = url
txdata = None
txheaders = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
req = urllib2.Request(theurl, txdata, txheaders)
handle = urllib2.urlopen(req) ## error raised
Error Code Here
Traceback (most recent call last):
File "<pyshell#20>", line 1, in <module>
handle = urlopen(req)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 503: Service Temporarily Unavailable
Simply I want to re-access the site when got HTTP 503 error without change cookies.
But I don't know how to do it.
Somebody help me please.
I have the following python script and it works beautifully.
import urllib2
url = 'http://abc.com' # write the url here
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print data
however, some of the URL's I give it may redirect it 2 or more times. How can I have python wait for redirects to complete before loading the data.
For instance when using the above code with
http://www.google.com/search?hl=en&q=KEYWORD&btnI=1
which is the equvilant of hitting the im lucky button on a google search, I get:
>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>>
Ive tried the (url, data, timeout) however, I am unsure what to put there.
EDIT:
I actually found out if I dont redirect and just used the header of the first link, I can grab the location of the next redirect and use that as my final link
Use requests as the other answer states, here is an example. The redirect will be in r.url. In the example below the http is redirected to https
For HEAD:
In [1]: import requests
...: r = requests.head('http://github.com', allow_redirects=True)
...: r.url
Out[1]: 'https://github.com/'
For GET:
In [1]: import requests
...: r = requests.get('http://github.com')
...: r.url
Out[1]: 'https://github.com/'
Note for HEAD you have to specify allow_redirects, if you don't you can get it in the headers but this is not advised.
In [1]: import requests
In [2]: r = requests.head('http://github.com')
In [3]: r.headers.get('location')
Out[3]: 'https://github.com/'
To download the page you will need GET, you can then access the page using r.content
You might be better off with Requests library which has better APIs for controlling redirect handling:
https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
Requests:
https://pypi.org/project/requests/ (urllib replacement for humans)
I have a strange bug when trying to urlopen a certain page from Wikipedia. This is the page:
http://en.wikipedia.org/wiki/OpenCola_(drink)
This is the shell session:
>>> f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
Traceback (most recent call last):
File "C:\Program Files\Wing IDE 4.0\src\debug\tserver\_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "c:\Python26\Lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "c:\Python26\Lib\urllib2.py", line 397, in open
response = meth(req, response)
File "c:\Python26\Lib\urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "c:\Python26\Lib\urllib2.py", line 435, in error
return self._call_chain(*args)
File "c:\Python26\Lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "c:\Python26\Lib\urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
This happened to me on two different systems in different continents. Does anyone have an idea why this happens?
Wikipedias stance is:
Data retrieval: Bots may not be used
to retrieve bulk content for any use
not directly related to an approved
bot task. This includes dynamically
loading pages from another website,
which may result in the website being
blacklisted and permanently denied
access. If you would like to download
bulk content or mirror a project,
please do so by downloading or hosting
your own copy of our database.
That is why Python is blocked. You're supposed to download data dumps.
Anyways, you can read pages like this in Python 2:
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib2.urlopen( req )
print con.read()
Or in Python 3:
import urllib
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib.request.urlopen( req )
print(con.read())
To debug this, you'll need to trap that exception.
try:
f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
except urllib2.HTTPError, e:
print e.fp.read()
When I print the resulting message, it includes the following
"English
Our servers are currently experiencing
a technical problem. This is probably
temporary and should be fixed soon.
Please try again in a few minutes. "
Often times websites will filter access by checking if they are being accessed by a recognised user agent. Wikipedia is just treating your script as a bot and rejecting it. Try spoofing as a browser. The following link takes to you an article to show you how.
http://wolfprojects.altervista.org/changeua.php
Some websites will block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends. I don't know and can't imagine why wikipedia does/would do this, but have you tried spoofing your headers?
As Jochen Ritzel mentioned, Wikipedia blocks bots.
However, bots will not get blocked if they use the PHP api.
To get the Wikipedia page titled "love":
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=love&prop=revisions&rvprop=content
I made a workaround for this using php which is not blocked by the site I needed.
it can be accessed like this:
path='http://phillippowers.com/redirects/get.php?
file=http://website_you_need_to_load.com'
req = urllib2.Request(path)
response = urllib2.urlopen(req)
vdata = response.read()
This will return the html code to you