I'm attempting to connect to a website that requires you to have a specific cookie to access it. For the sake of this question, we'll call the cookie 'required_cookie' and the value 'required_value'.
This is my code:
import urllib
import http.cookiejar
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('required_cookie', 'required_value'), ('User-Agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
req = Request('https://www.thewebsite.com/')
webpage = urlopen(req).read()
print(webpage)
I'm new to urllib so please answer me as a beginner
To do this with urllib, you need to:
Construct a Cookie object. The constructor isn't documented in the docs, but if you help(http.cookiejar.Cookie) in the interactive interpreter, you can see that its constructor demands values for all 16 attributes. Notice that the docs say, "It is not expected that users of http.cookiejar construct their own Cookie instances."
Add it to the cookiejar with cj.set_cookie(cookie).
Tell the cookiejar to add the correct headers to the request with cj.add_cookie_headers(req).
Assuming you've configured the policy correctly, you're set.
But this is a huge pain. As the docs for urllib.request say:
See also The Requests package is recommended for a higher-level HTTP client interface.
And, unless you have some good reason you can't install requests, you really should go that way. urllib is tolerable for really simple cases, and it can be handy when you need to get deep under the covers—but for everything else, requests is much better.
With requests, your whole program becomes a one-liner:
webpage = requests.get('https://www.thewebsite.com/', cookies={'required_cookie': required_value}, headers={'User-Agent': 'Mozilla/5.0'}).text
… although it's probably more readable as a few lines:
cookies = {'required_cookie': required_value}
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.thewebsite.com/', cookies=cookies, headers=headers)
webpage = response.text
With the help of Kite documentation: https://www.kite.com/python/answers/how-to-add-a-cookie-to-an-http-request-using-urllib-in-python
You can add cookie this way:
import urllib
a_request = urllib.request.Request("http://www.kite.com/")
a_request.add_header("Cookie", "cookiename=cookievalue")
or in a different way:
from urllib.request import Request
url = "https://www.kite.com/"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=lovely'})
Related
Why I'm getting different responses when i use urllib.request.urlopen and requests.get
import requests
r = requests.get('https://upload.wikimedia.org/wikipedia/commons/1/14/Sunset_Boulevard_%281950_poster%29.jpg')
r.status_code
response 403
from urllib.request import urlopen
r = urlopen('https://upload.wikimedia.org/wikipedia/commons/1/14/Sunset_Boulevard_%281950_poster%29.jpg')
r.getcode()
response 200
First you could check print( r.content ) to see what you get from server.
Usually you can get some explanation which can help to see problem.
For your code it shows problem with header User-Agent
Wikipedia: User-Agent policy
Some servers check header User-Agent to send different content for different systems/browsers/devices. They use it also to detect scripts/bots/spamers/hackers and block them.
If I use header from real browser (or at least short Mozilla/5.0) then it works.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://upload.wikimedia.org/wikipedia/commons/1/14/Sunset_Boulevard_(1950_poster).jpg'
#url = 'https://upload.wikimedia.org/wikipedia/commons/1/14/Sunset_Boulevard_%281950_poster%29.jpg'
r = requests.get(url, headers=headers)
print(r.status_code)
print(r.content[:100])
with open('image.jpg', 'wb') as fh:
fh.write(r.content)
EDIT:
After running code few times it start working for me even without User-Agent. Maybe they checked it for some different reason.
Sometimes it's necessary to change the version attribute when retrieving a request using FancyURLopener, e.g.
from urllib.request import FancyURLopener
class NewOpener(FancyURLopener):
version = 'Some fancy thing'
url = 'www.google.com'
opener = NewOpener.retrieve(url, 'google.html')
Is there an equivalence in the requests library when using requests.get()?
As #Sraw commented, the "version" is basically the user-agent file in the header, so
requests.get(url, headers={'User-agent': 'Some fancy thing'}
I need to accomplish a login task in my own project.Luckily I found someone has it done already.
Here is the related code.
import re,urllib,urllib2,cookielib
class Login():
cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
def __init__(self,name='',password='',domain=''):
self.name=name
self.password=password
self.domain=domain
urllib2.install_opener(self.opener)
def login(self):
params = {'domain':self.domain,'email':self.name,'password':self.password}
req = urllib2.Request(
website_url,
urllib.urlencode(params)
)
self.openrate = self.opener.open(req)
print self.openrate.geturl()
info = self.openrate.read()
I've tested the code, it works great (according to info).
Now I want to port it to Python 3 as well as using requests lib instead of urllib2.
My thoughts:
since the original code use opener, though not sure, I think its equivalent in requests is requests.Session
Am I supposed to pass in a jar = cookiejar.CookieJar() when making request? Not sure either.
I've tried something like
import requests
from http import cookiejar
from urllib.parse import urlencode
jar = cookiejar.CookieJar()
s = requests.Session()
s.post(
website_url,
data = urlencode(params),
allow_redirects = True,
cookies = jar
)
Also, followed the answer in Putting a `Cookie` in a `CookieJar`, I tried making the same request again, but none of these worked.
That's why I'm here for help.
Will someone show me the right way to do this job? Thank you~
An opener and a Session are not entirely analogous, but for your particular use-case they match perfectly.
You do not need to pass a CookieJar when using a Session: Requests will automatically create one, attach it to the Session, and then persist the cookies to the Session for you.
You don't need to urlencode the data: requests will do that for you.
allow_redirects is True by default, you don't need to pass that parameter.
Putting all of that together, your code should look like this:
import requests
s = requests.Session()
s.post(website_url, data = params)
Any future requests made using the Session you just created will automatically have cookies applied to them if they are appropriate.
Using urlopen also for url queries seems obvious. What I tried is:
import urllib2
query='http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
f = urllib2.urlopen(query)
s = f.read()
f.close()
However, for this specific url query it fails with HTTP error 403 forbidden
When entering this query in my browser, it works.
Also when using http://www.httpquery.com/ to submit the query, it works.
Do you have suggestions how to use Python right to grab the correct response?
Looks like it requires cookies... (which you can do with urllib2), but an easier way if you're doing this, is to use requests
import requests
session = requests.session()
r = session.get('http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627')
This is generally a much easier and less-stressful method of retrieving URLs in Python.
requests will automatically store and re-use cookies for you. Creating a session is slightly overkill here, but is useful for when you need to submit data to login pages etc..., or re-use cookies across a site... etc...
using urllib2 is something like
import urllib2, cookielib
cookies = cookielib.CookieJar()
opener = urllib2.build_opener( urllib2.HTTPCookieProcessor(cookies) )
data = opener.open('url').read()
It appears that the urllib2 default user agent is banned by the host. You can simply supply your own user agent string:
import urllib2
url = 'http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
request = urllib2.Request(url, headers={"User-Agent" : "MyUserAgent"})
contents = urllib2.urlopen(request).read()
print contents
I am using urllib.urlretrieve in Python to download websites. Though some websites seem to not want me to download them, unless they have a proper referrer from their own site. Does anybody know of a way I can set a referrer in one of Python's libraries or a external one to.
import urllib2
req = urllib2.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
r = urllib2.urlopen(req)
adopted from http://docs.python.org/library/urllib2.html
urllib makes it hard to send arbitrary headers with the request; you could use urllib2, which lets you build and send a Request object with arbitrary headers (including of course the -- alas sadly spelled;-) -- Referer). Doesn't offer urlretrieve, but it's easy to just urlopen as you with and copy the resulting file-like object to disk if you want (directly, or e.g. via shutil functions).
Also, using urllib2 with build_opener you can do this:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('Referer', 'http://www.python.org/')]
opener.open('http://www.example.com/')