FancyURLopener.version equivalence in requests - python

Sometimes it's necessary to change the version attribute when retrieving a request using FancyURLopener, e.g.
from urllib.request import FancyURLopener
class NewOpener(FancyURLopener):
version = 'Some fancy thing'
url = 'www.google.com'
opener = NewOpener.retrieve(url, 'google.html')
Is there an equivalence in the requests library when using requests.get()?

As #Sraw commented, the "version" is basically the user-agent file in the header, so
requests.get(url, headers={'User-agent': 'Some fancy thing'}

Related

How to send cookies with urllib

I'm attempting to connect to a website that requires you to have a specific cookie to access it. For the sake of this question, we'll call the cookie 'required_cookie' and the value 'required_value'.
This is my code:
import urllib
import http.cookiejar
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('required_cookie', 'required_value'), ('User-Agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
req = Request('https://www.thewebsite.com/')
webpage = urlopen(req).read()
print(webpage)
I'm new to urllib so please answer me as a beginner
To do this with urllib, you need to:
Construct a Cookie object. The constructor isn't documented in the docs, but if you help(http.cookiejar.Cookie) in the interactive interpreter, you can see that its constructor demands values for all 16 attributes. Notice that the docs say, "It is not expected that users of http.cookiejar construct their own Cookie instances."
Add it to the cookiejar with cj.set_cookie(cookie).
Tell the cookiejar to add the correct headers to the request with cj.add_cookie_headers(req).
Assuming you've configured the policy correctly, you're set.
But this is a huge pain. As the docs for urllib.request say:
See also The Requests package is recommended for a higher-level HTTP client interface.
And, unless you have some good reason you can't install requests, you really should go that way. urllib is tolerable for really simple cases, and it can be handy when you need to get deep under the covers—but for everything else, requests is much better.
With requests, your whole program becomes a one-liner:
webpage = requests.get('https://www.thewebsite.com/', cookies={'required_cookie': required_value}, headers={'User-Agent': 'Mozilla/5.0'}).text
… although it's probably more readable as a few lines:
cookies = {'required_cookie': required_value}
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.thewebsite.com/', cookies=cookies, headers=headers)
webpage = response.text
With the help of Kite documentation: https://www.kite.com/python/answers/how-to-add-a-cookie-to-an-http-request-using-urllib-in-python
You can add cookie this way:
import urllib
a_request = urllib.request.Request("http://www.kite.com/")
a_request.add_header("Cookie", "cookiename=cookievalue")
or in a different way:
from urllib.request import Request
url = "https://www.kite.com/"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=lovely'})

Headers, User-Agents, Url Requests in Ubuntu

Currently scraping LINK for products, deploying my script on a ubuntu server. This site requires you to specify User-Agent and url header related stuff. As I am using Ubuntu and connecting to a proxy server on Ubuntu, what should my "hdr" variable be within this script:
import urllib2
from bs4 import BeautifulSoup
import requests
from lxml import etree
import smtplib
import urllib2, sys
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
page = urllib2.urlopen(req)
This script works just fine on coming off my computer, however not sure what I would specify as browser or user-agent for ubuntu.
The code:
import urllib2
from bs4 import BeautifulSoup
import requests
from lxml import etree
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import urllib2, sys
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page, "lxml")
result = soup.find_all("span", {"class":"availability"})
returns the error code: urllib2.HTTPError: HTTP Error 403: Forbidden
But this only occurs on Ubuntu, not off local machine
I am not sure about the whole urllib2 thing, but if you are just trying to get the string within the html, you are importing way too much stuff here. For the url you've provided, the following is sufficient to scrape the text:
from bs4 import BeautifulSoup
import requests
As for the user-agent, that depends whether you want the site maintainer to know about your existence or not, mostly it is not related to the capacity of scraping itself. For some sites you might want to hide your user-agent for some you might prefer it to be explicitly stated.
For the url you've provided the following code worked for me without errors:
from bs4 import BeautifulSoup
import requests
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = requests.Session()
page_raw = req.get(url, headers=hdr)
page_raw.status_code # This was 200
page_raw.encoding = "utf-8" # Just to be sure
page_text = page_raw.text
page_soup = BeautifulSoup(page_text, "lxml")
page_avaliablity = page_soup.find_all("span", class_="availability")
You don't have to specify different User-Agent strings based on the operating systems where the script is running. You can leave it as is.
You can go further and start rotating User-Agent value - for instance, randomly picking it up from the fake-useragent's list of real world user agents.
You can specify any agent you want. The header is just a string that is part of the protocol HTTP protocol. There is no verification that goes on in the server. Beware that the header you specify will determine how the html your request will appear i.e. older agents might not contain all the information you expect

Adding authentication header in python 3

Using the urllib2 library and the add_header function, I am able to authenticate and retrieve data in python 2.7. But since urllib2 library in more present in python 3, how do I add the Basic Authentication header with urllib library?
Please check add_header method of urllib.request's Request class.
import urllib.request
req = urllib.request.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
r = urllib.request.urlopen(req)
By the way, I recommend you to check another way, using HTTPBasicAuthHandler:
import urllib.request
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
uri='https://mahler:8092/site-updates.py',
user='klem',
passwd='kadidd!ehopper')
opener = urllib.request.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib.request.install_opener(opener)
urllib.request.urlopen('http://www.example.com/login.html')
(taken from the same page)

Python requests lib, is requests.Session equivalent to urllib2's opener?

I need to accomplish a login task in my own project.Luckily I found someone has it done already.
Here is the related code.
import re,urllib,urllib2,cookielib
class Login():
cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
def __init__(self,name='',password='',domain=''):
self.name=name
self.password=password
self.domain=domain
urllib2.install_opener(self.opener)
def login(self):
params = {'domain':self.domain,'email':self.name,'password':self.password}
req = urllib2.Request(
website_url,
urllib.urlencode(params)
)
self.openrate = self.opener.open(req)
print self.openrate.geturl()
info = self.openrate.read()
I've tested the code, it works great (according to info).
Now I want to port it to Python 3 as well as using requests lib instead of urllib2.
My thoughts:
since the original code use opener, though not sure, I think its equivalent in requests is requests.Session
Am I supposed to pass in a jar = cookiejar.CookieJar() when making request? Not sure either.
I've tried something like
import requests
from http import cookiejar
from urllib.parse import urlencode
jar = cookiejar.CookieJar()
s = requests.Session()
s.post(
website_url,
data = urlencode(params),
allow_redirects = True,
cookies = jar
)
Also, followed the answer in Putting a `Cookie` in a `CookieJar`, I tried making the same request again, but none of these worked.
That's why I'm here for help.
Will someone show me the right way to do this job? Thank you~
An opener and a Session are not entirely analogous, but for your particular use-case they match perfectly.
You do not need to pass a CookieJar when using a Session: Requests will automatically create one, attach it to the Session, and then persist the cookies to the Session for you.
You don't need to urlencode the data: requests will do that for you.
allow_redirects is True by default, you don't need to pass that parameter.
Putting all of that together, your code should look like this:
import requests
s = requests.Session()
s.post(website_url, data = params)
Any future requests made using the Session you just created will automatically have cookies applied to them if they are appropriate.

setting referral url in python urllib.urlretrieve

I am using urllib.urlretrieve in Python to download websites. Though some websites seem to not want me to download them, unless they have a proper referrer from their own site. Does anybody know of a way I can set a referrer in one of Python's libraries or a external one to.
import urllib2
req = urllib2.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
r = urllib2.urlopen(req)
adopted from http://docs.python.org/library/urllib2.html
urllib makes it hard to send arbitrary headers with the request; you could use urllib2, which lets you build and send a Request object with arbitrary headers (including of course the -- alas sadly spelled;-) -- Referer). Doesn't offer urlretrieve, but it's easy to just urlopen as you with and copy the resulting file-like object to disk if you want (directly, or e.g. via shutil functions).
Also, using urllib2 with build_opener you can do this:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('Referer', 'http://www.python.org/')]
opener.open('http://www.example.com/')

Categories

Resources