How to get cookies from urllib.request? - python

How to get cookie from an urllib.request?
import urllib.request
import urllib.parse
data = urllib.parse.urlencode({
'user': 'user',
'pass': 'pass'
})
data = data.encode('utf-8')
request = urllib.request.urlopen('http://example.com', data)
print(request.info())
request.info() returns cookies but not in very usable way.

response.info() is a dict type object. so you can parse any info you need. Here is a demo written in python3:
from urllib import request
from urllib.error import HTTPError
# declare url, header_params
req = request.Request(url, data=None, headers=header_params, method='GET')
try:
response = request.urlopen(req)
cookie = response.info().get_all('Set-Cookie')
content_type = response.info()['Content-Type']
except HTTPError as err:
print("err status: {0}".format(err))
return
You can now, parse cookie variable as your application requirement.

Just used the following code to get cookie from Python Challenge #17, hope it helps (Python 3.8 being used):
import http.cookiejar
import urllib
cookiejar = http.cookiejar.CookieJar()
cookieproc = urllib.request.HTTPCookieProcessor(cookiejar)
opener = urllib.request.build_opener(cookieproc)
response = opener.open(url)
for cookie in cookiejar:
print(cookie.name, cookie.value)

I think using the requests package is a much better choice these days. Try this sample code that shows google setting cookies when you visit:
import requests
url = "http://www.google.com"
r = requests.get(url,timeout=5)
if r.status_code == 200:
for cookie in r.cookies:
print(cookie) # Use "print cookie" if you use Python 2.
Gives:
Cookie NID=67=n0l3ME1Jl3-wwlH7oE5pvxJ_CfU12hT5Kh65wh21bvE3hrKFAo1sJVj_UcuLCr76Ubi3yxENROaYNEitdgW4IttL43YZGlf8xAPl1IbzoLG31KP5U2tiP2y4DzVOJ2fA for .google.se/
Cookie PREF=ID=ce66d1288fc0d977:FF=0:TM=1407525509:LM=1407525509:S=LxQv7q8fju-iHJPZ for .google.se/

Related

How to set params in Python requests library

I have the following code using urllib in Python 2.7 and its working. I'm trying to do the same request using the requests library but I cant get it to work.
import urllib
import urllib2
import json
req = urllib2.Request(url='https://testone.limequery.com/index.php/admin/remotecontrol',\
data='{\"method\":\"get_session_key\",\"params\":[\"username\",\"password\"],\"id\":1}')
req.add_header('content-type', 'application/json')
req.add_header('connection', 'Keep-Alive')
f = urllib2.urlopen(req)
myretun = f.read()
j=json.loads(myretun)
print(j['result'])
Using requests library( Doesn't work)
import requests
import json
d= {"method":"get_session_key","params":["username","password"],"id":"1"}
headers = {'content-type' :'application/json','connection': 'Keep-Alive'}
req2 = requests.get(url='https://testone.limequery.com/index.php/admin/remotecontrol',data=d,headers=headers)
json_data = json.loads(req2.text)
print(json data['result'])
I'm getting an error JSONDecodeError: Expecting value: line 1 column 1 (char 0) How can I make the code work with the requests library?
First, you're sending the wrong type of request. You're sending a GET request, but you need to send a POST, with requests.post.
Second, passing a dict as data will form-encode the data rather than JSON-encoding it. If you want to use JSON in your request body, use the json argument, not data:
requests.post(url=..., json=d)
Reference Link: http://docs.python-requests.org/en/master/api/
You can use requests module of python like so
import requests
Req = requests.request(
method = "GET", # or "POST", "PUT", "DELETE", "PATCH" etcetera
url = "http(s)://*",
params = {"key": "value"}, # IF GET Request (Optional)
data = {"key": "value"}, # IF POST Request (Optional)
headers = {"header_name": "header_value"} # (Optional)
)
print Req.content
You can surround the code with try::catch block like below to catch any exception thrown by requests module
try:
# requests.request(** Arguments)
except requests.exceptions.RequestException as e:
print e
For full argument list, please check reference link.

What's wrong with my requests.Session for python crawler?

I'm coding a crawler for www.researchgate.net, but it seems that I'll be stuck in the login page forever.
Here's my code:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
params = {'login': 'my_email', 'password': 'my_password'}
session.post("https://www.researchgate.net/application.Login.html", data = params)
s = session.get("https://www.researchgate.net/search.Search.html?type=researcher&query=zhang")
print BeautifulSoup(s.text).title
Can anybody find anything wrong with my code? Why does s redirect to login page every time?
There are hidden fields in the login form that probably need to be supplied (I can't test - I don't have a login there).
One is request_token which is set to a long base64 encoded string. Others are invalidPasswordCount and loginCookie which might also be required.
Further to that there is a session cookie that you might need to send with the login credentials.
To make this work will require an initial GET to get the request_token, which you need to extract somehow - e.g. with BeautifulSoup. If you use your requests session then the cookie will be presented in the following POST, so you shouldn't need to worry about that.
import requests
from bs4 import BeautifulSoup
session = requests.Session()
# initial GET to retrieve token and set cookies
r = session.get('https://www.researchgate.net/application.Login.html')
soup = r.BeautifulSoup(r.text)
request_token = soup.find('input', attrs={'name':'request_token'})['value']
params = {'login': 'my_email', 'password': 'my_password', 'request_token': request_token, 'invalidPasswordCount': 0, 'loginCookie': 'yes'}
session.post("https://www.researchgate.net/application.Login.html", data=params)
s = session.get("https://www.researchgate.net/search.Search.html?type=researcher&query=zhang")
print BeautifulSoup(s.text).title
Thanks for mhawke, I modified my original code as he suggested and I finally logged in successfully.
Here's my new code:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
loginpage = session.get("https://www.researchgate.net/application.Login.html")
request_token = BeautifulSoup(loginpage.text).form.find("input",{"name":"request_token"}).attrs["value"]
print request_token
params = {"request_token":request_token,
"invalidPasswordCount":"0",
'login': 'my_email',
'password': 'my_password',
"setLoginCookie":"yes"
}
session.post("https://www.researchgate.net/application.Login.html", data = params)
#print s.cookies.get_dict()
s = session.get("https://www.researchgate.net/search.Search.html?type=researcher&query=zhang")
print BeautifulSoup(s.text).title

HTTPCookieProcessor not serving cookies

I am trying to access a website that requires cookies. Using urllib2 and cookielib I am able to get a response from the site. The HTML printout informs me that I am not getting access with the line:
<h2>Cookies Disabled</h2>
<p> class="share-prompt"><strong>Cookies must be enabled.</strong></p>
I cannot understand where I am going wrong. Code below:
import urllib2, cookielib
cookieJar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.ProxyHandler({'http':"http://216.208.156.69:3128"}),urllib2.HTTPCookieProcessor(cookieJar))
request = urllib2.Request("[website]")
response = opener.open(request)
print response.read()
Can anyone see where I have gone wrong?
Cheers,
The code looks good. For example the output from this
import urllib, urllib2, cookielib
cookieJar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookieJar))
params = urllib.urlencode({'cookie_name': 'cookie_value'})
request = urllib2.Request('http://httpbin.org/cookies/set?' + params)
opener.open(request)
request = urllib2.Request('http://httpbin.org/cookies')
response = opener.open(request)
print response.read()
is
{
"cookies": {
"cookie_name": "cookie_value"
}
}
Without showing us the url you use not much can be done.

How can I unshorten a URL?

I want to be able to take a shortened or non-shortened URL and return its un-shortened form. How can I make a python program to do this?
Additional Clarification:
Case 1: shortened --> unshortened
Case 2: unshortened --> unshortened
e.g. bit.ly/silly in the input array should be google.com in the output array
e.g. google.com in the input array should be google.com in the output array
Send an HTTP HEAD request to the URL and look at the response code. If the code is 30x, look at the Location header to get the unshortened URL. Otherwise, if the code is 20x, then the URL is not redirected; you probably also want to handle error codes (4xx and 5xx) in some fashion. For example:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return response.getheader('Location')
else:
return url
Using requests:
import requests
session = requests.Session() # so connections are recycled
resp = session.head(url, allow_redirects=True)
print(resp.url)
Unshorten.me has an api that lets you send a JSON or XML request and get the full URL returned.
If you are using Python 3.5+ you can use the Unshortenit module that makes this very easy:
from unshortenit import UnshortenIt
unshortener = UnshortenIt()
uri = unshortener.unshorten('https://href.li/?https://example.com')
Open the url and see what it resolves to:
>>> import urllib2
>>> a = urllib2.urlopen('http://bit.ly/cXEInp')
>>> print a.url
http://www.flickr.com/photos/26432908#N00/346615997/sizes/l/
>>> a = urllib2.urlopen('http://google.com')
>>> print a.url
http://www.google.com/
To unshort, you can use requests. This is a simple solution that works for me.
import requests
url = "http://foo.com"
site = requests.get(url)
print(site.url)
http://github.com/stef/urlclean
sudo pip install urlclean
urlclean.unshorten(url)
Here a src code that takes into account almost of the useful corner cases:
set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.
The src code is on github # https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
You can use geturl()
from urllib.request import urlopen
url = "bit.ly/silly"
unshortened_url = urlopen(url).geturl()
print(unshortened_url)
# google.com
This Is very easy task you just need to add 4 lines of codes thats it :)
import requests
url = input('Enter url : ')
site = requests.get(url)
print(site.url)
just run this code you will successfully unshort the url.

Is there any way to do HTTP PUT in python

I need to upload some data to a server using HTTP PUT in python. From my brief reading of the urllib2 docs, it only does HTTP POST. Is there any way to do an HTTP PUT in python?
I've used a variety of python HTTP libs in the past, and I've settled on requests as my favourite. Existing libs had pretty useable interfaces, but code can end up being a few lines too long for simple operations. A basic PUT in requests looks like:
payload = {'username': 'bob', 'email': 'bob#bob.com'}
>>> r = requests.put("http://somedomain.org/endpoint", data=payload)
You can then check the response status code with:
r.status_code
or the response with:
r.content
Requests has a lot synactic sugar and shortcuts that'll make your life easier.
import urllib2
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = urllib2.Request('http://example.org', data='your_put_data')
request.add_header('Content-Type', 'your/contenttype')
request.get_method = lambda: 'PUT'
url = opener.open(request)
Httplib seems like a cleaner choice.
import httplib
connection = httplib.HTTPConnection('1.2.3.4:1234')
body_content = 'BODY CONTENT GOES HERE'
connection.request('PUT', '/url/path/to/put/to', body_content)
result = connection.getresponse()
# Now result.status and result.reason contains interesting stuff
You can use the requests library, it simplifies things a lot in comparison to taking the urllib2 approach. First install it from pip:
pip install requests
More on installing requests.
Then setup the put request:
import requests
import json
url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
# Create your header as required
headers = {"content-type": "application/json", "Authorization": "<auth-key>" }
r = requests.put(url, data=json.dumps(payload), headers=headers)
See the quickstart for requests library. I think this is a lot simpler than urllib2 but does require this additional package to be installed and imported.
This was made better in python3 and documented in the stdlib documentation
The urllib.request.Request class gained a method=... parameter in python3.
Some sample usage:
req = urllib.request.Request('https://example.com/', data=b'DATA!', method='PUT')
urllib.request.urlopen(req)
You should have a look at the httplib module. It should let you make whatever sort of HTTP request you want.
I needed to solve this problem too a while back so that I could act as a client for a RESTful API. I settled on httplib2 because it allowed me to send PUT and DELETE in addition to GET and POST. Httplib2 is not part of the standard library but you can easily get it from the cheese shop.
I also recommend httplib2 by Joe Gregario. I use this regularly instead of httplib in the standard lib.
Have you taken a look at put.py? I've used it in the past. You can also just hack up your own request with urllib.
You can of course roll your own with the existing standard libraries at any level from sockets up to tweaking urllib.
http://pycurl.sourceforge.net/
"PyCurl is a Python interface to libcurl."
"libcurl is a free and easy-to-use client-side URL transfer library, ... supports ... HTTP PUT"
"The main drawback with PycURL is that it is a relative thin layer over libcurl without any of those nice Pythonic class hierarchies. This means it has a somewhat steep learning curve unless you are already familiar with libcurl's C API. "
If you want to stay within the standard library, you can subclass urllib2.Request:
import urllib2
class RequestWithMethod(urllib2.Request):
def __init__(self, *args, **kwargs):
self._method = kwargs.pop('method', None)
urllib2.Request.__init__(self, *args, **kwargs)
def get_method(self):
return self._method if self._method else super(RequestWithMethod, self).get_method()
def put_request(url, data):
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = RequestWithMethod(url, method='PUT', data=data)
return opener.open(request)
You can use requests.request
import requests
url = "https://www.example/com/some/url/"
payload="{\"param1\": 1, \"param1\": 2}"
headers = {
'Authorization': '....',
'Content-Type': 'application/json'
}
response = requests.request("PUT", url, headers=headers, data=payload)
print(response.text)
A more proper way of doing this with requests would be:
import requests
payload = {'username': 'bob', 'email': 'bob#bob.com'}
try:
response = requests.put(url="http://somedomain.org/endpoint", data=payload)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(e)
raise
This raises an exception if there is an error in the HTTP PUT request.
Using urllib3
To do that, you will need to manually encode query parameters in the URL.
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> from urllib.parse import urlencode
>>> encoded_args = urlencode({"name":"Zion","salary":"1123","age":"23"})
>>> url = 'http://dummy.restapiexample.com/api/v1/update/15410' + encoded_args
>>> r = http.request('PUT', url)
>>> import json
>>> json.loads(r.data.decode('utf-8'))
{'status': 'success', 'data': [], 'message': 'Successfully! Record has been updated.'}
Using requests
>>> import requests
>>> r = requests.put('https://httpbin.org/put', data = {'key':'value'})
>>> r.status_code
200

Categories

Resources