python requests with cookielib equivalent to wget with --load-cookies [duplicate] - python

I'm trying to access an authenticated site using a cookies.txt file (generated with a Chrome extension) with Python Requests:
import requests, cookielib
cj = cookielib.MozillaCookieJar('cookies.txt')
cj.load()
r = requests.get(url, cookies=cj)
It doesn't throw any error or exception, but yields the login screen, incorrectly. However, I know that my cookie file is valid, because I can successfully retrieve my content using it with wget. Any idea what I'm doing wrong?
Edit:
I'm tracing cookielib.MozillaCookieJar._really_load and can verify that the cookies are correctly parsed (i.e. they have the correct values for the domain, path, secure, etc. tokens). But as the transaction is still resulting in the login form, it seems that wget must be doing something additional (as the exact same cookies.txt file works for it).

MozillaCookieJar inherits from FileCookieJar which has the following docstring in its constructor:
Cookies are NOT loaded from the named file until either the .load() or
.revert() method is called.
You need to call .load() method then.
Also, like Jermaine Xu noted the first line of the file needs to contain either # Netscape HTTP Cookie File or # HTTP Cookie File string. Files generated by the plugin you use do not contain such a string so you have to insert it yourself. I raised appropriate bug at http://code.google.com/p/cookie-txt-export/issues/detail?id=5
EDIT
Session cookies are saved with 0 in the 5th column. If you don't pass ignore_expires=True to load() method all such cookies are discarded when loading from a file.
File session_cookie.txt:
# Netscape HTTP Cookie File
.domain.com TRUE / FALSE 0 name value
Python script:
import cookielib
cj = cookielib.MozillaCookieJar('session_cookie.txt')
cj.load()
print len(cj)
Output:
0
EDIT 2
Although we managed to get cookies into the jar above they are subsequently discarded by cookielib because they still have 0 value in the expires attribute. To prevent this we have to set the expire time to some future time like so:
for cookie in cj:
# set cookie expire date to 14 days from now
cookie.expires = time.time() + 14 * 24 * 3600
EDIT 3
I checked both wget and curl and both use 0 expiry time to denote session cookies which means it's the de facto standard. However Python's implementation uses empty string for the same purpose hence the problem raised in the question. I think Python's behavior in this regard should be in line with what wget and curl do and that's why I raised the bug at http://bugs.python.org/issue17164
I'll note that replacing 0s with empty strings in the 5th column of the input file and passing ignore_discard=True to load() is the alternate way of solving the problem (no need to change expiry time in this case).

I tried taking into account everything that Piotr Dobrogost had valiantly figured out about MozillaCookieJar but to no avail. I got fed up and just parsed the damn cookies.txt myself and now all is well:
import re
import requests
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = line.strip().split('\t')
cookies[lineFields[5]] = lineFields[6]
return cookies
cookies = parseCookieFile('cookies.txt')
import pprint
pprint.pprint(cookies)
r = requests.get('https://example.com', cookies=cookies)

This worked for me:
from http.cookiejar import MozillaCookieJar
from pathlib import Path
import requests
cookies = Path('/Users/name/cookies.txt')
jar = MozillaCookieJar(cookies)
jar.load()
requests.get('https://path.to.site.com', cookies=jar)
<Response [200]>

I tried editing Tristan answer to add some info to it but it seems SO edit q is full therefore, I am writing this answer, since, I have struggled real bad on using existing cookies with python request.
First, get the cookies from the Chrome. Easiest way would be to use an extension called 'cookies.txt'
https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid/related
After downloading those cookies, use the below code to make sure that you are able to parse the file without any issues.
import re, requests, pprint
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = re.findall(r'[^\s]+', line) #capturing anything but empty space
try:
cookies[lineFields[5]] = lineFields[6]
except Exception as e:
print (e)
return cookies
cookies = parseCookieFile('cookies.txt') #replace the filename
pprint.pprint(cookies)
Next, use those cookies with python request
x = requests.get('your__url', verify=False, cookies=cookies)
print (x.content)
This should save your day from going on different SO posts and trying those cookielib and other methods which never worked for me.

I finally found a way to make it work (I got the idea by looking at curl's verbose ouput): instead of loading my cookies from a file, I simply created a dict with the required value/name pairs:
cd = {'v1': 'n1', 'v2': 'n2'}
r = requests.get(url, cookies=cd)
and it worked (although it doesn't explain why the previous method didn't). Thanks for all the help, it's really appreciated.

Related

How to use existing cookie file in Python request? [duplicate]

I'm trying to access an authenticated site using a cookies.txt file (generated with a Chrome extension) with Python Requests:
import requests, cookielib
cj = cookielib.MozillaCookieJar('cookies.txt')
cj.load()
r = requests.get(url, cookies=cj)
It doesn't throw any error or exception, but yields the login screen, incorrectly. However, I know that my cookie file is valid, because I can successfully retrieve my content using it with wget. Any idea what I'm doing wrong?
Edit:
I'm tracing cookielib.MozillaCookieJar._really_load and can verify that the cookies are correctly parsed (i.e. they have the correct values for the domain, path, secure, etc. tokens). But as the transaction is still resulting in the login form, it seems that wget must be doing something additional (as the exact same cookies.txt file works for it).
MozillaCookieJar inherits from FileCookieJar which has the following docstring in its constructor:
Cookies are NOT loaded from the named file until either the .load() or
.revert() method is called.
You need to call .load() method then.
Also, like Jermaine Xu noted the first line of the file needs to contain either # Netscape HTTP Cookie File or # HTTP Cookie File string. Files generated by the plugin you use do not contain such a string so you have to insert it yourself. I raised appropriate bug at http://code.google.com/p/cookie-txt-export/issues/detail?id=5
EDIT
Session cookies are saved with 0 in the 5th column. If you don't pass ignore_expires=True to load() method all such cookies are discarded when loading from a file.
File session_cookie.txt:
# Netscape HTTP Cookie File
.domain.com TRUE / FALSE 0 name value
Python script:
import cookielib
cj = cookielib.MozillaCookieJar('session_cookie.txt')
cj.load()
print len(cj)
Output:
0
EDIT 2
Although we managed to get cookies into the jar above they are subsequently discarded by cookielib because they still have 0 value in the expires attribute. To prevent this we have to set the expire time to some future time like so:
for cookie in cj:
# set cookie expire date to 14 days from now
cookie.expires = time.time() + 14 * 24 * 3600
EDIT 3
I checked both wget and curl and both use 0 expiry time to denote session cookies which means it's the de facto standard. However Python's implementation uses empty string for the same purpose hence the problem raised in the question. I think Python's behavior in this regard should be in line with what wget and curl do and that's why I raised the bug at http://bugs.python.org/issue17164
I'll note that replacing 0s with empty strings in the 5th column of the input file and passing ignore_discard=True to load() is the alternate way of solving the problem (no need to change expiry time in this case).
I tried taking into account everything that Piotr Dobrogost had valiantly figured out about MozillaCookieJar but to no avail. I got fed up and just parsed the damn cookies.txt myself and now all is well:
import re
import requests
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = line.strip().split('\t')
cookies[lineFields[5]] = lineFields[6]
return cookies
cookies = parseCookieFile('cookies.txt')
import pprint
pprint.pprint(cookies)
r = requests.get('https://example.com', cookies=cookies)
This worked for me:
from http.cookiejar import MozillaCookieJar
from pathlib import Path
import requests
cookies = Path('/Users/name/cookies.txt')
jar = MozillaCookieJar(cookies)
jar.load()
requests.get('https://path.to.site.com', cookies=jar)
<Response [200]>
I tried editing Tristan answer to add some info to it but it seems SO edit q is full therefore, I am writing this answer, since, I have struggled real bad on using existing cookies with python request.
First, get the cookies from the Chrome. Easiest way would be to use an extension called 'cookies.txt'
https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid/related
After downloading those cookies, use the below code to make sure that you are able to parse the file without any issues.
import re, requests, pprint
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = re.findall(r'[^\s]+', line) #capturing anything but empty space
try:
cookies[lineFields[5]] = lineFields[6]
except Exception as e:
print (e)
return cookies
cookies = parseCookieFile('cookies.txt') #replace the filename
pprint.pprint(cookies)
Next, use those cookies with python request
x = requests.get('your__url', verify=False, cookies=cookies)
print (x.content)
This should save your day from going on different SO posts and trying those cookielib and other methods which never worked for me.
I finally found a way to make it work (I got the idea by looking at curl's verbose ouput): instead of loading my cookies from a file, I simply created a dict with the required value/name pairs:
cd = {'v1': 'n1', 'v2': 'n2'}
r = requests.get(url, cookies=cd)
and it worked (although it doesn't explain why the previous method didn't). Thanks for all the help, it's really appreciated.

Writing to NamedTemporaryFile fails silently; converting Curl cookie jar to Requests cookies

I'm trying to take the Netscape HTTP Cookie File that Curl spits out and convert it to a Cookiejar that the Requests library can work with. I have netscapeCookieString in my Python script as a variable, which looks like:
# Netscape HTTP Cookie File
# https://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.
.miami.edu TRUE / TRUE 0 PS_LASTSITE https://canelink.miami.edu/psc/PUMI2J/
Since I don't want to parse the cookie file myself, I'd like to use cookielib. Sadly, this means I have to write to disk since cookielib.MozillaCookieJar() won't take a string as input: it has to take a file.
So I'm using NamedTemporaryFile (couldn't get SpooledTemporaryFile to work; again would like to do all of this in memory if possible).
tempCookieFile = tempfile.NamedTemporaryFile()
# now take the contents of the cookie string and put it into this in memory file
# that cookielib will read from. There are a couple quirks though.
for line in netscapeCookieString.splitlines():
# cookielib doesn't know how to handle httpOnly cookies correctly
# so we have to do some pre-processing to make sure they make it into
# the cookielib. Basically just removing the httpOnly prefix which is honestly
# an abuse of the RFC in the first place. note: httpOnly actually refers to
# cookies that javascript can't access, as in only http protocol can
# access them, it has nothing to do with http vs https. it's purely
# to protect against XSS a bit better. These cookies may actually end up
# being the most critical of all cookies in a given set.
# https://stackoverflow.com/a/53384267/2611730
if line.startswith("#HttpOnly_"):
# this is actually how the curl library removes the httpOnly, by doing length
line = line[len("#HttpOnly_"):]
tempCookieFile.write(line)
tempCookieFile.flush()
# another thing that cookielib doesn't handle very well is
# session cookies, which have 0 in the expires param
# so we have to make sure they don't get expired when they're
# read in by cookielib
# https://stackoverflow.com/a/14759698/2611730
print tempCookieFile.read()
cookieJar = cookielib.MozillaCookieJar(tempCookieFile.name)
cookieJar.load(ignore_expires=True)
pprint.pprint(cookieJar)
But here's the kicker, this doesn't work!
print tempCookieFile.read() prints an empty line.
Thus, pprint.pprint(cookieJar) prints an empty cookie jar.
I was easily able to reproduce this on my Mac:
>>> import tempfile
>>> tempCookieFile = tempfile.NamedTemporaryFile()
>>> tempCookieFile.write("hey")
>>> tempCookieFile.flush()
>>> print tempCookieFile.read()
>>>
How can I actually write to a NamedTemporaryFile?
After you write to the file, the pointer to that file is to the location after that written data (in your case end of file) so when you read it returns an empty string (no more data after end of file) just seek to 0 before reading
>>> import tempfile
>>> tempCookieFile = tempfile.NamedTemporaryFile()
>>> tempCookieFile.write("hey")
>>> tempCookieFile.seek(0)
>>> print(tempCookieFile.read())

urllib3 HTTPResponse.read() returns empty bytes

I'm trying to read a website's content but I get an empty bytes object, b''.
import urllib3
from urllib3 import PoolManager
urllib3.disable_warnings()
https = PoolManager()
r = https.request('GET', 'https://minemen.club/leaderboards/practice/')
print(r.status)
print(r.read())
When I open the URL in a web browser I see the website, and r.status is 200 (success).
Why does r.read() not return the content?
What makes you think it is wrong? Try the following, you'll have much more output:
print(r.data)
Check HTTPResponse to see how to use the r object you got.
This is how urllib3.response.HTTPResponse.read is supposed to work.
It is explained for example here by one of the contributors to urllib3:
This is about documentation. You cannot use read() by default, because
by default all the content is consumed into data. If you want read()
to work, you need to set preload_content=True on the call to urlopen.
Want to give that a try?
So you can simply use r.data.

reading cookies file created by curl

I have the following cookie saved by curl (in test.txt, tab-separated, this editor doesn't preserve tabs):
# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
#HttpOnly_my-example.com FALSE / FALSE 0 _rails-root_session test
I'm trying to read it with the following code:
import sys
if sys.version_info < (3,):
from cookielib import Cookie, MozillaCookieJar
else:
from http.cookiejar import Cookie, MozillaCookieJar
def load_cookies_from_mozilla(filename):
ns_cookiejar = MozillaCookieJar()
ns_cookiejar.load(filename, ignore_discard=True)
return ns_cookiejar
cookies = load_cookies_from_mozilla("test.txt")
print (len(cookies))
It outputs 0 (unable to read the cookie).
If I manually modify my cookie to the following line (remove HttpOnly flag and changing 0 to the empty string for expiration time, and again, tab-separated):
my-example.com FALSE / FALSE _rails-root_session test
then it outputs 1 (successfully read the cookie).
What needs to be done to my python code to read the original cookie line? And preferably to be able to save it in the same format (with HttpOnly flag and with 0 instead of empty string for never-expiring cookie)?
Thanks.
This appears to be an open bug: https://bugs.python.org/issue2190.
This bug report contains a link to a workaround: https://gerrit.googlesource.com/git-repo/+/master/subcmds/sync.py#995
In that linked code, the developer creates a temporary cookies file, removes the "#HttpOnly_" prefixes, and then creates a cookiejar with that temporary file.
tmpcookiefile = tempfile.NamedTemporaryFile()
tmpcookiefile.write("# HTTP Cookie File")
try:
with open(cookiefile) as f:
for line in f:
if line.startswith("#HttpOnly_"):
line = line[len("#HttpOnly_"):]
tmpcookiefile.write(line)
tmpcookiefile.flush()
cookiejar = cookielib.MozillaCookieJar(tmpcookiefile.name)
try:
cookiejar.load()
except cookielib.LoadError:
cookiejar = cookielib.CookieJar()
finally:
tmpcookiefile.close()
I tested your code and modified it, it works.
First in the cookie file you have to put off the '#' before your cookie, I think it will comment the data after it.
Second the 0 in the cookie means the expire time, 0 means expire now, so you can change the 0 to empty string or latter time, but i suggest you use the argument ignore_expire=True, the official means:
ignore_discard: save even cookies set to be discarded.
ignore_expires: save even cookies that have expiredThe file is overwritten if it already exists
and the result code is :
import sys
if sys.version_info < (3,):
from cookielib import Cookie, MozillaCookieJar
else:
from http.cookiejar import Cookie, MozillaCookieJar
def load_cookies_from_mozilla(filename):
ns_cookiejar = MozillaCookieJar()
ns_cookiejar.load(filename, ignore_discard=True, ignore_expires=True)
return ns_cookiejar
cookies = load_cookies_from_mozilla("test.txt")
print (len(cookies))
and you can see the link to find more detail:
Using cookies.txt file with Python Requests

Script to connect to a web page

Looking for a python script that would simply connect to a web page (maybe some querystring parameters).
I am going to run this script as a batch job in unix.
urllib2 will do what you want and it's pretty simple to use.
import urllib
import urllib2
params = {'param1': 'value1'}
req = urllib2.Request("http://someurl", urllib.urlencode(params))
res = urllib2.urlopen(req)
data = res.read()
It's also nice because it's easy to modify the above code to do all sorts of other things like POST requests, Basic Authentication, etc.
Try this:
aResp = urllib2.urlopen("http://google.com/");
print aResp.read();
If you need your script to actually function as a user of the site (clicking links, etc.) then you're probably looking for the python mechanize library.
Python Mechanize
A simple wget called from a shell script might suffice.
in python 2.7:
import urllib2
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urllib2.urlopen(url+"?"+params).read()
print html
more info at https://docs.python.org/2.7/library/urllib2.html
in python 3.6:
from urllib.request import urlopen
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urlopen(url+"?"+params).read()
print(html)
more info at https://docs.python.org/3.6/library/urllib.request.html
to encode params into GET format:
def myEncode(dictionary):
result = ""
for k in dictionary: #k is the key
result += k+"="+dictionary[k]+"&"
return result[:-1] #all but that last `&`
I'm pretty sure this should work in either python2 or python3...
What are you trying to do? If you're just trying to fetch a web page, cURL is a pre-existing (and very common) tool that does exactly that.
Basic usage is very simple:
curl www.example.com
You might want to simply use httplib from the standard library.
myConnection = httplib.HTTPConnection('http://www.example.com')
you can find the official reference here: http://docs.python.org/library/httplib.html

Categories

Resources