I used this piece to
cj = cookielib.LWPCookieJar()
cookie_support = urllib2.HTTPCookieProcessor(cj)
opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
// ..... log in with username and password.
// urllib2.urlopen() to get the stuff I need.
Now, how do I preserve the cookie and set the expiration dates to forever, so next time I don't have to log in with username and password again. I can directly use urllib2.urlopen() ?
By "next time" I mean after the program ends, when I start a new program, I can just reload the cookie from the disk and use it
Thanks a lot
I would highly recommend using the Requests HTTP library. It will handle all this for you.
http://docs.python-requests.org/en/latest/
import requests
sess = requests.session()
sess.post("http://somesite.com/someform.php", data={"username": "me", "password": "pass"})
#Everything after that POST will retain the login session
print sess.get("http://somesite.com/otherpage.php").text
edit: To save the session to disk, there are a lot of ways. You could do the following:
from requests.utils import cookiejar_from_dict as jar
cookies = jar(sess.cookies)
Then read the following documentation. You could convert it to a FileCookieJar and save the cookies to a text file, then load them at the start of the program.
http://docs.python.org/2/library/cookielib.html#cookiejar-and-filecookiejar-objects
Alternatively you could pickle the dict and save that data to a file, and load it with pickle.load(file).
http://docs.python.org/2/library/pickle.html
edit 2: To handle expiration, you can iterate over the CookieJar as follows. cj is assumed to be a CookieJar obtained in some fashion.
for cookie in cj:
if cookie.is_expired():
#re-attain session
To check if any of the cookies are expired, it may be more convenient to do if any(c.is_expired() for c in cj).
Related
I have been googling for this problem for a week now.
The thing I want to achive is the following:
Send a POST request to the URL including the correct credentials.
Save the session (not cookie since my website is not using cookies at the moment)
With the saved session open a session protected URL and grab the contents.
I have seen alot of topics on this with cookies but not with sessions, I tried sessions with requests but seems to fail everytime.
You want to use a URL opener. Here's a sample of how I've managed to do it. If you just want a default opener, use opener=urllib.request.build_opener(), otherwise use the custom opener. This worked when I had to log into a website and keep a session, using URL as your URL, user as user, password as password, all changed as appropriate.
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(http.cookiejar.CookieJar()))
pData=urllib.parse.urlencode({"identity":user,"password":password})
req=urllib.request.Request(URL,pData.encode('utf-8'))
opener.open(req)
req=urllib.request.Request(url)
response= opener.open(req)
I am planning to write a website crawler in Python using Requests and PyQuery.
However, the site I am targeting requires me to be signed into my account. Using Requests, is it possible for me to establish a session with the server (using my credentials for the site), and use this session to crawl sites that I have access to only when logged in?
I hope this question is clear, thank you.
Yes it is possible.
I don't know about PyQuery but I've made crawlers that log in to sites using urllib2.
All you need is to use cookiejar to handle cookies and send the login form using a request.
If you ask something more specific I will try to be more explicit too.
LE:
urllib2 is not a mess. It's the best library for such things in my opinion.
Here's a code snipet that will log in to a site (after that you can just parse the site normally):
import urllib
import urllib2
import cookielib
"""Adding cookie support"""
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
"""Next we will log in to the site. The actual url will be different and also the data.
You should check the log in form to see what parameters it takes and what values.
"""
data = {'username' : 'foo',
'password' : 'bar'
}
data = urllib.urlencode(data)
urllib2.urlopen('http://www.siteyouwanttoparse.com/login', data) #this should log us in
"""Now you can parse the site"""
html = urllib2.urlopen('http://www.siteyoutwanttoparse.com').read()
print html
Using urlopen also for url queries seems obvious. What I tried is:
import urllib2
query='http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
f = urllib2.urlopen(query)
s = f.read()
f.close()
However, for this specific url query it fails with HTTP error 403 forbidden
When entering this query in my browser, it works.
Also when using http://www.httpquery.com/ to submit the query, it works.
Do you have suggestions how to use Python right to grab the correct response?
Looks like it requires cookies... (which you can do with urllib2), but an easier way if you're doing this, is to use requests
import requests
session = requests.session()
r = session.get('http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627')
This is generally a much easier and less-stressful method of retrieving URLs in Python.
requests will automatically store and re-use cookies for you. Creating a session is slightly overkill here, but is useful for when you need to submit data to login pages etc..., or re-use cookies across a site... etc...
using urllib2 is something like
import urllib2, cookielib
cookies = cookielib.CookieJar()
opener = urllib2.build_opener( urllib2.HTTPCookieProcessor(cookies) )
data = opener.open('url').read()
It appears that the urllib2 default user agent is banned by the host. You can simply supply your own user agent string:
import urllib2
url = 'http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
request = urllib2.Request(url, headers={"User-Agent" : "MyUserAgent"})
contents = urllib2.urlopen(request).read()
print contents
I'm writing a script in Python 3.1.2 that logs into a site and then begins to make requests. I can log in without any great difficulty, but after doing that the requests return an error stating I haven't logged in. My code looks like this:
import urllib.request
from http import cookiejar
from urllib.parse import urlencode
jar = cookiejar.CookieJar()
credentials = {'accountName': 'username', 'password': 'unenc_pw'}
credenc = urlencode(credentials)
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(jar))
urllib.request.install_opener(opener)
req = opener.open('http://www.wowarmory.com/?app=armory?login&cr=true', credenc)
test = opener.open('http://www.wowarmory.com/auctionhouse/search.json')
print(req.read())
print(test.read())
The response to the first request is the page I expect to get when logging in.
The response to the second is:
b'{"error":{"code":10005,"error":true,"message":"You must log in."},"command":{"sort":"RARITY","reverse":false,"pageSize":20,"end":20,"start":0,"minLvl":0,"maxLvl":0,"id":0,"qual":0,"classId":-1,"filterId":"-1"}}'
Is there something I'm missing to use any cookie information I have from successful authentication for future requests?
I had this issue once. I can't get the cookie the cookie management working automatically. Frustrated me for days, I ended up handling the cookie manually. That is getting the content of 'Set-Cookie' from the response header, saving it somewhere safe. Subsequently, any request made to that server, I will set the 'Cookie' into the request header with the value I got earlier.
I've been googling around for quite some time now and can't seem to get this to work. A lot of my searches have pointed me to finding similar problems but they all seem to be related to cookie grabbing/storing. I think I've set that up properly, but when I try to open the 'hidden' page, it keeps bringing me back to the login page saying my session has expired.
import urllib, urllib2, cookielib, webbrowser
username = 'userhere'
password = 'passwordhere'
url = 'http://example.com'
webbrowser.open(url, new=1, autoraise=1)
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})
opener.open('http://example.com', login_data)
resp = opener.open('http://example.com/afterlogin')
print resp
webbrowser.open(url, new=1, autoraise=1)
First off, when doing cookie-based authentication, you need to have a CookieJar to store your cookies in, much in the same way that your browser stores its cookies a place where it can find them again.
After opening a login-page through python, and saving the cookie from a successful login, you should use the MozillaCookieJar to pass the python created cookies to a format a firefox browser can parse. Firefox 3.x no longer uses the cookie format that MozillaCookieJar produces, and I have not been able to find viable alternatives.
If all you need to do is to retrieve specific (in advance known format formatted) data, then I suggest you keep all your HTTP interactions within python. It is much easier, and you don't have to rely on specific browsers being available. If it is absolutely necessary to show stuff in a browser, you could render the so-called 'hidden' page through urllib2 (which incidentally integrates very nicely with cookielib), save the html to a temporary file and pass this to the webbrowser.open which will then render that specific page. Further redirects are not possible.
I've generally used the mechanize library to handle stuff like this. That doesn't answer your question about why your existing code isn't working, but it's something else to play with.
The provided code calls:
opener.open('http://example.com', login_data)
but throws away the response. I would look at this response to see if it says "Bad password" or "I only accept IE" or similar.