Establishing session with web app to crawl - python

I am planning to write a website crawler in Python using Requests and PyQuery.
However, the site I am targeting requires me to be signed into my account. Using Requests, is it possible for me to establish a session with the server (using my credentials for the site), and use this session to crawl sites that I have access to only when logged in?
I hope this question is clear, thank you.

Yes it is possible.
I don't know about PyQuery but I've made crawlers that log in to sites using urllib2.
All you need is to use cookiejar to handle cookies and send the login form using a request.
If you ask something more specific I will try to be more explicit too.
LE:
urllib2 is not a mess. It's the best library for such things in my opinion.
Here's a code snipet that will log in to a site (after that you can just parse the site normally):
import urllib
import urllib2
import cookielib
"""Adding cookie support"""
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
"""Next we will log in to the site. The actual url will be different and also the data.
You should check the log in form to see what parameters it takes and what values.
"""
data = {'username' : 'foo',
'password' : 'bar'
}
data = urllib.urlencode(data)
urllib2.urlopen('http://www.siteyouwanttoparse.com/login', data) #this should log us in
"""Now you can parse the site"""
html = urllib2.urlopen('http://www.siteyoutwanttoparse.com').read()
print html

Related

Log into a website using Requests module in Python

I am new in Python and web scraping, but I keep learning. I have managed to get some exciting results using BeautifulSoup and Requests libraries and my next goal is to log into a website that allows remote access to my heating system to do some web scraping and maybe extend its capabilities further.
Unfortunately, I got stuck. I have used Mozilla's Web Dev Tools to see the url that the form posts to, and the name attributes of the username and password fields. The webpage url is https://emodul.pl/login and the Request payload looks as follows:
{"username":"my_username","password":"my_password","rememberMe":false,"languageId":"en","remote":false}
I am using requests.Session() instance to make a post request to the login url and using the above-mentioned payload:
import requests
url = 'https://emodul.pl/login'
payload = {'username':'my_username','password':'my_password','rememberMe':False,'languageId':'en','remote':False}
with requests.Session() as s:
p = s.post(url, data=payload)
print(p.text)
Apparently I'm doing something wrong because I'm getting the "error":"Sorry, something went wrong. Try again." response.
Any advice will be much appreciated.

using python urlopen for a url query

Using urlopen also for url queries seems obvious. What I tried is:
import urllib2
query='http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
f = urllib2.urlopen(query)
s = f.read()
f.close()
However, for this specific url query it fails with HTTP error 403 forbidden
When entering this query in my browser, it works.
Also when using http://www.httpquery.com/ to submit the query, it works.
Do you have suggestions how to use Python right to grab the correct response?
Looks like it requires cookies... (which you can do with urllib2), but an easier way if you're doing this, is to use requests
import requests
session = requests.session()
r = session.get('http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627')
This is generally a much easier and less-stressful method of retrieving URLs in Python.
requests will automatically store and re-use cookies for you. Creating a session is slightly overkill here, but is useful for when you need to submit data to login pages etc..., or re-use cookies across a site... etc...
using urllib2 is something like
import urllib2, cookielib
cookies = cookielib.CookieJar()
opener = urllib2.build_opener( urllib2.HTTPCookieProcessor(cookies) )
data = opener.open('url').read()
It appears that the urllib2 default user agent is banned by the host. You can simply supply your own user agent string:
import urllib2
url = 'http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
request = urllib2.Request(url, headers={"User-Agent" : "MyUserAgent"})
contents = urllib2.urlopen(request).read()
print contents

how to open facebook/gmail/authencation sites using python urllib2.urlopen?

I am trying to write a small web-based proxy using python, I can fetch and show normal websites, but I can not login to facebook/gmail/...anything with login .
I have seen some examples of authentication here
http://docs.python.org/release/2.5.2/lib/urllib2-examples.html but I don't know how I can make a general solution for all web sites with login , any idea?
my code is :
def showurl():
url=request.vars.url
response = urllib2.urlopen(url)
html = response.read()
return html
Your proxy-server needs to store cookies, search stackoverflow for cookielib.
Many web sites authenticate clients in different way, so your job is to fake client as much as possible with your proxy-server. Some web sites authenticate by browser type, some by creating cookies and storing sessionId in it, or other JavaScript hidden content that allows to do some authentication steps.
As far as my small experience, all important stuff ends in cookies.
This is just flat example how to use cookielib.
import urllib, urllib2, cookielib, getpass
username = ''
button = 'submit'
www_login = 'http://website.com'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders.append(('User-agent', 'Mozilla/4.0'))
opener.addheaders.append( ('Referer', '/dev/null') )
login_data = urllib.urlencode({'username' : username, 'password': getpass.getpass("Password:"), 'login' : button})
resp = opener.open(www_login, login_data)
print resp.read()
EDITED:
Don't mislead yourself with "Basic HTTP Authentication" and authentication by facebook/gmail because it is different stuff. "Basic HTTP Authentication" or "Digest HTTP Authentication" is done by web-server not web-site that you want to log in.
http://www.voidspace.org.uk/python/articles/authentication.shtml#id24

Form-based authentication with Python

I'm trying to use a code read in Kent's Korner for Form-based authentication. At least I'm told the web site I'm trying to read is form-based authenticated.
But I don't seem to be able to get past the login page. The code I'm using is
Import urllib, urllib2, cookielib, string
# configure an opener that will handle cookies
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
# use the opener to POST to the login form and the protected page
params = urllib.urlencode(dict(username='user', password='stuff'))
f = opener.open('http://www.hammernutrition.com/forums/memberlist.php?mode=viewprofile&u=1323', params)
data = f.read()
f.close()
f = opener.open('http://www.hammernutrition.com/forums/memberlist.php?mode=viewprofile&u=1323')
data = f.read()
f.close()
You can simulate a web browser in Python without using much too resources with mechanize
(Debian/Ubuntu package is called python-mechanize). It handles both cookies and submitting forms, just the way a web browser would do, one great example is a Python Dropbox Uploader script, which you can transform to your needs.

Python: How do you login to a page and view the resulting page in a browser?

I've been googling around for quite some time now and can't seem to get this to work. A lot of my searches have pointed me to finding similar problems but they all seem to be related to cookie grabbing/storing. I think I've set that up properly, but when I try to open the 'hidden' page, it keeps bringing me back to the login page saying my session has expired.
import urllib, urllib2, cookielib, webbrowser
username = 'userhere'
password = 'passwordhere'
url = 'http://example.com'
webbrowser.open(url, new=1, autoraise=1)
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})
opener.open('http://example.com', login_data)
resp = opener.open('http://example.com/afterlogin')
print resp
webbrowser.open(url, new=1, autoraise=1)
First off, when doing cookie-based authentication, you need to have a CookieJar to store your cookies in, much in the same way that your browser stores its cookies a place where it can find them again.
After opening a login-page through python, and saving the cookie from a successful login, you should use the MozillaCookieJar to pass the python created cookies to a format a firefox browser can parse. Firefox 3.x no longer uses the cookie format that MozillaCookieJar produces, and I have not been able to find viable alternatives.
If all you need to do is to retrieve specific (in advance known format formatted) data, then I suggest you keep all your HTTP interactions within python. It is much easier, and you don't have to rely on specific browsers being available. If it is absolutely necessary to show stuff in a browser, you could render the so-called 'hidden' page through urllib2 (which incidentally integrates very nicely with cookielib), save the html to a temporary file and pass this to the webbrowser.open which will then render that specific page. Further redirects are not possible.
I've generally used the mechanize library to handle stuff like this. That doesn't answer your question about why your existing code isn't working, but it's something else to play with.
The provided code calls:
opener.open('http://example.com', login_data)
but throws away the response. I would look at this response to see if it says "Bad password" or "I only accept IE" or similar.

Categories

Resources