I'm trying to use a code read in Kent's Korner for Form-based authentication. At least I'm told the web site I'm trying to read is form-based authenticated.
But I don't seem to be able to get past the login page. The code I'm using is
Import urllib, urllib2, cookielib, string
# configure an opener that will handle cookies
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
# use the opener to POST to the login form and the protected page
params = urllib.urlencode(dict(username='user', password='stuff'))
f = opener.open('http://www.hammernutrition.com/forums/memberlist.php?mode=viewprofile&u=1323', params)
data = f.read()
f.close()
f = opener.open('http://www.hammernutrition.com/forums/memberlist.php?mode=viewprofile&u=1323')
data = f.read()
f.close()
You can simulate a web browser in Python without using much too resources with mechanize
(Debian/Ubuntu package is called python-mechanize). It handles both cookies and submitting forms, just the way a web browser would do, one great example is a Python Dropbox Uploader script, which you can transform to your needs.
Related
I have a script for Python 2 to login into a webpage and then move inside to reach a couple of files pointed to on the same site, but different pages. Python 2 let me open the site with my credentials and then create a opener.open() to keep the connection available to navigate to the other pages.
Here's the code that worked in Python 2:
$Your admin login and password
LOGIN = "*******"
PASSWORD = "********"
ROOT = "https:*********"
#The client have to take care of the cookies.
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
#POST login query on '/login_handler' (post data are: 'login' and 'password').
req = urllib2.Request(ROOT + "/login_handler",
urllib.urlencode({'login': LOGIN,
'password': PASSWORD}))
opener.open(rep)
#Set the right accountcode
for accountcode, queues in QUEUES.items():
req = urllib2.Request(ROOT + "/switch_to" + accountcode)
opener.open(req)
I need to do the same thing in Python 3. I have tried with request module and urllib, but although I can establish the first login, I don't know how to keep the opener to navigate the site. I found the OpenerDirector but it seems like I don't know how to do it, because I haven't reached my goal.
I have used this Python 3 code to get the result desired but unfortunately I can't get the csv file to print it.
enter image description here
Question: I don't know how to keep the opener to navigate the site.
Python 3.6ยป Documentation urllib.request.build_opener
Use of Basic HTTP Authentication:
import urllib.request
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
uri='https://mahler:8092/site-updates.py',
user='klem',
passwd='kadidd!ehopper')
opener = urllib.request.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib.request.install_opener(opener)
f = urllib.request.urlopen('http://www.example.com/login.html')
csv_content = f.read()
Use python requests library for python 3 and session.
http://docs.python-requests.org/en/master/user/advanced/#session-objects
Once you login your session will be automatically managed. You dont need to create your own cookie jar. Following is the sample code.
s = requests.Session()
auth={"login":LOGIN,"pass":PASS}
url=ROOT+/login_handler
r=s.post(url, data=auth)
print(r.status_code)
for accountcode, queues in QUEUES.items():
req = s.get(ROOT + "/switch_to" + accountcode)
print(req.text) #response text
Hi I have researched this but I can not find any answers this question. I need to download a sub directory of a web page to a string for a search, I know have to do this but the only problem is the site is encrypted and requires a login to acces the directory. I know I need to send the cookies to request the download but I am unsure how to do this. I am coding python. feel free to ask for more info.
import urllib
import urllib2
import cookielib
import time
# All your cookie related things are done by this.
cookie_jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
urllib2.install_opener(opener)
#POST Parameters for login page.
request_body_params = {'your_parameter_name': 'its_value', 'another_parameter_name': 'its_value'}
data_encoding = urllib.urlencode(request_body_params)
url_main = 'https://your_site.com/login'
main_request = urllib2.Request(url_main, data_encoding)
#Any headers required goes here.
main_request.add_header('Accept-encoding', 'gzip')
# This is the response of login. You don't want to read this.
main_response = urllib2.urlopen(main_request)
# You want data from this link.
url_results = 'https://your_site.com/sub_directory'
results_response = urllib2.urlopen(url_results)
print results_response.read()
To check the POST Parameters, go to the site from a browser, click on 'View Source', go to 'Network' in view source. Then as you login in the browser, there will be network logs generated, click on the link and check out it's POST Parameters and headers.
I'm using python to scrape my school's webpage, but in order to do that I needed to simulate a user login first. here is my code:
import requests, lxml.html
s = requests.session()
url = "https://my.emich.edu"
login = s.get(url)
login_html = lxml.html.fromstring(login.text)
hidden_inputs = login_html.xpath(r'//form//input[#type="hidden"]')
form = {x.attrib["name"]:x.attrib["value"] for x in hidden_inputs}
form["username"] = "myusernamge"
form["password"] = "mypassword"
form["submit"] = "LOGIN"
response = s.post("https://netid.emich.edu/cas/loginservice=https%3A%2F%2Fmy.emich.edu%2Fc%2Fportal%2Flogin",form)
response = s.get("http://my.emich.edu")
f = open("result.html","w")
f.write(response.text)
print response.text
i am expecting that response.text will give me my own student account page instead of that it gives me a log in requirement page. Can any one help me with this issue?
BTW this is not a homework
There are a few options here, and I think your requests approach can be made much easier by logging in manually and copying over the headers.
Use a python scripting package like http://wwwsearch.sourceforge.net/mechanize/ to scrape the site.
Use a browser-emulater such as http://casperjs.org/. Using this you can basically do anything you'd be able to do in a browser.
My suggestion here would be to go to the website, log in, and then open the developer console and copy those headers/cookies into your requests headers/cookies. This way you can just hardcode the 'already-authenticated request' and it will work fine. Note that this method is the least reliable for doing robust, everyday scraping, but if you're looking for something that will be the quickest to implement and will work until the authentication runs out, use this method.
Also, you need the request the logged-in homepage (again) after you successfully do the post.
I am planning to write a website crawler in Python using Requests and PyQuery.
However, the site I am targeting requires me to be signed into my account. Using Requests, is it possible for me to establish a session with the server (using my credentials for the site), and use this session to crawl sites that I have access to only when logged in?
I hope this question is clear, thank you.
Yes it is possible.
I don't know about PyQuery but I've made crawlers that log in to sites using urllib2.
All you need is to use cookiejar to handle cookies and send the login form using a request.
If you ask something more specific I will try to be more explicit too.
LE:
urllib2 is not a mess. It's the best library for such things in my opinion.
Here's a code snipet that will log in to a site (after that you can just parse the site normally):
import urllib
import urllib2
import cookielib
"""Adding cookie support"""
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
"""Next we will log in to the site. The actual url will be different and also the data.
You should check the log in form to see what parameters it takes and what values.
"""
data = {'username' : 'foo',
'password' : 'bar'
}
data = urllib.urlencode(data)
urllib2.urlopen('http://www.siteyouwanttoparse.com/login', data) #this should log us in
"""Now you can parse the site"""
html = urllib2.urlopen('http://www.siteyoutwanttoparse.com').read()
print html
I am trying to write a small web-based proxy using python, I can fetch and show normal websites, but I can not login to facebook/gmail/...anything with login .
I have seen some examples of authentication here
http://docs.python.org/release/2.5.2/lib/urllib2-examples.html but I don't know how I can make a general solution for all web sites with login , any idea?
my code is :
def showurl():
url=request.vars.url
response = urllib2.urlopen(url)
html = response.read()
return html
Your proxy-server needs to store cookies, search stackoverflow for cookielib.
Many web sites authenticate clients in different way, so your job is to fake client as much as possible with your proxy-server. Some web sites authenticate by browser type, some by creating cookies and storing sessionId in it, or other JavaScript hidden content that allows to do some authentication steps.
As far as my small experience, all important stuff ends in cookies.
This is just flat example how to use cookielib.
import urllib, urllib2, cookielib, getpass
username = ''
button = 'submit'
www_login = 'http://website.com'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders.append(('User-agent', 'Mozilla/4.0'))
opener.addheaders.append( ('Referer', '/dev/null') )
login_data = urllib.urlencode({'username' : username, 'password': getpass.getpass("Password:"), 'login' : button})
resp = opener.open(www_login, login_data)
print resp.read()
EDITED:
Don't mislead yourself with "Basic HTTP Authentication" and authentication by facebook/gmail because it is different stuff. "Basic HTTP Authentication" or "Digest HTTP Authentication" is done by web-server not web-site that you want to log in.
http://www.voidspace.org.uk/python/articles/authentication.shtml#id24