I've recently written this with help from SO. Now could someone please tell me how to make it actually log onto the board. It brings up everything just in a non logged in format.
import urllib2, re
import urllib, re
logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
page = urllib2.urlopen("http://www.woarl.com/board/index.php", logindata)
pagesource = page.read()
print pagesource
Someone recently asked the same question you're asking. If you read through the answers to that question you'll see code examples showing you how to stay logged in while browsing a site in a Python script using only stuff in the standard library.
The accepted answer might not be as useful to you as this other answer, since the accepted answer deals with a specific problem involving redirection. However, I recommend reading through all of the answers regardless.
You probably want to look into preserving cookies from the server.
Pycurl or Mechanize will make this much easier for you
If actually look at the page, you see that the login link takes you to http://www.woarl.com/board/ucp.php?mode=login
That page has the login form and submits to http://www.woarl.com/board/ucp.php?mode=login again with POST.
You'll then have to extract the cookies that are probably set, and put those in a CookieJar or similar.
You probably want to create an opener with these handlers and apply it to urllib2.
With these applied your cookies are handled and you'll be redirected, if server decides it wants you somewhere else.
# Create handlers
cookieHandler = urllib2.HTTPCookieProcessor() # Needed for cookie handling
redirectionHandler = urllib2.HTTPRedirectHandler() # needed for redirection (not needed for javascript redirect?)
# Create opener
opener = urllib2.build_opener(cookieHandler,redirectionHandler)
# Install the opener
urllib2.install_opener(opener)
Related
I am trying to log in to a website using python and the requests module.
I am doing:
import requests
payload = {
'login_Email': 'xxxxx#gmail.com',
'login_Password': 'xxxxx'
}
with requests.Session() as s:
p = s.post('https://www.auction4cars.com/', data=payload)
print p.text
The problem is, that the output of this just seems to be the login page and not the page AFTER log in. I.e. the page says 'welcome guest' and 'please enter your username and password' etc.
I was expecting it to return the page saying something like 'thanks for logging in xxxxx' etc.
Can anyone suggest what I'm doing wrong?
EDIT:
I don't think my question is a duplicate of How to "log in" to a website using Python's Requests module? because I am using the script from the most popular answer (regarded on the thread as the answer that should be the accepted one).
I have also tried the accepted one, but my problem remains.
EDIT:
I confused about whether I need to do something with cookies. The URLs that I am trying to visit after logging in don't seem to contain cookie values.
EDIT:
This seems to be a similar problem to mine:
get restricted page after login using requests,urllib2 python
However, I don't see other inputs that I need to fill out. Except:
Do I need to do anything with the submit button?
I was helped with the problem here:
Replicate browser actions with a python script using Fiddler
Thank you for any input.
I'm doing a small project to help my work go by faster.
I currently have a program written in Python 3.2 that does almost all of the manual labour for me, with one exception.
I need to log on to the company website (username and password) then choose a month and year and click download.
I would like to write a little program to do that for me, so that the whole process is completely done by the program.
I have looked into it and I can only find tools for 2.X.
I have looked into urllib and I know that some of the 2.X moudles are now in urllib.request.
I have even found some code to start it off, however I'm confused as to how to put it into practise.
Here is what I have found:
import urllib2
theurl = 'http://www.someserver.com/toplevelurl/somepage.htm'
username = 'johnny'
password = 'XXXXXX'
# a great password
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
# this creates a password manager
passman.add_password(None, theurl, username, password)
# because we have put None at the start it will always
# use this username/password combination for urls
# for which `theurl` is a super-url
authhandler = urllib2.HTTPBasicAuthHandler(passman)
# create the AuthHandler
opener = urllib2.build_opener(authhandler)
urllib2.install_opener(opener)
# All calls to urllib2.urlopen will now use our handler
# Make sure not to include the protocol in with the URL, or
# HTTPPasswordMgrWithDefaultRealm will be very confused.
# You must (of course) use it when fetching the page though.
pagehandle = urllib2.urlopen(theurl)
# authentication is now handled automatically for us
All Credit to Michael Foord and his page: Basic Authentication
So I changed the code around a bit and replaced all the 'urllib2' with 'urllib.request'
Then I learned how to open a webpage, figuring the program should open the webpage, use the login and password data to open the page, then I'll learn how to download the files from it.
ie = webbrowser.get('c:\\program files\\internet explorer\\iexplore.exe')
ie.open(theurl)
(I know Explorer is garbage, just using it to test then I'll be using crome ;) )
But that doesnt open the page with the login data entered, it simply opens the page as though you had typed in the url.
How do I get it to open the page with the password handle?
I sort of understand how Michael made them, but I'm not sure which to use to actually open the website.
Also an after thought, might I need to look into cookies?
Thanks for your time
you get things confused here.
webbrowser is a wrapper around your actual webbrowser, and urllib is a library for http- and url-related stuff.
They don't know each other, and serve very different purposes.
In former IE versions, you could encode HTTP Basic Auth username and password in the URL like so:
http(s)://Username:Password#Server/Ressource.ext - I believe Firefox and Chrome still support that, IE killed it: http://support.microsoft.com/kb/834489/EN-US
if you want to emulate a browser, rather than just open a real one, take a look at mechanize: http://wwwsearch.sourceforge.net/mechanize/
your browser doesn't know anything about the authenitcation you've done in python (and that has nothing to do wheater your browser is garbage or not). the webbrowser module simply offers convenience methods for launching a browser and pointing it to a webbrowser. you can't 'transfer' your credentials to the browser.
as for migrating from python2 to python3: the 2to3 tool can convert simple scripts like your automatically.
They are not running in the same environment.
You need to figure out what really happened when you click the download button. Use your browser's develop tool to get the POST format the website is using. Then build a request in python to fetch the file.
Requests is a nice lib to do that kind of things much easier.
I would use selenium, this is some code from a little script I have hacked about a bit to give you an idea:
def get_name():
user = 'johnny'
passwd = 'XXXXXX'
try :
driver = webdriver.Remote(desired_capabilities=webdriver.DesiredCapabilities.HTMLUNIT)
driver.get('http://www.someserver.com/toplevelurl/somepage.htm')
assert 'Page Title' in driver.title
username = driver.find_element_by_name('name_of_userid_box')
username.send_keys(user)
password = driver.find_element_by_name('name_of_password_box')
password.send_keys(passwd)
submit = driver.find_element_by_name('name_of_login_button')
submit.click()
driver.get('http://www.someserver.com/toplevelurl/page_with_download_button.htm')
assert 'page_with_download_button title' in driver.title
download = driver.find_element_by_name('download_button')
download.click()
except :
print('process failed')
I'm new to python so that may not be the best code every written but it should give you the general idea.
Hope it helps
I am trying to access and parse a website at work using Python. The sites authorization is done via siteminder, so the usual urllib/urllib2 user password does not work.
Does anyone have an idea how to do that?
Thanks
NoamM
Just did this - I know its an oldie - but if anyone else looking to do this - use the requests library. I had done this in C# before and used mammoth amounts of code - but this is all it takes to login to my corporate siteminder system - nice. The request.session() object will persist redirects, headers and cookies - so all you need to worry about is posting the login form. I'm sure the variables will be different in your environment, but the process will be the same.
output.text will be the body of the target page you wanted to parse which you can then xpath or whatever.
import requests
r = requests.session()
postUrl = "https://loginUrl"
params = { 'USER': 'user',
'PASSWORD': 'pass',
'SMENC': 'ISO-8859-1',
'SMLOCALE': 'US-EN',
'target': '/redir.shtml?GOTO=redirecturl}',
'smauthreason': '0' }
r.post(postUrl, data=params)
getUrl = "http://urlFromBehindLogInYouWantDataFrom"
output = r.get(getUrl)
print(output.text)
First of all, you should find out what's happening when you authenticate through siteminder. Perhaps there's documentation for it, but if not it's not so hard to find out: the Network tab in Chrome or Safari's developer tools has all the information you need: HTTP Headers and Cookies for every network request. Firebug can give you that as well.
Once you have a clear idea of what's happening at each step of the authentication process, it's only a matter of replicating the same behavior in your script. urllib2 has support for cookies and headers. If you need something urllib2 doesn't provide, PycURL will probably do.
Agree with Martin - you need to just replicate what the browser does. Siteminder will pass you a token once successfully authenticated. I have to do this as well, will post once I find a good way.
I am trying to access Facebook from Python :D
I want to fetch some data which requires I be logged in in order to view. I know I will require cookies and such to view said data with Python, but I am entirely clueless when it comes to cookies.
How can I use Python to login to Facebook, navigate to multiple pages and retrieve some data?
Okay. Potentially this is a very large question. Instead of using the standard API to retrieve information, you wish to screen scrap?
It's possible - although not recommended as screen scraping is reliant upon the HTML format not changing. However it's not an impossible task.
To get started, you want to look at opening a url:
http://docs.python.org/library/urllib2.html
It's super easy - The example on the page will show you something like this:
>>> import urllib2
>>> f = urllib2.urlopen('http://facebook.com/')
>>> print f.read()
And you see you have HTML.
Now facebook will be smarter than your average site to circumvent this type of login ed: I hope
So you may wan to look at handling the session manually:
import urllib2
req = urllib2.Request('http://www.facebook.com/')
req.add_header('Referer', 'http://www.lastpage.com/')
r = urllib2.urlopen(req)
All snipped from the python docs.
I found that you can't read from some sites using Python's urllib2(or urllib). An example...
urllib2.urlopen("http://www.dafont.com/").read()
# Returns ''
These sites work when you visit the site with a browser. I can even scrape them using PHP(didn't try other languages). I have seen other sites with the same issue - but can't remember the URL at the moment.
My questions are...
What is the cause of this issue?
Any workarounds?
I believe it gets blocked by the User-Agent. You can change User-Agent using the following sample code:
USERAGENT = 'something'
HEADERS = {'User-Agent': USERAGENT}
req = urllib2.Request(URL_HERE, headers=HEADERS)
f = urllib2.urlopen(req)
s = f.read()
f.close()
Try setting a different user agent. Check the answers in this link.
I'm the guy who posted the question. I have some suspicions - but not sure about them - that's why I posted the question here.
What is the cause of this issue?
I think its due to the host blocking the urllib library using robot.txt or htaccess. But not sure about it. Not even sure if its possible.
Any workaround for this issue?
If you are in Unix, this will work...
contents = commands.getoutput("curl -s '"+url+"'")