I'm trying to fill forms of an ajax box (just my term for those several forms) using the mechanize module, but it seems not to work. I'm not a web programmer but afaik the ajax box updates itself 'onchange' with an event which is handled by the browser.
Mechanize seems not to handle that, in the links list (from the iterator Browser.links) I can find a url 'javascript:AjaxRetry();' with an error msg as text which tells me that something has gone wrong.
Here is my code:
import mechanize as m
br = m.Browser()
br.open(url)
br.select_form(nr=0)
# fill in one form (in a real browser, the other form refresh and are not disabled anymore)
br.set_value(code, br.form.controls[10].name)
# how to make it refresh now?
#br.submit() doesn't work (also br.click() does not work (no clickable around at all))
Is mechanize the right module to fill forms of that ajax box?
I can't paste the link to the page where that ajax box is, because you have to be logged-in in order to see that box.
Mechanize doesn't handle javascript, see this answer for more details and an alternative solution How to properly use mechanize to scrape AJAX sites
Related
From this question, the last responder seems to think that it is possible to use python to open a webpage, let me sign in manually, go through a bunch of menus then let the python parse the page when I get where I want. The website has a weird sign in procedure so using requests and passing a user name and password will not be sufficient.
However it seems from this question that it's not a possibility.
SO the question is, is it possible? if so, do you know of some example code out there?
The way to approach this problem is when you login normally have the developer tools next to you and see what the request is sending.
When logging in to bandcamp the XHR request that's being sent is the following:
From that response you can see that an identity cookie is being sent. That's probably how they identify that you are logged in. So when you've got that cookie set you would be authorized to view logged in pages.
So in your program you could login normally using requests, save the cookie in a variable and then apply the cookie to further requests using requests.
Of course login procedures and how this authorization mechanism works may differ, but that's the general gist of it.
So when do you actually need selenium? You need it if a lot of the things are being rendered by javascript. requests is only able to get the html. So if the menus and such is rendered with javascript you won't ever be able to see that information using requests.
I have been using mechanize to fill in a form from a website but this now has changed and some of the required fields seem to be hidden and cannot be accessed using mechanize any longer - when printing all available forms.
I assume it has been modified to use more current methods (application/x-www-form-urlencoded) but I have not found a way to update my script to continue using this form programmatically.
From what I have read, I should be able to send a dict (key/value pair) to the submit button directly rather than filling the form in the first place - please correct me if I am wrong.
BUT I have not been able to find a way to obtain what keys are required...
I would massively appreciate it if someone could point me in the right direction or put me straight in case this is no longer possible.
You cannot, in all circumstances, extract all fields a server expects.
The post target, the code handling the POST, is a black box. You cannot look inside the code that the server runs. The best information you have about what it expects is what the original form tells your browser to post. That original form consists not only of the HTML, but also of the headers that were sent with it (cookies for example) and any JavaScript code that is run by the browser.
In many cases, parsing the HTML sent for the form is enough; that's what Mechanize (or a recent more modern framework like robobrowser) does, plus a little cookie handling and making sure typical headers such as the referrer are included. But if any JavaScript code manipulated the HTML or intercepts the form submission to add or remove data then Mechanize or other Python form parsers cannot replicate that step.
Your options then are to:
Reverse engineer what the Javascript code does and replicate that in Python code. The development tools of your browser can help here; observe what is being posted on the network tab, for example, or use the debugger to step through the JavaScript code to see what it does.
Use an actual browser, controlled from Python. Selenium could do this for you; it can drive a desktop browser (Chrome, Firefox, etc.) or it can be used to drive a headless browser implementation such as PhantomJS. This is heavier on the resources, but will actually run the JavaScript code and let you post a form just as your browser would, in each and every way.
Say, I browse to a website (on intranet too) that require a login to access the contents. I will fill in the required fields... e.g. username, password and any captcha, etc. that is required for logging in from the browser itself.
Once I have logged in into the site, there are lots of goodies that can be scraped from several links and tabs on the first page after logged in.
Now, from this point forward (that is after logged in from the browser).. I want to control the page and downloads from urllib2... like going through page by page, download pdf and images on each page, etc.
I understand that we can use everything from urllib2 (or mechanize) directly (that is login to the page and do the whole thing).
But, for some sites.. it is really a pain to go through and find out the login mechanism, required hidden parameters, referrers, captcha, cookies and pop ups.
Please advise. Hope my question makes sense.
In summary, i want the initial login part done using the web browser manually... and then take over the automation for scraping through urllib2.
Did you consider Selenium? It's about browser automation instead of http requests (urllib2), and you can manipulate the browser in between steps.
You want to use the cookielib module.
http://docs.python.org/library/cookielib.html
You can log on using your browser, then export the cookies into a Netscape-style cookie.txt file. Then from python you'll be able to load this and fetch the resource you require. The cookie will be good until the website expires your session (often around 30 days).
import cookielib, urllib2
cj = cookielib.MozillaCookieJar()
cj.load('cookie.txt')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/resource")
There are add-ons for Chrome and Firefox that will export the cookies in this format. For example:
https://chrome.google.com/webstore/detail/lopabhfecdfhgogdbojmaicoicjekelh
https://addons.mozilla.org/en-US/firefox/addon/export-cookies/
I am trying to log in a forum using Python/URLLib2. But I can't seem to succeed. I think it might be because there are several form objects in the login page, and I submit the incorrect one (the same code worked for a different forum, with a single form).
Is there a way to specify which form to submit in URLLib2?
Thanks.
Here i can give you the steps to achieve your goal:
Read the page using urllib2
parse the page into a dom object ( see xml.dom.minidom.parsestring or other equivalent)
search if the page has the login form by serching form id etc.
if form is there, create the form click using code (create http headers, response data). and post the information using urllib2 http methods ( GET or POST or for Ajax, with extra header as documented at w3c school)
9000 said: "I'd try to sniff/track a real exchange between browser and the site; both Chrome and FF have tools for that. I'd also consider using mechanize instead of raw urrlib2"
This is the answer - mechanize is really easy to use and supports multiple forms.
Thanks!
So I have an authenticated site that I want to access via the mechanize module. I'm able to log in, and then go to the page I want. However, because the page recognizes that mechanize doesn't have javascript enabled, it wants me to click a submit button to get redirected to a non javascript part of the site. How can I simply click the button and then read the contents of the page that follows that?
Or, is there a way to trick it into thinking that my javascript is enables?
Thanks!
if that submit button is really a submit input element of the form, and the redirection works as usual form submit action, and provided that it's the only form in the page, your mechanize browser instance is br, following should work
br.select_form(nr=0) # select the first form
br.submit()
afaik, there's no simple or moderately possible way, how to emulate javascript in mechanize, possible workarounds depend on what is javascript exactly doing