I have been using mechanize to fill in a form from a website but this now has changed and some of the required fields seem to be hidden and cannot be accessed using mechanize any longer - when printing all available forms.
I assume it has been modified to use more current methods (application/x-www-form-urlencoded) but I have not found a way to update my script to continue using this form programmatically.
From what I have read, I should be able to send a dict (key/value pair) to the submit button directly rather than filling the form in the first place - please correct me if I am wrong.
BUT I have not been able to find a way to obtain what keys are required...
I would massively appreciate it if someone could point me in the right direction or put me straight in case this is no longer possible.
You cannot, in all circumstances, extract all fields a server expects.
The post target, the code handling the POST, is a black box. You cannot look inside the code that the server runs. The best information you have about what it expects is what the original form tells your browser to post. That original form consists not only of the HTML, but also of the headers that were sent with it (cookies for example) and any JavaScript code that is run by the browser.
In many cases, parsing the HTML sent for the form is enough; that's what Mechanize (or a recent more modern framework like robobrowser) does, plus a little cookie handling and making sure typical headers such as the referrer are included. But if any JavaScript code manipulated the HTML or intercepts the form submission to add or remove data then Mechanize or other Python form parsers cannot replicate that step.
Your options then are to:
Reverse engineer what the Javascript code does and replicate that in Python code. The development tools of your browser can help here; observe what is being posted on the network tab, for example, or use the debugger to step through the JavaScript code to see what it does.
Use an actual browser, controlled from Python. Selenium could do this for you; it can drive a desktop browser (Chrome, Firefox, etc.) or it can be used to drive a headless browser implementation such as PhantomJS. This is heavier on the resources, but will actually run the JavaScript code and let you post a form just as your browser would, in each and every way.
Related
I have been playing with requests module on Python for a while as part of studying HTTP requests/responses; and I think I grasped most of the fundamental things on the topic that are supposed to be understood. With a naive analogy it basically works on ping-pong principle. You send a request in a packet to server and then it send back to you another packet. For instance, logging in to a site is simply sending a post request to server, I managed to do that. However, what I have trouble is to fail clicking on buttons through HTTP post request. I searched for it here and there, but I could not find a valid answer to my inquiry other than utilizing selenium module, which is what I do not want to if there is another way with requests module too. I am also aware of the fact that they created such a module called selenium for a thing.
QUESTIONS:
1) What kind of parameters do I have to take into account for being able to click on buttons or links from the account I accessed through HTTP requests? For instance, when I watch network activity for request header and response header with my browser's built-in inspect tool, I get so many parameters sent back by server, e.g. sec-fetch-dest, sec-fetch-mode, etc.
2) Is it too complicated for a beginner or is there too much advanced stuff going on behind the scene to do that so selenium was created for that reason?
Theoretically, you could write a program to do this with requests, but you would be duplicating much of the functionality that is already built and optimized in other tools and APIs. The general process would be:
Load the HTML that is normally rendered in your browser using a get request.
Process the HTML to find the button in question.
Then, if it's a simple form:
Determine the request method the button will carry out (e.g. using the formmethod argument, see here).
Perform the specified request with the required information in your request packet.
If it's a complex page (i.e. it uses JavaScript):
Find the button's unique identifier.
Process the JavaScript code to determine what action is performed when the button is clicked.
If possible, perform the JavaScript action using requests (e.g. following a link or something like that). I say if possible because JavaScript can do many things that, to my knowledge, simple HTTP request cannot, like changing rendered CSS in order to change the background color of a <div> when a button is clicked.
You are much better off using a tool like selenium or beautiful soup, as they have created APIs that do a lot of the above for you. If you've used the built-in requests library to learn about the basic HTTP request types and how they work, awesome--now move on to the plethora of excellent tools that wrap requests up into a more functional and robust API.
Not sure If I'll make sense or not but here goes. In Google chrome, if you rightclick a page and go to resources, then refresh a page, you can see all the GET/POST methods pop up as they happen. I'm wanting to know if there is a way, in python, to input a url and have it generate a list of each get call be listed (not sure if possible)
Would love some direction on it!
Thanks
I believe I can clarify parts of your original question.
One the one hand, using the browser-built-in debugging tools for investigating how a certain website behaves when loaded by a browser is a good technique, and not easily replaceable by custom code.
On the other hand, it looks like you are looking for an HTML parser, such as BeautifulSoup.
Also, you seem to confuse the meaning of a URL and an HTML document. A URL can point to an HTML document, but in many cases it points to other things, such as JSON-API endpoint.
Assuming you actually wanted to ask how "to input a URL to an HTML document and have it generate a list of each remote resource call a browser would perform":
Before rendering a website, a web browser fires off the initial HTTP GET request and retrieves the main HTML document. It parses this document and, among others, searches for further resources to be retrieved. Such resources may be CSS files, JavaScript files, images, iframes, ... (long list). If it finds such resources, the browser automatically fires off one HTTP GET request for each of these resources. As you can see, there is quite some work involved and happening behind the scenes, before all these requests are performed by your browser.
In Python, you cannot trivially simulate the behavior of your browser. You can easily retrieve a single HTML document via the urllib or requests module. That is, you can manually fire off a single HTTP GET request to retrieve an HTML document. Replicating the behavior of a browser would then require
to parse the HTML document in the same way the browser does,
to search the document for remote sources such as images, CSS files, ....,
to decide which remote resources to query in which order, and
then to fire off even more HTTP GET requests, and possibly recursively repeat the entire process (as would be required for iframes)
Exact replication of browser behavior is too complex. Building a proper web browser is an inherently difficult task.
That is, if you want to understand the behavior of a website within a browser, use the browser's debugging tools.
I have been using mechanize to fill in a form from a website but this now has changed and some of the required fields seem to be hidden and cannot be accessed using mechanize any longer - when printing all available forms.
I assume it has been modified to use more current methods (application/x-www-form-urlencoded) but I have not found a way to update my script to continue using this form programmatically.
From what I have read, I should be able to send a dict (key/value pair) to the submit button directly rather than filling the form in the first place - please correct me if I am wrong.
BUT I have not been able to find a way to obtain what keys are required...
I would massively appreciate it if someone could point me in the right direction or put me straight in case this is no longer possible.
You cannot, in all circumstances, extract all fields a server expects.
The post target, the code handling the POST, is a black box. You cannot look inside the code that the server runs. The best information you have about what it expects is what the original form tells your browser to post. That original form consists not only of the HTML, but also of the headers that were sent with it (cookies for example) and any JavaScript code that is run by the browser.
In many cases, parsing the HTML sent for the form is enough; that's what Mechanize (or a recent more modern framework like robobrowser) does, plus a little cookie handling and making sure typical headers such as the referrer are included. But if any JavaScript code manipulated the HTML or intercepts the form submission to add or remove data then Mechanize or other Python form parsers cannot replicate that step.
Your options then are to:
Reverse engineer what the Javascript code does and replicate that in Python code. The development tools of your browser can help here; observe what is being posted on the network tab, for example, or use the debugger to step through the JavaScript code to see what it does.
Use an actual browser, controlled from Python. Selenium could do this for you; it can drive a desktop browser (Chrome, Firefox, etc.) or it can be used to drive a headless browser implementation such as PhantomJS. This is heavier on the resources, but will actually run the JavaScript code and let you post a form just as your browser would, in each and every way.
I'm trying to interact with a HTML 4.0 website which uses heavily obfuscated javascript to hide the regular HTML elements. What I want to do is to fill out a form and read the returned results, and this is proving harder to do than expected.
When I read the page using Firebug, it gave me the source code deobfuscated, and I can then use this to do what I want to accomplish. The Firebug output showed all the regular elements of a website, such as -tags and the like, which were hidden in the original source.
I've written the rest of my application in Python, using mechanize to interact with other web services, so I'd rather use an existing Python module to do this if that's possible. The problem is not only how to read the source code in a way mechanize can understand, but also how to generate the response which the web server can interpret. Could I use regular mechanize controls even though the html code is obfuscated?
In the beginning of my project I used pywebkitgtk instead of mechanize, but ditched it because it wasn't really implemented that well in python. Most functions are missing. Would that be a sensible method perhaps, to start up a webkit-browser which I read the HTML from, and use that with mechanize?
Any help would be greatly appreciated, I'm really in a bind here. Thanks!
Edit: I tried dumping the HTML fetched from mechanize and opening that with pywebkitgtk, using load_html_string, and then evaluating the html that way. Unfortunately, since the document I'm trying to parse loads more resources dynamically, that scripts just stops waiting for resources to be loaded. Note that I can't use webkit to load the document itself since I use mechanize's CookieJar function to allow me to log in first.
I also tried dumping the HTML from webkit, which for some reason dumped the obfuscated javascript only, while displaying the website perfectly fine. If webkit could dump the deobfuscated javascript the way Firebug does, I could work with that and form a request according to the clean code..
Rather than trying to process the page, how about just use Firebug to figure out the names of the form fields, and then use httplib or whatever to send a request with the necessary fields and settings?
If it's sent using ajax, you should be able to determine the values being sent to the server in Firebug as well.
As far as I know, for a new request coming from a webapp, you need to reload the page to process and respond to that request.
For example, if you want to show a comment on a post, you need to reload the page, process the comment, and then show it. What I want, however, is I want to be able to add comments (something like facebook, where the comment gets added and shown without having to reload the whole page, for example) without having to reload the web-page. Is it possible to do with only Django and Python with no Javascript/AJAX knowledge?
I have heard it's possible with AJAX (I don't know how), but I was wondering if it was possible to do with Django.
Thanks,
You want to do that with out any client side code (javascript and ajax are just examples) and with out reloading your page (or at least part of it)?
If that is your question, then the answer unfortunately is you can't. You need to either have client side code or reload your page.
Think about it, once the client get's the page it will not change unless
The client requests the same page from the server and the server returns and updated one
the page has some client side code (eg: javascript) that updates the page.
You definitely want to use AJAX. Which means the client will need to run some javascript code.
If you don't want to learn javascript you can always try something like pyjamas. You can check out an example of it's HttpRequest here
But I always feel that using straight javascript via a library (like jQuery) is easier to understand than trying to force one language into another one.
To do it right, ajax would be the way to go BUT in a limited sense you can achieve the same thing by using a iframe, iframe is like another page embedded inside main page, so instead of refreshing whole page you may just refresh the inner iframe page and that may give the same effect.
More about iframe patterns you can read at
http://ajaxpatterns.org/IFrame_Call
Maybe a few iFrames and some Comet/long-polling? Have the comment submission in an iFrame (so the whole page doesn't reload), and then show the result in the long-polled iFrame...
Having said that, it's a pretty bad design idea, and you probably don't want to be doing this. AJAX/JavaScript is pretty much the way to go for things like this.
I have heard it's possible with AJAX...but I was
wondering if it was possible to do
with Django.
There's no reason you can't use both - specifically, AJAX within a Django web application. Django provides your organization and framework needs (and a page that will respond to AJAX requests) and then use some JavaScript on the client side to make AJAX calls to your Django-backed page that will respond correctly.
I suggest you go find a basic jQuery tutorial which should explain enough basic JavaScript to get this working.