Not sure If I'll make sense or not but here goes. In Google chrome, if you rightclick a page and go to resources, then refresh a page, you can see all the GET/POST methods pop up as they happen. I'm wanting to know if there is a way, in python, to input a url and have it generate a list of each get call be listed (not sure if possible)
Would love some direction on it!
Thanks
I believe I can clarify parts of your original question.
One the one hand, using the browser-built-in debugging tools for investigating how a certain website behaves when loaded by a browser is a good technique, and not easily replaceable by custom code.
On the other hand, it looks like you are looking for an HTML parser, such as BeautifulSoup.
Also, you seem to confuse the meaning of a URL and an HTML document. A URL can point to an HTML document, but in many cases it points to other things, such as JSON-API endpoint.
Assuming you actually wanted to ask how "to input a URL to an HTML document and have it generate a list of each remote resource call a browser would perform":
Before rendering a website, a web browser fires off the initial HTTP GET request and retrieves the main HTML document. It parses this document and, among others, searches for further resources to be retrieved. Such resources may be CSS files, JavaScript files, images, iframes, ... (long list). If it finds such resources, the browser automatically fires off one HTTP GET request for each of these resources. As you can see, there is quite some work involved and happening behind the scenes, before all these requests are performed by your browser.
In Python, you cannot trivially simulate the behavior of your browser. You can easily retrieve a single HTML document via the urllib or requests module. That is, you can manually fire off a single HTTP GET request to retrieve an HTML document. Replicating the behavior of a browser would then require
to parse the HTML document in the same way the browser does,
to search the document for remote sources such as images, CSS files, ....,
to decide which remote resources to query in which order, and
then to fire off even more HTTP GET requests, and possibly recursively repeat the entire process (as would be required for iframes)
Exact replication of browser behavior is too complex. Building a proper web browser is an inherently difficult task.
That is, if you want to understand the behavior of a website within a browser, use the browser's debugging tools.
Related
I have been playing with requests module on Python for a while as part of studying HTTP requests/responses; and I think I grasped most of the fundamental things on the topic that are supposed to be understood. With a naive analogy it basically works on ping-pong principle. You send a request in a packet to server and then it send back to you another packet. For instance, logging in to a site is simply sending a post request to server, I managed to do that. However, what I have trouble is to fail clicking on buttons through HTTP post request. I searched for it here and there, but I could not find a valid answer to my inquiry other than utilizing selenium module, which is what I do not want to if there is another way with requests module too. I am also aware of the fact that they created such a module called selenium for a thing.
QUESTIONS:
1) What kind of parameters do I have to take into account for being able to click on buttons or links from the account I accessed through HTTP requests? For instance, when I watch network activity for request header and response header with my browser's built-in inspect tool, I get so many parameters sent back by server, e.g. sec-fetch-dest, sec-fetch-mode, etc.
2) Is it too complicated for a beginner or is there too much advanced stuff going on behind the scene to do that so selenium was created for that reason?
Theoretically, you could write a program to do this with requests, but you would be duplicating much of the functionality that is already built and optimized in other tools and APIs. The general process would be:
Load the HTML that is normally rendered in your browser using a get request.
Process the HTML to find the button in question.
Then, if it's a simple form:
Determine the request method the button will carry out (e.g. using the formmethod argument, see here).
Perform the specified request with the required information in your request packet.
If it's a complex page (i.e. it uses JavaScript):
Find the button's unique identifier.
Process the JavaScript code to determine what action is performed when the button is clicked.
If possible, perform the JavaScript action using requests (e.g. following a link or something like that). I say if possible because JavaScript can do many things that, to my knowledge, simple HTTP request cannot, like changing rendered CSS in order to change the background color of a <div> when a button is clicked.
You are much better off using a tool like selenium or beautiful soup, as they have created APIs that do a lot of the above for you. If you've used the built-in requests library to learn about the basic HTTP request types and how they work, awesome--now move on to the plethora of excellent tools that wrap requests up into a more functional and robust API.
I have been using mechanize to fill in a form from a website but this now has changed and some of the required fields seem to be hidden and cannot be accessed using mechanize any longer - when printing all available forms.
I assume it has been modified to use more current methods (application/x-www-form-urlencoded) but I have not found a way to update my script to continue using this form programmatically.
From what I have read, I should be able to send a dict (key/value pair) to the submit button directly rather than filling the form in the first place - please correct me if I am wrong.
BUT I have not been able to find a way to obtain what keys are required...
I would massively appreciate it if someone could point me in the right direction or put me straight in case this is no longer possible.
You cannot, in all circumstances, extract all fields a server expects.
The post target, the code handling the POST, is a black box. You cannot look inside the code that the server runs. The best information you have about what it expects is what the original form tells your browser to post. That original form consists not only of the HTML, but also of the headers that were sent with it (cookies for example) and any JavaScript code that is run by the browser.
In many cases, parsing the HTML sent for the form is enough; that's what Mechanize (or a recent more modern framework like robobrowser) does, plus a little cookie handling and making sure typical headers such as the referrer are included. But if any JavaScript code manipulated the HTML or intercepts the form submission to add or remove data then Mechanize or other Python form parsers cannot replicate that step.
Your options then are to:
Reverse engineer what the Javascript code does and replicate that in Python code. The development tools of your browser can help here; observe what is being posted on the network tab, for example, or use the debugger to step through the JavaScript code to see what it does.
Use an actual browser, controlled from Python. Selenium could do this for you; it can drive a desktop browser (Chrome, Firefox, etc.) or it can be used to drive a headless browser implementation such as PhantomJS. This is heavier on the resources, but will actually run the JavaScript code and let you post a form just as your browser would, in each and every way.
I have been using mechanize to fill in a form from a website but this now has changed and some of the required fields seem to be hidden and cannot be accessed using mechanize any longer - when printing all available forms.
I assume it has been modified to use more current methods (application/x-www-form-urlencoded) but I have not found a way to update my script to continue using this form programmatically.
From what I have read, I should be able to send a dict (key/value pair) to the submit button directly rather than filling the form in the first place - please correct me if I am wrong.
BUT I have not been able to find a way to obtain what keys are required...
I would massively appreciate it if someone could point me in the right direction or put me straight in case this is no longer possible.
You cannot, in all circumstances, extract all fields a server expects.
The post target, the code handling the POST, is a black box. You cannot look inside the code that the server runs. The best information you have about what it expects is what the original form tells your browser to post. That original form consists not only of the HTML, but also of the headers that were sent with it (cookies for example) and any JavaScript code that is run by the browser.
In many cases, parsing the HTML sent for the form is enough; that's what Mechanize (or a recent more modern framework like robobrowser) does, plus a little cookie handling and making sure typical headers such as the referrer are included. But if any JavaScript code manipulated the HTML or intercepts the form submission to add or remove data then Mechanize or other Python form parsers cannot replicate that step.
Your options then are to:
Reverse engineer what the Javascript code does and replicate that in Python code. The development tools of your browser can help here; observe what is being posted on the network tab, for example, or use the debugger to step through the JavaScript code to see what it does.
Use an actual browser, controlled from Python. Selenium could do this for you; it can drive a desktop browser (Chrome, Firefox, etc.) or it can be used to drive a headless browser implementation such as PhantomJS. This is heavier on the resources, but will actually run the JavaScript code and let you post a form just as your browser would, in each and every way.
The basic idea is that the web-application fetches an external website and overlays it some JavaScript, for additional functionality.
However the links on the webpage I fetched shouldn't navigate to the external website, but stay on my website. I figured that converting the links with regular expressions (or a similar method) would be inefficient as it would not cover the links dynamically generated, like AJAX-requests or other JavaScript functionality. So basically what I can't seem to find is a method to change/intercept/redirect all links of the scraped website.
So, what is a (good) way to change/intercept the dynamically generated links of a scraped website? Preferably a python-method.
Unless you're changing the URL's on the scraped web page (including dynamic ones), you can't do what you ask.
If a client is served up a web page with a URL pointing to an external site, your website will have no opportunity to intercept this or change it since their browser will navigate away without even going to your site (not strictly true though - read on). Theoretically, you could possibly attach event handlers to all links (before serving up the scraped page), and even intercept dynamically created ones (by parsing their javascript), but this might prove pretty difficult. You also have to stop other methods of the URL changing as well (like header redirection).
Clients themselves can use proxies in their browsers (that affect all outgoing URLs), but this is the client deciding that all traffic should be routed through a proxy server. You can't do this on their behalf (without actually changing the URLs).
EDIT: Since OP removed suggestion of using a web proxy, the answer details change slightly, but the end result is the same. For all practical purposes, it's nearly impossible to do this.
You could try parsing the javascript on the page and be successful for some pages (or possibly with a sophisticated enough script for many typical pages); but throw in one little eval on the page, and you'll need your own javascript engine written in javascript to try to figure out every possible external request on a page. ...and even then you couldn't do it.
Basically, give me a script which someone says can parse any webpage (including javascript) to intercept any external calls, and I'll give you a webpage that this script won't work for. Disclaimer: I'm speaking about intercepting the links, but letting the site function normally after...not just parsing the page to remove all javascript entirely.
Someone else may be able to provide you an answer that works sometimes on some web pages - maybe that would be good enough for your purposes.
Also, have you considered that most javascript on a page isn't embedded, but rather either loaded via <script> tags, or possibly even loaded dynamically, from the original server. I assume you'd want to distinguish "stuff loaded from original server needed to make page function and look correctly", from "stuff loaded from original server for other things". How does your program "know" this?
You could try parsing the page and removing all javascript...but even this would be very difficult, since there are still tricky ways of getting around this.
I'm trying to interact with a HTML 4.0 website which uses heavily obfuscated javascript to hide the regular HTML elements. What I want to do is to fill out a form and read the returned results, and this is proving harder to do than expected.
When I read the page using Firebug, it gave me the source code deobfuscated, and I can then use this to do what I want to accomplish. The Firebug output showed all the regular elements of a website, such as -tags and the like, which were hidden in the original source.
I've written the rest of my application in Python, using mechanize to interact with other web services, so I'd rather use an existing Python module to do this if that's possible. The problem is not only how to read the source code in a way mechanize can understand, but also how to generate the response which the web server can interpret. Could I use regular mechanize controls even though the html code is obfuscated?
In the beginning of my project I used pywebkitgtk instead of mechanize, but ditched it because it wasn't really implemented that well in python. Most functions are missing. Would that be a sensible method perhaps, to start up a webkit-browser which I read the HTML from, and use that with mechanize?
Any help would be greatly appreciated, I'm really in a bind here. Thanks!
Edit: I tried dumping the HTML fetched from mechanize and opening that with pywebkitgtk, using load_html_string, and then evaluating the html that way. Unfortunately, since the document I'm trying to parse loads more resources dynamically, that scripts just stops waiting for resources to be loaded. Note that I can't use webkit to load the document itself since I use mechanize's CookieJar function to allow me to log in first.
I also tried dumping the HTML from webkit, which for some reason dumped the obfuscated javascript only, while displaying the website perfectly fine. If webkit could dump the deobfuscated javascript the way Firebug does, I could work with that and form a request according to the clean code..
Rather than trying to process the page, how about just use Firebug to figure out the names of the form fields, and then use httplib or whatever to send a request with the necessary fields and settings?
If it's sent using ajax, you should be able to determine the values being sent to the server in Firebug as well.