The basic idea is that the web-application fetches an external website and overlays it some JavaScript, for additional functionality.
However the links on the webpage I fetched shouldn't navigate to the external website, but stay on my website. I figured that converting the links with regular expressions (or a similar method) would be inefficient as it would not cover the links dynamically generated, like AJAX-requests or other JavaScript functionality. So basically what I can't seem to find is a method to change/intercept/redirect all links of the scraped website.
So, what is a (good) way to change/intercept the dynamically generated links of a scraped website? Preferably a python-method.
Unless you're changing the URL's on the scraped web page (including dynamic ones), you can't do what you ask.
If a client is served up a web page with a URL pointing to an external site, your website will have no opportunity to intercept this or change it since their browser will navigate away without even going to your site (not strictly true though - read on). Theoretically, you could possibly attach event handlers to all links (before serving up the scraped page), and even intercept dynamically created ones (by parsing their javascript), but this might prove pretty difficult. You also have to stop other methods of the URL changing as well (like header redirection).
Clients themselves can use proxies in their browsers (that affect all outgoing URLs), but this is the client deciding that all traffic should be routed through a proxy server. You can't do this on their behalf (without actually changing the URLs).
EDIT: Since OP removed suggestion of using a web proxy, the answer details change slightly, but the end result is the same. For all practical purposes, it's nearly impossible to do this.
You could try parsing the javascript on the page and be successful for some pages (or possibly with a sophisticated enough script for many typical pages); but throw in one little eval on the page, and you'll need your own javascript engine written in javascript to try to figure out every possible external request on a page. ...and even then you couldn't do it.
Basically, give me a script which someone says can parse any webpage (including javascript) to intercept any external calls, and I'll give you a webpage that this script won't work for. Disclaimer: I'm speaking about intercepting the links, but letting the site function normally after...not just parsing the page to remove all javascript entirely.
Someone else may be able to provide you an answer that works sometimes on some web pages - maybe that would be good enough for your purposes.
Also, have you considered that most javascript on a page isn't embedded, but rather either loaded via <script> tags, or possibly even loaded dynamically, from the original server. I assume you'd want to distinguish "stuff loaded from original server needed to make page function and look correctly", from "stuff loaded from original server for other things". How does your program "know" this?
You could try parsing the page and removing all javascript...but even this would be very difficult, since there are still tricky ways of getting around this.
Related
Not sure If I'll make sense or not but here goes. In Google chrome, if you rightclick a page and go to resources, then refresh a page, you can see all the GET/POST methods pop up as they happen. I'm wanting to know if there is a way, in python, to input a url and have it generate a list of each get call be listed (not sure if possible)
Would love some direction on it!
Thanks
I believe I can clarify parts of your original question.
One the one hand, using the browser-built-in debugging tools for investigating how a certain website behaves when loaded by a browser is a good technique, and not easily replaceable by custom code.
On the other hand, it looks like you are looking for an HTML parser, such as BeautifulSoup.
Also, you seem to confuse the meaning of a URL and an HTML document. A URL can point to an HTML document, but in many cases it points to other things, such as JSON-API endpoint.
Assuming you actually wanted to ask how "to input a URL to an HTML document and have it generate a list of each remote resource call a browser would perform":
Before rendering a website, a web browser fires off the initial HTTP GET request and retrieves the main HTML document. It parses this document and, among others, searches for further resources to be retrieved. Such resources may be CSS files, JavaScript files, images, iframes, ... (long list). If it finds such resources, the browser automatically fires off one HTTP GET request for each of these resources. As you can see, there is quite some work involved and happening behind the scenes, before all these requests are performed by your browser.
In Python, you cannot trivially simulate the behavior of your browser. You can easily retrieve a single HTML document via the urllib or requests module. That is, you can manually fire off a single HTTP GET request to retrieve an HTML document. Replicating the behavior of a browser would then require
to parse the HTML document in the same way the browser does,
to search the document for remote sources such as images, CSS files, ....,
to decide which remote resources to query in which order, and
then to fire off even more HTTP GET requests, and possibly recursively repeat the entire process (as would be required for iframes)
Exact replication of browser behavior is too complex. Building a proper web browser is an inherently difficult task.
That is, if you want to understand the behavior of a website within a browser, use the browser's debugging tools.
I have been using mechanize to fill in a form from a website but this now has changed and some of the required fields seem to be hidden and cannot be accessed using mechanize any longer - when printing all available forms.
I assume it has been modified to use more current methods (application/x-www-form-urlencoded) but I have not found a way to update my script to continue using this form programmatically.
From what I have read, I should be able to send a dict (key/value pair) to the submit button directly rather than filling the form in the first place - please correct me if I am wrong.
BUT I have not been able to find a way to obtain what keys are required...
I would massively appreciate it if someone could point me in the right direction or put me straight in case this is no longer possible.
You cannot, in all circumstances, extract all fields a server expects.
The post target, the code handling the POST, is a black box. You cannot look inside the code that the server runs. The best information you have about what it expects is what the original form tells your browser to post. That original form consists not only of the HTML, but also of the headers that were sent with it (cookies for example) and any JavaScript code that is run by the browser.
In many cases, parsing the HTML sent for the form is enough; that's what Mechanize (or a recent more modern framework like robobrowser) does, plus a little cookie handling and making sure typical headers such as the referrer are included. But if any JavaScript code manipulated the HTML or intercepts the form submission to add or remove data then Mechanize or other Python form parsers cannot replicate that step.
Your options then are to:
Reverse engineer what the Javascript code does and replicate that in Python code. The development tools of your browser can help here; observe what is being posted on the network tab, for example, or use the debugger to step through the JavaScript code to see what it does.
Use an actual browser, controlled from Python. Selenium could do this for you; it can drive a desktop browser (Chrome, Firefox, etc.) or it can be used to drive a headless browser implementation such as PhantomJS. This is heavier on the resources, but will actually run the JavaScript code and let you post a form just as your browser would, in each and every way.
I have been using mechanize to fill in a form from a website but this now has changed and some of the required fields seem to be hidden and cannot be accessed using mechanize any longer - when printing all available forms.
I assume it has been modified to use more current methods (application/x-www-form-urlencoded) but I have not found a way to update my script to continue using this form programmatically.
From what I have read, I should be able to send a dict (key/value pair) to the submit button directly rather than filling the form in the first place - please correct me if I am wrong.
BUT I have not been able to find a way to obtain what keys are required...
I would massively appreciate it if someone could point me in the right direction or put me straight in case this is no longer possible.
You cannot, in all circumstances, extract all fields a server expects.
The post target, the code handling the POST, is a black box. You cannot look inside the code that the server runs. The best information you have about what it expects is what the original form tells your browser to post. That original form consists not only of the HTML, but also of the headers that were sent with it (cookies for example) and any JavaScript code that is run by the browser.
In many cases, parsing the HTML sent for the form is enough; that's what Mechanize (or a recent more modern framework like robobrowser) does, plus a little cookie handling and making sure typical headers such as the referrer are included. But if any JavaScript code manipulated the HTML or intercepts the form submission to add or remove data then Mechanize or other Python form parsers cannot replicate that step.
Your options then are to:
Reverse engineer what the Javascript code does and replicate that in Python code. The development tools of your browser can help here; observe what is being posted on the network tab, for example, or use the debugger to step through the JavaScript code to see what it does.
Use an actual browser, controlled from Python. Selenium could do this for you; it can drive a desktop browser (Chrome, Firefox, etc.) or it can be used to drive a headless browser implementation such as PhantomJS. This is heavier on the resources, but will actually run the JavaScript code and let you post a form just as your browser would, in each and every way.
I'm trying to interact with a HTML 4.0 website which uses heavily obfuscated javascript to hide the regular HTML elements. What I want to do is to fill out a form and read the returned results, and this is proving harder to do than expected.
When I read the page using Firebug, it gave me the source code deobfuscated, and I can then use this to do what I want to accomplish. The Firebug output showed all the regular elements of a website, such as -tags and the like, which were hidden in the original source.
I've written the rest of my application in Python, using mechanize to interact with other web services, so I'd rather use an existing Python module to do this if that's possible. The problem is not only how to read the source code in a way mechanize can understand, but also how to generate the response which the web server can interpret. Could I use regular mechanize controls even though the html code is obfuscated?
In the beginning of my project I used pywebkitgtk instead of mechanize, but ditched it because it wasn't really implemented that well in python. Most functions are missing. Would that be a sensible method perhaps, to start up a webkit-browser which I read the HTML from, and use that with mechanize?
Any help would be greatly appreciated, I'm really in a bind here. Thanks!
Edit: I tried dumping the HTML fetched from mechanize and opening that with pywebkitgtk, using load_html_string, and then evaluating the html that way. Unfortunately, since the document I'm trying to parse loads more resources dynamically, that scripts just stops waiting for resources to be loaded. Note that I can't use webkit to load the document itself since I use mechanize's CookieJar function to allow me to log in first.
I also tried dumping the HTML from webkit, which for some reason dumped the obfuscated javascript only, while displaying the website perfectly fine. If webkit could dump the deobfuscated javascript the way Firebug does, I could work with that and form a request according to the clean code..
Rather than trying to process the page, how about just use Firebug to figure out the names of the form fields, and then use httplib or whatever to send a request with the necessary fields and settings?
If it's sent using ajax, you should be able to determine the values being sent to the server in Firebug as well.
As far as I know, for a new request coming from a webapp, you need to reload the page to process and respond to that request.
For example, if you want to show a comment on a post, you need to reload the page, process the comment, and then show it. What I want, however, is I want to be able to add comments (something like facebook, where the comment gets added and shown without having to reload the whole page, for example) without having to reload the web-page. Is it possible to do with only Django and Python with no Javascript/AJAX knowledge?
I have heard it's possible with AJAX (I don't know how), but I was wondering if it was possible to do with Django.
Thanks,
You want to do that with out any client side code (javascript and ajax are just examples) and with out reloading your page (or at least part of it)?
If that is your question, then the answer unfortunately is you can't. You need to either have client side code or reload your page.
Think about it, once the client get's the page it will not change unless
The client requests the same page from the server and the server returns and updated one
the page has some client side code (eg: javascript) that updates the page.
You definitely want to use AJAX. Which means the client will need to run some javascript code.
If you don't want to learn javascript you can always try something like pyjamas. You can check out an example of it's HttpRequest here
But I always feel that using straight javascript via a library (like jQuery) is easier to understand than trying to force one language into another one.
To do it right, ajax would be the way to go BUT in a limited sense you can achieve the same thing by using a iframe, iframe is like another page embedded inside main page, so instead of refreshing whole page you may just refresh the inner iframe page and that may give the same effect.
More about iframe patterns you can read at
http://ajaxpatterns.org/IFrame_Call
Maybe a few iFrames and some Comet/long-polling? Have the comment submission in an iFrame (so the whole page doesn't reload), and then show the result in the long-polled iFrame...
Having said that, it's a pretty bad design idea, and you probably don't want to be doing this. AJAX/JavaScript is pretty much the way to go for things like this.
I have heard it's possible with AJAX...but I was
wondering if it was possible to do
with Django.
There's no reason you can't use both - specifically, AJAX within a Django web application. Django provides your organization and framework needs (and a page that will respond to AJAX requests) and then use some JavaScript on the client side to make AJAX calls to your Django-backed page that will respond correctly.
I suggest you go find a basic jQuery tutorial which should explain enough basic JavaScript to get this working.