I'm trying to scrape search results from a number of websites. The problem is that not all of these sites return their search results as plain html text, a lot of it is dynamically generated with with JS, AJAX, etc. However, I can see exactly what I need by looking at the page with the Firefox inspector, since the scripts have all run and modified the html.
My question is: is there a way for me to download a webpage AFTER allowing the scripts to run, or at least get them to run locally. That way, I'd get the final html.
For reference, I'm using python.
Possible duplicate. In that case the question is with php and JS.
Sure, you have to provide some enviroment for scripts (js) to run and often to return a test value to target server. It's not that easy for the server side languages. So today for this we mostly leverage browser driving or imitating tools mentioned there.
I've found for you the python analog to v8js php plugin: PyV8.
PyV8 is a python wrapper for Google V8 engine, it act as a bridge between the Python and JavaScript objects, and support to hosting Google's v8 engine in a python script.
If properly configured, your scraper:
Gets site's js
Evaluates this js thru the given plugin
Gets access to target html for further parse.
Related
I want to make a script in python that interacts with a webpage that has quite a lot of javascript in it (it's a webpage that computes a bunch of physics stuff).
I don't want my code to break if the page formatting changes and I want it to run offline so I would prefer my script to run on a local html copy of the page I got (all the JS code is accessible in the HTML source, there is no call to an external server). I wanted to use the requests library to do it, but it only works with URLs. Is there any library to do this? Note that I want to interact with the HTML (input values and look at the outputs etc..), I know that I can parse the file but that's not what I'm asking. I'm also totally new to web bots or anything related.
Right now I can open my .html version of the page offline with chrome and interact with it, so there has to be a way to automate this somehow. I'm also not against using something else than python if there is a better library for this in another language.
interesting question, best way I can think to do that is use a web framework and then just scrape the data using requests. I am familiar with flask and its simple to use but im sure there are other options as well
I am trying to write a web scraper in python but I have an issue, the contents of the site are not coded into the html, it seems like they are coming from a different source and I want to know if there's any python library that can fetch the contents for me or if there is such tool in any other language I'm willing to learn.
See: Is this possible to load the page after the javascript execute using python?
You'll have to execute the JS and whatever else it is that generates the HTML you want. You can do this in a lot of ways, but the answer I linked above suggests using Selenium Web Driver.
hi everyone I would like to create a small bot to help me on binary option.
i am not an expert on python but actualy I can read a web page and
retrieve a precise value in a tag,
but the information what I need is on a web application
and not in the source code of the web page. I am not an expert of eb application and I want to know if I retrieve a value displayed on the application with python.
here is a link to the picture of the application:
"http://comparatif-options-binaires.fr/wp-content/uploads/2014/05/optionweb-analyse-technique-ow-school.jpg"
I think the problem you face here is the value you need is being loaded via Javascript of some sort (though without access to the web application and no visible effort from your code I can't be sure).
Expanding on #sabhirams answer (and agreeing that requests and BeautifulSoup are excellent libraries for static text) I would recommend having a look at the following:
Selenium - automates web browser usage in python (so will run the full javascript).
Webkit - Again another headless browser for python that has some excellent SO questions on the matter.
Ghost.py - attempts to make the Webkit experience a little smoother.
pyv8 - something a bit more barebones, pyv8 is a python wrapper for the Google V8 Javascript engine and can be used to run the javascript on the page and, hopefully, extract the element you need.
And if you're really not settled with python why not look at using a Javascript headless browser to run the javascript like PhantomJS.
As mentioned before; Respect others when scraping and be aware there may be consequences if you are caught.
I think you mean you want to build a script which can scrape a given webpage, and extract a certain value out of a given target DOM element.
I dont currently have the time to write this code for you, but it should be rather simple to put together. Here are some modules which might help you:
Request - Use this to fetch a given webpage into your py script
BeautifulSoup - Feed the above "DOM text" to beautiful soup, and you will be able to more easily manipulate the HTML page (fetch your var of interest etc...).
EDIT:
As pointed out in the comments above, please consider the Terms and Conditions of the web-service you are trying to scrape info from.
I know how to grab a sources HTML but not PHP is it possible with the built in functions?
By "grab a sources HTML" I assume you mean opening and reading a web page like this:
impor urllib2
urllib2.urlopen("http://google.com").read()
Since PHP is rendered on the server side, and the client (you and your python script) have no access to it, there is no way to get at it, in a manner similar to how you would extract HTML from a webpage.
PHP scripts are run server-side and produce a HTML document (among other things). You will never see the PHP source of a HTML document when requesting a website, hence there is no way for Python to grab it either. This isn't even Python-related.
I use seleniumRC to open a url, then how to save this web page? How to realize it like urllib.urlretrieve do it? But urllib can't operate javascript in the page. One more question: Will it save the whole page with what I see as seleniumRC open it?
It sounds like you are confusing two very different libraries.
urllib:
This module provides a high-level interface for fetching data across the World Wide Web. In particular, the urlopen() function is similar to the built-in function open(), but accepts Universal Resource Locators (URLs) instead of filenames.
You can use python's urllib library to retrieve the raw markup from a valid URL. The library doesn't invoke any embedded javascript on the page, because the library never attempts to parse or render anything.
Selenium RC:
Selenium Remote Control (RC) is a test tool that allows you to write automated web application UI tests in any programming language against any HTTP website using any mainstream JavaScript-enabled browser.
Selenium RC is used to automate testing. Execution of your tests occurs in a web browser via javascript, but this is a testing suite — you receive information about the status of your tests. Selenium RC does not provide any functionality to save an image of the rendered page.
Unless I've misinterpreted your question, you seem to be looking for a library that will allow you to retrieve an image of a rendered HTML page (including javascript DOM manipulation). If this is indeed the case, I would suggest looking into PyWebShot, which seems to provide exactly that functionality. You can view screenshots of it in action here (along with some additional info about it).
If it doesn't necessarily need to be a python library, there are a number of web services around that provide screenshots:
IE Web Renderer
Browsershots
BrowsrCamp
BrowserCam