I know how to grab a sources HTML but not PHP is it possible with the built in functions?
By "grab a sources HTML" I assume you mean opening and reading a web page like this:
impor urllib2
urllib2.urlopen("http://google.com").read()
Since PHP is rendered on the server side, and the client (you and your python script) have no access to it, there is no way to get at it, in a manner similar to how you would extract HTML from a webpage.
PHP scripts are run server-side and produce a HTML document (among other things). You will never see the PHP source of a HTML document when requesting a website, hence there is no way for Python to grab it either. This isn't even Python-related.
Related
I want to make a script in python that interacts with a webpage that has quite a lot of javascript in it (it's a webpage that computes a bunch of physics stuff).
I don't want my code to break if the page formatting changes and I want it to run offline so I would prefer my script to run on a local html copy of the page I got (all the JS code is accessible in the HTML source, there is no call to an external server). I wanted to use the requests library to do it, but it only works with URLs. Is there any library to do this? Note that I want to interact with the HTML (input values and look at the outputs etc..), I know that I can parse the file but that's not what I'm asking. I'm also totally new to web bots or anything related.
Right now I can open my .html version of the page offline with chrome and interact with it, so there has to be a way to automate this somehow. I'm also not against using something else than python if there is a better library for this in another language.
interesting question, best way I can think to do that is use a web framework and then just scrape the data using requests. I am familiar with flask and its simple to use but im sure there are other options as well
I am trying to write a web scraper in python but I have an issue, the contents of the site are not coded into the html, it seems like they are coming from a different source and I want to know if there's any python library that can fetch the contents for me or if there is such tool in any other language I'm willing to learn.
See: Is this possible to load the page after the javascript execute using python?
You'll have to execute the JS and whatever else it is that generates the HTML you want. You can do this in a lot of ways, but the answer I linked above suggests using Selenium Web Driver.
I'm trying to scrape search results from a number of websites. The problem is that not all of these sites return their search results as plain html text, a lot of it is dynamically generated with with JS, AJAX, etc. However, I can see exactly what I need by looking at the page with the Firefox inspector, since the scripts have all run and modified the html.
My question is: is there a way for me to download a webpage AFTER allowing the scripts to run, or at least get them to run locally. That way, I'd get the final html.
For reference, I'm using python.
Possible duplicate. In that case the question is with php and JS.
Sure, you have to provide some enviroment for scripts (js) to run and often to return a test value to target server. It's not that easy for the server side languages. So today for this we mostly leverage browser driving or imitating tools mentioned there.
I've found for you the python analog to v8js php plugin: PyV8.
PyV8 is a python wrapper for Google V8 engine, it act as a bridge between the Python and JavaScript objects, and support to hosting Google's v8 engine in a python script.
If properly configured, your scraper:
Gets site's js
Evaluates this js thru the given plugin
Gets access to target html for further parse.
hi everyone I would like to create a small bot to help me on binary option.
i am not an expert on python but actualy I can read a web page and
retrieve a precise value in a tag,
but the information what I need is on a web application
and not in the source code of the web page. I am not an expert of eb application and I want to know if I retrieve a value displayed on the application with python.
here is a link to the picture of the application:
"http://comparatif-options-binaires.fr/wp-content/uploads/2014/05/optionweb-analyse-technique-ow-school.jpg"
I think the problem you face here is the value you need is being loaded via Javascript of some sort (though without access to the web application and no visible effort from your code I can't be sure).
Expanding on #sabhirams answer (and agreeing that requests and BeautifulSoup are excellent libraries for static text) I would recommend having a look at the following:
Selenium - automates web browser usage in python (so will run the full javascript).
Webkit - Again another headless browser for python that has some excellent SO questions on the matter.
Ghost.py - attempts to make the Webkit experience a little smoother.
pyv8 - something a bit more barebones, pyv8 is a python wrapper for the Google V8 Javascript engine and can be used to run the javascript on the page and, hopefully, extract the element you need.
And if you're really not settled with python why not look at using a Javascript headless browser to run the javascript like PhantomJS.
As mentioned before; Respect others when scraping and be aware there may be consequences if you are caught.
I think you mean you want to build a script which can scrape a given webpage, and extract a certain value out of a given target DOM element.
I dont currently have the time to write this code for you, but it should be rather simple to put together. Here are some modules which might help you:
Request - Use this to fetch a given webpage into your py script
BeautifulSoup - Feed the above "DOM text" to beautiful soup, and you will be able to more easily manipulate the HTML page (fetch your var of interest etc...).
EDIT:
As pointed out in the comments above, please consider the Terms and Conditions of the web-service you are trying to scrape info from.
I'm trying to interact with a HTML 4.0 website which uses heavily obfuscated javascript to hide the regular HTML elements. What I want to do is to fill out a form and read the returned results, and this is proving harder to do than expected.
When I read the page using Firebug, it gave me the source code deobfuscated, and I can then use this to do what I want to accomplish. The Firebug output showed all the regular elements of a website, such as -tags and the like, which were hidden in the original source.
I've written the rest of my application in Python, using mechanize to interact with other web services, so I'd rather use an existing Python module to do this if that's possible. The problem is not only how to read the source code in a way mechanize can understand, but also how to generate the response which the web server can interpret. Could I use regular mechanize controls even though the html code is obfuscated?
In the beginning of my project I used pywebkitgtk instead of mechanize, but ditched it because it wasn't really implemented that well in python. Most functions are missing. Would that be a sensible method perhaps, to start up a webkit-browser which I read the HTML from, and use that with mechanize?
Any help would be greatly appreciated, I'm really in a bind here. Thanks!
Edit: I tried dumping the HTML fetched from mechanize and opening that with pywebkitgtk, using load_html_string, and then evaluating the html that way. Unfortunately, since the document I'm trying to parse loads more resources dynamically, that scripts just stops waiting for resources to be loaded. Note that I can't use webkit to load the document itself since I use mechanize's CookieJar function to allow me to log in first.
I also tried dumping the HTML from webkit, which for some reason dumped the obfuscated javascript only, while displaying the website perfectly fine. If webkit could dump the deobfuscated javascript the way Firebug does, I could work with that and form a request according to the clean code..
Rather than trying to process the page, how about just use Firebug to figure out the names of the form fields, and then use httplib or whatever to send a request with the necessary fields and settings?
If it's sent using ajax, you should be able to determine the values being sent to the server in Firebug as well.