I want to make a script in python that interacts with a webpage that has quite a lot of javascript in it (it's a webpage that computes a bunch of physics stuff).
I don't want my code to break if the page formatting changes and I want it to run offline so I would prefer my script to run on a local html copy of the page I got (all the JS code is accessible in the HTML source, there is no call to an external server). I wanted to use the requests library to do it, but it only works with URLs. Is there any library to do this? Note that I want to interact with the HTML (input values and look at the outputs etc..), I know that I can parse the file but that's not what I'm asking. I'm also totally new to web bots or anything related.
Right now I can open my .html version of the page offline with chrome and interact with it, so there has to be a way to automate this somehow. I'm also not against using something else than python if there is a better library for this in another language.
interesting question, best way I can think to do that is use a web framework and then just scrape the data using requests. I am familiar with flask and its simple to use but im sure there are other options as well
Related
I am trying to write a web scraper in python but I have an issue, the contents of the site are not coded into the html, it seems like they are coming from a different source and I want to know if there's any python library that can fetch the contents for me or if there is such tool in any other language I'm willing to learn.
See: Is this possible to load the page after the javascript execute using python?
You'll have to execute the JS and whatever else it is that generates the HTML you want. You can do this in a lot of ways, but the answer I linked above suggests using Selenium Web Driver.
hi everyone I would like to create a small bot to help me on binary option.
i am not an expert on python but actualy I can read a web page and
retrieve a precise value in a tag,
but the information what I need is on a web application
and not in the source code of the web page. I am not an expert of eb application and I want to know if I retrieve a value displayed on the application with python.
here is a link to the picture of the application:
"http://comparatif-options-binaires.fr/wp-content/uploads/2014/05/optionweb-analyse-technique-ow-school.jpg"
I think the problem you face here is the value you need is being loaded via Javascript of some sort (though without access to the web application and no visible effort from your code I can't be sure).
Expanding on #sabhirams answer (and agreeing that requests and BeautifulSoup are excellent libraries for static text) I would recommend having a look at the following:
Selenium - automates web browser usage in python (so will run the full javascript).
Webkit - Again another headless browser for python that has some excellent SO questions on the matter.
Ghost.py - attempts to make the Webkit experience a little smoother.
pyv8 - something a bit more barebones, pyv8 is a python wrapper for the Google V8 Javascript engine and can be used to run the javascript on the page and, hopefully, extract the element you need.
And if you're really not settled with python why not look at using a Javascript headless browser to run the javascript like PhantomJS.
As mentioned before; Respect others when scraping and be aware there may be consequences if you are caught.
I think you mean you want to build a script which can scrape a given webpage, and extract a certain value out of a given target DOM element.
I dont currently have the time to write this code for you, but it should be rather simple to put together. Here are some modules which might help you:
Request - Use this to fetch a given webpage into your py script
BeautifulSoup - Feed the above "DOM text" to beautiful soup, and you will be able to more easily manipulate the HTML page (fetch your var of interest etc...).
EDIT:
As pointed out in the comments above, please consider the Terms and Conditions of the web-service you are trying to scrape info from.
I'm trying to interact with a HTML 4.0 website which uses heavily obfuscated javascript to hide the regular HTML elements. What I want to do is to fill out a form and read the returned results, and this is proving harder to do than expected.
When I read the page using Firebug, it gave me the source code deobfuscated, and I can then use this to do what I want to accomplish. The Firebug output showed all the regular elements of a website, such as -tags and the like, which were hidden in the original source.
I've written the rest of my application in Python, using mechanize to interact with other web services, so I'd rather use an existing Python module to do this if that's possible. The problem is not only how to read the source code in a way mechanize can understand, but also how to generate the response which the web server can interpret. Could I use regular mechanize controls even though the html code is obfuscated?
In the beginning of my project I used pywebkitgtk instead of mechanize, but ditched it because it wasn't really implemented that well in python. Most functions are missing. Would that be a sensible method perhaps, to start up a webkit-browser which I read the HTML from, and use that with mechanize?
Any help would be greatly appreciated, I'm really in a bind here. Thanks!
Edit: I tried dumping the HTML fetched from mechanize and opening that with pywebkitgtk, using load_html_string, and then evaluating the html that way. Unfortunately, since the document I'm trying to parse loads more resources dynamically, that scripts just stops waiting for resources to be loaded. Note that I can't use webkit to load the document itself since I use mechanize's CookieJar function to allow me to log in first.
I also tried dumping the HTML from webkit, which for some reason dumped the obfuscated javascript only, while displaying the website perfectly fine. If webkit could dump the deobfuscated javascript the way Firebug does, I could work with that and form a request according to the clean code..
Rather than trying to process the page, how about just use Firebug to figure out the names of the form fields, and then use httplib or whatever to send a request with the necessary fields and settings?
If it's sent using ajax, you should be able to determine the values being sent to the server in Firebug as well.
I have a program written for text simplification in python language, I need this program to be run on a browser as a plugin... If you click the plugin it should take the webpage's text as input and pass this input to my text simplification program and the output of the program should be again displayed in another web page...
Text simplification program takes input text and produces a simplified version of the text, so now I'm planning to create a plugin which uses this program and produces simplified version of text on the webpage...
It will be of great help if anyone help me out through this...
You would need to use NPAPI plugins in Chrome Extension:
http://code.google.com/chrome/extensions/npapi.html
Then you use Content Scripts to get the webpage text, you pass it to the Background Page via Messaging. Then your NPAPI plugin will call python (do it however you like since its all in C++), and from the Background Page, you send the text within the plugin.
Concerning your NPAPI plugin, you can take a look how it is done in pyplugin or gather ideas from here to create it.
Now the serious question, why can't you do this all in JavaScript?
If you want an easier way than trying to figure out plugins, make it run as a webservice somewhere (Google App Engine is good for Python, and free), then use a bookmarklet to send pages from the browser. As an added bonus, it works with any browser, not just Chrome.
More explanation:
Rather than running on your own computer, you make your program run on a computer at Google (or somewhere else), and access it over the web. See Google's introduction to App Engine. Then, if you want it in your browser, you make a "bookmarklet" - a little bit of javascript that grabs the web page you're currently on (either the code or the URL, depends on what you're trying to do), and sends it to your program over the web. You can add this to your browser's bookmark bar as a button you can click. There's some more info on this site.
I've had a look at many tutorials regarding cookiejar, but my problem is that the webpage that i want to scape creates the cookie using javascript and I can't seem to retrieve the cookie. Does anybody have a solution to this problem?
If all pages have the same JavaScript then maybe you could parse the HTML to find that piece of code, and from that get the value the cookie would be set to?
That would make your scraping quite vulnerable to changes in the third party website, but that's most often the case while scraping. (Please bear in mind that the third-party website owner may not like that you're getting the content this way.)
I responded to your other question as well: take a look at mechanize. It's probably the most fully featured scraping module I know: if the cookie is sent, then I'm sure you can get to it with this module.
Maybe you can execute the JavaScript code in a JavaScript engine with Python bindings (like python-spidermonkey or pyv8) and then retrieve the cookie. Or, as the javascript code is executed client side anyway, you may be able to convert the cookie-generating code to Python.
You could access the page using a real browser, via PAMIE, win32com or similar, then the JavaScript will be running in its native environment.