Often times I want to automate http queries. I currently use Java(and commons http client), but would probably prefer a scripting based approach. Something really quick and simple. Where I can set a header, go to a page and not worry about setting up the entire OO lifecycle, setting each header, calling up an html parser... I am looking for a solution in ANY language, preferable scripting
Have a look at Selenium. It generates code for C#, Java, Perl, PHP, Python, and Ruby if you need to customize the script.
Watir sounds close to what you want although it (like Selenium linked to in another answer) actually opens up a browser to do stuff. You can see some examples here. Another browser based record + playback approach system is sahi.
If your application is uses WSGI, then paste is a nice option.
Mechanize linked to in another answer is a "browser in a library" and there are clones in perl, Ruby and Python. The Perl one is the original one and this seems to be the way to go if you don't want a browser. The problem with this approach is that all the front end code (which might rely on JavaScript), won't be exercised.
Mechanize for Python seems easy to use: http://wwwsearch.sourceforge.net/mechanize/
My turn : wget or perl with lwp. You'll find example on the linked page.
If you have simple needs (fetch a page and then parse it), it is hard to beat LWP::Simple and HTML::TreeBuilder.
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder;
my $url = 'http://www.example.com';
my $content = get( $url) or die "Couldn't get $url";
my $t = HTML::TreeBuilder->new_from_content( $content );
$t->eof;
$t->elementify;
# Get first match:
my $thing = $t->look_down( _tag => 'p', id => qr/match_this_regex/ );
print $thing ? $thing->as_text : "No match found\n";
# Get all matches:
my #things = $t->look_down( _tag => 'p', id => qr/match_this_regex/ );
print $_ ? $_->as_text : "No match found" for #things;
I'm testing ReST APIs at the moment and found the ReST Client very nice. It's a GUI program, but nonetheless you can save and restore queries as XML files (or let them be generated), embed, write test scripts, and so on. And it's Java based (which is not an ad-hoc advantage, but you mentioned it).
Minus points for recording sessions. The ReST Client is good for stateless "one-shots".
If it doesn't suit your needs, I'd go for the already mentioned Mechanize (or WWW-Mechanize, as it is called at CPAN).
Depending on exactly what you're doing the easiest solution looks to be bash + curl.
The man page for the latter is available here:
http://curl.haxx.se/docs/manpage.html
You can do posts as well as gets, HTTPS, show headers, work with cookies, basic and digest HTTP authentication, tunnel through all sorts of proxies, including NTLM on *nix amongst other things.
curl is also available as shared library with C and PHP support.
HTH
C.
Python urllib may be what you're looking for.
Alternatively powershell exposes the full .NET http library in a scripting environment.
Twill is pretty good and made for testing. It can be used as script, in an interactive session or within a Python program.
Perl and WWW::Mechanize can make web scraping etc simple and easy, including easy handling of forms (let's say you want to go to a login page, fill in a username and password and submit the form, handling cookies / hidden session identifiers just as a browser would...)
Similarly, finding or extracting links from the fetched page is trivial.
If you need to parse stuff out of the resulting pages that WWW::Mechanize can't easily help with, then feed the result to HTML::TreeBuilder to make parsing easy.
What about using PHP+Curl, or just bash?
Some ruby libraries:
httparty: really interesting, the philosophy is interesting.
mechanize: classic good-quality web automatization library.
scrubYt: puzzling at first glance but fun to use.
Related
I want to make a Python script available as a service on the net. The script, which is my first 'proper' Python program, takes a txt file as argument and writes an image into the work directory. So:
How difficult is it for somebody who is new to Python and web development?
How much work is it?
Do I need a framework (Django, cherryPy, web2py)?
Are there good tutorials?
How do I avoid the server to be compromised?
What are my next steps?
==> What is the easiest way?
In the end it is enough, if it is a white page, with some text, and a button, which when clicked, opens a file dialog. After the txt is processed, the server should just return the image, which was written on the hard drive. Already I have access to a server which has Ubuntu installed through a friend.
[update]
Thanks for all your answers. After reading them I want to stress again, that I want to have it as minimal as possible. Srikar's suggestion sounds like the easiest one:
Put it in executable directory of your OS (commonly known as CGI
path). Provide a simple HTML form & upon form submission hit this
script which executes & returns back the image you want to display.
Any objections or comments? Do you know any tutorials for that?
[udpate2]
I found this SO answer: File Sharing Site in Python Is this a sensible approach?
It's not too difficult. Actually, it sounds like a good first project.
That too subjective to answer. An hour to days.
No, you don't need one, but I'd use one if I were you. They abstract away some of the stuff you really don't care about, and you'll learn a tool you can use again in the future.
Plenty. If you want a real rundown of how Python works for the web, read the HOWTO from Python.org. If you just want to learn how to do this one project, pick a framework and do their tutorial.
This question is so broad and complex that I'm not going to try to answer it. Search this site, or Google, for questions like that.
Your next step should be to pick a framework; I've used Django successfully. Just download it, follow the installation instructions, and work your way through their tutorial; it should tell you everything you need to know to do what you want. If you still have questions once you've learned how to do the basics, come back and ask again!
Edit: The answer to that other question will certainly work for you. There, they just receive a GET request and respond with data from a Python file. You need to receive a GET request, respond with an HTML page (easy enough), then respond to a POST request that includes an uploaded file (slightly more complicated) and run your python routine on the uploaded file and then respond with the created image (or a link to it).
Take a look at this page which includes a simple Python script to do file uploads. You should easily be able to modify it to do what you want.
How difficult is it for somebody who is new to Python and web development?
Depends on your level of knowledge.
How much work is it?
Depends on which method you choose to solve the problem.
Do I need a framework (Django, cherryPy, web2py)?
Not necessarily - you could get started by using the CGI (http://docs.python.org/library/cgi.html)
Are there good tutorials?
Yes, there are plenty. The Python docs are an excellent place to start.
How do I avoid the server to be compromised?
Again, depends on the method you choose to solve the problem, although there are commonalities.
What are my next steps?
Dare I say it again, choose a method, read the docs, have a play!
If its just as simple as you have described it. Then you might not even need Django. You could simply use CGI scripting. All of these design decisions, depend on whether
You need (or foresee) a SQL storage?
or a Content-Management-System?
Will you need multiple-user support?
Do you need tight security?
Do you need different privileges for different users?
Do you need an Admin to manage your site?
If the answer to above questions is atleast 60% correct, then you might consider Django. otherwise, just write a python script. Put it in executable directory of your OS (commonly known as CGI path). Provide a simple HTML form & upon form submission hit this script which executes & returns back the image you want to display. So, it all depends on the features you need...
In the end, I created what I needed with Flask.
They have a well documented pattern / tutorial on Uploading Files. The tutorial is understandable even for people with little python and web expericence.
To get a first working version it took me 2h and the resulting code was only 50 lines. This includes, starting the webserver, having a html file/form with file upload and serving a file back to the user.
I'm exploring many technologies, but I would like your input on which web framework would make this the easiest/ most possible. I'm currently looking to JSP/JSF/Primefaces, but I'm not sure if that is capable of this app.
Here's a basic description of the app:
Users log in with their username and password (maybe I can somehow incorporate OPENID)?
With a really nice UI, they will be presented a large list of questions specific to a certain category, for example, "Cooking". (I will manually compile this list and make it available.)
When they click on any of these questions, a little input box opens up below it to allow the user to put in a link/URL.
If the link they enter has the same question on that webpage the URL points to, they will be awarded one point. This question then disappears and gets added to a different page that has a list of all correctly linked questions.
On the right side of the screen, there will be a leaderboard with the usernames of the people with the top ten points.
The idea is relatively simple - to be able to compile links to external websites for specific questions by allowing many people to contribute.
I know I can build the UI easily with Primefaces. [B]What I'm not sure is if JSP/JSF gives the ability to parse HTML at a certain URL to see if it contains words.[/B] I can do this with python easily by using urllib, but I can't use python for web GUI building (it is very difficult). What is the best approach?
Any help would be appreciated!!! Thanks!
The best approach is whatever is best for you. If Python isn't your strength but Java is, then use Java. If you're a Python expert and know little Java, use Python.
There are so many resources on the Internet supporting so many platforms that the decision really comes down to what works best for you.
For starters, forget about JSP/JSF. This is an old combination that had many problems. Please consider Facelets/JSF. Facelets is the default templating language in the current version of JSF, while JSP is there only for backwards compatibility.
What I'm not sure is if JSP/JSF gives the ability to parse HTML at a certain URL to see if it contains words.
Yes it does, although the actual fetching of data and parsing of its content will be done by plain Java code. This itself has nothing to do with the JSF APIs.
With JSF you create a Facelet containing your UI (input fields, buttons, etc). Then still using JSF you bind this to a so-called backing bean, which is primarily a normal Java class with only one or two JSF specific annotations applied to it (e.g. #ManagedBean).
When the user enters the URL and presses some button, JSF takes care of calling some action method in your Java class (backing bean). In this action method you now have access to the URL the user entered, and from here on plain Java coding starts and JSF specifics end. You can put the code that fetches the URL and does the parsing you require in a separate helper class (separation of concerns), or at your discretion directly in the backing bean. The choice is yours.
Incidentally we had a very junior programmer at our office use JSF for something not unlike what you are requesting here and he succeeded in doing it in a short time. It thus really isn't that hard ;)
No web technology does what you want. Parsing documents found at certain urls is out of the scope of building web interfaces.
However, each of Java's web technologies will give you, without limits, access to a rich and varied (if not too rich and much too varied) set of libraries and frameworks running on JVM. You could safely say that if there is a library for doing something, there will be a Java version available. Downloading and parsing a document will not require more than what is available in the standard library (unless you insist on injecting your dependencies or crosscutting your concerns), so no problems with doing your project with JSP, or JSF/Primefaces, or whatever.
Since you claim to already know Python, and since you will have to add some HTML/CSS anyway, I suggest you try Django. It's dead simple, has a set of OpenID plugins to choose from, will give you admin interface for free (so you can prime the pump with the first set of links).
I am learning python. I have created some scripts that I use to parse various websites that I run daily (as their stats are updated), and look at the output in the Python interpreter. I would like to create a website to display the results. What I want to do is run my script when I go to the site, and display a sortable table of the results.
I have looked at Django and am part way through the tutorial, but it seems like an awful lot of overhead for what should be a simple problem. I know that I could just write a Python script to output simple HTML, but is that really the best way? I would like to be able to sort the table by various columns.
I have years of programming experience (C, Java, etc.), but have very little web development experience.
Thanks in advance.
Have you considered Flask? Like Tornado, it is both a "micro-framework" and a simple web server, so it has everything you need right out of the box. http://flask.pocoo.org/
This example (right off the homepage) pretty much sums up how simple the code can be:
from flask import Flask
app = Flask(__name__)
#app.route("/")
def hello():
return "Hello World!"
if __name__ == "__main__":
app.run()
If you are creating non-interactive pages, you can easily setup any modern web server to execute your python script as a CGI. Instead of loading a static file, your web server will return the output of your python script.
This isn't very sophisticated, but if you are simply returning the output without needing browser submitted date, this is the easiest way (scaling under load is a different story).
You don't even need the "cgi" module from python, if you aren't receiving any data from the browser. Anything more complicated than this and you should use a web framework.
Examples and other methods
Simple Example: hardest part is webserver configuration
mod_python: Cut down on CGI overhead (otherwise, apache execs the python interpreter for each hit)
python module cgi: sending data to your python script from the browser.
Sorting
Javascript side sorting: I've used this javascript library to add sortable tables. This is the easiest way to add sorting without requiring additional work or another HTTP GET.
Instructions:
Download this file
Add to your HTML
Add class="sortable" to any table you'd like to make sortable
Click on the headers to sort
You might consider Tornado if Django is too much overhead. I've used both and agree that, if you have something simple/small to do and don't already know Django, it's going to exponentially increase your time to production. On the other hand, you can 'get' Tornado in a couple of hours and get something relatively simple done in a day or two with no prior experience with it. At least, that's been my experience with it.
Note that Tornado is still a tradeoff: you get a lot of simplicity in exchange for the huge cornucopia of features and shortcuts you get w/ Django.
PS - in addition to being a 'micro-framework', Tornado is also its own web server, so there's no mucking with wsgi/mod-cgi/fcgi.... just write your request handlers and run it. Be sure to see the demos included in the distribution.
Have you seen bottle framework? It is a micro framework and very simple.
If I correctly understood your requirements you might find Wooey very interesting.
Wooey is a A Django app that creates automatic web UIs for Python scripts:
http://wooey.readthedocs.org
Here you can check a demo:
https://wooey.herokuapp.com/
Django is a big webframework, meant to include loads of things becaus eyou often needs them, even though sometimes you don't.
Look at Pyramid, earlier known as BFG. It's much smaller.
http://pypi.python.org/pypi/pyramid/1.0a1
Other microframeworks to check out are here: http://wiki.python.org/moin/WebFrameworks
On the other hand, in this case it's probably also overkill. sounds like you can run the script once every ten minites, and write a static HTML file, and just use Apache.
If you are not willing to write your own tool, there is a pretty advanced tool for executing your scripts: http://rundeck.org/
It's pretty simple to start and can be configured for complex scenarios as well.
For the requirement of custom view (with sortable results), I believe you can implement a simple plugin for translating script output into html elements.
Also, for simple setups I could recommend my own tool: https://github.com/bugy/script-server. It doesn't have tons of features, but very easy for end-users and supports interactive execution.
If you don't need any input from the browser, this sounds like an almost-static webpage that just happens to change once a day. You'll only need some way to get html out of your script, in a place where your webserver can access it.)
So you'd use some form of templating; if you'll need some structure above the single page, there's static site / blog generators that you can feed your output in, say, Markdown format, and call their make html or the like.
You can use DicksonUI https://dicksonui.gitbook.io
DicksonUI is better
Or Remi gui(search in google)
DicksonUI is better.
I am the author of DicksonUI
I was recently requested by a client to build a website for their insurance business. As part of this, they want to do some screen scraping of the quote site for one of their providers. They asked if their was an API to do this, and were told there wasn't one, but that if they could get the data from their engine they could use it as they wanted to.
My question: is it even possible to perform screen scraping on the response to a form submission to another site? If so, what are the gotchas that I should look out for. Obvious legal/ethical issues aside since they already asked for permission to do what we're planning to do.
As an aside, I would prefer to do any processing in python.
Thanks
A really nice library for screen-scraping is mechanize, which I believe is a clone of an original library written in Perl. Anyway, that in combination with the ClientForm module, and some additional help from either BeautifulSoup and you should be away.
I've written loads of screen-scraping code in Python and these modules turned out to be the most useful. Most of the stuff that mechanize does could in theory be done by simply using the urllib2 or httplib modules from the standard library, but mechanize makes this stuff a breeze: essentially it gives you a programmatic browser (note, it does not require a browser to work, but mearly provides you with an API that behaves like a completely customisable browser).
For post-processing, I've had a lot of success with BeautifulSoup, but lxml.html is a good choice too.
Basically, you will be able to do this in Python for sure, and your results should be really good with the range of tools out there.
You can pass a data parameter to urllib.urlopen to send POST data with the request just like you had filled out the form. You'll obviously have to take a look at what data exactly the form contains.
Also, if the form has method="GET", the request data is just part of the url given to urlopen.
Pretty much standard for scraping the returned HTML data is BeautifulSoup.
I see the other two answers already mention all the major libraries of choice for the purpose... as long as the site being scraped does not make extensive use of Javascript, that is. If it IS a Javascript-heavy site and dependent on JS for the data it fetches and display (e.g. via AJAX) your problem is an order of magnitude harder; in that case, I might suggest starting with crowbar, some customization of diggstripper, or selenium, etc.
You'll have to do substantial work in Javascript and probably dedicated work to deal with the specifics of the (hypothetically JS-heavy) site in question, depending on the JS frameworks it uses, etc; that's why the job is so much harder if that is the case. But in any case you might end up with (at least in part) local HTML copies of the site's pages as displayed, and end by scraping those copies with the other tools already recommended. Good luck: may the sites you scrape always be Javascript-light!-)
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
suppose, I need to perform a set of procedure on a particular website
say, fill some forms, click submit button, send the data back to server, receive the response, again do something based on the response and send the data back to the server of the website.
I know there is a webbrowser module in python, but I want to do this without invoking any web browser. It hast to be a pure script.
Is there a module available in python, which can help me do that?
thanks
selenium will do exactly what you want and it handles javascript
You can also take a look at mechanize. Its meant to handle "stateful programmatic web browsing" (as per their site).
I think the best solutions is the mix of requests and BeautifulSoup, I just wanted to update the question so it can be kept updated.
All answers are old, I recommend and I am a big fan of requests
From homepage:
Python’s standard urllib2 module provides most of the HTTP
capabilities you need, but the API is thoroughly broken. It was built
for a different time — and a different web. It requires an enormous
amount of work (even method overrides) to perform the simplest of
tasks.
Things shouldn't be this way. Not in Python.
Selenium http://www.seleniumhq.org/ is the best solution for me. you can code it with python, java, or anything programming language you like with ease. and easy simulation that convert into program.
There are plenty of built in python modules that whould help with this. For example urllib and htmllib.
The problem will be simpler if you change the way you're approaching it. You say you want to "fill some forms, click submit button, send the data back to server, recieve the response", which sounds like a four stage process.
In fact, what you need to do is post some data to a webserver and get a response.
This is as simple as:
>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
>>> print f.read()
(example taken from the urllib docs).
What you do with the response depends on how complex the HTML is and what you want to do with it. You might get away with parsing it using a regular expression or two, or you can use the htmllib.HTMLParser class, or maybe a higher level more flexible parser like Beautiful Soup.
Selenium2 includes webdriver, which has python bindings and allows one to use the headless htmlUnit driver, or switch to firefox or chrome for graphical debugging.
Do not forget zope.testbrowser which is wrapper around mechanize .
zope.testbrowser provides an easy-to-use programmable web browser with special focus on testing.
The best solution that i have found (and currently implementing) is :
- scripts in python using selenium webdriver
- PhantomJS headless browser (if firefox is used you will have a GUI and will be slower)
I have found the iMacros Firefox plugin (which is free) to work very well.
It can be automated with Python using Windows COM object interfaces. Here's some example code from http://wiki.imacros.net/Python. It requires Python Windows Extensions:
import win32com.client
def Hello():
w=win32com.client.Dispatch("imacros")
w.iimInit("", 1)
w.iimPlay("Demo\\FillForm")
if __name__=='__main__':
Hello()
You likely want urllib2. It can handle things like HTTPS, cookies, and authentication. You will probably also want BeautifulSoup to help parse the HTML pages.
You may have a look at these slides from the last italian pycon (pdf):
The author listed most of the library for doing scraping and autoted browsing in python. so you may have a look at it.
I like very much twill (which has already been suggested), which has been developed by one of the authors of nose and it is specifically aimed at testing web sites.
Internet Explorer specific, but rather good:
http://pamie.sourceforge.net/
The advantage compared to urllib/BeautifulSoup is that it executes Javascript as well since it uses IE.
httplib2 + beautifulsoup
Use firefox + firebug + httpreplay to see what the javascript passes to and from the browser from the website. Using httplib2 you can essentially do the same via post and get
For automation you definitely might wanna check out
webbot
Its is based on selenium and offers lot more features with very little code like automatically finding elements to perform actions like click , type based on the your parameters.
Its even works for sites with dynamically changing class names and ids .
Here is doc : https://webbot.readthedocs.io/