How to automate browsing using python? [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
suppose, I need to perform a set of procedure on a particular website
say, fill some forms, click submit button, send the data back to server, receive the response, again do something based on the response and send the data back to the server of the website.
I know there is a webbrowser module in python, but I want to do this without invoking any web browser. It hast to be a pure script.
Is there a module available in python, which can help me do that?
thanks

selenium will do exactly what you want and it handles javascript

You can also take a look at mechanize. Its meant to handle "stateful programmatic web browsing" (as per their site).

I think the best solutions is the mix of requests and BeautifulSoup, I just wanted to update the question so it can be kept updated.

All answers are old, I recommend and I am a big fan of requests
From homepage:
Python’s standard urllib2 module provides most of the HTTP
capabilities you need, but the API is thoroughly broken. It was built
for a different time — and a different web. It requires an enormous
amount of work (even method overrides) to perform the simplest of
tasks.
Things shouldn't be this way. Not in Python.

Selenium http://www.seleniumhq.org/ is the best solution for me. you can code it with python, java, or anything programming language you like with ease. and easy simulation that convert into program.

There are plenty of built in python modules that whould help with this. For example urllib and htmllib.
The problem will be simpler if you change the way you're approaching it. You say you want to "fill some forms, click submit button, send the data back to server, recieve the response", which sounds like a four stage process.
In fact, what you need to do is post some data to a webserver and get a response.
This is as simple as:
>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
>>> print f.read()
(example taken from the urllib docs).
What you do with the response depends on how complex the HTML is and what you want to do with it. You might get away with parsing it using a regular expression or two, or you can use the htmllib.HTMLParser class, or maybe a higher level more flexible parser like Beautiful Soup.

Selenium2 includes webdriver, which has python bindings and allows one to use the headless htmlUnit driver, or switch to firefox or chrome for graphical debugging.

Do not forget zope.testbrowser which is wrapper around mechanize .
zope.testbrowser provides an easy-to-use programmable web browser with special focus on testing.

The best solution that i have found (and currently implementing) is :
- scripts in python using selenium webdriver
- PhantomJS headless browser (if firefox is used you will have a GUI and will be slower)

I have found the iMacros Firefox plugin (which is free) to work very well.
It can be automated with Python using Windows COM object interfaces. Here's some example code from http://wiki.imacros.net/Python. It requires Python Windows Extensions:
import win32com.client
def Hello():
w=win32com.client.Dispatch("imacros")
w.iimInit("", 1)
w.iimPlay("Demo\\FillForm")
if __name__=='__main__':
Hello()

You likely want urllib2. It can handle things like HTTPS, cookies, and authentication. You will probably also want BeautifulSoup to help parse the HTML pages.

You may have a look at these slides from the last italian pycon (pdf):
The author listed most of the library for doing scraping and autoted browsing in python. so you may have a look at it.
I like very much twill (which has already been suggested), which has been developed by one of the authors of nose and it is specifically aimed at testing web sites.

Internet Explorer specific, but rather good:
http://pamie.sourceforge.net/
The advantage compared to urllib/BeautifulSoup is that it executes Javascript as well since it uses IE.

httplib2 + beautifulsoup
Use firefox + firebug + httpreplay to see what the javascript passes to and from the browser from the website. Using httplib2 you can essentially do the same via post and get

For automation you definitely might wanna check out
webbot
Its is based on selenium and offers lot more features with very little code like automatically finding elements to perform actions like click , type based on the your parameters.
Its even works for sites with dynamically changing class names and ids .
Here is doc : https://webbot.readthedocs.io/

Related

Python search script in an HTML webpage

I have an HTML webpage. It has a search textbox. I want to allow the user to search within a dataset. The dataset is represented by a bunch of files on my server. I wrote a python script which can make that search.
Unfortunately, I'm not familiar with how can I unite the HTML page and a Python script.
The task is to put a python script into the html file so, that:
Python code will be run on the server side
Python code can somehow take the values from the HTML page as input
Python code can somehow put the search results to the HTML webpage as output
Question 1 : How can I do this?
Question 2 : How the python code should be stored on the website?
Question 3 : How it should take HTML values as input?
Question 4 : How can it output the results to the webpage? Do I need to install/use any additional frameworks?
Thanks!
There are too many things to get wrong if you try to implement that by yourself with only what the standard library provides.
I would recommend using a web framework, like flask or django. I linked to the quickstart sections of the comprehensive documentation of both. Basically, you write code and URL specifications that are mapped to the code, e.g. an HTTP GET on /search is mapped to a method returning the HTML page.
You can then use a form submit button to GET /search?query=<param> with the being the user's input. Based on that input you search the dataset and return a new HTML page with results.
Both frameworks have template languages that help you put the search results into HTML.
For testing purposes, web frameworks usually come with a simple webserver you can use. For production purposes, there are better solutions like uwsgi and gunicorn
Also, you should consider putting the data into a database, parsing files for each query can be quite inefficient.
I'm sure you will have more questions on the way, but that's what stackoverflow is for, and if you can ask more specific questions, it is easier to provide more focused answers.
I would look at the cgi library in python.
You should check out Django, its a very flexible and easy Python web-framework.

How do I create a web interface to a simple python script?

I am learning python. I have created some scripts that I use to parse various websites that I run daily (as their stats are updated), and look at the output in the Python interpreter. I would like to create a website to display the results. What I want to do is run my script when I go to the site, and display a sortable table of the results.
I have looked at Django and am part way through the tutorial, but it seems like an awful lot of overhead for what should be a simple problem. I know that I could just write a Python script to output simple HTML, but is that really the best way? I would like to be able to sort the table by various columns.
I have years of programming experience (C, Java, etc.), but have very little web development experience.
Thanks in advance.
Have you considered Flask? Like Tornado, it is both a "micro-framework" and a simple web server, so it has everything you need right out of the box. http://flask.pocoo.org/
This example (right off the homepage) pretty much sums up how simple the code can be:
from flask import Flask
app = Flask(__name__)
#app.route("/")
def hello():
return "Hello World!"
if __name__ == "__main__":
app.run()
If you are creating non-interactive pages, you can easily setup any modern web server to execute your python script as a CGI. Instead of loading a static file, your web server will return the output of your python script.
This isn't very sophisticated, but if you are simply returning the output without needing browser submitted date, this is the easiest way (scaling under load is a different story).
You don't even need the "cgi" module from python, if you aren't receiving any data from the browser. Anything more complicated than this and you should use a web framework.
Examples and other methods
Simple Example: hardest part is webserver configuration
mod_python: Cut down on CGI overhead (otherwise, apache execs the python interpreter for each hit)
python module cgi: sending data to your python script from the browser.
Sorting
Javascript side sorting: I've used this javascript library to add sortable tables. This is the easiest way to add sorting without requiring additional work or another HTTP GET.
Instructions:
Download this file
Add to your HTML
Add class="sortable" to any table you'd like to make sortable
Click on the headers to sort
You might consider Tornado if Django is too much overhead. I've used both and agree that, if you have something simple/small to do and don't already know Django, it's going to exponentially increase your time to production. On the other hand, you can 'get' Tornado in a couple of hours and get something relatively simple done in a day or two with no prior experience with it. At least, that's been my experience with it.
Note that Tornado is still a tradeoff: you get a lot of simplicity in exchange for the huge cornucopia of features and shortcuts you get w/ Django.
PS - in addition to being a 'micro-framework', Tornado is also its own web server, so there's no mucking with wsgi/mod-cgi/fcgi.... just write your request handlers and run it. Be sure to see the demos included in the distribution.
Have you seen bottle framework? It is a micro framework and very simple.
If I correctly understood your requirements you might find Wooey very interesting.
Wooey is a A Django app that creates automatic web UIs for Python scripts:
http://wooey.readthedocs.org
Here you can check a demo:
https://wooey.herokuapp.com/
Django is a big webframework, meant to include loads of things becaus eyou often needs them, even though sometimes you don't.
Look at Pyramid, earlier known as BFG. It's much smaller.
http://pypi.python.org/pypi/pyramid/1.0a1
Other microframeworks to check out are here: http://wiki.python.org/moin/WebFrameworks
On the other hand, in this case it's probably also overkill. sounds like you can run the script once every ten minites, and write a static HTML file, and just use Apache.
If you are not willing to write your own tool, there is a pretty advanced tool for executing your scripts: http://rundeck.org/
It's pretty simple to start and can be configured for complex scenarios as well.
For the requirement of custom view (with sortable results), I believe you can implement a simple plugin for translating script output into html elements.
Also, for simple setups I could recommend my own tool: https://github.com/bugy/script-server. It doesn't have tons of features, but very easy for end-users and supports interactive execution.
If you don't need any input from the browser, this sounds like an almost-static webpage that just happens to change once a day. You'll only need some way to get html out of your script, in a place where your webserver can access it.)
So you'd use some form of templating; if you'll need some structure above the single page, there's static site / blog generators that you can feed your output in, say, Markdown format, and call their make html or the like.
You can use DicksonUI https://dicksonui.gitbook.io
DicksonUI is better
Or Remi gui(search in google)
DicksonUI is better.
I am the author of DicksonUI

Scripting HTTP more effeciently

Often times I want to automate http queries. I currently use Java(and commons http client), but would probably prefer a scripting based approach. Something really quick and simple. Where I can set a header, go to a page and not worry about setting up the entire OO lifecycle, setting each header, calling up an html parser... I am looking for a solution in ANY language, preferable scripting
Have a look at Selenium. It generates code for C#, Java, Perl, PHP, Python, and Ruby if you need to customize the script.
Watir sounds close to what you want although it (like Selenium linked to in another answer) actually opens up a browser to do stuff. You can see some examples here. Another browser based record + playback approach system is sahi.
If your application is uses WSGI, then paste is a nice option.
Mechanize linked to in another answer is a "browser in a library" and there are clones in perl, Ruby and Python. The Perl one is the original one and this seems to be the way to go if you don't want a browser. The problem with this approach is that all the front end code (which might rely on JavaScript), won't be exercised.
Mechanize for Python seems easy to use: http://wwwsearch.sourceforge.net/mechanize/
My turn : wget or perl with lwp. You'll find example on the linked page.
If you have simple needs (fetch a page and then parse it), it is hard to beat LWP::Simple and HTML::TreeBuilder.
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder;
my $url = 'http://www.example.com';
my $content = get( $url) or die "Couldn't get $url";
my $t = HTML::TreeBuilder->new_from_content( $content );
$t->eof;
$t->elementify;
# Get first match:
my $thing = $t->look_down( _tag => 'p', id => qr/match_this_regex/ );
print $thing ? $thing->as_text : "No match found\n";
# Get all matches:
my #things = $t->look_down( _tag => 'p', id => qr/match_this_regex/ );
print $_ ? $_->as_text : "No match found" for #things;
I'm testing ReST APIs at the moment and found the ReST Client very nice. It's a GUI program, but nonetheless you can save and restore queries as XML files (or let them be generated), embed, write test scripts, and so on. And it's Java based (which is not an ad-hoc advantage, but you mentioned it).
Minus points for recording sessions. The ReST Client is good for stateless "one-shots".
If it doesn't suit your needs, I'd go for the already mentioned Mechanize (or WWW-Mechanize, as it is called at CPAN).
Depending on exactly what you're doing the easiest solution looks to be bash + curl.
The man page for the latter is available here:
http://curl.haxx.se/docs/manpage.html
You can do posts as well as gets, HTTPS, show headers, work with cookies, basic and digest HTTP authentication, tunnel through all sorts of proxies, including NTLM on *nix amongst other things.
curl is also available as shared library with C and PHP support.
HTH
C.
Python urllib may be what you're looking for.
Alternatively powershell exposes the full .NET http library in a scripting environment.
Twill is pretty good and made for testing. It can be used as script, in an interactive session or within a Python program.
Perl and WWW::Mechanize can make web scraping etc simple and easy, including easy handling of forms (let's say you want to go to a login page, fill in a username and password and submit the form, handling cookies / hidden session identifiers just as a browser would...)
Similarly, finding or extracting links from the fetched page is trivial.
If you need to parse stuff out of the resulting pages that WWW::Mechanize can't easily help with, then feed the result to HTML::TreeBuilder to make parsing easy.
What about using PHP+Curl, or just bash?
Some ruby libraries:
httparty: really interesting, the philosophy is interesting.
mechanize: classic good-quality web automatization library.
scrubYt: puzzling at first glance but fun to use.

Python based web reporting tool? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 months ago.
Improve this question
I have a question for those of you doing web work with python. Is
anyone familiar with a python based reporting tool? I am about to
start on a pretty big web app and will need the ability to do some end
user reporting (invoices, revenue reports, etc). It can be an existing
django app or anything python based so I can hook into it.
ReportLab
Welcome to the ReportLab Open Source site. ReportLab is a library for programatically creating PDF documents. It's a fast, flexible, cross platform solution written in Python.
Or go a little higher level than reportlab: xhtml2pdf - now WeasyPrint (built on top of reportlab)
From the website:
Translates HTML and CSS input into PDF files
Is written pure Python and therefore platform independent
Supports document specifics like columns, headers, footers, page numbers, custom Postscript and TrueType fonts, etc.
Best support for frameworks like Django, Turbogears, CherryPy, Pylons, WSGI
Simple integration into Python programms
Also available as stand alone command line tool for Windows, MacOS X and Linux
Most reporting tools are stuck in the '80s: a time when you 'painted' a report intended to be printed that completely lacked integration with other reports.
Sometimes we still need that. If you need to print an invoice, you're pretty much stuck with that kind of functionality. But in general, most reporting these days consists of multiple queries/charts/graphs/tables per page with drill-down built directly into it.
If you've got enough of a need go with an OLAP tool - then you don't even code the reports, your users (theoretically) can. If not, I've seldom seen a scenario in which a "reporting tool" was better than using something like Chart Director with a language like php, perl, python, ruby, etc.
Try to have look at the Cubes - Light-weight OLAP framework for Python. It is just partial solution for your problem, but I think it might help.
Sources at github
Documentation
Blog with tutorials
You can either use Python to do OLAP/aggregated browsing or you can run an OLAP HTTP Server (called Slicer). Here is an example using HTTP Server: Open Public Procurements reporting. The front-end is PHP which accesses Slicer server through HTTP. Example of server can be found here with documentation for the server can be found here.
Currently the framework provides SQL backend using SQLAlchemy, so you can use any DB that can SQLAlchemy has engine for.
Reports in form of charts, tables & stuff, including JS front-end framework are planned. Just wanted to help at least with lower OLAP layer.
Let me know if you have any questions, I am the author.
Also have a look at myDBR a tool that allows you to define your reports in the database (using stored procedures) and then takes care of the layout and formatting of the data.
Even though myDBR is a PHP application, it does not require any PHP coding, just install the application and embed it as iframe in your own app.
I've been working on a recent addition to this. It allows you to create HTML reports from Python which you can share as standalone HTML files. This means you can have interactive components - such as table viewers and interactive plots (which have become a lot more popular since this question was originally posed).
Currently it supports pandas DataFrames, Bokeh, Plotly, Altair, JSON, and Markdown components.
For instance:
import altair as alt
import pandas as pd
import datapane as dp
df = pd.read_csv('https://query1.finance.yahoo.com/v7/finance/download/GOOG?period1=1553600505&period2=1585222905&interval=1d&events=history')
chart = alt.Chart(df).encode(x='Date', y='High', y2='Low').mark_area(opacity=0.5).interactive()
dp.Report(dp.Table(df['High']), dp.Plot(chart)).save(path='stock_analysis.html')
It's still early, but check it out: https://docs.datapane.com
I was doing some research and found about awe. It runs a small webserver which can do realtime updates on a page and allows for more complicated report layouts.
It has documentation and examples and can give you something up and running, quickly.

Screen Scrape Form Results

I was recently requested by a client to build a website for their insurance business. As part of this, they want to do some screen scraping of the quote site for one of their providers. They asked if their was an API to do this, and were told there wasn't one, but that if they could get the data from their engine they could use it as they wanted to.
My question: is it even possible to perform screen scraping on the response to a form submission to another site? If so, what are the gotchas that I should look out for. Obvious legal/ethical issues aside since they already asked for permission to do what we're planning to do.
As an aside, I would prefer to do any processing in python.
Thanks
A really nice library for screen-scraping is mechanize, which I believe is a clone of an original library written in Perl. Anyway, that in combination with the ClientForm module, and some additional help from either BeautifulSoup and you should be away.
I've written loads of screen-scraping code in Python and these modules turned out to be the most useful. Most of the stuff that mechanize does could in theory be done by simply using the urllib2 or httplib modules from the standard library, but mechanize makes this stuff a breeze: essentially it gives you a programmatic browser (note, it does not require a browser to work, but mearly provides you with an API that behaves like a completely customisable browser).
For post-processing, I've had a lot of success with BeautifulSoup, but lxml.html is a good choice too.
Basically, you will be able to do this in Python for sure, and your results should be really good with the range of tools out there.
You can pass a data parameter to urllib.urlopen to send POST data with the request just like you had filled out the form. You'll obviously have to take a look at what data exactly the form contains.
Also, if the form has method="GET", the request data is just part of the url given to urlopen.
Pretty much standard for scraping the returned HTML data is BeautifulSoup.
I see the other two answers already mention all the major libraries of choice for the purpose... as long as the site being scraped does not make extensive use of Javascript, that is. If it IS a Javascript-heavy site and dependent on JS for the data it fetches and display (e.g. via AJAX) your problem is an order of magnitude harder; in that case, I might suggest starting with crowbar, some customization of diggstripper, or selenium, etc.
You'll have to do substantial work in Javascript and probably dedicated work to deal with the specifics of the (hypothetically JS-heavy) site in question, depending on the JS frameworks it uses, etc; that's why the job is so much harder if that is the case. But in any case you might end up with (at least in part) local HTML copies of the site's pages as displayed, and end by scraping those copies with the other tools already recommended. Good luck: may the sites you scrape always be Javascript-light!-)
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

Categories

Resources