web browser created by spynner not responding - python

I am trying to use spynner for web scraping ... below I used www.google.com as an example .... I want to automatically search for "Barack Obama" using spynner ... However, the web browser created by spynner keeps not responding ... and the search string ("Barack Obama") is not filled in the search box (You will see it when you run the code below yourself).
import spynner
browser = spynner.Browser()
browser.show()
browser.load("https://www.google.com")
browser.wait_page_load()
browser.fill("input[name=q]", "Barack Obama")
browser.click("input[name=btnK]")
The input fields are identfied correctly in my code ... you can check for yourself. ... So why is this not working?

Trie this code snippet.. I used qt
import spynner
from PyQt4.QtCore import Qt
b = spynner.Browser()
b.show()
b.load("http://www.google.com")
b.wk_fill('input[name=q]', 'soup')
b.sendKeys("input[name=q]",[Qt.Key_Enter])
b.browse()

Related

Twill doesn't show forms

I'm trying to login in https://accounts.coursera.org/ using twill for python
I tried this sheet of code
import twill
b = get_browser()
b.go("https://accounts.coursera.org/")
b.showforms()
twill doesn't detect the form in the page and showforms methods doesn't show anything !!
Is that an internal issue in twill package or I'm misssing something
import twill
import webbrowser
b = twill.get_browser()
b.go("https://accounts.coursera.org/")
page = b.result.get_page()
tmp_page = "tmp.html"
with file(tmp_page, "w") as f:
f.write(page)
webbrowser.open(tmp_page)
# b.showforms()
I get a page that says..
Please use a modern browser with JavaScript enabled to use Coursera.
So I suspect that twill doesn't include a javascript interpreter?

spynner doesn't load XHR data

I'm building a script to monitor a reporting service. Depending on how it takes to process the report the report appears in HTML or comes via XmlHttpRequest.
As a tool to check the page I want to use spynner, which works perfect for HTML, but it seems that I can't get it to work when the data comes via XHR.
The code for the test is the following:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
__docformat__ = 'restructuredtext en'
from time import sleep
from spynner import browser
import pyquery
from PyQt4.QtCore import QUrl
from PyQt4.QtNetwork import QNetworkRequest, QNetworkAccessManager
from PyQt4.QtCore import QByteArray
def load_page(br):
ret = br.load_jquery(True)
print ret
return 'Japan' in br.html
br = browser.Browser(
debug_level=4
)
br.load('https://foobar.eu/newton/cgi-bin/cognos.cgi')
br.create_webview()
br.show()
#br.load("https://foobar.eu/newton/cgi-bin/cognos.cgi?b_action=xts.run&m=portal/cc.xts&m_folder=iA37B5BBC0615469DA37767D2B6F1DCF1")
#br.browse()
res = br.load("https://foobar.eu:443/newton/cgi-bin/cognos.cgi?b_action=cognosViewer&ui.action=run&ui.object=/content/folder[#name='DMA Admin Zone']/folder[#name='02. Performance Benchmark Module']/folder[#name='1. Reports']/report[#name='CQM_Test_3_HTML_Heavy_Local_Processing_Final']&ui.name=CQM_Test_3_HTML_Heavy_Local_Processing_Final&run.outputFormat=&run.prompt=true", 1, wait_callback=load_page)
d = str(pyquery.PyQuery(br.html))
if d.find("Japan") > -1:
print 'We discovered Japan!'
else:
print 'Japan is nowhere to be seen!'
sleep(10)
The URL in the comments is a page which contains a link to the report. When I click the report by hand the report works (via XHP). However, I can't seem to get it to work via scripting.
The br.load_jquery always returns None.
As a help I have added part of the spynner debug trace when I click the link by hand: http://fpaste.org/97583/13987135/
In firebug I can clearly see the XHP reponse with the string 'Japan' in.
What am I missing?
apparantly replacing the load page function with the following code makes it work:
def load_page(br):
br.wait(5)
return 'Japan' in br.html

Spynner crash python

I'm building a Django app and I'm using Spynner for web crawling. I have this problem and I hope someone can help me.
I have this function in the module "crawler.py":
import spynner
def crawling_js(url)
br = spynner.Browser()
br.load(url)
text_page = br.html
br.close (*)
return text_page
(*) I tried with br.close() too
in another module (eg: "import.py") I call the function in this way:
from crawler import crawling_js
l_url = ["https://www.google.com/", "https://www.tripadvisor.com/", ...]
for url in l_url:
mytextpage = crawling_js(url)
.. parse mytextpage....
when I pass the first url in to the function all is correct when I pass the second "url" python crash. Python crash in this line:br.load(url). Someone can help me? Thanks a lot
I have:
Django 1.3
Python 2.7
Spynner 1.1.0
PyQt4 4.9.1
Why you need to instantiate br = spynner.Browser() and close it every time you call crawling_js(). In a loop this will utilize a lot of resources which I think is the reason why it crashes. let's think of it like this, br is a browser instance. Therefore, you can make it browse any number of websites without the need to close it and open it again. Adjust your code this way:
import spynner
br = spynner.Browser() #you open it only once.
def crawling_js(url):
br.load(url)
text_page = br._get_html() #_get_html() to make sure you get the updated html
return text_page
then if you insist to close br later you simply do:
from crawler import crawling_js , br
l_url = ["https://www.google.com/", "https://www.tripadvisor.com/", ...]
for url in l_url:
mytextpage = crawling_js(url)
.. parse mytextpage....
br.close()

python webkit with proxy support

I am writing a python script for scraping a webpage. I have created a webkit webview object and used the open method for loading the url. But I want to load the url through a proxy.
How can i done this ? How to integrate webkit with proxy? which webkit class support proxy?
try below code snippets. (reference from url)
import gtk, webkit
import ctypes
libgobject = ctypes.CDLL('/usr/lib/libgobject-2.0.so.0')
libwebkit = ctypes.CDLL('/usr/lib/libsoup-2.4.so.1')
libsoup = ctypes.CDLL('/usr/lib/libsoup-2.4.so.1')
libwebkit = ctypes.CDLL('/usr/lib/libwebkit-1.0.so')
proxy_uri = libsoup.soup_uri_new('http://127.0.0.1:8000') # set your proxy url
session = libwebkit.webkit_get_default_session()
libgobject.g_object_set(session, "proxy-uri", proxy_uri, None)
w = gtk.Window()
s = gtk.ScrolledWindow()
v = webkit.WebView()
s.add(v)
w.add(s)
w.show_all()
v.open('http://www.google.com')
Hope, it could help you.
You can use QApplicationProxy if you're on pyqt or this snippet if you're using pygi:
from gi.repository import WebKit
from gi.repository import Soup
proxy_uri = Soup.URI.new("http://127.0.0.1:8080")
session = WebKit.get_default_session().set_property("proxy-uri")
session.set_property("proxy-uri",proxy_uri)
References:
PyGI
PyQt
How about a solution that's already made?
PyPhantomJS is a minimalistic, headless, WebKit-based, JavaScript-driven tool. It is written in PyQt4 and Python. It runs on Linux, Windows, and Mac OS X.
It gives you access to a full headless WebKit browser, controllable via scripts written in JavaScript, with the ability to do various things, amongst which is screen scraping and proxy support. It uses the command line.
You can see the API here.
* When I say screen scraping, I mean you can either scrape page content, or even save page renders to a file. There's even a screen scraping JS library already written here.

Launching browser within CherryPy

I have a html page displayed using...
cherrypy.quickstart(ShowHTML(htmlfile), config=configfile)
Once the page is loaded (eg. initiated via. the command 'python mypage.py'), I would like to automatically launch the browser to display the page (eg. via. http://localhost/8000). Is there any way I can achieve this (eg. via. a hook within CherryPy), or do I have to call-up the browser manually (eg. by double-clicking an icon)?
TIA
Alan
You can either hook your webbrowser into the engine start/stop lifecycle:
def browse():
webbrowser.open("http://127.0.0.1:8080")
cherrypy.engine.subscribe('start', browse, priority=90)
Or, unpack quickstart:
from cherrypy import config, engine, tree
config.update(configfile)
tree.mount(ShowHTML(htmlfile), '/', configfile)
if hasattr(engine, "signal_handler"):
engine.signal_handler.subscribe()
if hasattr(engine, "console_control_handler"):
engine.console_control_handler.subscribe()
engine.start()
webbrowser.open("http://127.0.0.1:8080")
engine.block()

Categories

Resources