I am writing a python script for scraping a webpage. I have created a webkit webview object and used the open method for loading the url. But I want to load the url through a proxy.
How can i done this ? How to integrate webkit with proxy? which webkit class support proxy?
try below code snippets. (reference from url)
import gtk, webkit
import ctypes
libgobject = ctypes.CDLL('/usr/lib/libgobject-2.0.so.0')
libwebkit = ctypes.CDLL('/usr/lib/libsoup-2.4.so.1')
libsoup = ctypes.CDLL('/usr/lib/libsoup-2.4.so.1')
libwebkit = ctypes.CDLL('/usr/lib/libwebkit-1.0.so')
proxy_uri = libsoup.soup_uri_new('http://127.0.0.1:8000') # set your proxy url
session = libwebkit.webkit_get_default_session()
libgobject.g_object_set(session, "proxy-uri", proxy_uri, None)
w = gtk.Window()
s = gtk.ScrolledWindow()
v = webkit.WebView()
s.add(v)
w.add(s)
w.show_all()
v.open('http://www.google.com')
Hope, it could help you.
You can use QApplicationProxy if you're on pyqt or this snippet if you're using pygi:
from gi.repository import WebKit
from gi.repository import Soup
proxy_uri = Soup.URI.new("http://127.0.0.1:8080")
session = WebKit.get_default_session().set_property("proxy-uri")
session.set_property("proxy-uri",proxy_uri)
References:
PyGI
PyQt
How about a solution that's already made?
PyPhantomJS is a minimalistic, headless, WebKit-based, JavaScript-driven tool. It is written in PyQt4 and Python. It runs on Linux, Windows, and Mac OS X.
It gives you access to a full headless WebKit browser, controllable via scripts written in JavaScript, with the ability to do various things, amongst which is screen scraping and proxy support. It uses the command line.
You can see the API here.
* When I say screen scraping, I mean you can either scrape page content, or even save page renders to a file. There's even a screen scraping JS library already written here.
Related
I'm trying to automate the process of creating an account for something, lets call it X, but I cant figure out what to do.
I saw this code somewhere,
import urllib
import urllib2
import webbrowser
data = urllib.urlencode({'q': 'Python'})
url = 'http://duckduckgo.com/html/'
full_url = url + '?' + data
response = urllib2.urlopen(full_url)
with open("results.html", "w") as f:
f.write(response.read())
webbrowser.open("results.html")
But I cant figure out how to modify it for my use.
I would highly recommend utilizing Selenium+Webdriver for this, since your question appears UI and browser-based. You can install Selenium via 'pip install selenium' in most cases. Here are a couple of good references to get started.
- http://selenium-python.readthedocs.io/
- https://pypi.python.org/pypi/selenium
Also, if this process needs to drive the browser headlessly, look into including PhantomJS (via GhostDriver), which can be downloaded from the phantomjs.org website.
I am trying to use spynner for web scraping ... below I used www.google.com as an example .... I want to automatically search for "Barack Obama" using spynner ... However, the web browser created by spynner keeps not responding ... and the search string ("Barack Obama") is not filled in the search box (You will see it when you run the code below yourself).
import spynner
browser = spynner.Browser()
browser.show()
browser.load("https://www.google.com")
browser.wait_page_load()
browser.fill("input[name=q]", "Barack Obama")
browser.click("input[name=btnK]")
The input fields are identfied correctly in my code ... you can check for yourself. ... So why is this not working?
Trie this code snippet.. I used qt
import spynner
from PyQt4.QtCore import Qt
b = spynner.Browser()
b.show()
b.load("http://www.google.com")
b.wk_fill('input[name=q]', 'soup')
b.sendKeys("input[name=q]",[Qt.Key_Enter])
b.browse()
I'm trying to login in https://accounts.coursera.org/ using twill for python
I tried this sheet of code
import twill
b = get_browser()
b.go("https://accounts.coursera.org/")
b.showforms()
twill doesn't detect the form in the page and showforms methods doesn't show anything !!
Is that an internal issue in twill package or I'm misssing something
import twill
import webbrowser
b = twill.get_browser()
b.go("https://accounts.coursera.org/")
page = b.result.get_page()
tmp_page = "tmp.html"
with file(tmp_page, "w") as f:
f.write(page)
webbrowser.open(tmp_page)
# b.showforms()
I get a page that says..
Please use a modern browser with JavaScript enabled to use Coursera.
So I suspect that twill doesn't include a javascript interpreter?
I'm building a Django app and I'm using Spynner for web crawling. I have this problem and I hope someone can help me.
I have this function in the module "crawler.py":
import spynner
def crawling_js(url)
br = spynner.Browser()
br.load(url)
text_page = br.html
br.close (*)
return text_page
(*) I tried with br.close() too
in another module (eg: "import.py") I call the function in this way:
from crawler import crawling_js
l_url = ["https://www.google.com/", "https://www.tripadvisor.com/", ...]
for url in l_url:
mytextpage = crawling_js(url)
.. parse mytextpage....
when I pass the first url in to the function all is correct when I pass the second "url" python crash. Python crash in this line:br.load(url). Someone can help me? Thanks a lot
I have:
Django 1.3
Python 2.7
Spynner 1.1.0
PyQt4 4.9.1
Why you need to instantiate br = spynner.Browser() and close it every time you call crawling_js(). In a loop this will utilize a lot of resources which I think is the reason why it crashes. let's think of it like this, br is a browser instance. Therefore, you can make it browse any number of websites without the need to close it and open it again. Adjust your code this way:
import spynner
br = spynner.Browser() #you open it only once.
def crawling_js(url):
br.load(url)
text_page = br._get_html() #_get_html() to make sure you get the updated html
return text_page
then if you insist to close br later you simply do:
from crawler import crawling_js , br
l_url = ["https://www.google.com/", "https://www.tripadvisor.com/", ...]
for url in l_url:
mytextpage = crawling_js(url)
.. parse mytextpage....
br.close()
I have a html page displayed using...
cherrypy.quickstart(ShowHTML(htmlfile), config=configfile)
Once the page is loaded (eg. initiated via. the command 'python mypage.py'), I would like to automatically launch the browser to display the page (eg. via. http://localhost/8000). Is there any way I can achieve this (eg. via. a hook within CherryPy), or do I have to call-up the browser manually (eg. by double-clicking an icon)?
TIA
Alan
You can either hook your webbrowser into the engine start/stop lifecycle:
def browse():
webbrowser.open("http://127.0.0.1:8080")
cherrypy.engine.subscribe('start', browse, priority=90)
Or, unpack quickstart:
from cherrypy import config, engine, tree
config.update(configfile)
tree.mount(ShowHTML(htmlfile), '/', configfile)
if hasattr(engine, "signal_handler"):
engine.signal_handler.subscribe()
if hasattr(engine, "console_control_handler"):
engine.console_control_handler.subscribe()
engine.start()
webbrowser.open("http://127.0.0.1:8080")
engine.block()