I'm building a Django app and I'm using Spynner for web crawling. I have this problem and I hope someone can help me.
I have this function in the module "crawler.py":
import spynner
def crawling_js(url)
br = spynner.Browser()
br.load(url)
text_page = br.html
br.close (*)
return text_page
(*) I tried with br.close() too
in another module (eg: "import.py") I call the function in this way:
from crawler import crawling_js
l_url = ["https://www.google.com/", "https://www.tripadvisor.com/", ...]
for url in l_url:
mytextpage = crawling_js(url)
.. parse mytextpage....
when I pass the first url in to the function all is correct when I pass the second "url" python crash. Python crash in this line:br.load(url). Someone can help me? Thanks a lot
I have:
Django 1.3
Python 2.7
Spynner 1.1.0
PyQt4 4.9.1
Why you need to instantiate br = spynner.Browser() and close it every time you call crawling_js(). In a loop this will utilize a lot of resources which I think is the reason why it crashes. let's think of it like this, br is a browser instance. Therefore, you can make it browse any number of websites without the need to close it and open it again. Adjust your code this way:
import spynner
br = spynner.Browser() #you open it only once.
def crawling_js(url):
br.load(url)
text_page = br._get_html() #_get_html() to make sure you get the updated html
return text_page
then if you insist to close br later you simply do:
from crawler import crawling_js , br
l_url = ["https://www.google.com/", "https://www.tripadvisor.com/", ...]
for url in l_url:
mytextpage = crawling_js(url)
.. parse mytextpage....
br.close()
Related
I'm trying to automate the process of creating an account for something, lets call it X, but I cant figure out what to do.
I saw this code somewhere,
import urllib
import urllib2
import webbrowser
data = urllib.urlencode({'q': 'Python'})
url = 'http://duckduckgo.com/html/'
full_url = url + '?' + data
response = urllib2.urlopen(full_url)
with open("results.html", "w") as f:
f.write(response.read())
webbrowser.open("results.html")
But I cant figure out how to modify it for my use.
I would highly recommend utilizing Selenium+Webdriver for this, since your question appears UI and browser-based. You can install Selenium via 'pip install selenium' in most cases. Here are a couple of good references to get started.
- http://selenium-python.readthedocs.io/
- https://pypi.python.org/pypi/selenium
Also, if this process needs to drive the browser headlessly, look into including PhantomJS (via GhostDriver), which can be downloaded from the phantomjs.org website.
I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grĂșa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grĂșa
plataforma
...
grulla blanca
grulla trompetera
I am trying to use spynner for web scraping ... below I used www.google.com as an example .... I want to automatically search for "Barack Obama" using spynner ... However, the web browser created by spynner keeps not responding ... and the search string ("Barack Obama") is not filled in the search box (You will see it when you run the code below yourself).
import spynner
browser = spynner.Browser()
browser.show()
browser.load("https://www.google.com")
browser.wait_page_load()
browser.fill("input[name=q]", "Barack Obama")
browser.click("input[name=btnK]")
The input fields are identfied correctly in my code ... you can check for yourself. ... So why is this not working?
Trie this code snippet.. I used qt
import spynner
from PyQt4.QtCore import Qt
b = spynner.Browser()
b.show()
b.load("http://www.google.com")
b.wk_fill('input[name=q]', 'soup')
b.sendKeys("input[name=q]",[Qt.Key_Enter])
b.browse()
I'm building a script to monitor a reporting service. Depending on how it takes to process the report the report appears in HTML or comes via XmlHttpRequest.
As a tool to check the page I want to use spynner, which works perfect for HTML, but it seems that I can't get it to work when the data comes via XHR.
The code for the test is the following:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
__docformat__ = 'restructuredtext en'
from time import sleep
from spynner import browser
import pyquery
from PyQt4.QtCore import QUrl
from PyQt4.QtNetwork import QNetworkRequest, QNetworkAccessManager
from PyQt4.QtCore import QByteArray
def load_page(br):
ret = br.load_jquery(True)
print ret
return 'Japan' in br.html
br = browser.Browser(
debug_level=4
)
br.load('https://foobar.eu/newton/cgi-bin/cognos.cgi')
br.create_webview()
br.show()
#br.load("https://foobar.eu/newton/cgi-bin/cognos.cgi?b_action=xts.run&m=portal/cc.xts&m_folder=iA37B5BBC0615469DA37767D2B6F1DCF1")
#br.browse()
res = br.load("https://foobar.eu:443/newton/cgi-bin/cognos.cgi?b_action=cognosViewer&ui.action=run&ui.object=/content/folder[#name='DMA Admin Zone']/folder[#name='02. Performance Benchmark Module']/folder[#name='1. Reports']/report[#name='CQM_Test_3_HTML_Heavy_Local_Processing_Final']&ui.name=CQM_Test_3_HTML_Heavy_Local_Processing_Final&run.outputFormat=&run.prompt=true", 1, wait_callback=load_page)
d = str(pyquery.PyQuery(br.html))
if d.find("Japan") > -1:
print 'We discovered Japan!'
else:
print 'Japan is nowhere to be seen!'
sleep(10)
The URL in the comments is a page which contains a link to the report. When I click the report by hand the report works (via XHP). However, I can't seem to get it to work via scripting.
The br.load_jquery always returns None.
As a help I have added part of the spynner debug trace when I click the link by hand: http://fpaste.org/97583/13987135/
In firebug I can clearly see the XHP reponse with the string 'Japan' in.
What am I missing?
apparantly replacing the load page function with the following code makes it work:
def load_page(br):
br.wait(5)
return 'Japan' in br.html
I'm making auto-login script by use mechanize python.
Before I was used mechanize with no problem, but www.gmarket.co.kr in this site I couldn't make it .
whenever i try to login always login page was returned even with correct gmarket id , pass, i can't login and I saw some suspicious message
"<script language=javascript>top.location.reload();</script>"
I think this related with my problem, but don't know exactly how to handle .
Here is sample id and pass for login test
id: tgi177 pass: tk1047
if anyone can help me much appreciate thanks in advance
CODE:
# -*- coding: cp949 -*-
from lxml.html import parse, fromstring
import sys,os
import mechanize, urllib
import cookielib
import re
from BeautifulSoup import BeautifulSoup,BeautifulStoneSoup,Tag
try:
params = urllib.urlencode({'command':'login',
'url':'http%3A%2F%2Fwww.gmarket.co.kr%2F',
'member_type':'mem',
'member_yn':'Y',
'login_id':'tgi177',
'image1.x':'31',
'image1.y':'26',
'passwd':'tk1047',
'buyer_nm':'',
'buyer_tel_no1':'',
'buyer_tel_no2':'',
'buyer_tel_no3':''
})
rq = mechanize.Request("http://www.gmarket.co.kr/challenge/login.asp")
rs = mechanize.urlopen(rq)
data = rs.read()
logged_in = r'input_login_check_value' in data
if logged_in:
print ' login success !'
rq = mechanize.Request("http://www.gmarket.co.kr")
rs = mechanize.urlopen(rq)
data = rs.read()
print data
else:
print 'login failed!'
pass
quit()
except:
pass
mechanize doesn't have the ability to interact with JavaScript. Probably spidermonkey module will help you (I have no experience with it, but description is quite promising). Also you could handle such reload (e.g.Browser.reload() for this particular case) manually if it's the only site you have this problem.
Update:
Quick look through your page shows that you have submit to other URL (with https: scheme). Look through checkValid() JavaScript function. Posting to it gives other result. Note, that this looks like homework you should do yourself before asking.