Scrapy - switch from selenium/browser back to default mechanism in single spider - python

I've encountered page with Ajax hidden elements, which I need to crawl. I've found this neat tutorial which shows how to do this with Selenium, in case when there is no additional calls to the server (this is the case for me as well).
http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/
However this and other sources mention a performance cost of using Selenium for this purpose. In this example the driver is initiated in the constructor, so I'm assuming all requests for the spider will go via Firefox then?
I just have a small portion of calls with Ajax involved, the rest is standard Scrapy crawling. Is it feasible to switch from Selenium/Browser in a single spider after part of the tasks were completed, back to the default Scrapy mechanism? If so how should I try to do this?
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
items = []
self.driver.get(response.url)
Edit
What I'm after is getting the Ajax based menu scraped from a single site, just the URLs. Then I want to pass this list to as start_urls to the main spider.

Your code does not break standard scrapy behaviour, try switch to standard way like this
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
items = []
self.driver.get(response.url)
# get hidden menu urls
yield scrapy.Request(hidden_menu_url, callback=self.parse_original_scrapy)
def parse_original_scrapy(self, response):
pass
You can to try my framework - Pomp instead of scrapy.
Start with phantomjs example and implement yours own Downloader that would be dispatch request to webdriver or fetch it by plain http request. It is not so easy to do, but much better then use webdriver inside parse method of scrapy spider.
Sorry for my poor English

There is no direct way to do this, as Scrapy crawls the request you send, plain text, without javascript rendering, something like curl if you've tried it.
The process of passing from Selenium to only Scrapy is possible by working every single (or just the necessary) request, you can use chrome dev tools or firebug to check which requests are being done for every call inside a browser, and then check the information you want and which requests are the necessary to get them.

Related

Is there a way to "refresh" a request?

I'm trying to download files from a website using python requests module and beautifulsoup4 but the problem is that you have to wait for 5 seconds before the download button appears.
I tried using requests.get('URL') to get the page and then parse it with beautifulsoup4 to get the download link but the problem is that you have to wait 5 seconds (if you were to open it with an actual browser) in order for the button to appear so when I pass the URL to requests.get() the initial response object doesn't have the button element I searched a lot on google but couldn't find any results that helped me.
Is there a way to "refresh" the response object? or "wait"? that is to update it's contents after five seconds as if it were opened with a browser?
I don't think this is possible with the requests module. What should I do?
I'm running Windows10 64x
I'm new so sorry if the formatting is bad. :(
HTTP is stateless, every new request goes as a different request to the earlier one. We typically imeplement states in cookies, browser stoarges and so on. Being a plain HTTP client, there is no way for requests to refresh a request, and the next request will be a compleletly new request.
What you're looking for is some client that understands JavaScript and can handle page update automatically. I suggest you to look at selenium which can do browser automation.
Try something like this,
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
finally:
driver.quit()

scrapy: spider quits without error messages before all requests yielded

If there are many requests in scheduler, would scheduler reject more requests to be added?
I met a very tricky question. I am trying to scrape a forum with all posts and comments. The problem is scrapy seems never finish it jobs and quits without error messages. I am wondering if I yielded too many requests so that scrapy stopped yielding new requests and just quit.
But I could not find documentation says that scrapy will quit if too many requests in schedular. Here is my code:
https://github.com/spacegoing/sentiment_mqd/blob/a46b59866e8f0a888b43aba6df0481a03136cf21/guba_spiders/guba_spiders/spiders/guba_spider.py#L217
The strange thing is that scrapy seems can only scrape 22 pages. If I start from page 1, it will stop at page 21. If I start from page 21, then it will stop at page 41.... There is no exception raised and scraped results are desired outputs.
1.
The code on GitHub you shared at a46b598 is probably not the exact version you have locally for the sample jobs. E.g. I haven't observed any line for the log lines like <timestamp> [guba] INFO: <url>.
But, well, I assumed there's no too significant difference.
2.
It's suggested to have the log level configured to DEBUG when you encounter any issue.
3.
If you've got the log level configured to DEBUG, you'd probably see something like this:
2018-10-26 15:25:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://guba.eastmoney.com/topic,600000_22.html>: max redirections reached
Some more lines: https://gist.github.com/starrify/b2483f0ed822a02d238cdf9d32dfa60e
That happens because you're passing the full response.meta dict to the following requests (related code), and Scrapy's RedirectMiddleware relies on some meta values (e.g. "redirect_times" and "redirect_ttl") to perform the check.
And the solution is simple: pass only the values you need into next_request.meta.
4.
It's also observed that you're trying to rotate the user agent strings, possibly for avoiding web crawl bans. But there's no other action taken. That would make your requests fishy still, because:
Scrapy's cookie management is enabled by default, which would use a same cookie jar for all your requests.
All your requests come from a same source IP address.
Thus I'm unsure whether it's good enough for you to scrape the whole site properly, especially when you're not throttling the requests.

Scraping delayed ajax with Selenium / Splinter

I am trying to write a script to identify if potential new homes have Verizon FiOS service available.
Unfortunately the site's extensive use of javascript has prevented everything from working. I'm using selenium (wrapped in the splinter module) to let the javascript execute, but I can't get past the second page.
Here is a simplified version of the script:
from splinter import Browser
browser = Browser()
browser.visit('https://www.verizon.com/FORYOURHOME/ORDERING/OrderNew/OrderAddressInfo.aspx')
nameAddress1 = "ctl00$body_content$txtAddress"
nameZip = "ctl00$body_content$txtZip"
formFill = {nameAddress1: '46 Demarest Ave',
nameZip: '10956'}
browser.fill_form(formFill)
browser.find_by_id('btnContinue').first.click()
if browser.is_element_present_by_id('rdoAddressOption0', wait_time=10):
browser.find_by_id('rdoAddressOption0').first.click()
browser.find_by_id('body_content_btnContinue').first.click()
In this example, it chooses the first option when it asks for confirmation of address.
It errors out with an ElementNotVisibleException. If I remove the is_element_present check, it errors out because it cannot find the element. The element is visible and clickable in the live browser that selenium is controlling, but it seems like selenium is not seeing an updated version of the page HTML.
As an alternative, I thought I might be able to do the POST request and process the response with requests or mechanize, but there is some kind of funny redirect that I can't wrap my head around.
How do I either get selenium to behave, or bypass the javascript/ajax and do it with GET and POST requests?
The problem is that the input you are clicking on is really hidden by setting display: none style.
To workaround it, execute javascript code to click on the input and set checked attribute:
browser.execute_script("""var element = document.getElementById('rdoAddressOption0');
element.click();
element.checked=true;""")
browser.find_by_id('body_content_btnContinue').first.click()

How Do You Pass New URLs to a Scrapy Crawler

I would like to keep a scrapy crawler constantly running inside a celery task worker probably using something like this. Or as suggested in the docs
The idea would be to use the crawler for querying an external API returning XML responses. I would like to pass the URL (or query parameters and let the crawler build the URL) I want to query to the crawler, and the crawler would make the URL call, and give me back the extracted items. How can I pass this new URL I want to fetch to the crawler once it started running. I do not want to restart the crawler every time I want to give it a new URL, instead I want the crawler to sit idly waiting for URLs to crawl.
The two methods I've spotted to run scrapy inside another python process use a new Process to run the crawler in. I would like to not have to fork and teardown a new process every time I want to crawl a URL, since that is pretty expensive and unnecessary.
Just have a spider that polls a database (or file?) that when presented with a new URL creates and yields a new Request() object for it.
You can build it by hand easily enough. There is probably a better way to do it than that, but thats basically what I did for an open-proxy scraper. The spider gets a list of all the 'potential' proxies from the database and generates a Request() object for each one - when they're returned they're then dispatched down the chain and verified by downstream middleware and their records are updated by item pipeline.
You could use a message queue (like IronMQ--full disclosure, I work for the company that makes IronMQ as a developer evangelist) to pass in the URLs.
Then in your crawler, poll for the URLs from the queue, and crawl based on the messages you retrieve.
The example you linked to could be updated (this is untested and pseudocode, but you should get the basic idea):
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
from iron-mq import IronMQ
mq = IronMQ()
q = mq.queue("scrape_queue")
crawler = Crawler(Settings())
crawler.configure()
while True: # poll forever
msg = q.get(timeout=120) # get messages from queue
# timeout is the number of seconds the message will be reserved for, making sure no other crawlers get that message. Set it to a safe value (the max amount of time it will take you to crawl a page)
if len(msg["messages"]) < 1: # if there are no messages waiting to be crawled
time.sleep(1) # wait one second
continue # try again
spider = FollowAllSpider(domain=msg["messages"][0]["body"]) # crawl the domain in the message
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
q.delete(msg["messages"][0]["id"]) # when you're done with the message, delete it

Controlling a browser using Python, on a Mac

I'm looking for a way to programatically control a browser on a Mac (i.e. Firefox or Safari or Chrome/-ium or Opera, but not IE) using Python.
The actions I need include following links, checking if elements exist in a page, and submitting forms.
Which solution would you recommend?
I like Selenium, it's scriptable through Python. The Selenium IDE only runs in Firefox, but Selenium RC supports multiple browsers.
Check out python-browsercontrol.
Also, you could read this forum page (I know, it's old, but it seems extremely relevant to your question):
http://bytes.com/topic/python/answers/45528-python-client-side-browser-script-language
Also: http://docs.python.org/library/webbrowser.html
Example:
from browser import *
my_browser = Firefox(99, '/usr/lib/firefox/firefox-bin') my_browser.open_url('cnn.com')
open_url returns when the cnn.com home page document is loaded in the browser frame.
Might be a bit restrictive, but py-appscript may be the easiest way of controlling a Applescript'able browser from Python.
For more complex things, you can use the PyObjC to achieve pretty much anything - for example, webkit2png is a Python script which uses WebKit to load a page, and save an image of it. You need to have a decent understanding of Objective-C and Cocoa/etc to use it (as it just exposes ObjC objects to Python)
Screen-scaping may achieve what you want with much less complexity.
Check out spynner Python module.
Spynner is a stateful programmatic web browser module for Python. It is based upon PyQT and WebKit. It supports Javascript, AJAX, and every other technology that !WebKit is able to handle (Flash, SVG, ...). Spynner takes advantage of JQuery. a powerful Javascript library that makes the interaction with pages and event simulation really easy.
Using Spynner you would able to simulate a web browser with no GUI (though a browsing window can be opened for debugging purposes), so it may be used to implement crawlers or acceptance testing tools.
See some examples at GitHub page.
Try mechanize, if you don't actually need a browser.
Example:
import re
import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info() # headers
print response1.read() # body
br.select_form(name="order")
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm.
br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__)
# Submit current form. Browser calls .close() on the current response on
# navigation, so this closes response1
response2 = br.submit()
Several Mac applications can be controlled via OSAScript (a.k.a. AppleScript), which can be sent via the osascript command. O'Reilly has an article on invoking osascript from Python. I can't vouch for it doing exactly what you want, but it's a starting point.
Maybe overpowered, but check out Marionette to control Firefox. There is a tutorial at readthedocs:
You first start a Marionette-enabled firefox instance:
firefox -marionette
Then you create a client:
client = Marionette('localhost', port=2828)
client.start_session()
Navigation f.ex. is done via
url = 'http://mozilla.org'
client.navigate(url)
client.go_back()
client.go_forward()
assert client.get_url() == url
Checkout Mozmill https://github.com/mikeal/mozmill
Mozmill is a UI Automation framework for Mozilla apps like Firefox and Thunderbird. It's both an addon and a Python command-line tool. The addon provides an IDE for writing and running the JavaScript tests and the Python package provides a mechanism for running the tests from the command line as well as providing a way to test restarting the application.
Take a look at PyShell (an extension to PyXPCOM).
Example:
promptSvc = components.classes["#mozilla.org/embedcomp/prompt-service;1"].\
getService(Components.interfaces.nsIPromptService)
promptSvc.alert(None, 'Greeting...', "Hello from Python")
You can use selenium library for Python, here is a simple example (in form of unittest):
#!/usr/bin/env python3
import unittest
from selenium import webdriver
class FooTest(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.base_url = "http://example.com"
def is_text_present(self, text):
return str(text) in self.driver.page_source
def test_example(self):
self.driver.get(self.base_url + "/")
self.assertTrue(self.is_text_present("Example"))
if __name__ == '__main__':
suite = unittest.TestLoader().loadTestsFromTestCase(FooTest)
result = unittest.TextTestRunner(verbosity=2).run(suite)

Categories

Resources