I need to write a unit test to the Scrapy spider. The problem is, the only way I know on how to call Scrapy spider programmatically is through scrapy.crawler.CrawlerProcess which creates subprocess, then Twisted reactor, etc. For simple unit test it's a massive overkill.
What I want to do is simply create a request, load project settings somehow, send it and process the response.
Is there a way to do it properly?
EDIT.
I checked Scrapy Unit Testing, but the whole point of the test is to check how some xpaths in the database map on the current state of the website. I'm ok with the online testing, I need it actually.
(Then it's becoming more like an integration test, but whatever.)
Related
I am using Python 3.6.1 with Flask, with the goal being to load a webpage, display live loading information via websockets, then once the loading is complete redirect or otherwise allow the user to do something.
I have the first two steps working- I can load the page fine, and the websocket interface (running on a separate thread using SocketIO) updates properly with the needed data. But how can I make something happen once the functions that need to load are finished? To my understanding once I return a webpage in Flask it is simply static, and there's no easy way to change it.
Specific code examples aren't necessary, I'm just looking for ideas or resources. Thanks.
I have been automating web application test cases using selenium webdriver most of the time these UI tests
are flaky and brittle but when I use python requests module and I am able to automate reliable test cases with just
GET, POST, DELETE http methods (with help of regular expressions to catch some tokens and ids.)
My question is that why nobody seems to use http libraries like python requests module to automate web application test cases instead of flaky UI test cases?
You are comparing apples to oranges. You can't use requests to test the same things that Selenium can test.
Selenium lets you test the result of rendering HTML and Javascript. requests is only a HTTP library, it can make HTTP requests, and no more. The tools are different and test different things.
Use requests if all you need to do is test if your server is producing the right responses to HTTP requests, like for a REST API. Use Selenium if you need to test how the HTML and Javascript is executed in a browser.
You are talking about using GET and POST methods directly with the server. Yes they are helpful. Here is my prediction why people wont use it. Because it is tiresome to do so.
If your application has exposed APIs, its easy to use any API client and contact the server. In your case, I assume there are no exposed APIs and your are intercepting the HTTP traffic to the server to play around with the requests/responses and then use it in your python script.
The advantage, its Fast and gets the thing done. Disadvantage there is no UI interaction, you wont get the screenshots etc. (like you do when you use selenium ) + you need send a lot of time in hacking around intercepting the HTTP traffic.
Its just a matter of your requirements and time you have in hand.
My experience with Scrapy is limited, and each time I use it, it's always through the terminal's commands. How can I get my form data (a url to be scraped) from my django template to communicate with scrapy to start doing scraping? So far, I've only thought of is to get the form's returned data from django's views and then try to reach into the spider.py in scrapy's directory to add the form data's url to the spider's start_urls. From there, I don't really know how to trigger the actual crawling since I'm used to doing it strictly through my terminal with commands like "scrapy crawl dmoz". Thanks.
tiny edit: Just discovered scrapyd... I think I may be headed in the right direction with this.
You've actually answered it with an edit. The best option would be to setup scrapyd service and make an API call to schedule.json to trigger a scraping job to run.
To make that API http call, you can either use urllib2/requests, or use a wrapper around scrapyd API - python-scrapyd-api:
from scrapyd_api import ScrapydAPI
scrapyd = ScrapydAPI('http://localhost:6800')
scrapyd.schedule('project_name', 'spider_name')
If we put aside scrapyd and try to run the spider from the view, it will block the request until the twisted reactor would stop - therefore, it is not really an option.
You can though, start using celery (in tandem with django_celery) - define a task that would run your Scrapy spider and call the task from your django view. This way, you would put the task on the queue and would not have a user waiting for crawling to be finished.
Also, take a look at the django-dynamic-scraper package:
Django Dynamic Scraper (DDS) is an app for Django build on top of the
scraping framework Scrapy. While preserving many of the features of
Scrapy it lets you dynamically create and manage spiders via the
Django admin interface.
I have some code (a celery task) which makes a call via urllib to a Django view. The code for the task and the view are both part of the same Django project.
I'm testing the task, and need it to be able to contact the view and get data back from it during the test, so I'm using a LiveServerTestCase. In theory I set up the database in the setUp function of my test case (I add a list of product instances) and then call the task, it does some stuff, and then calls the Django view through urllib (hitting the dev server set up by the LiveServerTestCase), getting a JSON list of product instances back.
In practice, though, it looks like the products I add in setUp aren't visible to the view when it's called. It looks like the test case code is using one database (test_<my_database_name>) and the view running on the dev server is accessing another (the urllib call successfully contacts the view but can't find the product I've asked for).
Any ideas why this may be the case?
Might be relevant - we're testing on a MySQL db instead of the sqlite.
Heading off two questions (but interested in comments if you think we're doing this wrong):
I know It seems weird that the task accesses the view using urllib. We do this because the task usually calls one of a series of third party APIs to get info about a product, and if it cannot access these, it accesses our own Django database of products. The code that makes the urllib call is generic code that is agnostic of which case we're dealing with.
These are integration tests so we'd prefer actually make the urllib call rather than mock it out
The celery workers are still feeding off of the dev database even if the test server brings up other databases because they were told to in the settings file.
One fix would be to make a separate settings_test.py file that specifies the test database name and bring up celery workers from the setup command using subprocess.checkoutput that consume from a special queue for testing. Then these celery workers would feed from the test database rather than the dev database.
I'm VERY new to Python and I'm attempting to integrate Scrapy with Django.
Here is what I'm trying to make happen:
User submits URL to be scraped
URL is scraped
Scraped data is returned in screen to user
User assigns attributes (if necessary) then saves it to database.
What is the best way to accomplish this? I've played with Django Dynamic Scraper, but I think I'm better off maintaining control over Scrapy for this.
holding on django request while scraping another website may not be the best idea, this flow is better done asynchronously, meaning, release django request and have another process to handle the scrapying, I guess its not an easy thing to achieve for newcomers, but try to bear with me.
flow should look like this:
user submit a request to scrape some data from another website
spider crawl start on a different process than django, user request is released
spider pipelines items to some data store (database)
the user loop on asking for that data, django update the user based on the data inserted to the data store
shooting a scrapy spider can be done by launching it straight from python code, using a tool like celery, also see django and celery, or by launching it in a new process using python's subprocess, or even better, using scrapyd to manage those spiders