Example: if the spider throws an exception on page15, it should be able to restart at page 15.
As i went through the Scrapy documentation, under Jobs: pausing and resuming crawls topic - I ran the spider with the command as mentioned in the document i.e, scrapy crawl spidername -s JOBDIR=directory-path
so when i go into that specific directory-path, i can see that three files had been created namely, requests.queue, requests.seen and spider.state[as in the image link https://i.stack.imgur.com/gE7zU.png] i can see that only spider.state is having 1KB size and rest two files are 0KB, but while running the spider, under requests.queue folder a file with name p0 will be created, but once the spider is stopped and ran again the file p0 under requests.queue folder is deleted.
As i took a look into document again, it stated "Requests must be serializable by the pickle module, in order for persistence to work, so you should make sure that your requests are serializable." and after making the setting SCHEDULER_DEBUG = TRUE in settings.py i can see in console that,
[scrapy.core.scheduler] WARNING: Unable to serialize request:
is this the reason, why i can not resume the spider from where it stopped as the requests are not serialized? if so how can i make the requests serialized and make the spider to resume from where it left off? or is there any other approach how this can be achieved, answers with a sample code will be helpful.
and also can anyone explain what those three files are for as there is no explaination in the Scrapy documentation.
I guess in order to stop and resume spiders effectively, we should make use of DB to store the state of the spider, there maybe other ways too, but i felt this is the most effective way.
Related
I need to scrape data from a list of domain given in Excel;
The problem is that I need to scrape data from the original website (let's take for example : https://www.lepetitballon.com) and data from similartech (https://www.similartech.com/websites/lepetitballon.com).
I want them to scrape at the same time so I could receive them and format them once at the end, after that i'll just go to the next domain.
Theoretically, I should just use 2 spiders in an asynchronous way with scrapy?
Ideally you would want to keep spiders which scrape differently structured sites separate, that way your code will be a lot easier to maintain in the long run.
Theoretically, if, for some reason you MUST parse them in the same spider, you could just collect the URLs you want to scrape and based on the base path you could invoke different parser callback methods. That being said, I personally cannot think of a reason why you would have to do that. Even if you would have the same structure, you can just reuse your scrapy.Item classes.
Twisted networking library is used by the scrapy framework for its internal networking tasks, and the scrapy has provided to handle the concurrent requests in settings.
Explained here: https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests
Or you could use multiple spider which are independent to each others which is already explained in scrapy docs, this might be what you are looking for.
By default, Scrapy runs a single spider per process when you run
scrapy crawl. However, Scrapy supports running multiple spiders per
process using the internal API.
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
As per the efficiency you could choose either option A or B, this really depends upon your resources and requirements whereas option A can be good for lower resources with decent speed or option B can be ideal for better speed with higher resources consumption than option A.
I know there is allready a question regarding this here (How to avoid re-downloading media to S3 in Scrapy?) but I do not have my answer yet
I have designed a spider with a FilesPipeline to get pdf files from several websites.
I understand that the FilesPipeline class is using the GCSFilesStore and the function media_to_download to compare the blob last_modified attribute date to the current time with respect to an expiration duration in days (EXPIRES initially equals 90)
The point is that I want to be able to launch my spider from time to time and to only download new documents
However when I run my spider a second time, it does re-download all the files again
I have tried to increase the EXPIRES parameter but it did not seem to cut it.
Any help appreciated thanks !
UPDATE :
I think this is a bug from scrapy. I filed a bug report on github where I explain how to reproduce this
It seems that this may be due to some permissions configuration with the bucket. Here is github user #michalp123's answer :
Blockquote
I cannot reproduce this bug. #lblanche, are you sure you set up permissions for the bucket correctly? The very first time I've tried reproducing it I got a setup where the service account I used had write permissions, but for some reason calling get_blob on the bucket raised a 403, which caused stat_file method in GCSFilesStore to fail, and that caused the file to be downloaded every time. After fixing the permissions everything worked as it should. If that's the case here, I think it would be a good idea to check permissions in GCSFilesStore's init and display a warning if it's impossible to get file's metadata from the bucket.
The same user merged a fix that adds a warning at GCSFilesStore init if the access to the metadata is not allowed.
I have a working crawler in Scrapy+Splash. It launchs a spider on many pages. Each page contains a list of links. For each page the spider download the page then, some of the pages linked from that(not recursively). All the pages are saved on the file system. The system works flawlessy. At the moment I'm refactoring it to add some DB interaction.
I'm not using items, nor Item Pipelines.
What are the benefits of using them?
Adding some info:
The purpose of my crawler is to download entire pages (in html, png, or converted to txt using a library). As soon as the spider has the response to save, it passes it to a library that encapsulate all the io ops(File system and DB). So in this way, it is simpler than use items(with boilerplate for conversion) and pipelines.
So where is my doubt?
I don't know the way scrapy works internally well enough. The way the crawler is implemented the io ops are executed into the thread of the spider. So each spider takes longer to execute. As opposite if I move the io ops into the pipeline, maybe(?) scrapy can schedule its jobs better, executing them separately from the crawling job. Will there be any real performance difference?
In my opinion, using pipelines is just following the separation of concerns principle. Your spider can do many things, but it's core function is to extract information from web pages. The rest can (and possibly should) be refactored into a pipeline or an extension.
It might not be such an issue if you have one spider for one web site. But imagine you have a Scrapy project with hundreds of spiders for semantically similar web sites and you want to apply the same logic for each item -- take a page snapshot, check for duplicates, store in database etc. And now imagine the maintenance hell if you had all the logic in each of the spiders and had to change that logic.
Does anyone know how I could run the same Scrapy scraper over 200 times on different websites, each with their respective output files? Usually in Scrapy, you indicate the output file when you run it from the command line by typing -o filename.json.
multiple ways:
Create a pipeline to drop the items with configurable parameters, like running scrapy crawl myspider -a output_filename=output_file.txt. output_filename is added as an argument to the spider, and now you can access it from a pipeline like:
class MyPipeline(object):
def process_item(self, item, spider):
filename = spider.output_filename
# now do your magic with filename
You can run scrapy within a python script, and then also do your things with the output items.
I'm doing a similar thing. Here is what I have done:
Write the crawler as you normally would, but make sure to implement feed exports. I have the feed export push the results directly to an S3 bucket. Also, I recommend that you accept the website as a command line parameter to the script. (Example here)
Setup scrapyd to run your spider
Package and deploy your spider to scrapyd using scrapyd-client
Now, with your list of websites, simply issue a single curl command per URL to your scrapyd process.
I've used the above strategy to shallow scrape two million domains, and I did it in less than 5 days.
I want to get data from the web in real-time and have used scrapy to extract the information to build a python utility. The problem is that the data is static while the information will change in time.
I wanted to know if it is viable to call my scrapy spider when the utility is invoked so that when the utility is called for the first time, the data at that time is stored as JSON with the user which will change when the user calls it the next time.
Please let me know if there is an alternative to it.
Thanks in advance.
Edit-1: To make it clear, the data that I have extracted will change over time. Here is a link to my previous question about building the spider: How to scrape contents from multiple tables in a webpage. The problem is that as the league progresses, the fixtures' status will change (completed or not yet completed). I want the users to get real-time scraped data.
Edit-2: What I previously did was calling my spider separately and using the JSON generated for the purpose of my utillity. For the users to have real-time data, when they use it on the terminal, should I push the scrapy code into the main repository that will be uploaded to PyPI and call the spider in the main function of the .py file? Is this possible? What are its alternatives, if any?
your could start your scrapy from code when you (or your user) need:
from scrapy import cmdline
SCRAPY_SPIDER_NAME = 'spyder_name' # spyder name to start scraping
cmdline.execute("scrapy crawl {}".format(SCRAPY_SPIDER_NAME))