Is it possible to scrape links by the date associated with them? I'm trying to implement a daily run spider that saves article information to a database, but I don't want to re-scrape articles that I have already scraped before-- i.e yesterday's articles. I ran across this SO post asking the same thing and the scrapy-deltafetch plugin was suggested.
However, this relies on checking new requests against previously saved request fingerprints stored in a database. I'm assuming that if the daily scraping went on for a while, there would be a need for significant memory overhead on the database to store request fingerprints that have already been scraped.
So given a list of articles on a site like cnn.com, I want to scrape all the articles that have been published today 6/14/17, but once the scraper hits later articles with a date listed as 6/13/17, I want to close the spider and stop scraping. Is this kind of approach possible with scrapy? Given a page of articles, will a CrawlSpider start at the top of the page and scrape articles in order?
Just new to Scrapy, so not sure what to try. Any help would be greatly appreciated, thank you!
You can use a custom delta-fetch_key which checks the date and the title as the fingerprint.
from w3lib.url import url_query_parameter
...
def parse(self, response):
...
for product_url in response.css('a.product_listing'):
yield Request(
product_url,
meta={'deltafetch_key': url_query_parameter(product_url, 'id')},
callback=self.parse_product_page
)
...
I compose a date using datetime.strptime(Item['dateinfo'], "%b-%d-%Y") from cobbled together information on the item of interest.
After that I just check it against a configured age in my settings, which can be overridden per invocation. You can issue a closespider exception when you find an age that is too old or you can set a finished flag and act on that in any of your other code.
No need for remembering stuff. I use this on a spider that I run daily and I simply set a 24hr age limit.
Related
I started to look at Scrapy and want to have one spider to get some prices of MTG Cards.
First I don't know if I'm 100% correct to use the link that select all the cards available in the beginning of the function:
name = 'bazarmtgbot'
allowed_domains = ['www.bazardebagda.com.br']
start_urls = ['https://bazardebagda.com.br/?view=ecom/itens&tcg=1&txt_estoque=1&txt_limit=160&txt_order=1&txt_extras=all&page=1']
1 - Should I use this kind of start_urls?
2 - Then, if you access the site, I could not find how to get the unit and price of the card, they are blank DIV's...
I got the name using:
titles = response.css(".itemNameP.ellipsis::text").extract()
3 - I couldn't find how can I do the pagination of this site to get the next set of items unit/prices. Do I need to copy the start_urls N times?
(and 3) That would be fine to start on a given page. When scraping you can queue additional URLs to scrape by looking for something like the "next page" button, scraping that link, and yield'ing a scrapy.Request that you want to follow-up on. See this part of the Scrapy tutorial
That site may be using a bunch of techniques to thwart price scraping: the blank price divs are loading an image like the below and chopping parts of it up with gibberish CSS class names to form the number. You may need to do some OCR or find an alternative method. Bear in mind that because they're going to that degree, there might be other anti-scraping countermeasures.
I'm using scrapy/spyder to build my crawler, using BeautifulSoup as well.. I have been working on a crawler and believe we are at a point that it works as expected with the few individual pages we have scraped, so my next challenge is to scrape the same site, but ONLY pages that are specific to a high level category.
Only thing i have tried is using allowed_domain and start_urls, but when i did that, it was literally hitting every page it was finding and we want to control what pages we scrape so we have a clean list of information.
I understand that on each page there are links that take you outside of the page you are and can end up elsewhere on the site.. but what im trying to do is only focus on a few pages within each category
# allowed_domain = ['dickssportinggoods.com']
# start_urls = ['https://www.dickssportinggoods.com/c/mens-top-trends-gear']
You can either base your spider on Spider class and code the navigation yourself, or base it on the CrawlSpider class and use the rules to control which pages get visited. From the information you provided it seems that the later approach is more appropriate for your requirement. Check out the example to see how the rules work.
I have used Beautiful Soup with great success when crawling single pages of a site, but I have a new project in which I have to check a large list of sites to see if they contain a mention or a link to my site. Therefore, I need to check the entire site of each site.
With BS I just don't know yet how to tell my scraper that it is done with a site, so I'm hitting recursion limits. Is that something Scrapy handles out of the box?
Scrapy uses a link follower to traverse through a site, until the list of available links is gone. Once a page is visited, it's removed from the list and Scrapy makes sure that link is not visited again.
Assuming all the websites pages have links on other pages, Scrapy would be able to visit every page of a website.
I've used Scrapy to traverse thousands of websites, mainly small businesses, and have had no problems. It's able to walk through the whole site.
I am afraid, no one knows when it crawled entire site. Can you say when you crawled entire Facebook, for example? That is because dynamically generated and cross-linked pages.
To set recursion limit is the only way to plan border after which you will stop your movement. But you can minimise the number of duplicate pages. You can use page link or page text's CRC as identifier and check if it is unique.
You can do something like this in your parse method:
if some_id not in set_of_all_page_ids:
set_of_all_page_ids.add(some_id)
yield scrapy.Request(response.urljoin(next_page_url))
The spider is to crawl info on a certain B2B website, and I want it to be a webserver, where user sumbit a url then the spider starts crawl.
The url seems like: apple.b2bxxx.com, which is a minisite on a B2B website, where all the products are listed. The "apple" might be different because different companies use different names for there minisite, and duplication is not allowed.
On the backend, it's MongoDB to store the data scraped.
What I have done, is that, I can collect info on the given url, but all data are stored in the same db.collection.
I know I can get parameters using "-a" for running scrapy, but how should I use it?
Should I change the pipelines.py or the spider python file?
Any suggestions?
I've got an answer.
for example:
using -s collection_name=abc for scrapy crawl command, then get the parameter in pipelines.py using param = settings.get('collection_name').
This is also found in stackoverflow, but can't remember which ticket.
Hope this would help some one facing same problem.
Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!
I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.