I am crawling online stores for price comparison. Mot of the stores are using dynamic URLs heavily. This is causing my crawler to spend lot of time on every online stores. Even though most of them have only 5-6k unique products, they have unique URLs >= 300k. Any idea how to get around this.
Thanks in advance!
If you parsing some product pages, usually these URLs have some kind of product id.
Find the pattern to extract product id from URLs, and use it to filter already visited URLs.
Related
I'm trying to scrape all pages with different ids from a site that is formatted url.com/page?id=1, but there are millions of ids so even at 1 request per second it will take weeks to get them all.
I am a total noob at this so I was wondering if there was a better way than going one by one such as a bulk request or something or should I just increase the requests per second to whatever I can get away with.
I am using requests and beautifulsoup in python to scrape the pages currently.
The grequests library is one possible approach you could take. The results are returned in the order they are obtained (which is not the same order as event_list).
import grequests
event_list = [grequests.get(f"http://url.com/page?id={req_id}") for req_id in range(1, 100)]
for r in grequests.imap(event_list, size=5):
print(r.request.url)
print(r.text[:100])
print()
Note: You are likely to be blocked if you attempt this on most sites. A better approach would be to see if the website has a suitable API you could use to obtain the same information which could be found by using your browser's network tools whilst using the site.
How would you go about crawling a website such that you can index every page when there is only really a search bar for navagation like the following sites.
https://plejehjemsoversigten.dk/
https://findadentist.ada.org/
Do people just brute force the search queries, or is there a method that's usually implemented to index these kinds of websites?
There could be several ways to approach your issue (however if the owner of a resource does not want the resource to be crawled, that might be really challenging)
Check robots.txt of a resource. It might give you a clue on the site structure.
Check sitemap.xml of a resource. It might give URLs of the pages a resource owner wishes to be public
Use alternative indexers (like google). Use advanced syntax narrowing the scope of search to a particular site (like site:your.domain)
Use breaches in site design. For example first site from your list does not have a minimal search string so that you can search for, say, a and get 800 results containing a. Then list remaining letters.
Having search result crawl all the links on the search result items pages since there often might be related pages listed.
I have made a Scrapy web crawler which can scrape Amazon. It can scrape by searching for items using a list of keywords and scrape the data from the resulting pages.
However, I would like to scrape Amazon for large portion of its product data. I don't have a preferred list of keywords with which to query for items. Rather, I'd like to scrape the website evenly and collect X number of items which is representative of all products listed on Amazon.
Does anyone know how scrape a website in this fashion? Thanks.
I'm putting my comment as an answer so that others looking for a similar solution can find it easier.
One way to achieve this is to going through each category (furniture, clothes, technology, automotive, etc.) and collecting a set number of items there. Amazon has side/top bars with navigation links to different categories, so you can let it run through there.
The process would be as follows:
Follow category urls from initial Amazon.com parse
Use a different parse function for the callback, one that will scrape however many items from that category
Ensure that data is writing to a file (it will probably be a lot of data)
However, such an approach would not be representative in the proportions of each category in the total Amazon products. Try looking for a "X number of results" label for each category to compensate for that. Good luck with your project!
Is it possible to scrape links by the date associated with them? I'm trying to implement a daily run spider that saves article information to a database, but I don't want to re-scrape articles that I have already scraped before-- i.e yesterday's articles. I ran across this SO post asking the same thing and the scrapy-deltafetch plugin was suggested.
However, this relies on checking new requests against previously saved request fingerprints stored in a database. I'm assuming that if the daily scraping went on for a while, there would be a need for significant memory overhead on the database to store request fingerprints that have already been scraped.
So given a list of articles on a site like cnn.com, I want to scrape all the articles that have been published today 6/14/17, but once the scraper hits later articles with a date listed as 6/13/17, I want to close the spider and stop scraping. Is this kind of approach possible with scrapy? Given a page of articles, will a CrawlSpider start at the top of the page and scrape articles in order?
Just new to Scrapy, so not sure what to try. Any help would be greatly appreciated, thank you!
You can use a custom delta-fetch_key which checks the date and the title as the fingerprint.
from w3lib.url import url_query_parameter
...
def parse(self, response):
...
for product_url in response.css('a.product_listing'):
yield Request(
product_url,
meta={'deltafetch_key': url_query_parameter(product_url, 'id')},
callback=self.parse_product_page
)
...
I compose a date using datetime.strptime(Item['dateinfo'], "%b-%d-%Y") from cobbled together information on the item of interest.
After that I just check it against a configured age in my settings, which can be overridden per invocation. You can issue a closespider exception when you find an age that is too old or you can set a finished flag and act on that in any of your other code.
No need for remembering stuff. I use this on a spider that I run daily and I simply set a 24hr age limit.
I have a list of over 1500 URLs relating to news media sites in India. I was interested in conducting some stats as part of my college project.
Long story short, I was interested in knowing which of these websites have links to their Facebook accounts on their main web page? Doing this would be a tedious task (I have done 25% of them so far), therefore I have been researching via the web on any possibilities of scraping these websites with a program. I have seen scrapers on scraperwiki as well as the importxml function primarily in Google Docs, however, thus far I have not been able to achieve much success with either.
I have tried the following function in Google Docs for a given site:
=ImportXML(A1, "//a[contains(#href, 'www.facebook.com')]")
Overall, I would like to ask whether its even possible (and how) to scan a given website (or list) just for a specific href link if the structure for each website differs significantly?
Thanks in advance for any help regarding this matter.
Mark