The spider is to crawl info on a certain B2B website, and I want it to be a webserver, where user sumbit a url then the spider starts crawl.
The url seems like: apple.b2bxxx.com, which is a minisite on a B2B website, where all the products are listed. The "apple" might be different because different companies use different names for there minisite, and duplication is not allowed.
On the backend, it's MongoDB to store the data scraped.
What I have done, is that, I can collect info on the given url, but all data are stored in the same db.collection.
I know I can get parameters using "-a" for running scrapy, but how should I use it?
Should I change the pipelines.py or the spider python file?
Any suggestions?
I've got an answer.
for example:
using -s collection_name=abc for scrapy crawl command, then get the parameter in pipelines.py using param = settings.get('collection_name').
This is also found in stackoverflow, but can't remember which ticket.
Hope this would help some one facing same problem.
Related
I am trying to scrape a website which has some dropdowns, So I planned to use Scrapy Framework with Scrapy-Selenium(more here) to click around the dropdowns(Nested For loop) and then capture the URL using below code and pass it to the parse() function to look for the needed data and scrape it to MySQL Database.
now_url=self.driver.current_url
print('Current URL is:'+now_url)
yield Request(now_url,callback=self.parse)
def parse(self, response):
#This Function Will Loop though Each Page and Capture the Data Sets Available on Each Page of Medicine
# creating items to be stored in itemspy file with this Crawler:
items=GrxItem()
#loop around the items on each medicine page(from a-z) and add them to items and throw into pipelines to SQL DB
But the logics seems not working as expected. Any insight to deal with this is appriciated. The full code is here.
EDIT:
I tried using SeleniumRequest() as well but it seems that too is not working.
I have a search engine in production serving around 700 000 url. The crawling is done using Scrapy, and all spiders are scheduled using DeltaFetch in order to get daily new links.
The difficulty I'm facing is handling broken links.
I have a hard time finding a good way to periodically scan, and remove broken links. I was thinking about a few solutions :
Developing a python script using requests.get, to check on every single url, and delete anything that returns a 404 status.
Using a third party tool like https://github.com/linkchecker/linkchecker, but not sure if it's the best solution since I only need to check up a list of url, not a website.
Using a scrapy spider to scrap this url list, and return any urls that are erroring out. I'm not really confident on that one since I know scrapy tends to timeout when scaning a lot of urls on different domains, this is why I rely so much on deltafetch
Do you have any recommendations / best practice to solve this problem?
Thanks a lot.
Edit : I forgot to give one precision : I'm looking to "validate" those 700k urls, not to crawl them. actually those 700k urls are the crawling result of around 2500k domains.
You could write a small script that just check the return http status like so:
for url in urls:
try:
urllib2.urlopen(url)
except urllib2.HTTPError, e:
# Do something when request fails
print e.code
This would be the same as your first point. You could also run this async in order to optimize the time it takes to run through your 700k links.
I would suggest using scrapy, since you're already looking up each URL with this tool and thus knows which URLs errors out. This means you don't have to check the URLs a second time.
I'd go about it like this:
Save every URL erroring out in a separate list/map with a counter (which is stored between runs).
Every time an URL errors out, increment the counter. If it doesn't, decrement the counter.
After running the Scrapy script, check this list/map for URLs with a high enough counter - let's say more than 10 faults, and remove them - or store them in a seperate list of links to check up on a later time (As a check if you accidentally removed a working URL because a server was down too long).
Since your third bullet is concerned about Scrapy being shaky with URL results, the same could be said for websites in general. If a site errors out on 1 try, it might not mean a broken link.
If you go for creating a script of our own check this solution
In addition an optimization that I suggest is to make heirarchy in your URL repository. If you get 404 from one of a parent URL you can avoid checking all it children URLs
First thought came into my mind is to request URLs with HEAD instead of any other method
Spawn multiple spiders at once assigning them batches like LIMIT 0,10000 and LIMIT 10000,10000
In your data pipeline, instead of running a MySQL DELETE query each time scraper finds 404 status, run DELETE FROM table WHERE link IN(link1,link2) query in bulk
I am sure you have INDEX on link column, if not add it
Is it possible to scrape links by the date associated with them? I'm trying to implement a daily run spider that saves article information to a database, but I don't want to re-scrape articles that I have already scraped before-- i.e yesterday's articles. I ran across this SO post asking the same thing and the scrapy-deltafetch plugin was suggested.
However, this relies on checking new requests against previously saved request fingerprints stored in a database. I'm assuming that if the daily scraping went on for a while, there would be a need for significant memory overhead on the database to store request fingerprints that have already been scraped.
So given a list of articles on a site like cnn.com, I want to scrape all the articles that have been published today 6/14/17, but once the scraper hits later articles with a date listed as 6/13/17, I want to close the spider and stop scraping. Is this kind of approach possible with scrapy? Given a page of articles, will a CrawlSpider start at the top of the page and scrape articles in order?
Just new to Scrapy, so not sure what to try. Any help would be greatly appreciated, thank you!
You can use a custom delta-fetch_key which checks the date and the title as the fingerprint.
from w3lib.url import url_query_parameter
...
def parse(self, response):
...
for product_url in response.css('a.product_listing'):
yield Request(
product_url,
meta={'deltafetch_key': url_query_parameter(product_url, 'id')},
callback=self.parse_product_page
)
...
I compose a date using datetime.strptime(Item['dateinfo'], "%b-%d-%Y") from cobbled together information on the item of interest.
After that I just check it against a configured age in my settings, which can be overridden per invocation. You can issue a closespider exception when you find an age that is too old or you can set a finished flag and act on that in any of your other code.
No need for remembering stuff. I use this on a spider that I run daily and I simply set a 24hr age limit.
Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!
I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.
I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T
I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.
You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.
In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.
Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html