I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages).
I have looked on the scrapy webpage and the mailing lists and stackoverflow, but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it. All suggestions welcome!
I have listed my code below, if it's needed.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import re
class Sale(Item):
Adresse = Field()
Pris = Field()
Salgsdato = Field()
SalgsType = Field()
KvmPris = Field()
Rum = Field()
Postnummer = Field()
Boligtype = Field()
Kvm = Field()
Bygget = Field()
class HouseSpider(BaseSpider):
name = 'House'
allowed_domains = ["http://boliga.dk/"]
start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 23770, 1)]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("id('searchresult')/tr")
items = []
for site in sites:
item = Sale()
item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
item['Pris'] = site.select("td[2]/text()").extract()
item['Salgsdato'] = site.select("td[3]/text()").extract()
Temp = site.select("td[4]/text()").extract()
Temp = Temp[0]
m = re.search('\r\n\t\t\t\t\t(.+?)\r\n\t\t\t\t', Temp)
if m:
found = m.group(1)
item['SalgsType'] = found
else:
item['SalgsType'] = Temp
item['KvmPris'] = site.select("td[5]/text()").extract()
item['Rum'] = site.select("td[6]/text()").extract()
item['Postnummer'] = site.select("td[7]/text()").extract()
item['Boligtype'] = site.select("td[8]/text()").extract()
item['Kvm'] = site.select("td[9]/text()").extract()
item['Bygget'] = site.select("td[10]/text()").extract()
items.append(item)
return items
Thanks!
Here's a collection of things to try:
use latest scrapy version (if not using already)
check if non-standard middlewares are used
try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
turn off logging LOG_ENABLED = False (docs)
try yielding an item in a loop instead of collecting items into the items list and returning them
use local cache DNS (see this thread)
check if this site is using download threshold and limits your download speed (see this thread)
log cpu and memory usage during the spider run - see if there are any problems there
try run the same spider under scrapyd service
see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
try running Scrapy on pypy, see Running Scrapy on PyPy
Hope that helps.
Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips #alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache
One workaround to speed up your scrapy is to config your start_urls appropriately.
For example, If our target data is in http://apps.webofknowledge.com/doc=1 where the doc number range from 1 to 1000, you can config your start_urls in followings:
start_urls = [
"http://apps.webofknowledge.com/doc=250",
"http://apps.webofknowledge.com/doc=750",
]
In this way, requests will start from 250 to 251,249 and from 750 to 751,749 simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"].
I work also on web scraping, using optimized C#, and it ends up CPU bound, so I am switching to C.
Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.
If you do the math, you are quickly compute bound but not memory bound.
Related
So i am working on a small crawler using scrapy and python on this website https://www.theverge.com/reviews. From there i am trying to extract the reviews based on the rules i have set which should match links that match this criteria:
example: https://www.theverge.com/22274747/tern-hsd-p9-ebike-review-electric-cargo-bike-price-specs
Extracting the url from the review page, title of the page, name of who made the review and the link to their profile. However i assume there is something either wrong with my code or something wrong with the way i have my files sorted. Because this error when i try to run it:
runspider: error: Unable to load 'spiders/vergespider.py': No module named 'oblig3.oblig3'
My folders look like this.
So my intended results should look something like this. Visiting up to 20 pages, which i don't quite understand how to fix through the scrapy settings, but that is another problem.
authorlink,authorname,title,url
"https://www.theverge.com/authors/cameron-faulkner,https://www.twitter.com/camfaulkner",Cameron
Faulkner,"Gigabyte’s Aorus 15G is great at gaming, but not much
else",https://www.theverge.com/22299226/gigabyte-aorus-15g-review-gaming-laptop-price-specs-features
So my question is what could be causing the error i am getting why am i not getting any csv output from this code. I am fairly new at python and scrapy oo any tips or improvement to the code are appreciated. I would like to keep the "solutions" through scrapy and python as those are the things i am trying to learn atm.
Edit:
This is what i use to run the code with scrapy runspider spiders/vergespider.py -o vergetest.csv -t csv. And this is what i have coded so far.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from oblig3.items import VergeReview
class VergeSpider(CrawlSpider):
name = 'verge'
allowed_domains = ['theverge.com']
start_urls = ['https://www.theverge.com/reviews']
rules = [
Rule(LinkExtractor(allow=r'^(https://www.theverge.com/)(/d+)/([%5E/]+$)%27'),
callback='parse_items', follow=True),
Rule(LinkExtractor(allow=r'.*'),
callback='parse_items', cb_kwargs={'is_verge': False})
]
def parse(self, response, is_verge):
if is_verge:
verge = VergeReview()
verge['url'] = response.url
verge['title'] = response.xpath("//h1/text()").extract_first()
verge['authorname'] = response.xpath("//span[#class='c-byline__author-name']/text()").extract()
verge['authorlink'] = response.xpath("//*/span[#class = 'c-byline__item'][1]/a/#href").extract()
yield verge
else:
# Do something else
pass
My items file
import scrapy
class VergeReview(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
authorname = scrapy.Field()
authorlink = scrapy.Field()
And my settings file is unchanged though i should implement CLOSESPIDER_PAGECOUNT = 20 but idk how.
The error you have is:
runspider: error ..... No module named 'oblig3.oblig3'
What I can see from your screenprint is that oblig3 is the name of your project.
This is a common error when you try to run your spider using:
scrapy runspider spider_file.py
If you are running your spider this way, you need to change the way you are running the spider:
First, make sure that you are in the directory where scrapy.cfg is located
then run
scrapy list
This should give you a list of all the spiders it found.
After that, you should use this command to run your spider.
scrapy crawl <spidername>
If this does not solve your problem, you need to share the code and share the details about how you are running your spider.
I tried to get all the products from this website but somehow I don't think I chose the best method because some of them are missing and I can't figure out why. It's not the first time when I get stuck when it comes to this.
The way I'm doing it now is like this:
go to the index page of the website
get all the categories from there (A-Z 0-9)
access each of the above category and recursively go through all the subcategories from there until I reach the products page
when I reach the products page, check if the product has more SKUs. If it has, get the links. Otherwise, that's the only SKU.
Now, the below code works but it just doesn't get all the products and I don't see any reasons for why it'd skip some. Maybe the way I approached everything is wrong.
from lxml import html
from random import randint
from string import ascii_uppercase
from time import sleep
from requests import Session
INDEX_PAGE = 'https://www.richelieu.com/us/en/index'
session_ = Session()
def retry(link):
wait = randint(0, 10)
try:
return session_.get(link).text
except Exception as e:
print('Retrying product page in {} seconds because: {}'.format(wait, e))
sleep(wait)
return retry(link)
def get_category_sections():
au = list(ascii_uppercase)
au.remove('Q')
au.remove('Y')
au.append('0-9')
return au
def get_categories():
html_ = retry(INDEX_PAGE)
page = html.fromstring(html_)
sections = get_category_sections()
for section in sections:
for link in page.xpath("//div[#id='index-{}']//li/a/#href".format(section)):
yield '{}?imgMode=m&sort=&nbPerPage=200'.format(link)
def dig_up_products(url):
html_ = retry(url)
page = html.fromstring(html_)
for link in page.xpath(
'//h2[contains(., "CATEGORIES")]/following-sibling::*[#id="carouselSegment2b"]//li//a/#href'
):
yield from dig_up_products(link)
for link in page.xpath('//ul[#id="prodResult"]/li//div[#class="imgWrapper"]/a/#href'):
yield link
for link in page.xpath('//*[#id="ts_resultList"]/div/nav/ul/li[last()]/a/#href'):
if link != '#':
yield from dig_up_products(link)
def check_if_more_products(tree):
more_prods = [
all_prod
for all_prod in tree.xpath("//div[#id='pm2_prodTableForm']//tbody/tr/td[1]//a/#href")
]
if not more_prods:
return False
return more_prods
def main():
for category_link in get_categories():
for product_link in dig_up_products(category_link):
product_page = retry(product_link)
product_tree = html.fromstring(product_page)
more_products = check_if_more_products(product_tree)
if not more_products:
print(product_link)
else:
for sku_product_link in more_products:
print(sku_product_link)
if __name__ == '__main__':
main()
Now, the question might be too generic but I wonder if there's a rule of thumb to follow when someone wants to get all the data (products, in this case) from a website. Could someone please walk me through the whole process of discovering what's the best way to approach a scenario like this?
If your ultimate goal is to scrape the entire product listing for each category, it may make sense to target the full product listings for each category on the index page. This program uses BeautifulSoup to find each category on the index page and then iterates over each product page under each category. The final output is a list of namedtuples stories each category name with the current page link and the full product titles for each link:
url = "https://www.richelieu.com/us/en/index"
import urllib
import re
from bs4 import BeautifulSoup as soup
from collections import namedtuple
import itertools
s = soup(str(urllib.urlopen(url).read()), 'lxml')
blocks = s.find_all('div', {'id': re.compile('index\-[A-Z]')})
results_data = {[c.text for c in i.find_all('h2', {'class':'h1'})][0]:[b['href'] for b in i.find_all('a', href=True)] for i in blocks}
final_data = []
category = namedtuple('category', 'abbr, link, products')
for category1, links in results_data.items():
for link in links:
page_data = str(urllib.urlopen(link).read())
print "link: ", link
page_links = re.findall(';page\=(.*?)#results">(.*?)</a>', page_data)
if not page_links:
final_page_data = soup(page_data, 'lxml')
final_titles = [i.text for i in final_page_data.find_all('h3', {'class':'itemHeading'})]
new_category = category(category1, link, final_titles)
final_data.append(new_category)
else:
page_numbers = set(itertools.chain(*list(map(list, page_links))))
full_page_links = ["{}?imgMode=m&sort=&nbPerPage=48&page={}#results".format(link, num) for num in page_numbers]
for page_result in full_page_links:
new_page_data = soup(str(urllib.urlopen(page_result).read()), 'lxml')
final_titles = [i.text for i in new_page_data.find_all('h3', {'class':'itemHeading'})]
new_category = category(category1, link, final_titles)
final_data.append(new_category)
print final_data
The output will garner results in the format:
[category(abbr=u'A', link='https://www.richelieu.com/us/en/category/tools-and-shop-supplies/workshop-accessories/tool-accessories/sander-accessories/1058847', products=[u'Replacement Plate for MKT9924DB Belt Sander', u'Non-Grip Vacuum Pads', u'Sandpaper Belt 2\xbd " x 14" for Compact Belt Sander PC371 or PC371K', u'Stick-on Non-Vacuum Pads', u'5" Non-Vacuum Disc Pad Hook-Face', u'Sanding Filter Bag', u'Grip-on Vacuum Pads', u'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x 10.79 cm (3" x 4-1/4")', u'4" Abrasive for Finishing Tool', u'Sander Backing Pad for RO 150 Sander', u'StickFix Sander Pad for ETS 125 Sander', u'Sub-Base Pad for Stocked Sanders', u'(5") Non-Vacuum Disc Pad Vinyl-Face', u'Replacement Sub-Base Pads for Stocked Sanders', u"5'' Multi-Hole Non-Vaccum Pad", u'Sander Backing Pad for RO 90 DX Sander', u'Converting Sanding Pad', u'Stick-On Vacuum Pads', u'Replacement "Stik It" Sub Base', u'Drum Sander/Planer Sandpaper'])....
To access each attribute, call like so:
categories = [i.abbr for i in final_data]
links = [i.links for i in final_data]
products = [i.products for i in final_data]
I believe the benefit of using BeautifulSoup is this instance is that it provides a higher level of control over the scraping and is easily modified. For instance, should the OP change his mind regarding what facets of the product/index he would like to scrape, simple changes in the find_all parameters should only be needed, as the general structure of the code above centers around each product category from the index page.
First of all, there is no definite answer to your generic question of how would one know if the data one has already scraped is all the available data. This is at least web-site specific and is rarely actually revealed. Plus, the data itself might be highly dynamic. On this web-site though you may more or less use the product counters to verify the amount of results found:
Your best bet here would be to debug - use logging module to print out information while scraping, then analyze the logs and look for why there was a missing product and what caused that.
Some of the ideas I currently have:
could it be that the retry() is the problematic part - could it be that session_.get(link).text does not raise an error but does not contain the actual data in the response as well?
I think the way you extract category links is correct and I don't see you missing categories on the index page
the dig_up_products() is questionable: when you extract links to the subcategories, you have this carouselSegment2b id used in the XPath expression, but I see that on at least some of the pages (like this one) the id value is carouselSegment1b. In any case, I would probably do just //h2[contains(., "CATEGORIES")]/following-sibling::div//li//a/#href here
I also don't like that imgWrapper class used to find a product link (could be that products missing images are missed?). Why not just: //ul[#id="prodResult"]/li//a/#href - this would though bring in some duplicates which you can address separately. But, you can also look for the link in the "info" section of the product container: //ul[#id="prodResult"]/li//div[contains(#class, "infoBox")]//a/#href.
There can also be an anti-bot, anti-web-scraping strategy deployed that may temporarily ban your IP or/and User-Agent or even obfuscate the response. Check for that too.
As pointed out by #mzjn and #alecxe, some websites employ anti-scraping measures. To hide their intentions, scrapers should try to mimic a human visitor.
One particular way for websites to detect a scraper, is to measure the time between subsequent page requests. Which is why scrapers typically keep a (random) delay between requests.
Besides, hammering a web server that is not yours without giving it some slack, is not considered good netiquette.
From Scrapy's documentation:
RANDOMIZE_DOWNLOAD_DELAY
Default: True
If enabled, Scrapy will wait a random amount of time (between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY) while fetching requests from the same website.
This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests.
The randomization policy is the same used by wget --random-wait option.
If DOWNLOAD_DELAY is zero (default) this option has no effect.
Oh, and make sure the User-Agent string in your HTTP request resembles that of an ordinary web browser.
Further reading:
https://exposingtheinvisible.org/guides/scraping/
Sites not accepting wget user agent header
I have nearly 2500 unique links, from which I want to run BeautifulSoup and gather some text captured in paragraphs on each of the 2500 pages. I could create variables for each link, but having 2500 is obviously not the most efficient course of action. The links are contained in a list like the following:
linkslist = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
Should I just write a for loop like the following?
for link in linkslist:
opened_url = urllib2.urlopen(link).read()
soup = BeautifulSoup(opened_url)
...
I'm looking for any constructive criticism. Thanks!
This is a good use case for Scrapy - a popular web-scraping framework based on Twisted:
Scrapy is written with Twisted, a popular event-driven networking
framework for Python. Thus, it’s implemented using a non-blocking (aka
asynchronous) code for concurrency.
Set the start_urls property of your spider and parse the page inside the parse() callback:
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
allowed_domains = ["website.com"]
def parse(self, response):
print response.xpath("//title/text()").extract()
How about writing a function that would treat each URL separately?
def processURL(url):
pass
# Your code here
map(processURL, linkslist)
This will run your function on each url in your list. If you want to speed things up, this is easy to run in parallel:
from multiprocessing import Pool
list(Pool(processes = 10).map(processURL, linkslist))
I've written a basic Scrapy spider to crawl a website which seems to run fine other than the fact it doesn't want to stop, i.e. it keeps revisiting the same urls and returning the same content - I always end up having to stop it. I suspect it's going over the same urls over and over again. Is there a rule that will stop this? Or is there something else I have to do? Maybe middleware?
The Spider is as below:
class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
"http://www.lsbu.ac.uk"
]
rules = [
Rule(SgmlLinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]
def parse_item(self, response):
join = Join()
sel = Selector(response)
bits = sel.xpath('//*')
scraped_bits = []
for bit in bits:
scraped_bit = LsbuItem()
scraped_bit['title'] = scraped_bit.xpath('//title/text()').extract()
scraped_bit['desc'] = join(bit.xpath('//*[#id="main_content_main_column"]//text()').extract()).strip()
scraped_bits.append(scraped_bit)
return scraped_bits
My settings.py file looks like this
BOT_NAME = 'lsbu6'
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = True
SPIDER_MODULES = ['lsbu.spiders']
NEWSPIDER_MODULE = 'lsbu.spiders'
Any help/ guidance/ instruction on stopping it running continuously would be greatly appreciated...
As I'm a newbie to this; any comments on tidying the code up would also be helpful (or links to good instruction).
Thanks...
The DupeFilter is enabled by default: http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class and it's based on the request url.
I tried a simplified version of your spider on a new vanilla scrapy project without any custom configuration. The dupefilter worked and the crawl stopped after a few requests. I'd say you have something wrong on your settings or on your scrapy version. I'd suggest you to upgrade to scrapy 1.0, just to be sure :)
$ pip install scrapy --pre
The simplified spider I tested:
from scrapy.spiders import CrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy import Item, Field
from scrapy.spiders import Rule
class LsbuItem(Item):
title = Field()
url = Field()
class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
"http://www.lsbu.ac.uk"
]
rules = [
Rule(LinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]
def parse_item(self, response):
scraped_bit = LsbuItem()
scraped_bit['url'] = response.url
yield scraped_bit
Your design makes the crawl go in circles. For examples, there is a page http://www.lsbu.ac.uk/business-and-partners/business, which when opened contains the link to "http://www.lsbu.ac.uk/business-and-partners/partners, and that one contains again the link to the first one. Thus, you go in circles indefinitely.
In order to overcome this, you need to create better rules, eliminating the circular references.
And also, you have two identical rules defined, which is not needed. If you want the follow you can always put it on the same rule, you don't need a new rule.
Here is my problem statement :
I'm trying to retrieve all the well specific information for a state from http://www.aogc2.state.ar.us/AOGConline/ . After doing a bit of R&D , i figured out that individual well information is stored in path structured as :
http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100280000&KeyType=STRING&DetailXML=WellDetails.xml
where each KeyValue is unique for every well.I was trying to derive a generic pattern in the KeyValue - for above URL eg in 3143100280000 , 03 represents state(arkansas),143 represents County , but the remaining no - 100280000 is not necessarily following a serial pattern and thus makes life difficult.
Is there a way through which all the KeyValues for 43K+ wells be obtained here (which i'm presuming is coming from a database) ?Tried looking for all sources js files being loaded from http://www.aogc2.state.ar.us/AOGConline/ but none points towards all KeyValues/Well API source directory
Using Python Scrapy i've written the following spider which crawls few specific Well XML URLs.In need to make this generic so as to obtain all 43k+ well information but not being able to attain a way to figure out all the KeyValues here
from scrapy.spider import Spider
from scrapy.selector import Selector
import codecs
class AogcSpider(Spider):
name = "aogc"
allowed_domains = ["http://www.aogc2.state.ar.us/"]
start_urls = [
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100280000&KeyType=STRING&DetailXML=WellDetails.xml",
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100290000&KeyType=STRING&DetailXML=WellDetails.xml",
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100300000&KeyType=STRING&DetailXML=WellDetails.xml",
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100310000&KeyType=STRING&DetailXML=WellDetails.xml",
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100320000&KeyType=STRING&DetailXML=WellDetails.xml",
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100330000&KeyType=STRING&DetailXML=WellDetails.xml"
]
def parse(self,response):
hxs = Selector(response)
trnodes = hxs.xpath("//td[#class='ColumnValue']")
filename = codecs.open("aogc_wells","a","utf-8-sig")
filename.write("\n")
for nodes in trnodes:
ftext = nodes.xpath("text()").extract()
for txt in ftext:
filename.write(txt)
filename.write("|")