Scrapy throwing up Traceback when trying to parse tabulated data

Scrapy throwing up Traceback when trying to parse tabulated data - python

I am running Scrapy.org version 2.7 64 bit on Windows Vista 64 bit. I have some Scrapy code that is trying parse data contained within a table at the URL contained within the following code:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.utils.markup import remove_tags
from scrapy.cmdline import execute
import re
class MySpider(Spider):
name = "wiki"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney"]
def parse(self, response):
for row in response.selector.xpath('//table[#id="player-fixture"]//tr[td[#class="tournament"]]'):
# Is this row contains goal symbols?
list_of_goals = row.xpath('//span[#title="Goal"')
if list_of_goals:
print remove_tags(list_of_goals).encode('utf-8')
execute(['scrapy','crawl','wiki'])
However, it is throwing up the following error:
Traceback (most recent call last):
File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 383, in callback
self._startRunCallbacks(result)
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 491, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 578, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\Python27\lib\site-packages\scrapy\spider.py", line 56, in parse
raise NotImplementedError
exceptions.NotImplementedError:
Can anyone tell me what the issue is here? I am trying to get a screen print of all items in the table, including the data in the goals and assists column.
Thanks

Your indentation is wrong:
class MySpider(Spider):
name = "wiki"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney"]
def parse(self, response):
for row in response.selector.xpath('//table[#id="player-fixture"]//tr[td[#class="tournament"]]'):
# Is this row contains goal symbols?
list_of_goals = row.xpath('//span[#title="Goal"')
if list_of_goals:
print remove_tags(list_of_goals).encode('utf-8')
Implementing a parse method is a requirement when you use the Spider class, this is what the method is like in the source code:
def parse(self, response):
raise NotImplementedError
Your indentation was wrong so parse was not part of the class and therefore you had not implemented the required method.
The raise NotImplementedError is there to ensure you write the required parse method when inheriting from the Spider base class.
You now just have to find the correct xpath ;)

Related

Scrapy - AttributeError: 'dict' object has no attribute 'dont_filter'

I'm trying to run this code, the webdriver opens the page but soon after it stops working and I receive and error: AttributeError: 'dict' object has no attribute 'dont_filter'.
This is my code:
import scrapy
from scrapy import Spider
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from scrapy.selector import Selector
from scrapy.http import Request
class RentalMarketSpider(Spider):
name = 'rental_market'
allowed_domains = ['home.co.uk']
def start_requests(self):
s=Service('/Users/chrisb/Desktop/Scrape/Home/chromedriver')
self.driver = webdriver.Chrome(service=s)
self.driver.get('https://www.home.co.uk/for_rent/ampthill/current_rents?location=ampthill')
sel = Selector(text=self.driver.page_source)
tot_prop_rent = sel.xpath('.//div[1]/table/tbody/tr[1]/td[2]/text()').extract_first()
last_14_days = sel.xpath('.//div[1]/table/tbody/tr[2]/td[2]/text()').extract_first()
average = sel.xpath('.//div[1]/table/tbody/tr[3]/td[2]/text()').extract_first()
median = sel.xpath('.//div[1]/table/tbody/tr[4]/td[2]/text()').extract_first()
one_b_num_prop = sel.xpath('.//div[3]/table/tbody/tr[2]/td[2]/text()').extract_first()
one_b_average = sel.xpath('.//div[3]/table/tbody/tr[2]/td[3]/text()').extract_first()
yield {
'tot_prop_rent': tot_prop_rent,
'last_14_days': last_14_days,
'average': average,
'median': median,
'one_b_num_prop': one_b_num_prop,
'one_b_average': one_b_average
}
Below is the full error I receive. I looked everywhere but couldn't find a clear answer in order to get rid of this error:
2021-12-23 17:43:26 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
File "/Users/chrisb/opt/anaconda3/lib/python3.8/site-packages/scrapy/commands/crawl.py", line 27, in run
self.crawler_process.start()
File "/Users/chrisb/opt/anaconda3/lib/python3.8/site-packages/scrapy/crawler.py", line 327, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Users/chrisb/opt/anaconda3/lib/python3.8/site-packages/twisted/internet/base.py", line 1318, in run
self.mainLoop()
File "/Users/chrisb/opt/anaconda3/lib/python3.8/site-packages/twisted/internet/base.py", line 1328, in mainLoop
reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
File "/Users/chrisb/opt/anaconda3/lib/python3.8/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Users/chrisb/opt/anaconda3/lib/python3.8/site-packages/scrapy/utils/reactor.py", line 50, in __call__
return self._func(*self._a, **self._kw)
File "/Users/chrisb/opt/anaconda3/lib/python3.8/site-packages/scrapy/core/engine.py", line 137, in _next_request
self.crawl(request, spider)
File "/Users/chrisb/opt/anaconda3/lib/python3.8/site-packages/scrapy/core/engine.py", line 218, in crawl
self.schedule(request, spider)
File "/Users/chrisb/opt/anaconda3/lib/python3.8/site-packages/scrapy/core/engine.py", line 223, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/Users/chrisb/opt/anaconda3/lib/python3.8/site-packages/scrapy/core/scheduler.py", line 78, in enqueue_request
if not request.dont_filter and self.df.request_seen(request):
builtins.AttributeError: 'dict' object has no attribute 'dont_filter'
2021-12-23 17:43:26 [scrapy.core.engine] INFO: Closing spider (finished)
Any advice would be appreciated. Thanks for your time.

I don't see anything wrong within your code as such. Possibly you are using a old version of ChromeDriver returning a the wrong shaped object.
Solution
Ensure that:
ChromeDriver is updated to current ChromeDriver v96.0 level.
Chrome is updated to current chrome=96.0.4664.45 (as per chrome=96.0.4664.45 release notes).
tl; dr
FIND_ELEMENT command return a dict object value

start_requests is supposed to yield individual Request objects
Not a dict
I am facing the same issue now. I think if we do this process in another function it will work. This solution is not the best but better than nothing:
def start_requests(self):
yield scrapy.Request(url='https://scrapy.org/', callback=self.parse)
def parse(self,response):
s = Service('/Users/chrisb/Desktop/Scrape/Home/chromedriver')
self.driver = webdriver.Chrome(service=s)
self.driver.get('https://www.home.co.uk/for_rent/ampthill/current_rents?location=ampthill')
sel = Selector(text=self.driver.page_source)
tot_prop_rent = sel.xpath('.//div[1]/table/tbody/tr[1]/td[2]/text()').extract_first()
last_14_days = sel.xpath('.//div[1]/table/tbody/tr[2]/td[2]/text()').extract_first()
average = sel.xpath('.//div[1]/table/tbody/tr[3]/td[2]/text()').extract_first()
median = sel.xpath('.//div[1]/table/tbody/tr[4]/td[2]/text()').extract_first()
one_b_num_prop = sel.xpath('.//div[3]/table/tbody/tr[2]/td[2]/text()').extract_first()
one_b_average = sel.xpath('.//div[3]/table/tbody/tr[2]/td[3]/text()').extract_first()
yield {
'tot_prop_rent': tot_prop_rent,
'last_14_days': last_14_days,
'average': average,
'median': median,
'one_b_num_prop': one_b_num_prop,
'one_b_average': one_b_average
}
I have just tested it and it worked.

How Can I Fix "TypeError: Cannot mix str and non-str arguments"?

I'm writing some scraping codes and experiencing an error as above.
My code is following.
# -*- coding: utf-8 -*-
import scrapy
from myproject.items import Headline
class NewsSpider(scrapy.Spider):
name = 'IC'
allowed_domains = ['kosoku.jp']
start_urls = ['http://kosoku.jp/ic.php']
def parse(self, response):
"""
extract target urls and combine them with the main domain
"""
for url in response.css('table a::attr("href")'):
yield(scrapy.Request(response.urljoin(url), self.parse_topics))
def parse_topics(self, response):
"""
pick up necessary information
"""
item=Headline()
item["name"]=response.css("h2#page-name ::text").re(r'.*（インターチェンジ）')
item["road"]=response.css("div.ic-basic-info-left div:last-of-type ::text").re(r'.*道$')
yield item
I can get the correct response when I do them individually on a shell script, but once it gets in a programme and run, it doesn't happen.
2017-11-27 18:26:17 [scrapy.core.scraper] ERROR: Spider error processing <GET http://kosoku.jp/ic.php> (referer: None)
Traceback (most recent call last):
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/sonogi/scraping/myproject/myproject/spiders/IC.py", line 16, in parse
yield(scrapy.Request(response.urljoin(url), self.parse_topics))
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/http/response/text.py", line 82, in urljoin
return urljoin(get_base_url(self), url)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/parse.py", line 424, in urljoin
base, url, _coerce_result = _coerce_args(base, url)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/parse.py", line 120, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
2017-11-27 18:26:17 [scrapy.core.engine] INFO: Closing spider (finished)
I'm so confused and appreciate anyone's help upfront!

According to the Scrapy documentation, the .css(selector) method that you're using, returns a SelectorList instance. If you want the actual (unicode) string version of the url, call the extract() method:
def parse(self, response):
for url in response.css('table a::attr("href")').extract():
yield(scrapy.Request(response.urljoin(url), self.parse_topics))

You're getting this error because of code at line 15.
As response.css('table a::attr("href")') returns the object of type list so you've to first convert the type of url from list to str and then you can parse your code to another function.
further the attr syntax might will lead you an error as the correct attr tag doesn't has "" so instead of a::attr("href") it would be a::attr(href).
So after removing above two issues the code will look something like this:
def parse(self, response):
"""
extract target urls and combine them with the main domain
"""
url = response.css('table a::attr(href)')
url_str = ''.join(map(str, url)) #coverts list to str
yield response.follow(url_str, self.parse_topics)

Scraping large number of static html.gz files in scrapy

I have a scrapy spider that looks for static html files on disk using the file:/// command as a start url, but I'm unable to load the gzip files and loop through my directory of 150,000 files which all have the .html.gz suffix, I've tried several different approaches that I have commented out but nothing works so far, my code so far looks as
from scrapy.spiders import CrawlSpider
from Scrapy_new.items import Scrapy_newTestItem
import gzip
import glob
import os.path
class Scrapy_newSpider(CrawlSpider):
name = "info_extract"
source_dir = '/path/to/file/'
allowed_domains = []
start_urls = ['file://///path/to/files/.*html.gz']
def parse_item(self, response):
item = Scrapy_newTestItem()
item['user'] = response.xpath('//*[#id="page-user"]/div[1]/div/div/div[2]/div/div[2]/div[1]/h1/span[2]/text()').extract()
item['list_of_links'] = response.xpath('//*[#id="page-user"]/div[1]/div/div/div[2]/div/div[3]/div[3]/a/#href').extract()
item['list_of_text'] = response.xpath('//*[#id="page-user"]/div[1]/div/div/div/div/div/div/a/text()').extract()
Running this gives the error code
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/handlers/file.py", line 13, in download_request
with open(filepath, 'rb') as fo:
IOError: [Errno 2] No such file or directory: 'path/to/files/*.html'
Changing my code so that the files are first unziped and then passed through as follow:
source_dir = 'path/to/files/'
for src_name in glob.glob(os.path.join(source_dir, '*.gz')):
base = os.path.basename(src_name)
with gzip.open(src_name, 'rb') as infile:
#start_urls = ['/path/to/files*.html']#
file_cont = infile.read()
start_urls = file_cont#['file:////file_cont']
Gives the following error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 57, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: %3C

You don't have to use start_urls always on a scrapy spider. Also CrawlSpider is commonly used in conjunction with rules for specifying routes to follow and what to extract in big crawling sites, you'll maybe want to use scrapy.Spider directly instead of CrawlSpider.
Now, the solution relies on using the start_requests method that a scrapy spider offers, which handles the first requests of the spider. If this method is implemented in your spider, start_urls won't be used:
from scrapy import Spider
import gzip
import glob
import os
class ExampleSpider(Spider):
name = 'info_extract'
def start_requests(self):
os.chdir("/path/to/files")
for file_name in glob.glob("*.html.gz"):
f = gzip.open(file_name, 'rb')
file_content = f.read()
print file_content # now you are reading the file content of your local files
Now, remember that start_requests must return an iterable of requests, which ins't the case here, because you are only reading files (I assume you are going to create requests later with the content of those files), so my code will be failing with something like:
CRITICAL:
Traceback (most recent call last):
...
/.../scrapy/crawler.py", line 73, in crawl
start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable
Which points that I am not returning anything from my start_requests method (None), which isn't iterable.

Scrapy will not be able to deal with the compressed html files, you have to extract them first. This can be done on-the-fly in Python or you just extract them on operating system level.
Related: Python Scrapy on offline (local) data

module._init_() takes at most 2 arguments (3 given) (scrapy tutorial w/ xpath)

I'm not sure how I have 3 arguements here, and if so, how do i call DmozItem from items.py? This seems like a simple inheritance issue I'm missing. This code is copied directly from the scrapy tutorial website.
-- Shell error --
SyntaxError: invalid syntax
PS C:\Users\Steve\tutorial> scrapy crawl dmoz
Traceback (most recent call last):
File "c:\python27\scripts\scrapy-script.py", line 9, in <module>
load_entry_point('scrapy==1.0.3', 'console_scripts', 'scrapy')()
File "C:\Python27\lib\site-packages\scrapy-1.0.3-py2.7.egg\scrapy\cmdline.py", line 142, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "C:\Python27\lib\site-packages\scrapy-1.0.3-py2.7.egg\scrapy\crawler.py", line 209, in __init__
super(CrawlerProcess, self).__init__(settings)
File "C:\Python27\lib\site-packages\scrapy-1.0.3-py2.7.egg\scrapy\crawler.py", line 115, in __init__
self.spider_loader = _get_spider_loader(settings)
File "C:\Python27\lib\site-packages\scrapy-1.0.3-py2.7.egg\scrapy\crawler.py", line 296, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "C:\Python27\lib\site-packages\scrapy-1.0.3-py2.7.egg\scrapy\spiderloader.py", line 30, in from_settings
return cls(settings)
File "C:\Python27\lib\site-packages\scrapy-1.0.3-py2.7.egg\scrapy\spiderloader.py", line 21, in __init__
for module in walk_modules(name):
File "C:\Python27\lib\site-packages\scrapy-1.0.3-py2.7.egg\scrapy\utils\misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "C:\Python27\lib\importlib\__init__.py", line 37, in import_module
__import__(name)
File "C:\Users\Steve\tutorial\tutorial\spiders\dmoz_spider.py", line 3, in <module>
from tutorial.items import DmozItem
File "C:\Users\Steve\tutorial\tutorial\items.py", line 11, in <module>
class DmozItem(scrapy.item):
TypeError: Error when calling the metaclass bases
module._init_() takes at most 2 arguments (3 given)
-- items.py -- my items list for parsing
import scrapy
class DmozItem(scrapy.item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
-- dmoz_spider.py -- this is the spider
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"https://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"https://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/#href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item

You have mistyped scrapy.Item class name.
In items.py, change:
scrapy.item
to
scrapy.Item
It should look like this:
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()

Running Scrapy from a script, need help understanding it

I relatively new to Python so any help/advice is appreciated.
I am trying to build a script which will run a Scrapy spider.
So far I have the code below,
from scrapy.contrib.loader import XPathItemLoader
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.crawler import CrawlerProcess
class QuestionItem(Item):
"""Our SO Question Item"""
title = Field()
summary = Field()
tags = Field()
user = Field()
posted = Field()
votes = Field()
answers = Field()
views = Field()
class MySpider(BaseSpider):
"""Our ad-hoc spider"""
name = "myspider"
start_urls = ["http://stackoverflow.com/"]
question_list_xpath = '//div[#id="content"]//div[contains(#class, "question- summary")]'
def parse(self, response):
hxs = HtmlXPathSelector(response)
for qxs in hxs.select(self.question_list_xpath):
loader = XPathItemLoader(QuestionItem(), selector=qxs)
loader.add_xpath('title', './/h3/a/text()')
loader.add_xpath('summary', './/h3/a/#title')
loader.add_xpath('tags', './/a[#rel="tag"]/text()')
loader.add_xpath('user', './/div[#class="started"]/a[2]/text()')
loader.add_xpath('posted', './/div[#class="started"]/a[1]/span/#title')
loader.add_xpath('votes', './/div[#class="votes"]/div[1]/text()')
loader.add_xpath('answers', './/div[contains(#class, "answered")]/div[1]/text()')
loader.add_xpath('views', './/div[#class="views"]/div[1]/text()')
yield loader.load_item()
class CrawlerWorker(Process):
def __init__(self, spider, results):
Process.__init__(self)
self.results = results
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
self.crawler.start()
self.crawler.stop()
self.results.put(self.items)
def main():
results = Queue()
crawler = CrawlerWorker(MySpider(BaseSpider), results)
crawler.start()
for item in results.get():
pass # Do something with item
I get this error below,
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
...
C:\Python27\lib\site-packages\twisted\internet\win32eventreactor.py:64: UserWarn
ing: Reliable disconnection notification requires pywin32 215 or later
category=UserWarning)
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python27\lib\multiprocessing\forking.py", line 374, in main
self = load(from_parent)
File "C:\Python27\lib\pickle.py", line 1378, in load
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
return Unpickler(file).load()
File "C:\Python27\lib\pickle.py", line 858, in load
dispatch[key](self)
File "C:\Python27\lib\pickle.py", line 1090, in load_global
klass = self.find_class(module, name)
File "C:\Python27\lib\pickle.py", line 1124, in find_class
__import__(module)
File "Webscrap.py", line 53, in <module>
class CrawlerWorker(Process):
NameError: name 'Process' is not defined
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
...
"PicklingError: <function remove at 0x07871CB0>: Can't pickle <function remove at 0x077F6BF0>: it's not found as weakref.remove".
I realise I am doing something logically wrong. Being new to this I can't spot it. Could anyone give me some help to get this code running?
Ultimately I just want a script which will run, scrap the required data, and store it in a database, but first I would like to get just the scraping working. I thought this would run it, but no luck so far.

I assume you want a standalone spider/crawler... That is actually quite simple, though I'm not using a custom Process.
class StandAloneSpider( CyledgeSpider ):
#a regular spider
settings.overrides['LOG_ENABLED'] = True
#more settings can be changed...
crawler = CrawlerProcess( settings )
crawler.install()
crawler.configure()
spider = StandAloneSpider()
crawler.crawl( spider )
crawler.start()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy throwing up Traceback when trying to parse tabulated data - python

Related

Scrapy - AttributeError: 'dict' object has no attribute 'dont_filter'

How Can I Fix "TypeError: Cannot mix str and non-str arguments"?

Scraping large number of static html.gz files in scrapy

module._init_() takes at most 2 arguments (3 given) (scrapy tutorial w/ xpath)

Running Scrapy from a script, need help understanding it

Categories

Resources