Scrapy: Save response.body as html file?

Scrapy: Save response.body as html file? - python

My spider works, but I can't download the body of the website I crawl in a .html file. If I write self.html_fil.write('test') then it works fine. I don't know how to convert the tulpe to string.
I use Python 3.6
Spider:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ['google.com']
start_urls = ['http://google.com/']
def __init__(self):
self.path_to_html = html_path + 'index.html'
self.path_to_header = header_path + 'index.html'
self.html_file = open(self.path_to_html, 'w')
def parse(self, response):
url = response.url
self.html_file.write(response.body)
self.html_file.close()
yield {
'url': url
}
Tracktrace:
Traceback (most recent call last):
File "c:\python\python36-32\lib\site-packages\twisted\internet\defer.py", line
653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\Users\kv\AtomProjects\example_project\example_bot\example_bot\spiders
\example.py", line 35, in parse
self.html_file.write(response.body)
TypeError: write() argument must be str, not bytes

Actual problem is you are getting byte code. You need to convert it to string format. there are many ways for converting byte to string format.
You can use
self.html_file.write(response.body.decode("utf-8"))
instead of
self.html_file.write(response.body)
also you can use
self.html_file.write(response.text)

The correct way is to use response.text, and not response.body.decode("utf-8"). To quote documentation:
Keep in mind that Response.body is always a bytes object. If you want the unicode version use TextResponse.text (only available in TextResponse and subclasses).
and
text: Response body, as unicode.
The same as response.body.decode(response.encoding), but the result is cached after the first call, so you can access response.text multiple times without extra overhead.
Note: unicode(response.body) is not a correct way to convert response body to unicode: you would be using the system default encoding (typically ascii) instead of the response encoding.

Taking in consideration responses above, and making it as much pythonic as possible adding the use of the with statement, the example should be rewritten like:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ['google.com']
start_urls = ['http://google.com/']
def __init__(self):
self.path_to_html = html_path + 'index.html'
self.path_to_header = header_path + 'index.html'
def parse(self, response):
with open(self.path_to_html, 'w') as html_file:
html_file.write(response.text)
yield {
'url': response.url
}
But the html_file will only accessible from the parse method.

Related

Scrapy not identifying key from json

I am trying to scrape the information pertaining to the biblical commentaries off of a website. Below is the code I have made to do so. start_urls is the link to the json file I am trying to scrape. I chose ['0']['father']['_id'] to get the name of the commenter, however, the following error occurs. What should I do?
Error: TypeError: list indices must be integers or slices, not str
Code:
import scrapy
import json
class catenaspider(scrapy.Spider): #spider to crawl the url
name = 'commentary' #name to be called in command terminal
start_urls = ['https://api.catenabible.com:8080/anc_com/c/mt/1/1?tags=[%22ALL%22]&sort=def']
def parse(self,response):
data = json.loads(response.body)
yield from data['0']['father']['_id']```

Read the documentation again.
import scrapy
class catenaspider(scrapy.Spider): # spider to crawl the url
name = 'commentary' # name to be called in command terminal
start_urls = ['https://api.catenabible.com:8080/anc_com/c/mt/1/1?tags=[%22ALL%22]&sort=def']
def parse(self, response):
data = response.json()
yield {'id_father': data[0]['father']['_id']}
# if you want to get all the id's
# for d in data:
# yield {'id_father': d['father']['_id']}

Request url must be str or unicode, got list error?

I am getting following error : 'Request url must be str or unicode, got list'
starting url is 'https://www.zomato.com/istanbul/restaurants?page=1'
import scrapy
def parse(self, response):
all_css = response.css('.search_left_featured')
all_product = all_css.css('a::attr(href)').extract()
yield scrapy.Request(all_product, callback=self.parse_dir_contents)
max_page_number = 10
for i in range(1, max_page_number):
url_next = 'https://www.zomato.com/istanbul/restaurants?page=' + str(i)+''
yield scrapy.Request(url_next, callback=self.parse)
def parse_dir_contents(self, response):
items = ZomatodataItem()
name = response.css('.iNaazl::text').extract()
genre =response.css('.PhzdX::text').extract()
location =response.css('.gqeQEx::text').extract()
tags = response.css('.cunMUz::text').extract()
address = response.css('.clKRrC::text').extract()
phone = response.css('.kKemRh::text').extract()
items['name']= name
items['genre']= genre
items['location']= location
items['tags']= tags
items['address']= address
items['phone_number']= phone
yield items

What is your issue? The error seems clear: .css returns a SelectorList, whose extract method returnds a list, which you're then passing to Request which wants a URL, which a list is not.
Either iterate on your result, or use the (more modern and less confusing) .get() and .getall() methods from scrapy, extract() is deprecated (as in scrapy has stopped using it in its documentation) because it behaves differently depending on being invoked on a Selector (returns a string) or SelectorList (returns a list).
Hell, do both.

Scraping Japanese website using Scrapy but no data in output file

I am new to Scrapy. I wanted some data scraped from a Japanese website but when I run the following spider, it won't show any data on the exported file. Can someone help me please.
Exporting to csv format doesn't show any results in the shell either, just [].
Here is my code.
import scrapy
class suumotest(scrapy.Spider):
name = "testsecond"
start_urls = [
'https://suumo.jp/jj/chintai/ichiran/FR301FC005/?tc=0401303&tc=0401304&ar=010&bs=040'
]
def parse(self, response):
# for following property link
for href in response.css('.property_inner-title+a::attr(href)').extract():
yield scrapy.Request(response.urljoin(href), callback=self.parse_info)
# defining parser to extract data
def parse_info(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'Title': extract_with_css('h1.section_title::text'),
'Fee': extract_with_css('td.detailinfo-col--01 span.detailvalue-item-accent::text'),
'Fee Descrition': extract_with_css('td.detailinfo-col--01 span.detailvalue-item-text::text'),
'Prop Description': extract_with_css('td.detailinfo-col--03::text'),
'Prop Address': extract_with_css('td.detailinfo-col--04::text'),
}

Your first css selector in parse method is faulty here:
response.css('.property_inner-title+a::attr(href)').extract()
+ is the fault here. Just replace it with a space, like:
response.css('.property_inner-title a::attr(href)').extract()
Another issue is in your defined extract_with_css() function:
def parse_info(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
The problem here is that extract_first() will return None by default if no values are found and .strip() is a function of string base class, since you're not getting a string this will throw an error.
To fix that you can set default value to extract_first to be an empty string instead:
def parse_info(self, response):
def extract_with_css(query):
return response.css(query).extract_first('').strip()

scrapy, how to send multiple requests to a form

I have working code here. I am sending 1 request to a form, and I am getting back all the data that I need. Code:
def start_requests(self):
nubmers="12345"
submitForm = FormRequest("https://example.com/url",
formdata={'address':numbers,'submit':'Search'},
callback=self.after_submit)
return [submitForm]
Now, I need to send multiple requests through the same form, and collect the data for each request. I need to collect the data for x numbers. I stored all numbers into a file:
12345
54644
32145
12345
code:
def start_requests(self):
with open('C:\spiders\usps\zips.csv') as fp:
for line in fp:
submitForm = FormRequest("https://example.com/url",
formdata={'address':line,
'submit':'Search'},callback=self.after_submit,dont_filter=True)
return [submitForm]
This code works, but it also collects data for the last entry in the file only. I need to collect the data for each row/number in the file. If I try yield instead, it returns scrapy, stops, and throws this error:
if not request.dont_filter and self.df.request_seen(request):
exceptions.AttributeError: 'list' object has no attribute 'dont_filter'

First of all, you definitely need yield to "fire" up multiple requests:
def start_requests(self):
with open('C:\spiders\usps\zips.csv') as fp:
for line in fp:
yield FormRequest("https://domain.com/url",
formdata={'address':line, 'submit':'Search'},
callback=self.after_submit,
dont_filter=True)
Also, you shouldn't enclose the FormRequest into a list, just yield the request.

Creating a generic scrapy spider

My question is really how to do the same thing as a previous question, but in Scrapy 0.14.
Using one Scrapy spider for several websites
Basically, I have GUI that takes parameters like domain, keywords, tag names, etc. and I want to create a generic spider to crawl those domains for those keywords in those tags. I've read conflicting things, using older versions of scrapy, by either overriding the spider manager class or by dynamically creating a spider. Which method is preferred and how do I implement and invoke the proper solution? Thanks in advance.
Here is the code that I want to make generic. It also uses BeautifulSoup. I paired it down so hopefully didn't remove anything crucial to understand it.
class MySpider(CrawlSpider):
name = 'MySpider'
allowed_domains = ['somedomain.com', 'sub.somedomain.com']
start_urls = ['http://www.somedomain.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/pages/', ), deny=('', ))),
Rule(SgmlLinkExtractor(allow=('/2012/03/')), callback='parse_item'),
)
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll('p', itemprop="myProp")
for contentTag in contentTags:
matchedResult = re.search('Keyword1|Keyword2', contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
pass

You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so:
a = open("test.py")
from compiler import compile
d = compile(a.read(), 'spider.py', 'exec')
eval(d)
MySpider
<class '__main__.MySpider'>
print MySpider.start_urls
['http://www.somedomain.com']

I use the Scrapy Extensions approach to extend the Spider class to a class named Masterspider that includes a generic parser.
Below is the very "short" version of my generic extended parser. Note that you'll need to implement a renderer with a Javascript engine (such as Selenium or BeautifulSoup) a as soon as you start working on pages using AJAX. And a lot of additional code to manage differences between sites (scrap based on column title, handle relative vs long URL, manage different kind of data containers, etc...).
What is interresting with the Scrapy Extension approach is that you can still override the generic parser method if something does not fit but I never had to. The Masterspider class checks if some methods have been created (eg. parser_start, next_url_parser...) under the site specific spider class to allow the management of specificies: send a form, construct the next_url request from elements in the page, etc.
As I'm scraping very different sites, there's always specificities to manage. That's why I prefer to keep a class for each scraped site so that I can write some specific methods to handle it (pre-/post-processing except PipeLines, Request generators...).
masterspider/sitespider/settings.py
EXTENSIONS = {
'masterspider.masterspider.MasterSpider': 500
}
masterspider/masterspdier/masterspider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
class MasterSpider(Spider):
def start_requests(self):
if hasattr(self,'parse_start'): # First page requiring a specific parser
fcallback = self.parse_start
else:
fcallback = self.parse
return [ Request(self.spd['start_url'],
callback=fcallback,
meta={'itemfields': {}}) ]
def parse(self, response):
sel = Selector(response)
lines = sel.xpath(self.spd['xlines'])
# ...
for line in lines:
item = genspiderItem(response.meta['itemfields'])
# ...
# Get request_url of detailed page and scrap basic item info
# ...
yield Request(request_url,
callback=self.parse_item,
meta={'item':item, 'itemfields':response.meta['itemfields']})
for next_url in sel.xpath(self.spd['xnext_url']).extract():
if hasattr(self,'next_url_parser'): # Need to process the next page URL before?
yield self.next_url_parser(next_url, response)
else:
yield Request(
request_url,
callback=self.parse,
meta=response.meta)
def parse_item(self, response):
sel = Selector(response)
item = response.meta['item']
for itemname, xitemname in self.spd['x_ondetailpage'].iteritems():
item[itemname] = "\n".join(sel.xpath(xitemname).extract())
return item
masterspider/sitespider/spiders/somesite_spider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
from masterspider.masterspider import MasterSpider
class targetsiteSpider(MasterSpider):
name = "targetsite"
allowed_domains = ["www.targetsite.com"]
spd = {
'start_url' : "http://www.targetsite.com/startpage", # Start page
'xlines' : "//td[something...]",
'xnext_url' : "//a[contains(#href,'something?page=')]/#href", # Next pages
'x_ondetailpage' : {
"itemprop123" : u"id('someid')//text()"
}
}
# def next_url_parser(self, next_url, response): # OPTIONAL next_url regexp pre-processor
# ...

Instead of having the variables name,allowed_domains, start_urls and rules attached to the class, you should write a MySpider.__init__, call CrawlSpider.__init__ from that passing the necessary arguments, and setting name, allowed_domains etc. per object.
MyProp and keywords also should be set within your __init__. So in the end you should have something like below. You don't have to add name to the arguments, as name is set by BaseSpider itself from kwargs:
class MySpider(CrawlSpider):
def __init__(self, allowed_domains=[], start_urls=[],
rules=[], findtag='', finditemprop='', keywords='', **kwargs):
CrawlSpider.__init__(self, **kwargs)
self.allowed_domains = allowed_domains
self.start_urls = start_urls
self.rules = rules
self.findtag = findtag
self.finditemprop = finditemprop
self.keywords = keywords
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll(self.findtag, itemprop=self.finditemprop)
for contentTag in contentTags:
matchedResult = re.search(self.keywords, contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)

I am not sure which way is preferred, but I will tell you what I have done in the past. I am in no way sure that this is the best (or correct) way of doing this and I would be interested to learn what other people think.
I usually just override the parent class (CrawlSpider) and either pass in arguments and then initialize the parent class via super(MySpider, self).__init__() from within my own init-function or I pull in that data from a database where I have saved a list of links to be appended to start_urls earlier.

As far as crawling specific domains passed as arguments goes, I just override Spider.__init__:
class MySpider(scrapy.Spider):
"""
This spider will try to crawl whatever is passed in `start_urls` which
should be a comma-separated string of fully qualified URIs.
Example: start_urls=http://localhost,http://example.com
"""
def __init__(self, name=None, **kwargs):
if 'start_urls' in kwargs:
self.start_urls = kwargs.pop('start_urls').split(',')
super(Spider, self).__init__(name, **kwargs)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy: Save response.body as html file? - python

Related

Scrapy not identifying key from json

Request url must be str or unicode, got list error?

Scraping Japanese website using Scrapy but no data in output file

scrapy, how to send multiple requests to a form

Creating a generic scrapy spider

Categories

Resources