Below is my source code, I am getting - KeyError: 'No input element with the name None' error.
import re
import json
from loginform import fill_login_form
class wikihowSpider(scrapy.Spider):
name = "wikihow"
start_urls = ['http://www.wikihow.com/Category:Arts-and-Entertainment']
login_url = 'https://www.wikihow.com/Main-Page#wh-dialog-login'
def start_requests(self):
yield scrapy.Request(self.login_url, self.parse_login)
def parse_login(self, response):
print('Here')
data, url, method = fill_login_form(response.url, response.body,
'username', 'password')
return scrapy.FormRequest(url, formdata=dict(data),
method=method, callback=self.parse_main)
def parse_main(self, response):
# crawl
My use-case is to log-in and then crawl given a list of starting urls.
I also tried using versions of examples mentioned here but kept getting an error like Ignoring response <404 https://www.wikihow.com/wikiHowTo?search=&wpName=username&wpPassword=password&wpRemember=1&wploginattempt=Log+in>: HTTP status code is not handled or not allowed errors. Any help would be appreciated!
Related
I am trying to get a json field with key "longName" with scrapy but I am receiving the error: "Spider must return request, item, or None, got 'str'".
The JSON I'm trying to scrape looks something like this:
{
"id":5355,
"code":9594,
}sadsadsd
This is my code:
import scrapy
import json
class NotesSpider(scrapy.Spider):
name = 'notes'
allowed_domains = ['blahblahblah.com']
start_urls = ['https://blahblahblah.com/api/123']
def parse(self, response):
data = json.loads(response.body)
yield from data['longName']
I get the above error when I run "scrapy crawl notes" in prompt. Anyone can point me in the right direction?
If you only want longName modifying your parse method like this should do the trick:
def parse(self, response):
data = json.loads(response.body)
yield {"longName": data["longName"]}
Currently I'm trying to crawl a page which needs to have a session indicated in order to let me get some information. For that purpose I have created a simple Python program using the Scrapy library, but I'm not completely sure if it is developed correctly since I don't know how to debug it (or if is this even possible) right now I'm not getting any results.
At the moment my code looks like this:
import scrapy
from scrapy import FormRequest
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from strava_crawler.items import StravaCrawlerItem
from scrapy.exceptions import CloseSpider
from scrapy.utils.response import open_in_browser
class stravaSpider(CrawlSpider):
name = 'stravaSpider'
item_count = 0
allowed_domain = ['https://www.strava.com']
start_urls = ['https://www.strava.com/login']
def parse(self, response):
token = response.xpath("//meta[#name='csrf-token']/#content").extract_first()
print(token)
yield FormRequest.from_response(response, formdata={
'username': 'xxxx#xxx.com',
'password': 'xxxxx',
'authenticity_token': token
}, callback=self.start_scraping)
def start_scraping(self, response):
sc_item = StravaCrawlerItem()
sc_item['titulo'] = response.xpath('//div[#class="athlete-name"]/text()').extract()
self.item_count += 1
if self.item_count == 1:
raise CloseSpider('item_exceeded')
yield sc_item
The code seems pretty simple as you can see but my problem comes in 2 points of it. The first one is on the:
token = response.xpath("//meta[#name='csrf-token']/#content").extract_first()
Which I'm not really sure about, I have looked the html from the page and this is what I found:
My second problem comes with the network header from the own page, which looks like this:
Which make me doubt about my FormRequest.from_response, I have seen solutions using the csrf-token in the own from_response, but I have tried and haven't gotten any response either. This is the return I get on the terminal which I can suppose could be interesting for the question.
Do you see something wrong on the code or the concept of the program?
EDIT: after a few changes I get a new output which looks like a redirect loop imo, in which I get redirected from the dashboard to the login. I print the response.body and it´s the html code of the login page.
EDIT2: I ran the code in Ubuntu (I was on windows) and worked perfectly. So feel free to use it as example of scrapy logging.
You don't need to get the token yourself, the FormRequest.from_response fills it in for you. You can test this in scrapy shell like this:
>>> from scrapy import FormRequest
>>> req = FormRequest.from_response(response, formdata={
... 'username': 'xxxx#xxx.com',
... 'password': 'xxxxx',
... })
>>> req.body
b'utf8=%E2%9C%93&authenticity_token=L4mSH2wLcNAiLcR7yqCb%2BEdaNyPyJqU%2BbbT1ct9wQGWPnqstXVM5bWX1tmIPq62qpp4FpHdsjazlruVe%2Ba0xpg%3D%3D&plan=&email=&username=xxxx%40xxx.com&password=xxxxx'
You use 'username', but if you check the request done when you login, they use 'email'.
I don't think it gives problems here, but it's usually good to specify the form and the submit (to avoid filling in a wrong form, or clicking a wrong button like 'reset password).
I would try to test it like this:
def parse(self, response):
yield FormRequest.from_response(
response,
formid="login_form",
clickdata={'type': 'submit'},
formdata={
'email': 'xxxx#xxx.com',
'password': 'xxxxx',
},
callback=self.start_scraping)
I want to scrape product pages from its sitemap, the products page are similar, but not all of them are the same.
for example
Product A
https://www.vitalsource.com/products/environment-the-science-behind-the-stories-jay-h-withgott-matthew-v9780134446400
Product B
https://www.vitalsource.com/products/abnormal-psychology-susan-nolen-hoeksema-v9781259765667
we can see the product A has the subtitle but another one doesn't have.
So I get errors when I trying to scrape all the product pages.
My question is, is there a way to let the spider skip the error for returning no data?
There is a simple way to bypass it. that is not using strip()
But I am wondering if there is a better way to do the job.
import scrapy
import re
from VitalSource.items import VitalsourceItem
from scrapy.selector import Selector
from scrapy.spiders import SitemapSpider
class VsSpider(SitemapSpider):
name = 'VS'
allowed_domains = ['vitalsource.com']
sitemap_urls = ['https://storage.googleapis.com/vst-stargate-production/sitemap/sitemap1.xml.gz']
sitemap_rules = [
('/products/', 'parse_product'),
]
def parse_product(self, response):
selector = Selector(response=response)
item = VitalsourceItem()
item['Ebook_Title'] = response.css('.product-overview__title-header::text').extract()[1].strip
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").extract().strip
print(item)
return item
error message
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").extract().strip
AttributeError: 'list' object has no attribute 'strip'
Since you need only one subtitle you can use get() with setting default value to empty string. This will save you from errors about applying strip() function to empty element.
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").get('').strip()
In general scrapy will not stop crawling if callbacks raise an exception. e.g.:
def start_requests(self):
for i in range(10):
yield Requst(
f'http://example.org/page/{i}',
callback=self.parse,
errback=self.errback,
)
def parse(self, response):
# first page
if 'page/1' in response.request.url:
raise ValueError()
yield {'url': response.url}
def errback(self, failure):
print(f"oh no, failed to parse {failure.request}")
In this example 10 requests will be made and 9 items will be scraped but 1 will fail and got o errback
In your case you have nothing to fear - any request that does not raise an exception will scrape as it should, for the ones that do you'll just see an exception traceback in your terminal/logs.
You could check if a value is returned before extracting:
if response.css("div.subtitle.subtitle-pdp::text"):
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").get().strip
That way the subTitle code line would only run if a value was to be returned...
I am trying to run the code using python(scrapy) but there is no output.
I am also tyring to login to a webpage, let me know if there are any errors
The code i am using is this:
class MySpider(Spider):
def init(self, login, password):
link = "http://freeerisa.benefitspro.com"
self.login = login
self.password = password
self.cj = cookielib.CookieJar()
self.opener = urllib2.build_opener(
urllib2.HTTPRedirectHandler(),
urllib2.HTTPHandler(debuglevel=0),
urllib2.HTTPSHandler(debuglevel=0),
urllib2.HTTPCookieProcessor(self.cj)
)
self.loginToFreeErissa()
self.loginToFreeErissa()
def loginToFreeErissa(self):
login_data = urllib.urlencode({
'MainContent_mainContent_txtEmail' : self.login,
'MainContent_mainContent_txtPassword' : self.password,
})
response = self.opener.open(link + "/login.aspx", login_data)
return ''.join(response.readlines())
def after_login(self, response):
if "Error while logging in" in response.body:
self.logger.error("Login failed!")
else:
url = [link + "/5500/plandetails.aspx?Ein=042088633",
link + "/5500/plandetails.aspx?Ein=046394579"]
for u in url:
g_data =soup.find_all("span")
for item in g_data:
return item.text
I tried calling the function and this is the error I received:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spiders_init_.py",line 30,
in init raise ValueError("%s must have a name" % type(self).__name__)
ValueError: MySpider must have a name
There is no output because you don't call anything.
In other worlds, you defined what is MySpider but you didn't used it.
Here's a link that could help you
Change your code to
class MySpider(Spider):
name = 'myspider'
def init(self, login, password):
link = "http://freeerisa.benefitspro.com"
and run your spider by
scrapy crawl myspider
for more information
The error message could not be plainer: The spider must have a name. There is no name in the code you have posted. This is basic to creating a spider in Scrapy. Also, your Python spacing is terrible, you need an editor with Pylint or something that will tell you about PEP8.
I'm trying to create an input processor to convert scraped relative urls to absolute urls, based on this Stackoverflow post. I'm struggling with the loader_context concept and I'm probably mixing things up here. Could anyone point me in the right direction?
I have the following in items.py
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
from urlparse import urljoin
def convert_to_baseurl(url, loader_context):
response = loader_context.get('response')
return urljoin(url, response)
class Item(scrapy.Item):
url = scrapy.Field(
input_processor=MapCompose(convert_to_baseurl)
)
And the following in my spider
class webscraper(scrapy.Spider):
name = "spider"
def start_requests(self):
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for entry in response.css('li.aanbodEntry'):
loader = ItemLoader(item=Huis(), selector=entry)
loader.add_css('url', 'a')
yield loader.load_item()
The _urljoin() in the answer you referenced is a function written by the OP, and it has a different signature than the one in the stdlib.
The correct way to use the stdlib urljoin() would be:
return urljoin(response.url, url)
There is no need to use that however, since you can use Response.urljoin() :
def absolute_url(url, loader_context):
return loader_context['response'].urljoin(url)
For the response to be accessible through the context attribute, you need to pass it as an argument when creating the item loader, or use a different method mentioned in item loader docs:
loader = ItemLoader(item=Huis(), selector=entry, response=response)