scrapy internal links + pipeline and mongodb collection relationships

scrapy internal links + pipeline and mongodb collection relationships - python

I am watching videos and reading some articles about how scrapy works with python and inserting to mongodb.
Then two questions popped up which either I am not googling with the correct keywords or just couldn't find the answer.
Anyways, let me take example on this tutorial site https://blog.scrapinghub.com to scrape blog posts.
I know we can get things like the title, author, date. But what if I want to get the content too? Which I need to click on more in order to go into another url then get the content. How can this be done though?
Then I either want the content to be the same dict as title, author, date or maybe title, author, date can be in one collection and having the content in another collection but the same post should be related though.
I am kinda lost when I thought of this, can someone give me suggestions / advices for this kind of idea?
Thanks in advance for any help and suggestions.

In the situation you described, you will scrape the content from the main page, yield a new Request to the read more page and send the data you already scraped together with the Request. When the new request callbacks it's parsing method, all the data scraped in the previous page will be available.
The recommended way to send the data with the request is to use cb_kwargs. Quite often you may find people/tutorials using the meta param, as cb_kwargs only became available on Scrapy v1.7+.
Here is a example to illustrate:
class MySpider(Spider):
def parse(self, response):
title = response.xpath('//div[#id="title"]/text()').get()
author = response.xpath('//div[#id="author"]/text()').get()
scraped_data = {'title': title, 'author': author}
read_more_url = response.xpath('//div[#id="read-more"]/#href').get()
yield Request(
url=read_more_url,
callback=self.parse_read_more,
cb_kwargs={'main_page_data': scraped_data}
)
def parse_read_more(self, response, main_page_data):
# The data from the main page will be received as a param in this method.
content = response.xpath('//article[#id="content"]/text()').get()
yield {
'title': main_page_data['title'],
'author': main_page_data['author'],
'content': content
}
Notice that the key in the cb_kwargs must be the same as the param name in the callback function.

Related

(Python) Interacting with Ajax webpages via Scrapy

System: Windows 10, Python 2.7.15, Scrapy 1.5.1
Goal: Retrieve text from within html markup for each of the link items on the target website, including those revealed (6 at a time) via the '+ SEE MORE ARCHIVES' button.
Target Website: https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info
Initial Progress: Python and Scrapy successfully installed. The following code...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
urls = [
'https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info',
]
for url in urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...successfully produces the following results (when -o to .csv)...
href,eventtype,eventmonth,eventdate,eventyear
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-08-02,Competitive Standard Constructed League, August ,2, 2018
/en/articles/archive/mtgo-standings/pauper-constructed-league-2018-08-01,Pauper Constructed League, August ,1, 2018
/en/articles/archive/mtgo-standings/competitive-modern-constructed-league-2018-07-31,Competitive Modern Constructed League, July ,31, 2018
/en/articles/archive/mtgo-standings/pauper-challenge-2018-07-30,Pauper Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/legacy-challenge-2018-07-30,Legacy Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-07-30,Competitive Standard Constructed League, July ,30, 2018
However, the spider will not touch any of the the info buried by the Ajax button. I've done a fair amount of Googling and digesting of documentation, example articles, and 'help me' posts. I am under the impression that to get the spider to actually see the ajax-buried info, that I need to simulate some sort of request. Variously, the correct type of request might be something to do with XHR, a scrapy FormRequest, or other. I am simply too new to web archetecture in general to be able to surmise the answer.
I hacked together a version of the initial code that calls a FormRequest, which seems to be able to still reach the initial page just fine, yet incrementing the only parameter that appears to change (when inspecting the xhr calls sent out when physically clicking the button on the page) does not appear to have an effect. That code is here...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
for i in range(1,10):
yield scrapy.FormRequest(url='https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info', formdata={'l':'en','f':'9041','search-result-theme':'','limit':'6','fromDate':'','toDate':'','event_format':'0','sort':'DESC','word':'','offset':str(i*6)}, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...and the results are the same as before, except the 6 output lines are repeated, as a block, 9 extra times.
Can anyone help point me to what I am missing? Thank you in advance.
Postscript: I always seem to get heckled out of my chair whenever I seek help for coding problems. If I am doing something wrong, please have mercy on me, I will do whatever I can to correct it.

Scrapy don't render dynamic content very well, you need something else to deal with Javascript. Try these:
scrapy + selenium
scrapy + splash
This blog post about scrapy + splash has a good introduction on the topic.

Scrapy - parse a url without crawling

I have a list of urls I want to scrape and follow all the same pipelines. How do I begin this? I'm not actually sure where to even start.
The main idea is my crawl works through a site and pages. It then yields to parse the page and update a database. What I am now trying to achieve is to now parse the page of all the existing urls in the database which were not crawled that day.
I have tried doing this in a pipeline using the close_spider method, but can't get these urls to Request/parse. Soon as I yield the whole close_spider method is no longer called.
def close_spider(self, spider):
for item in models.Items.select().where(models.Items.last_scraped_ts < '2016-02-06 10:00:00'):
print item.url
yield Request(item.url, callback=spider.parse_product, dont_filter=True)

(re-reading your thread, I am not sure I answering your question at all...)
I have done something similar without scrapy but modules lxml and request
The url:
listeofurl=['url1','url2']
or if Url have a pattern generate them:
for i in range(0,10):
url=urlpattern+str(i)
Then I made a loop that parse each url which has the same pattern:
import json
from lxml import html
import requests
listeOfurl=['url1','url2']
mydataliste={};
for url in listeOfurl:
page = requests.get(url)
tree = html.fromstring(page.content)
DataYouWantToKeep= tree.xpath('//*[#id="content"]/div/h2/text()[2]')
data[url]=DataYouWantToKeep
#and at the end you save all the data in Json
with open(filejson, 'w') as outfile:
json.dump(data, outfile)

You could simply copy and paste the urls into start_urls, if you don't have override start_requests, parse will be the default call back. If it is a long list and you don't want ugly code, you can just override the start_requests, open your file or do a db call, and for each item within yield a request for that url and callback to parse. This will let you use your parse function and your pipelines, as well as handle concurrency through scrapy. If you just have a list without that extra infrastructure already existing and the list isn't too long, Sulot's answer is easier.

How to yield url with orders to let scrapy crawl

here is my code:
def parse(self, response):
selector = Selector(response)
sites = selector.xpath("//h3[#class='r']/a/#href")
for index, site in enumerate(sites):
url = result.group(1)
print url
yield Request(url = site.extract(),callback = self.parsedetail)
def parsedetail(self,response):
print response.url
...
obj = Store.objects.filter(id=store_obj.id,add__isnull=True)
if obj:
obj.update(add=add)
in def parse
scarpy will get urls from google
the url output like:
www.test.com
www.hahaha.com
www.apple.com
www.rest.com
but when it yield to def parsedetail
the url is not with order maybe it will become :
www.rest.com
www.test.com
www.hahaha.com
www.apple.com
is there any way I can let the yield url with order to send to def parsedetail ?
Because I need to crawl www.test.com first.(the data the top url provide in google search is more correctly)
if there is no data in it.
I will go next url until update the empty field .(www.hahaha.com ,www.apple.com,www.rest.com)
Please guide me thank you!

By default, the order which the Scrapy requests are scheduled and sent is not defined. But, you can control it using priority keyword argument:
priority (int) – the priority of this request (defaults to 0). The
priority is used by the scheduler to define the order used to process
requests. Requests with a higher priority value will execute earlier.
Negative values are allowed in order to indicate relatively
low-priority.
You can also make the crawling synchronous by passing around the callstack inside the meta dictionary, as an example see this answer.

Rewrite scrapy URLs before sending the request

I'm using scrapy to crawl a multilingual site. For each object, versions in three different languages exist. I'm using the search as a starting point. Unfortunately the search contains URLs in various languages, which causes problems when parsing.
Therefore I'd like to preprocess the URLs before they get sent out. If they contain a specific string, I want to replace that part of the URL.
My spider extends the CrawlSpider. I looked at the docs and found the make_request_from _url(url) method, which led to this attempt:
def make_requests_from_url(self, url):
"""
Override the original function go make sure only german URLs are
being used. If french or italian URLs are detected, they're
rewritten.
"""
if '/f/suche' in url:
self.log('French URL was rewritten: %s' % url)
url = url.replace('/f/suche/pages/', '/d/suche/seiten/')
elif '/i/suche' in url:
self.log('Italian URL was rewritten: %s' % url)
url = url.replace('/i/suche/pagine/', '/d/suche/seiten/')
return super(MyMultilingualSpider, self).make_requests_from_url(url)
But that does not work for some reason. What would be the best way to rewrite URLs before requesting them? Maybe via a rule callback?

Probably worth nothing an example since it took me about 30 minutes to figure it out:
rules = [
Rule(SgmlLinkExtractor(allow = (all_subdomains,)), callback='parse_item', process_links='process_links')
]
def process_links(self,links):
for link in links:
link.url = "something_to_prepend%ssomething_to_append" % link.url
return links

As you already extend CrawlSpider you can use process_links() to process the URL extracted by your link extractors (or process_requests() if you prefer working at the Request level), detailed here

Scrapy: next button uses javascript

I am trying to scrape from this website http://saintbarnabas.hodesiq.com/joblist.asp?user_id=
and I want to get all the RNs in it... I can scrape a data but cannot continue to the next page
because of its javascript. I tried reading to other questions but I don't get it. This is my code
class MySpider(CrawlSpider):
name = "commu"
allowed_domains = ["saintbarnabas.hodesiq.com"]
start_urls = ["http://saintbarnabas.hodesiq.com/joblist.asp?user_id=",
]
rules = (Rule (SgmlLinkExtractor(allow=('\d+'),restrict_xpaths=('*'))
, callback="parse_items", follow= True),
)
the next button shows as
Next
This pagination is kills me...

In short, you need to figure out what Move('next') does and reproduce that in your code.
A quick inspection of the sites shows that the function code is this:
function Move(strIndicator)
{
document.frm.move_indicator.value = strIndicator;
document.frm.submit();
}
And the document.frm is the form with name "frm":
<form name="frm" action="joblist.asp" method="post">
So, basically you need to build a request to perform the POST for that form with the move_indicator value as 'next'. This is easily done by using the FormRequest class (see the docs) like:
return FormRequest.from_response(response, formname="frm",
formdata={'move_indicator': 'next'})
This technique works in most cases. The difficult part is to figure out what does the javascript code, sometimes it might be obfuscated and perform overly complex stuff just to avoid being scraped.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy internal links + pipeline and mongodb collection relationships - python

Related

(Python) Interacting with Ajax webpages via Scrapy

Scrapy - parse a url without crawling

How to yield url with orders to let scrapy crawl

Rewrite scrapy URLs before sending the request

Scrapy: next button uses javascript

Categories

Resources