Rewrite scrapy URLs before sending the request - python

I'm using scrapy to crawl a multilingual site. For each object, versions in three different languages exist. I'm using the search as a starting point. Unfortunately the search contains URLs in various languages, which causes problems when parsing.
Therefore I'd like to preprocess the URLs before they get sent out. If they contain a specific string, I want to replace that part of the URL.
My spider extends the CrawlSpider. I looked at the docs and found the make_request_from _url(url) method, which led to this attempt:
def make_requests_from_url(self, url):
"""
Override the original function go make sure only german URLs are
being used. If french or italian URLs are detected, they're
rewritten.
"""
if '/f/suche' in url:
self.log('French URL was rewritten: %s' % url)
url = url.replace('/f/suche/pages/', '/d/suche/seiten/')
elif '/i/suche' in url:
self.log('Italian URL was rewritten: %s' % url)
url = url.replace('/i/suche/pagine/', '/d/suche/seiten/')
return super(MyMultilingualSpider, self).make_requests_from_url(url)
But that does not work for some reason. What would be the best way to rewrite URLs before requesting them? Maybe via a rule callback?

Probably worth nothing an example since it took me about 30 minutes to figure it out:
rules = [
Rule(SgmlLinkExtractor(allow = (all_subdomains,)), callback='parse_item', process_links='process_links')
]
def process_links(self,links):
for link in links:
link.url = "something_to_prepend%ssomething_to_append" % link.url
return links

As you already extend CrawlSpider you can use process_links() to process the URL extracted by your link extractors (or process_requests() if you prefer working at the Request level), detailed here

Related

scrapy internal links + pipeline and mongodb collection relationships

I am watching videos and reading some articles about how scrapy works with python and inserting to mongodb.
Then two questions popped up which either I am not googling with the correct keywords or just couldn't find the answer.
Anyways, let me take example on this tutorial site https://blog.scrapinghub.com to scrape blog posts.
I know we can get things like the title, author, date. But what if I want to get the content too? Which I need to click on more in order to go into another url then get the content. How can this be done though?
Then I either want the content to be the same dict as title, author, date or maybe title, author, date can be in one collection and having the content in another collection but the same post should be related though.
I am kinda lost when I thought of this, can someone give me suggestions / advices for this kind of idea?
Thanks in advance for any help and suggestions.
In the situation you described, you will scrape the content from the main page, yield a new Request to the read more page and send the data you already scraped together with the Request. When the new request callbacks it's parsing method, all the data scraped in the previous page will be available.
The recommended way to send the data with the request is to use cb_kwargs. Quite often you may find people/tutorials using the meta param, as cb_kwargs only became available on Scrapy v1.7+.
Here is a example to illustrate:
class MySpider(Spider):
def parse(self, response):
title = response.xpath('//div[#id="title"]/text()').get()
author = response.xpath('//div[#id="author"]/text()').get()
scraped_data = {'title': title, 'author': author}
read_more_url = response.xpath('//div[#id="read-more"]/#href').get()
yield Request(
url=read_more_url,
callback=self.parse_read_more,
cb_kwargs={'main_page_data': scraped_data}
)
def parse_read_more(self, response, main_page_data):
# The data from the main page will be received as a param in this method.
content = response.xpath('//article[#id="content"]/text()').get()
yield {
'title': main_page_data['title'],
'author': main_page_data['author'],
'content': content
}
Notice that the key in the cb_kwargs must be the same as the param name in the callback function.

Scrapy - parse a url without crawling

I have a list of urls I want to scrape and follow all the same pipelines. How do I begin this? I'm not actually sure where to even start.
The main idea is my crawl works through a site and pages. It then yields to parse the page and update a database. What I am now trying to achieve is to now parse the page of all the existing urls in the database which were not crawled that day.
I have tried doing this in a pipeline using the close_spider method, but can't get these urls to Request/parse. Soon as I yield the whole close_spider method is no longer called.
def close_spider(self, spider):
for item in models.Items.select().where(models.Items.last_scraped_ts < '2016-02-06 10:00:00'):
print item.url
yield Request(item.url, callback=spider.parse_product, dont_filter=True)
(re-reading your thread, I am not sure I answering your question at all...)
I have done something similar without scrapy but modules lxml and request
The url:
listeofurl=['url1','url2']
or if Url have a pattern generate them:
for i in range(0,10):
url=urlpattern+str(i)
Then I made a loop that parse each url which has the same pattern:
import json
from lxml import html
import requests
listeOfurl=['url1','url2']
mydataliste={};
for url in listeOfurl:
page = requests.get(url)
tree = html.fromstring(page.content)
DataYouWantToKeep= tree.xpath('//*[#id="content"]/div/h2/text()[2]')
data[url]=DataYouWantToKeep
#and at the end you save all the data in Json
with open(filejson, 'w') as outfile:
json.dump(data, outfile)
You could simply copy and paste the urls into start_urls, if you don't have override start_requests, parse will be the default call back. If it is a long list and you don't want ugly code, you can just override the start_requests, open your file or do a db call, and for each item within yield a request for that url and callback to parse. This will let you use your parse function and your pipelines, as well as handle concurrency through scrapy. If you just have a list without that extra infrastructure already existing and the list isn't too long, Sulot's answer is easier.

Checking ALL links within links from a source HTML, Python

My code is to search a Link passed in the command prompt, get the HTML code for the webpage at the Link, search the HTML code for links on the webpage, and then repeat these steps for the links found. I hope that is clear.
It should print out any links that cause errors.
Some more needed info:
The max visits it can do is 100.
If a website has an error, a None value is returned.
Python3 is what I am using
eg:
s = readwebpage(url)... # This line of code gets the HTML code for the link(url) passed in its argument.... if the link has an error, s = None.
The HTML code for that website has links that end in p2.html, p3.html, p4.html, and p5.html on its webpage. My code reads all of these, but it does not visit these links individually to search for more links. If it did this, it should search through these links and find a link that ends in p10.html, and then it should report that the link ending with p10.html has errors. Obviously it doesn't do that at the moment, and it's giving me a hard time.
My code..
url = args.url[0]
url_list = [url]
checkedURLs = []
AmountVisited = 0
while (url_list and AmountVisited<maxhits):
url = url_list.pop()
s = readwebpage(url)
print("testing url: http",url) #Print the url being tested, this code is here only for testing..
AmountVisited = AmountVisited + 1
if s == None:
print("* bad reference to http", url)
else:
urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with...
while urls_list: #... http or https
insert = urls_list.pop()
while(insert in checkedURLs and urls_list):
insert = urls_list.pop()
url_list.append(insert)
checkedURLs = insert
Please help :)
Here is the code you wanted. However, please, stop using regexes for parsing HTML. BeautifulSoup is the way to go for that.
import re
from urllib import urlopen
def readwebpage(url):
print "testing ",current
return urlopen(url).read()
url = 'http://xrisk.esy.es' #put starting url here
yet_to_visit= [url]
visited_urls = []
AmountVisited = 0
maxhits = 10
while (yet_to_visit and AmountVisited<maxhits):
print yet_to_visit
current = yet_to_visit.pop()
AmountVisited = AmountVisited + 1
html = readwebpage(current)
if html == None:
print "* bad reference to http", current
else:
r = re.compile('(?<=href=").*?(?=")')
links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
for u in links:
if u in visited_urls:
continue
elif u.find('http')!=-1:
yet_to_visit.append(u)
print links
visited_urls.append(current)
Not Python but since you mentioned you aren't tied strictly to regex, I think you might find some use in using wget for this.
wget --spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com
Broken down:
--spider: When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
-o C:\wget.log: Log all messages to C:\wget.log.
-e robots=off: Ignore robots.txt
-w 1: set a wait time of 1 second
-r: set recursive search on
-l 10: sets the recursive depth to 10, meaning wget will only go as deep as 10 levels in, this may need to change depending on your max requests
http://www.stackoverflow.com: the URL you want to start with
Once complete, you can review the wget.log entries to determine which links had errors by searching for something like HTTP status codes 404, etc.
I suspect your regex is part of your problem. Right now, you have http outside your capture group, and [\s:] matches "some sort of whitespace (ie \s) or :"
I'd change the regex to: urls_list = re.findall(r'href="(.*)"',s). Also known as "match anything in quotes, after href=". If you absolutely need to ensure the http[s]://, use r'href="(https?://.*)"' (s? => one or zero s)
EDIT: And with actually working regex, using a non-greedly glom: href=(?P<q>[\'"])(https?://.*?)(?P=q)'
(Also, uh, while it's not technically necessary in your case because re caches, I think it's good practice to get into the habit of using re.compile.)
I think it's awfully nice that all of your URLs are full URLs. Do you have to deal with relative URLs at all?
`

How to yield url with orders to let scrapy crawl

here is my code:
def parse(self, response):
selector = Selector(response)
sites = selector.xpath("//h3[#class='r']/a/#href")
for index, site in enumerate(sites):
url = result.group(1)
print url
yield Request(url = site.extract(),callback = self.parsedetail)
def parsedetail(self,response):
print response.url
...
obj = Store.objects.filter(id=store_obj.id,add__isnull=True)
if obj:
obj.update(add=add)
in def parse
scarpy will get urls from google
the url output like:
www.test.com
www.hahaha.com
www.apple.com
www.rest.com
but when it yield to def parsedetail
the url is not with order maybe it will become :
www.rest.com
www.test.com
www.hahaha.com
www.apple.com
is there any way I can let the yield url with order to send to def parsedetail ?
Because I need to crawl www.test.com first.(the data the top url provide in google search is more correctly)
if there is no data in it.
I will go next url until update the empty field .(www.hahaha.com ,www.apple.com,www.rest.com)
Please guide me thank you!
By default, the order which the Scrapy requests are scheduled and sent is not defined. But, you can control it using priority keyword argument:
priority (int) – the priority of this request (defaults to 0). The
priority is used by the scheduler to define the order used to process
requests. Requests with a higher priority value will execute earlier.
Negative values are allowed in order to indicate relatively
low-priority.
You can also make the crawling synchronous by passing around the callstack inside the meta dictionary, as an example see this answer.

How do I know when I'm done crawling a domain?

I've written a function in Python that gets all the links on a page.
Then, I run that function for all of the links that first function returned.
My question is, if I were to keep on doing this using CNN as my starting point, how would I know when I had crawled all (or most) of CNN's webpages?
Here's the code for the crawler.
base_url = "http://www.cnn.com"
title = "cnn"
my_file = open(title+".txt","w")
def crawl(site):
seed_url = site
br = Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.open(seed_url)
link_bank = []
for link in br.links():
if link.url[0:4] == "http":
link_bank.append(link.url)
if link.url[0] == "/":
url = link.url
if url.find(".com") == -1:
if url.find(".org") == -1:
link_bank.append(base_url+link.url)
else:
link_bank.append(link.url)
else:
link_bank.append(link.url)
if link.url[0] == "#":
link_bank.append(base_url+link.url)
link_bank = list(set(link_bank))
for link in link_bank:
my_file.write(link+"\n")
return link_bank
my_file.close()
I did not specifically look into your code, but you should look up how to implement a breadth-first-search, and additionally store already visited URLs in a set. If you find a new URL in the currently visited page, append it to the list of URLs to visit, if it wasn't in the set already.
You might need to ignore the query string (everything after the question mark in a URL).
The first thing coming into my mind is to have a set of visited links. Each time you are requesting a link, add a link to a set. Before requesting a link, check if it is not in the set.
Another point is that you are actually reinventing the wheel here, Scrapy web-scraping framework has link extracting mechanism built-in - it's worth using.
Hope that helps.

Categories

Resources