I am trying to scrape from this website http://saintbarnabas.hodesiq.com/joblist.asp?user_id=
and I want to get all the RNs in it... I can scrape a data but cannot continue to the next page
because of its javascript. I tried reading to other questions but I don't get it. This is my code
class MySpider(CrawlSpider):
name = "commu"
allowed_domains = ["saintbarnabas.hodesiq.com"]
start_urls = ["http://saintbarnabas.hodesiq.com/joblist.asp?user_id=",
]
rules = (Rule (SgmlLinkExtractor(allow=('\d+'),restrict_xpaths=('*'))
, callback="parse_items", follow= True),
)
the next button shows as
Next
This pagination is kills me...
In short, you need to figure out what Move('next') does and reproduce that in your code.
A quick inspection of the sites shows that the function code is this:
function Move(strIndicator)
{
document.frm.move_indicator.value = strIndicator;
document.frm.submit();
}
And the document.frm is the form with name "frm":
<form name="frm" action="joblist.asp" method="post">
So, basically you need to build a request to perform the POST for that form with the move_indicator value as 'next'. This is easily done by using the FormRequest class (see the docs) like:
return FormRequest.from_response(response, formname="frm",
formdata={'move_indicator': 'next'})
This technique works in most cases. The difficult part is to figure out what does the javascript code, sometimes it might be obfuscated and perform overly complex stuff just to avoid being scraped.
Related
I am watching videos and reading some articles about how scrapy works with python and inserting to mongodb.
Then two questions popped up which either I am not googling with the correct keywords or just couldn't find the answer.
Anyways, let me take example on this tutorial site https://blog.scrapinghub.com to scrape blog posts.
I know we can get things like the title, author, date. But what if I want to get the content too? Which I need to click on more in order to go into another url then get the content. How can this be done though?
Then I either want the content to be the same dict as title, author, date or maybe title, author, date can be in one collection and having the content in another collection but the same post should be related though.
I am kinda lost when I thought of this, can someone give me suggestions / advices for this kind of idea?
Thanks in advance for any help and suggestions.
In the situation you described, you will scrape the content from the main page, yield a new Request to the read more page and send the data you already scraped together with the Request. When the new request callbacks it's parsing method, all the data scraped in the previous page will be available.
The recommended way to send the data with the request is to use cb_kwargs. Quite often you may find people/tutorials using the meta param, as cb_kwargs only became available on Scrapy v1.7+.
Here is a example to illustrate:
class MySpider(Spider):
def parse(self, response):
title = response.xpath('//div[#id="title"]/text()').get()
author = response.xpath('//div[#id="author"]/text()').get()
scraped_data = {'title': title, 'author': author}
read_more_url = response.xpath('//div[#id="read-more"]/#href').get()
yield Request(
url=read_more_url,
callback=self.parse_read_more,
cb_kwargs={'main_page_data': scraped_data}
)
def parse_read_more(self, response, main_page_data):
# The data from the main page will be received as a param in this method.
content = response.xpath('//article[#id="content"]/text()').get()
yield {
'title': main_page_data['title'],
'author': main_page_data['author'],
'content': content
}
Notice that the key in the cb_kwargs must be the same as the param name in the callback function.
I am learning website module of Odoo 9 and want to know the format of route expression. I am aware about the regex but could not get it completely. Take a look to this :-
class WebsiteBlog(http.Controller):
_blog_post_per_page = 20
_post_comment_per_page = 10
# codes
#http.route([
'/blog/<model("blog.blog"):blog>',
'/blog/<model("blog.blog"):blog>/page/<int:page>',
'/blog/<model("blog.blog"):blog>/tag/<string:tag>',
'/blog/<model("blog.blog"):blog>/tag/<string:tag>/page/<int:page>',
], type='http', auth="public", website=True)
def blog(self, blog=None, tag=None, page=1, **opt):
print 123
# etc
You can find this code on Git: Website Blog Module
I want to understand these expression. I can understand that this function will be executed if any one of these four URL will be requested by the browser and blog, tag and page are the variables but what is the meaning of this model(blog.blog) here ?
It defines that you are passing a value in URL is the record of the model blog.blog.
Ex.
your url like this..
localhost:8069/blog/3
Then in the controller you will get the record of model blog.blog which is having id = 3.
I have a set of start urls, like below:
start_urls = [www.example.com,www.example.com/ca,wwww.example.com/ap]
Now I have written code for extracting all the urls occurring inside each start_urls like below:
rules = (Rule(
LinkExtractor(
allow_domains = ('example.com'),
attrs = ('href'),
tags = ('a'),
deny = (),
deny_extensions = (),
unique = True,
),
callback = 'parseHtml', follow = True),)
In the parseHtml function, I am parsing the the content of the links.
Now in the above sites, I have common links occurring. For those common links I need to have some sort of identification to be done based on the start_urls.
How can I achieve this using scrappy?
You could not use the CrawlSpider and pass the start_url information yourself from start_requests through all your callbacks
You could create a Spider Middleware to handle start_requests to do the same but without doing it directly on the spider, you can find a similar behaviour here
I have nearly 2500 unique links, from which I want to run BeautifulSoup and gather some text captured in paragraphs on each of the 2500 pages. I could create variables for each link, but having 2500 is obviously not the most efficient course of action. The links are contained in a list like the following:
linkslist = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
Should I just write a for loop like the following?
for link in linkslist:
opened_url = urllib2.urlopen(link).read()
soup = BeautifulSoup(opened_url)
...
I'm looking for any constructive criticism. Thanks!
This is a good use case for Scrapy - a popular web-scraping framework based on Twisted:
Scrapy is written with Twisted, a popular event-driven networking
framework for Python. Thus, it’s implemented using a non-blocking (aka
asynchronous) code for concurrency.
Set the start_urls property of your spider and parse the page inside the parse() callback:
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
allowed_domains = ["website.com"]
def parse(self, response):
print response.xpath("//title/text()").extract()
How about writing a function that would treat each URL separately?
def processURL(url):
pass
# Your code here
map(processURL, linkslist)
This will run your function on each url in your list. If you want to speed things up, this is easy to run in parallel:
from multiprocessing import Pool
list(Pool(processes = 10).map(processURL, linkslist))
I'm using scrapy to crawl a multilingual site. For each object, versions in three different languages exist. I'm using the search as a starting point. Unfortunately the search contains URLs in various languages, which causes problems when parsing.
Therefore I'd like to preprocess the URLs before they get sent out. If they contain a specific string, I want to replace that part of the URL.
My spider extends the CrawlSpider. I looked at the docs and found the make_request_from _url(url) method, which led to this attempt:
def make_requests_from_url(self, url):
"""
Override the original function go make sure only german URLs are
being used. If french or italian URLs are detected, they're
rewritten.
"""
if '/f/suche' in url:
self.log('French URL was rewritten: %s' % url)
url = url.replace('/f/suche/pages/', '/d/suche/seiten/')
elif '/i/suche' in url:
self.log('Italian URL was rewritten: %s' % url)
url = url.replace('/i/suche/pagine/', '/d/suche/seiten/')
return super(MyMultilingualSpider, self).make_requests_from_url(url)
But that does not work for some reason. What would be the best way to rewrite URLs before requesting them? Maybe via a rule callback?
Probably worth nothing an example since it took me about 30 minutes to figure it out:
rules = [
Rule(SgmlLinkExtractor(allow = (all_subdomains,)), callback='parse_item', process_links='process_links')
]
def process_links(self,links):
for link in links:
link.url = "something_to_prepend%ssomething_to_append" % link.url
return links
As you already extend CrawlSpider you can use process_links() to process the URL extracted by your link extractors (or process_requests() if you prefer working at the Request level), detailed here