Scraper collecting the content of first page only

Scraper collecting the content of first page only - python

I've written a scraper using python to scrape movie names from yiffy torrents. The webpage has traversed around 12 pages. If i run my crawler using print statement, it gives me all the results from all the pages. However, when I run the same using return then it gives me the content from the first page only and do not go on to the next page to process the rest. As I'm having a hard time understanding the behavior of return statement, if somebody points out where I'm going wrong and give me a workaround I would be very happy. Thanks in advance.
This is what I'm trying with (the full code):
import requests
from urllib.request import urljoin
from lxml.html import fromstring
main_link = "https://www.yify-torrent.org/search/western/"
# film_storage = [] #I tried like this as well (keeping the list storage outside the function)
def get_links(link):
root = fromstring(requests.get(link).text)
film_storage = []
for item in root.cssselect(".mv"):
name = item.cssselect("h3 a")[0].text
film_storage.append(name)
return film_storage
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)
if __name__ == '__main__':
items = get_links(main_link)
for item in items:
print(item)
But, when i do like below, i get all the results (pasted gist portion only):
def get_links(link):
root = fromstring(requests.get(link).text)
for item in root.cssselect(".mv"):
name = item.cssselect("h3 a")[0].text
print(name) ## using print i get all the results from all the pages
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)

Your return statement prematurely terminates your get_links() function. Meaning this part
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)
is never executed.
Quickfix would be to put the return statement at the end of your function, but you have to make film_storage global(defined outside the get_links() function).
Edit:
Just realized, since you will be making your film_storage global, there is no need for the return statement.
Your code in main would just look like this:
get_links(main_link)
for item in film_storage:
print(item)

Your film_storage results list is local to the function get_links() which is called recursively for the next page. After the recursive call (for all the next pages), the initial (entry) function returns results only for the first page.
You'll have to either (1) unwrap the tail recursion into a loop, (2) make results list global; (3) use a callback (like you call print), or the best option (4) is to turn the get_links function into a generator that yields results for all pages.
Generator version:
def get_links(link):
root = fromstring(requests.get(link).text)
for item in root.cssselect(".mv"):
name = item.cssselect("h3 a")[0].text
yield name
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
for name in get_links(full_link):
yield name

Related

Recursively calling function on certain condition

I have a function that extracts the content from a random website every time using beautifulsoup library where I get random content every time. I'm successfully able to extract the content..... but let's say (if the output text is 'abc'). I want to re-call the function again and again until I get a different output. I added an if condition to make it done but somehow it's not working as I thought:
class MyClass:
def get_comment(self):
source = requests.get('https://www.example.com/random').text
soup = BeautifulSoup(source, 'lxml')
comment = soup.find('div', class_='commentMessage').span.text
if comment == "abc":
logging.warning('Executing again....')
self.get_comment() #Problem here....Not executing again
return comment
mine = MyClass()
mine.get_comment() # I get 'abc' output

When you call your function recursively you aren't doing anything with the output:
class MyClass:
def get_comment(self):
source = requests.get('https://www.example.com/random').text
soup = BeautifulSoup(source, 'lxml')
comment = soup.find('div', class_='commentMessage').span.text
if comment == "abc":
logging.warning('Executing again....')
return self.get_comment() #Call the method again, AND return result from that call
else:
return comment #return unchanged
mine = MyClass()
mine.get_comment()
I think this should be more like what you're after.

My Scrapy function is never being called. What do I miss

I have this general save function in scrapy spider.
def save_results(self, menu, url ):
inspect_response(response,self)
res, method = self.crawl_result(url)
self.item['crawl_result'] = res
self.item['raw_menu_urls'] = url
self.item['conversion_method'] = method
self.item['menu_text'] = menu
print self.item
yield self.item
And I call it like this from other function:
def yelp_menu(self, response):
id = response.meta['id']
menu = response.xpath('//div[#class="container biz-menu"]//text()').extract()
menu = self.clean_text(menu)
self.save_results(response.url, menu)
But it never gets called.
Where am I wrong?
P.S. I know it is not how scrapy is supposed to work with items , pipelines and the rest.

The problem is that self.save_results returns a generator. What you need is the following:
for item in self.save_results(response.url, menu):
yield item
Or, if you're using Python 3.3+, you can use yield from magic:
yield from self.save_results(response.url, menu)

Scrapy Spider only generates one item per loop

Since I added another request at the end of the for loop, to test a link, the Spyder only generates Items for the first index of the loop.
def parse_product_page(self, response):
products = response.xpath('//div[#class="content"]//div[#class="tov-rows"]//div[#class="t-row"]')
for x, product in enumerate(products): #ERROR: Just gives an item for the first product
product_loader = VerbraucherweltProdukt()
product_loader['name'] = product.xpath(
'//div[#class="t-center"]//div[#class="t-name"]/text()').extract_first()
request = scrapy.Request(non_ref_link,callback=self.test_link, errback=self.test_link)
request.meta['item'] = product_loader
yield request
It all worked before when i just yielded the product item, but since the item is returned in the callback, i dont know where my problem lays.
The callback is just:
def test_link(self, response):
item = response.meta['item']
item['link_fehlerhaft'] = response.status
yield item
Also the full code, maybe the problem is anywhere else:
http://pastebin.com/tgL38zpD

Here's your culprit:
link = product.xpath('//div[#class="t-right"]//a/#href').extract_first()
You're not grounding your recursive xpath to the product node you have. To fix it simply pre append . to your xpath to indicate current node as root:
link = product.xpath('.//div[#class="t-right"]//a/#href').extract_first()

Scrapy: How to do I prevent a yield request with a conditional item value?

I'm parsing a list of urls, and I want to avoid saving some url resulted item on the condition of some its value. My code is something like this:
start_urls = [www.rootpage.com]
def parse(self,response):
item = CreatedItem()
url_list = response.xpath('somepath').extract()
for url in url_list:
request = scrapy.Request(item['url'],callback=self.parse_article)
request.meta['item'] = item
yield request
def parse_article(self,response):
item = response.meta['item']
item['parameterA'] = response.xpath('somepath').extract()
yield item
Now I want that in case item['parameterA'] follows a condition, there is no need to "yield request" (so that no saving for this url occurs). I tried add a conditional like:
if item['parameterA'] == 0:
continue
else:
yield item
but as expected it does not work, because scrapy continues the loop even before the request is performed.

From what I understand, you should make the decision inside the parse_article method:
def parse_article(self,response):
item = response.meta['item']
item['parameterA'] = response.xpath('somepath').extract_first()
if item['parameterA'] != "0":
yield item
Note the use of the extract_first() and the quotes around 0.

iterate through url params template in Scrapy

I have the following url to begin with: http://somedomain.mytestsite.com/?offset=0. I'd like to loop through this url by incrementing offset parameter, let's say by 100 each time. Each time I recieve response I need to check some condition to decide whether I should run next iteration. For example:
class SomeSpider(BaseSpider):
name = 'somespider'
offset = 0
items = list()
def start_requests(self):
return [scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset), callback=self.request_iterator)]
def request_iterator(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
self.items.extend(data["matches"])
self.offset += 100
return self.start_requests()
else:
#process collected data in items list
return self.do_something_with_items()
This works, but I can't help feeling something wrong with this code. Maybe I should use some scrapy's rules?

Following things could be improved:
1) dont keep items as spider attribute, you will consume extremely high amount of memory with bigger inputs, use python generators instead. When you use generators you can yield items and requests from one spider callback without any trouble.
2) start_requests are used at spider startup, there seems to be little need to overwrite them in your code, if you rename your method to parse (default method name executed as callback to start_requests) code will be more readable
# we should process at least one item otherwise data["matches"] will be empty.
start_urls = ["http://somedomain.mytestsite.com/?offset="+1]
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()
else:
#process collected data in items list
for x self.do_something_with_items():
yield x
def next_request(self):
return scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset))
probably even better version of your callback would be:
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if not data["matches"]:
self.logger.info("processing done")
return
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraper collecting the content of first page only - python

Related

Recursively calling function on certain condition

My Scrapy function is never being called. What do I miss

Scrapy Spider only generates one item per loop

Scrapy: How to do I prevent a yield request with a conditional item value?

iterate through url params template in Scrapy

Categories

Resources