Ignoring requests while scraping two pages - python

I am now scraping this website on a daily basis, and am using DeltaFetch to ignore pages which have already been visited (a lot of them).
The issue I am facing is that for this website, I need to first scrape page A, and then scrape page B to retrieve additional information about the item. DeltaFetch works well in ignoring requests to page B, but that also means that every time the scraping runs, it runs requests to page A regardless of whether it has visited it or not.
This is how my code is structured right now:
# Gathering links from a page, creating an item, and passing it to parse_A
def parse(self, response):
for href in response.xpath(u'//a[text()="詳細を見る"]/#href').extract():
item = ItemLoader(item=ItemClass(), response=response)
yield scrapy.Request(response.urljoin(href),
callback=self.parse_A,
meta={'item':item.load_item()})
# Parsing elements in page A, and passing the item to parse_B
def parse_A(self, response):
item = ItemLoader(item=response.meta['item'], response=response)
item.replace_xpath('age',u"//td[contains(#class,\"age\")]/text()")
page_B = response.xpath(u'//a/img[#alt="周辺環境"]/../#href').extract_first()
yield scrapy.Request(response.urljoin(page_B),
callback=self.parse_B,
meta={'item':item.load_item()})
# Parsing elements in page B, and yielding the item
def parse_B(self, response):
item = ItemLoader(item=response.meta['item'])
item.add_value('url_B',response.url)
yield item.load_item()
Any help would be appreciated to ignore the first request to page A when this page has already been visited, using DeltaFetch.

DeltaFetch only keeps record of the requests that yield items in its database, which means only those will be skipped by default.
However, you are able to customize the key used to store a record by using the deltafetch_key meta key. If you make this key the same for the requests that call parse_A() as for those created inside parse_A(), you should be able to achieve the effect you want.
Something like this should work (untested):
from scrapy.utils.request import request_fingerprint
# (...)
def parse_A(self, response):
# (...)
yield scrapy.Request(
response.urljoin(page_B),
callback=self.parse_B,
meta={
'item': item.load_item(),
'deltafetch_key': request_fingerprint(response.request)
}
)
Note: the example above effectively replaces the filtering of requests to parse_B() urls with the filtering of requests to parse_A() urls. You might need to use a different key depending on your needs.

Related

How do I obtain results from 'yield' in python?

Perhaps yield in Python is remedial for some, but not for me... at least not yet.
I understand yield creates a 'generator'.
I stumbled upon yield when I decided to learn scrapy.
I wrote some code for a Spider which works as follows:
Go to start hyperlink and extract all hyperlinks - which are not full hyperlinks, just sub-directories concatenated onto the starting hyperlink
Examines hyperlinks appends those meeting specific criteria to base hyperlink
Uses Request to navigate to new hyperlink and parses to find unique id in element with 'onclick'
import scrapy
class newSpider(scrapy.Spider)
name = 'new'
allowed_domains = ['www.alloweddomain.com']
start_urls = ['https://www.alloweddomain.com']
def parse(self, response)
links = response.xpath('//a/#href').extract()
for link in links:
if link == 'SpecificCriteria':
next_link = response.urljoin(link)
yield Request(next_link, callback=self.parse_new)
EDIT 1:
for uid_dict in self.parse_new(response):
print(uid_dict['uid'])
break
End EDIT 1
Running the code here evaluates response as the HTTP response to start_urls and not to next_link.
def parse_new(self, response)
trs = response.xpath("//*[#class='unit-directory-row']").getall()
for tr in trs:
if 'SpecificText' in tr:
elements = tr.split()
for element in elements:
if 'onclick' in element:
subelement = element.split('(')[1]
uid = subelement.split(')')[0]
print(uid)
yield {
'uid': uid
}
break
It works, scrapy crawls the first page, creates the new hyperlink and navigates to the next page. new_parser parses the HTML for the uid and 'yields' it. scrapy's engine shows that the correct uid is 'yielded'.
What I don't understand is how I can 'use' that uid obtained by parse_new to create and navigate to a new hyperlink like I would a variable and I cannot seem to be able to return a variable with Request.
I'd check out What does the "yield" keyword do? for a good explanation of how exactly yield works.
In the meantime, spider.parse_new(response) is an iterable object. That is, you can acquire its yielded results via a for loop. E.g.,
for uid_dict in spider.parse_new(response):
print(uid_dict['uid'])
After much reading and learning I discovered the reason scrapy does not perform the callback in the first parse and it has nothing to do with yield! It has a lot to do with two issues:
1) robots.txt. Link Can be 'resolved' with ROBOTSTXT_OBEY = False in settings.py
2) The logger has Filtered offsite request to. Link dont_filter=True may resolve this.

Scrapy infinite scrolling - no pagination indication

I am new to web scraping and I encountered some issues when I was trying to scrape a website with infinite scroll. I looked at some other questions but I could not find the answer, so I hope someone could help me out here.
I am working on the website http://www.aastocks.com/tc/stocks/analysis/stock-aafn/00001/0/all/. I have the following (very basic) piece of code some far, where I could get every piece of article on the first page (20 entries).
def parse(self, response):
# collect all article links
news = response.xpath("//div[starts-with(#class,'newshead4')]//a//text()").extract()  
# visit each news link and gather news info
for n in news:
url = urljoin(response.url, n)
yield scrapy.Request(url, callback=self.parse_news)
However, I could not figure out how to go to the next page. I read some tutorials online, such as going to Inspect -> Network and observe the Request URL after scrolling, it returned http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=905169272&newsid=NOW.895783&period=0&key=&symbol=00001 where I could not find an indication of pagination or other pattern to help me go to the next page. When I copy this link to a new tab, I see a json document with the news of the next page, but without a url with it. In this case, how could I fix it? Many thanks!
Link
http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=905169272&newsid=NOW.895783&period=0&key=&symbol=00001
gives JSON data with values like NOW.XXXXXX which you can use to generate links to news
"http://www.aastocks.com/tc/stocks/analysis/stock-aafn-con/00001/" + "NOW.XXXXXX" + "/all"
If you scroll down few times then you will see that next pages generate similar links but with different parameters newstime, newsid.
If you check JSON data then you will see that last item has values 'dtd' and 'id' which are the same as parameters newstime, newsid in link used to download JSON data for next page.
So you can generate link to get JSON data for next page(s).
"http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=" + DTD + "&newsid=" + ID + "&period=0&key=&symbol=00001"
Working example with requests
import requests
newstime = '934735827'
newsid = 'HKEX-EPS-20190815-003587368'
url = 'http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime={}&newsid={}&period=0&key=&symbol=00001'
url_article = "http://www.aastocks.com/tc/stocks/analysis/stock-aafn-con/00001/{}/all"
for x in range(5):
print('---', x, '----')
print('data:', url.format(newstime, newsid))
# get JSON data
r = requests.get(url.format(newstime, newsid))
data = r.json()
#for item in data[:3]: # test only few links
for item in data[:-1]: # skip last link which gets next page
# test links to articles
r = requests.get(url_article.format(item['id']))
print('news:', r.status_code, url_article.format(item['id']))
# get data for next page
newstime = data[-1]['dtd']
newsid = data[-1]['id']
print('next page:', newstime, newsid)

How to follow 302 redirects while still getting page information when scraping using Scrapy?

Been wrestling with trying to get around this 302 redirection. First of all, the point of this particular part of my scraper is to get the next page index so I can flip through pages. The direct URLS aren't available for this site, so I cant just move on to the next or anything; in order to continue scraping the actual data using a parse_details function, I have to go through each page and simulate requests.
This is all pretty new to me, so I made sure to try anything I could find first. I have tried various settings ("REDIRECT_ENABLED":False, altering handle_httpstatus_list, etc.) but none are getting me through this. Currently I'm trying to follow the location of the redirection, but this isn't working either.
Here is an example of one of the potential solutions I've tried following.
try:
print('Current page index: ', page_index)
except: # Will be thrown if page_index wasnt found due to redirection.
if response.status in (302,) and 'Location' in response.headers:
location = to_native_str(response.headers['location'].decode('latin1'))
yield scrapy.Request(response.urljoin(location), method='POST', callback=self.parse)
The code, without the details parsing and such, is as follows:
def parse(self, response):
table = response.css('td> a::attr(href)').extract()
additional_page = response.css('span.page_list::text').extract()
for string_item in additional_page: # The text has some non-breaking
# spaces (&nbsp) to ignore. We want the text representing the
# current page index only.
char_list = list(string_item)
for char in char_list:
if char.isdigit():
page_index = char
break # Now that we have the current page index, we
# can back out of this loop.
# Below is where the code breaks; it cannot find page_index since it is
# not getting to the site for scraping after redirection.
try:
print('Current page index: ', page_index)
# To get to the next page, we submit a form request since it is all
# setup with javascript instead of simlpy giving a URL to follow.
# The event target has 'dgTournament' information where the first
# piece is always '_ctl1' and the second is '_ctl' followed by
# the page index number we want to go to minus one (so if we want
# to go to the 8th page, its '_ctl7').
# Thus we can just plug in the current page index which is equal to
# the next we want to hit minus one.
# Here is how I am making the requests; they work until the (302)
# redirection...
form_data = {"__EVENTTARGET": "dgTournaments:_ctl1:_ctl" + page_index,
"__EVENTARGUMENT": {";;AjaxControlToolkit, Version=3.5.50731.0, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:ec0bb675-3ec6-4135-8b02-a5c5783f45f5:de1feab2:f9cec9bc:35576c48"}}
yield FormRequest(current_LEVEL, formdata=form_data, method="POST", callback=self.parse, priority=2)
Alternatively, a solution may be to follow pagination in a different way, instead of making all of these requests?
The original link is
https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx?typeofsubmit=&action=2&keywords=&tournamentid=&sectiondistrict=&city=&state=&zip=&month=0&startdate=&enddate=&day=&year=2019&division=G16&category=28&surface=&onlineentry=&drawssheets=&usertime=&sanctioned=-1&agegroup=Y&searchradius=-1
if anyone is able to help.
You don't have to follow 302 requests instead you can do a POST request and receive the details of the page. The following code prints the data in the first 5 pages:
import requests
from bs4 import BeautifulSoup
url = 'https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx'
pages=5
for i in range(pages):
params={'year':'2019','division':'G16','month':'0','searchradius':'-1'}
payload={'__EVENTTARGET': 'dgTournaments:_ctl1:_ctl'+str(i)}
res= requests.post(url,params=params,data=payload)
soup = BeautifulSoup(res.content,'lxml')
table=soup.find('table',id='ctl00_mainContent_dgTournaments')
#pretty print the table contents
for row in table.find_all('tr'):
for column in row.find_all('td'):
text = ', '.join(x.strip() for x in column.text.split('\n') if x.strip()).strip()
print(text)
print('-'*10)

Selecting dependent dropdown with scrapy-splash

I am trying to scrape the following website: https://www.climatempo.com.br/climatologia/558/saopaulo-sp. It has a two drop-down menu with the second depending on the first, so I choose to use scrapy and splash via scrapy-splash.
I need to automate the change of location by selecting first the state, then the city. I tried SplashFormRequest but I am not being able to change the cities list. My spider is (prints for debugging):
import scrapy
from scrapy_splash import SplashRequest, SplashFormRequest
class ExampleSpider(scrapy.Spider):
name = 'climatologia'
def start_requests(self):
urls = ['https://www.climatempo.com.br/climatologia/558/saopaulo-sp']
for url in urls:
yield SplashRequest(url=url, callback=self.parse,
endpoint='render.html',
args={'wait': 0.5},)
def parse(self, response):
print(response.url)
state = response.css("select.slt-geo")[0].css("option::attr(value)").extract()
print(state)
return SplashFormRequest(response.url, method='POST',
formdata={'sel-state-geo': 'SP'},
callback=self.state_selected,
args={'wait': 0.5})
def state_selected(self, response):
print('\t:+)\t:+)\t:+)\t:+)\t:+)\t:+)')
print(response.css("select.slt-geo")[0].css("option::text").extract())
print(response.css("select.slt-geo")[1].css("option::text").extract())
This is a job that I would suggest Selenium for if you absolutely must use the sites menus. The only way to script Splash is through LUA scripts. You would have to send to the execute end point and create a LUA script. I found the options you were trying to select but not where to submit the form or how it functions on the site. I did have to translate to english.
My suggestion is to look in the browser inspector for end points like this is one of several which look particularly interesting:
https://www.climatempo.com.br/json/busca-estados
This endpoint gives json like follows
{"success":true,"message":"Resultados encontrados","time":"2017-11-30 16:05:20","totalRows":null,"totalPages":null,"page":null,"data":[{"idlocale":338,"idstate":31,"uf":"AC","state":"Acre","region":"N","latitude":null,"longitude":null},{"idlocale":339,"idstate":49,"uf":"AL","state":"Alagoas","region":"NE","latitude":null,"longitude":null},{"idlocale":340,"idstate":41,"uf":"AM","state":"Amazonas","region":"N","latitude":null,"longitude":null},{"idlocale":341,"idstate":30,"uf":"AP","state":"Amap\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":342,"idstate":56,"uf":"BA","state":"Bahia","region":"NE","latitude":null,"longitude":null},{"idlocale":343,"idstate":44,"uf":"CE","state":"Cear\u00e1","region":"NE","latitude":null,"longitude":null},{"idlocale":344,"idstate":47,"uf":"DF","state":"Distrito Federal","region":"CO","latitude":null,"longitude":null},{"idlocale":345,"idstate":45,"uf":"ES","state":"Esp\u00edrito Santo","region":"SE","latitude":null,"longitude":null},{"idlocale":346,"idstate":54,"uf":"GO","state":"Goi\u00e1s","region":"CO","latitude":null,"longitude":null},{"idlocale":347,"idstate":52,"uf":"MA","state":"Maranh\u00e3o","region":"NE","latitude":null,"longitude":null},{"idlocale":348,"idstate":53,"uf":"MG","state":"Minas Gerais","region":"SE","latitude":null,"longitude":null},{"idlocale":349,"idstate":39,"uf":"MS","state":"Mato Grosso do Sul","region":"CO","latitude":null,"longitude":null},{"idlocale":350,"idstate":40,"uf":"MT","state":"Mato Grosso","region":"CO","latitude":null,"longitude":null},{"idlocale":351,"idstate":50,"uf":"ND","state":"N\u00e3o Aplic\u00e1vel","region":"ND","latitude":null,"longitude":null},{"idlocale":352,"idstate":55,"uf":"PA","state":"Par\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":353,"idstate":37,"uf":"PB","state":"Para\u00edba","region":"NE","latitude":null,"longitude":null},{"idlocale":354,"idstate":29,"uf":"PE","state":"Pernambuco","region":"NE","latitude":null,"longitude":null},{"idlocale":355,"idstate":33,"uf":"PI","state":"Piau\u00ed","region":"NE","latitude":null,"longitude":null},{"idlocale":356,"idstate":32,"uf":"PR","state":"Paran\u00e1","region":"S","latitude":null,"longitude":null},{"idlocale":357,"idstate":46,"uf":"RJ","state":"Rio de Janeiro","region":"SE","latitude":null,"longitude":null},{"idlocale":358,"idstate":35,"uf":"RN","state":"Rio Grande do Norte","region":"NE","latitude":null,"longitude":null},{"idlocale":359,"idstate":38,"uf":"RO","state":"Rond\u00f4nia","region":"N","latitude":null,"longitude":null},{"idlocale":360,"idstate":43,"uf":"RR","state":"Roraima","region":"N","latitude":null,"longitude":null},{"idlocale":361,"idstate":48,"uf":"RS","state":"Rio Grande do Sul","region":"S","latitude":null,"longitude":null},{"idlocale":362,"idstate":36,"uf":"SC","state":"Santa Catarina","region":"S","latitude":null,"longitude":null},{"idlocale":363,"idstate":51,"uf":"SE","state":"Sergipe","region":"NE","latitude":null,"longitude":null},{"idlocale":364,"idstate":34,"uf":"SP","state":"S\u00e3o Paulo","region":"SE","latitude":null,"longitude":null},{"idlocale":365,"idstate":42,"uf":"TO","state":"Tocantins","region":"N","latitude":null,"longitude":null}]}
Hopefully this is another way to get the data you are looking for?
Then you can use normal requests to get the data. You would just have to form the request the same. Usually adding an accept, useragent, and requested with header is enough to pass.

Can't crawl more than a few items per page

I'm new to scrapy and tried to crawl from a couple of sites, but wasn't able to get more than a few images from there.
For example, for http://shop.nordstrom.com/c/womens-dresses-new with the following code -
def parse(self, response):
for dress in response.css('article.npr-product-module'):
yield {
'src': dress.css('img.product-photo').xpath('#src').extract_first(),
'url': dress.css('a.product-photo-href').xpath('#href').extract_first()
}
I got 6 products. I expect 66.
For URL https://www.renttherunway.com/products/dress with the following code -
def parse(self, response):
for dress in response.css('div.cycle-image-0'):
yield {
'image-url': dress.xpath('.//img/#src').extract_first(),
}
I got 12. I expect roughly 100.
Even when I changed it to crawl every 'next' page, I got the same number per page but it went through all pages successfully.
I have tried a different USER_AGENT, disabled COOKIES, and DOWNLOAD_DELAY of 5.
I imagine I will run into the same problem on any site so folks should have seen this before but can't find a reference to it.
What am I missing?
It's one of those weird websites where they store product data as json in html source and unpack it with javascript on page load later.
To figure this out usually what you want to do is
disable javascript and do scrapy view <url>
investigate the results
find the id in the product url and search that id in page source to check whether it exists and if so where it is hidden. If it doesn't exist that means it's being populated by some AJAX request -> reenable javascript, go to the page and dig through browser inspector's network tab to find it.
if you do regex based search:
re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
You'll get a huge json that contains all products and their information.
import json
import re
data = re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
data = json.loads(data[0])['data']
print(len(data['ProductResult']['Products']))
>> 66
That gets a correct amount of products!
So in your parse you can do this:
def parse(self, response):
for product in data['ProductResult']['Products']:
# find main image
image_url = [m['Url'] for m in product['Media'] if m['Type'] == 'MainImage']
yield {'image_url': image_url}

Categories

Resources