Scrapy bypass data usage consent wall - python

I am scraping yahoo finance news using the code below.
class YfinNewsSpider(scrapy.Spider):
name = 'yfin_news_spider'
custom_settings = {'DOWNLOAD_DELAY': '0.5', 'COOKIES_ENABLED': True, 'COOKIES_DEBUG': True}
def __init__(self, month, year, **kwargs):
self.start_urls = ['https://finance.yahoo.com/sitemap/2020_03_all']
self.allowed_domains = ['finance.yahoo.com']
super().__init__(**kwargs)
def parse(self, response):
all_news_urls = response.xpath('//ul/li[#class="List(n) Py(3px) Lh(1.2)"]')
for news in all_news_urls:
news_url = news.xpath('.//a[#class="Td(n) Td(u):h C($c-fuji-grey-k)"]/#href').extract_first()
yield scrapy.Request(news_url, callback=self.parse_news, dont_filter=True)
def parse_news(self, response):
news_url = str(response.url)
title = response.xpath('//title/text()').extract_first()
paragraphs = response.xpath('//div[#class="caas-body"]/p/text()').extract()
date_time = response.xpath('//div[#class="caas-attr-time-style"]/time/#datetime').extract_first()
yield {'title': title, 'url': news_url, 'body_text': paragraphs, 'timestamp': date_time}
However, when I run my spider it give me below results.
2020-11-28 20:42:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_05cc09ea-0bc0-439d-8b4c-2d6f20f52d6e> (referer: https://finance.yahoo.com/sitemap/2020_03_all)
2020-11-28 20:42:40 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://finance.yahoo.com/news/onegold-becomes-first-company-offer-110000241.html>
Cookie: B=cnmvgrdfs5a0r&b=3&s=o1; GUCS=ASXMbR9p
2020-11-28 20:42:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_05cc09ea-0bc0-439d-8b4c-2d6f20f52d6e>
{'title': 'Yahoo er nu en del af Verizon Media', 'url': 'https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_05cc09ea-0bc0-439d-8b4c-2d6f20f52d6e', 'body_text': [], 'timestamp': None}
2020-11-28 20:42:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_d6731ce6-78bc-4222-914f-24cf98f874b8> (referer: https://finance.yahoo.com/sitemap/2020_03_all)
This seems to indicate that when my spider go to https://finance.yahoo.com/news/onegold-becomes-first-company-offer-110000241.html found in https://finance.yahoo.com/sitemap/2020_03_all. It tried sending cookie to https://finance.yahoo.com/news/onegold-becomes-first-company-offer-110000241.html, but was redirected to consent accepting wall https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_05cc09ea-0bc0-439d-8b4c-2d6f20f52d6e.
I open this consent wall https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_05cc09ea-0bc0-439d-8b4c-2d6f20f52d6e in browser and found data consent accepting screen. When I clicked accept, it brought me to the correct site that I want to scrape. The scraping results also exactly the content in this consent screen.
I have tried setting COOKIES_ENABLED to True, but it did not work. So, is there anyway to bypass this accepting screen in scrapy?
Thank you.

You can try one thing:
Open the consent page on network tab, then click on the give consent button. There you can identify the request that it sends when you give your consent. You can try replicating the same request using scrapy. May be this way your issue will be solved.
Other option would be to use scrapy-selenium to manually click that button and then scrapy can take over from there.

Related

How can I implement custom proxy on Scrapy?

I'm trying to implement custom scraper API but as a begging I think I'm doing wrong. But I follow their documentation to setup everything. Here is a documentation link
from scrapy import Spider
from scrapy.http import Request
from .config import API
from scraper_api import ScraperAPIClient
client = ScraperAPIClient(API)
class GlassSpider(Spider):
name = 'glass'
allowed_domains = ['glassdoor.co.uk']
start_urls = [client.scrapyGet(url='https://www.glassdoor.co.uk/Job/russian-jobs-SRCH_KE0,7.htm?fromAge=1')]
def parse(self, response):
jobs = response.xpath('//*[contains(#class, "react-job-listing")]')
for job in jobs:
job_url = job.xpath('.//*[contains(#class, "jobInfoItem jobTitle")]/#href').extract_first()
absulate_job_url = response.urljoin(job_url)
yield Request(client.scrapyGet(url=absulate_job_url),
callback=self.parse_jobpage,
meta={
"Job URL": absulate_job_url
})
def parse_jobpage(self, response):
absulate_job_url = response.meta.get('Job URL')
job_description = "".join(line for line in response.xpath('//*[contains(#class, "desc")]//text()').extract())
yield {
"Job URL": absulate_job_url,
"Job Description": job_description
}
That's the output I'm receiving.... Please what's wrong with my code. Please fix it for me. So I can follow and get the point. Thank you.
2020-10-01 23:01:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://api.scraperapi.com/?url=https%3A%2F%2Fwww.glassdoor.co.uk%2FJob%2F
russian-jobs-SRCH_KE0%2C7.htm%3FfromAge%3D1&api_key=bec9dd9f2be095dfc6158a7e609&scraper_sdk=python>
(referer: None)
2020-10-01 23:01:45 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET
https://api.scraperapi.c
om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6
7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61
5c8a7e639scraper_sdk=python>
I'm not familiar with this particular lib, but from your execution logs the issue is that your request is beign filtered, since it's consider offsite.
[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET https://api.scraperapi.c om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6 7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61 5c8a7e639scraper_sdk=python>
Since scraperapi will make your request go through their domain and that's outside of what you defined in your allowed_domains it's filtered as an offsite request. To avoid this issue you can remove this line entirely:
allowed_domains = ['glassdoor.co.uk']
or try include 'api.scraperapi.com' in it.

Issue with scrapy spider

I am trying to get volume-weighted average prices for stocks from the moneycontrol.com website. The parse function is running without any issues but the parse_links function is not getting called. Am i missing something here?
# -*- coding: utf-8 -*-
import scrapy
class MoneycontrolSpider(scrapy.Spider):
name = "moneycontrol"
allowed_domains = ["https://www.moneycontrol.com"]
start_urls = ["https://www.moneycontrol.com/india/stockpricequote"]
def parse(self,response):
for link in response.css('td.last > a::attr(href)').extract():
if(link):
yield scrapy.Request(link, callback=self.parse_links,method='GET')
def parse_links(self, response):
VWAP= response.xpath('//*[#id="n_vwap_val"]/text()').extract_first()
print(VWAP)
with open('quotes.txt','a+') as f:
f.write('VWAP: {}'.format(VWAP) + '\n')
If you read the log output, the error becomes obvious.
2018-09-08 19:52:38 [py.warnings] WARNING: c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py:59: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.moneycontrol.com in allowed_domains.
warnings.warn("allowed_domains accepts only domains, not URLs. Ignoring URL entry %s in allowed_domains." % domain, URLWarning)
2018-09-08 19:52:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-08 19:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.moneycontrol.com/india/stockpricequote> (referer: None)
2018-09-08 19:52:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.moneycontrol.com': <GET http://www.moneycontrol.com/india/stockpricequote/chemicals/aartiindustries/AI45>
So just fix your allowed_domains, and you should be fine:
allowed_domains = ["moneycontrol.com"]

python scrapy images pipeline not downloading (301 error)

I am trying to download images from pages like this on this site: http://39.moscowfilmfestival.ru/miff39/eng/films/?id=39016. but i receive a 301 error and the images are not downloaded. i can download all my other data points without a problem, including images_url. (i am reusing scrapy code that has worked on other similar sites.) if i input the downloaded images_url into the browser, it returns a page with the image. however, the URL of the page is slightly different, a forward slash (/) is interpolated:
submit: http://39.moscowfilmfestival.ru/upimg/cache/photo/640/6521.jpg
receive: http://moscowfilmfestival.ru/upimg//cache/photo/640/6521.jpg
the output log for the above page reads:
2018-01-02 11:19:40 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:62638/session/949ab9c1-6a0a-6a42-a19a-ef72c55acc33/url {"sessionId": "949ab9c1-6a0a-6a42-a19a-ef72c55acc33", "url": "http://39.moscowfilmfestival.ru//miff39/eng/films/?id=39016"}
2018-01-02 14:46:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://39.moscowfilmfestival.ru//miff39/eng/films/?id=39016> (referer: None)
2018-01-02 14:46:59 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://39.moscowfilmfestival.ru/upimg/cache/photo/640/6521.jpg> (referer: None)
2018-01-02 14:46:59 [scrapy.pipelines.files] WARNING: File (code: 301): Error downloading file from <GET http://39.moscowfilmfestival.ru/upimg/cache/photo/640/6521.jpg> referred in <None>
2018-01-02 14:46:59 [scrapy.core.scraper] DEBUG: Scraped from <200 http://39.moscowfilmfestival.ru//miff39/eng/films/?id=39016>
{'camera': ['HUANG LIAN'],
'cast': ['GAO ZIFENG, MENG HALYAN, JHAO ZIFENG, HE MIAO, WAN PEILU'],
'country': ['CHINA'],
'design': ['YANG ZHIWEN'],
'director': ['Liang Qiao'],
'festival_edition': ['39th'],
'festival_year': ['2017'],
'image_urls': ['http://39.moscowfilmfestival.ru/upimg/cache/photo/640/6521.jpg'],
'images': [],
'length': ['107'],
'music': [''],
'producer': ['DUAN PENG'],
'production': ['SUNNYWAY FILM'],
'program': ['Main Competition'],
'script': ['LI YONG'],
'sound': ['HU MAI, HAO CONG'],
'synopsis': ['The story begins with Vince Kang, a reporter in Beijing, having '
'to go back to his hometown to report a crested ibis, one of the '
'national treasures found unexpectedly. During the process of '
'pursuit and hide of the crested ibis, everyone’s interest is '
'revealed and the scars, both mental and physical were rip up. '
'In addition, the environment pollution, an aftermath from '
'China`s development pattern, is brought into daylight. The '
'story, from the perspective of a returnee, reveals the living '
'condition of rural China and exposes the dilemma of humanity. '
'In the end, Vince, the renegade, had no alternative but make a '
'compromise with his birthland.'],
'title': ['CRESTED IBIS'],
'year': ['2017']}
to resolve the issue:
i have tried to mimic the browser url by interpolating the additional /. No effect.
i have tried to add a 301 exception handler to the spider class (handle_httpstatus_all = True) and also to the settings.py file. No effect.
interestingly, an earlier version of the spider i wrote completed a partial url mistakenly with an extra / (between the .ru and miff parts of the URL), and the GET and POST requests worked fine. they work just the same with the correct original page URL in the current version of the spider.
any help sincerely appreciated.
I suggest you to use the urllib library to download any image.
import urllib
from urllib import request
url = 'http://39.moscowfilmfestival.ru/upimg/cache/photo/640/6521.jpg'
file_path = r'C:/Users/admin/Desktop/test/6521.jpg'
getPath, headers = urllib.request.urlretrieve(url, file_path)
print(getPath) #This is the image path

Scrapy is not listening to deny rules [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
For some reason scrapy is parsing data from URLs in my denied rules:
I'm getting parsed data from urls containing /browse/, /search/, /ip/.
I'm not sure where this is going wrong.
Please advise, thanks! Please find my code below:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class mydomainSpider(CrawlSpider):
name = "tp"
allowed_domains = ["www.mydomain.com"]
start_urls = ["http://www.mydomain.com",]
"""/tp/ page type to crawl"""
rules = (Rule (SgmlLinkExtractor(allow=('/tp/', ),
deny=(
'browse/',
'browse-ng.do?',
'search-ng.do?',
'facet=',
'ip/',
'page/'
'search/',
'/[1-9]$',
'(bti=)[1-9]+(?:\.[1-9]*)?',
'(sort_by=)[a-zA-Z]',
'(sort_by=)[1-9]+(?:\.[1-9]*)?',
'(ic=32_)[1-9]+(?:\.[1-9]*)?',
'(ic=60_)[0-9]+(?:\.[0-9]*)?',
'(search_sort=)[1-9]+(?:\.[1-9]*)?', )
,)
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
for site in sites:
item = Website()
item['referer'] = response.request.headers.get('Referer')
item['url'] = response.url
item['title'] = site.xpath('/html/head/title/text()').extract()
item['description'] = site.select('//meta[#name="Description"]/#content').extract()
items.append(item)
return items
a part of my console log, its grabing /ip/ pages?:
2013-12-11 11:21:43-0800 [tp] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/1104329> (referer: http://www.mydomain.com/tp/john-duigan)
2013-12-11 11:21:43-0800 [tp] DEBUG: Scraped from <200 http://www.mydomain.com/ip/1104329>
{'description': [u'Shop Low Prices on: Molly (Widescreen) : Movies'],
'referer': 'http://www.mydomain.com/tp/john-duigan',
'title': [u'Molly (Widescreen): Movies : mydomain.com '],
'url': 'http://www.mydomain.com/ip/1104329'}
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/jon-furmanski>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/taylor-byrd>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/greg-byers>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/tom-bowker>
2013-12-11 11:21:43-0800 [tp] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/21152221> (referer: http://www.mydomain.com/tp/peter-levin)
2013-12-11 11:21:43-0800 [tp] DEBUG: Scraped from <200 http://www.mydomain.com/ip/21152221>
{'description': [u'Shop Low Prices on: Marva Collins Story (1981) : Video on Demand by VUDU'],
'referer': 'http://www.mydomain.com/tp/peter-levin',
'title': [u'Marva Collins Story (1981): Video on Demand by VUDU : mydomain.com '],
'url': 'http://www.mydomain.com/ip/21152221'}
The rules of your SgmlLinkExtractor apply when extracting links from pages. And in your case, some of your .../tp/... requests are being redirected to .../ip/... pages.
Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/tom-bowker>
allow and deny patterns do no apply to URLs after redirections.
You could disable following redirections altogether by setting REDIRECT_ENABLED to False (see RedirectMiddleware)
I found out what was wrong, the pages were redirecting to a page type that was in my deny rule. Thank you for all your help! I appreciate it!

python method is not called

I have following class method in a scrapy spider. parse_category yields a Request object that has callback to parse_product. Sometimes a category page redirects to a product page. So here I detect if a category page is a product page. If it is, I just call the parse_product method. But for some reason it does not call the method.
def parse(self, response):
hxs = HtmlXPathSelector(response)
anchors = hxs.select('//div[#id="panelMfr"]/div/ul/li[position() != last()]/a')
for anchor in anchors[2:3]:
url = anchor.select('#href').extract().pop()
cat = anchor.select('text()').extract().pop().strip()
yield Request(urljoin(get_base_url(response), url), callback=self.parse_category, meta={"category": cat})
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
# check if its a redirected product page
if (hxs.select(self.product_name_xpath)):
self.log("Category-To-Product Redirection")
self.parse_product(response) # <<---- This line is not called.
self.log("Product Parsed")
return
products_xpath = '//div[#class="productName"]/a/#href'
products = hxs.select(products_xpath).extract()
for url in products:
yield Request(urljoin(base_url, url), callback=self.parse_product, meta={"category": response.meta['category']})
next_page = hxs.select('//table[#class="nav-back"]/tr/td/span/a[contains(text(), "Next")]/text()').extract()
if next_page:
url = next_page[0]
yield Request(urljoin(base_url, url), callback=self.parse_category, meta={"category": response.meta['category']})
def parse_product(self, response):
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
self.log("Inside parse_product")
In the log I see Category-To-Product Redirection and Product Parsed is printed but Inside parse_product is missing. Whats did I do wrong here?
2013-07-12 21:31:34+0100 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/category.aspx> (referer: None)
2013-07-12 21:31:34+0100 [example.com] DEBUG: Redirecting (302) to <GET http://www.example.com/productinfo.aspx?catref=AM6901> from <GET http://www.example.com/products/Inks-Toners/Apple>
2013-07-12 21:31:35+0100 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/productinfo.aspx?catref=AM6901> (referer: http://www.example.com/category.aspx)
2013-07-12 21:31:35+0100 [example.com] DEBUG: Category-To-Product Redirection
2013-07-12 21:31:35+0100 [example.com] DEBUG: Product Parsed
2013-07-12 21:31:35+0100 [example.com] INFO: Closing spider (finished)
2013-07-12 21:31:35+0100 [-] ERROR: ERROR:root:SPIDER CLOSED: No. of products: 0

Categories

Resources