I just started and I've been on this for a week or two. Just using the internet to help but now I reached the point where I cant understand or my problem cannot be found anywhere else. In case you didnt understand my program I want to scrape data then click on a button then scrape data until I scrape an already collected data. then go to the next page which is in the list.
I reached the point where I scrape the first 8 data but I cant find a way to click on the "see more!" button. I know I should use Selenium and the button's Xpath. Anyway here is my code :
class KickstarterSpider(scrapy.Spider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ["https://www.kickstarter.com/projects/zwim/zwim-smart-swimming-goggles/community", "https://www.kickstarter.com/projects/zunik/oriboard-the-amazing-origami-multifunctional-cutti/community"]
def _init_(self, driver):
self.driver = webdriver.Chrome(chromedriver)
def parse(self, response):
self.driver.get('https://www.kickstarter.com/projects/zwim/zwim-smart-swimming-goggles/community')
backers = response.css('.founding-backer.community-block-content')
b = backers[0]
while True:
try:
seemore = selfdriver.find_element_by_xpath('//*[#id="content-wrap"]').click()
except:
break
self.driver.close()
def parse2(self,response):
print('you are here!')
for b in backers:
name = b.css('.name.js-founding-backer-name::text').extract_first()
backed = b.css('.backing-count.js-founding-backer-backings::text').extract_first()
print(name, backed)
Be shure web driver used in scrapy loads and interprets JS (idk... it can be a solution)
Related
I would like to scrape an arbitrary offer from aliexpress. Im trying to use scrapy and selenium. The issue I face is that when I use chrome and do right click > inspect on a element I see the real HTML but when I do right click > view source I see something different - a HTML CSS and JS mess all around.
As far as I understand the content is pulled asynchronously? I guess this is the reason why I cant find the element I am looking for on the page.
I was trying to use selenium to load the page first and then get the content I want but failed. I'm trying to scroll down to get to reviews section and get its content
Is this some advanced anti-bot solution that they have or maybe my approach is wrong?
The code that I currently have:
import scrapy
from selenium import webdriver
import logging
import time
logging.getLogger('scrapy').setLevel(logging.WARNING)
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://pl.aliexpress.com/item/32998115046.html']
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
self.driver.get(response.url)
scroll_retries = 20
data = ''
while scroll_retries > 0:
try:
data = self.driver.find_element_by_class_name('feedback-list-wrap')
scroll_retries = 0
except:
self.scroll_down(500)
scroll_retries -= 1
print("----------")
print(data)
print("----------")
self.driver.close()
def scroll_down(self, pixels):
self.driver.execute_script("window.scrollTo(0, {});".format(pixels))
time.sleep(2)
By watching requests in network tab in inspect tool of browser you will find out comments are comming from here so you can crawl this page instead.
I've been trying to scrape Bioshock games from the steam store and save their name, price and link in a CSV file. I know how to do it just by using Scrapy, but I really want to know if there's a way to do it combining both Scrapy and Selenium. I want to use Selenium just to get rid of the age check gate that pops up on certain game store sites.
Example of an age gate
Example of another age gate
So I've managed to scrape games that don't have the age gate by using Scrapy and I've managed to bypass the age gates using Selenium.
The problem I'm having is passing the game store site that Selenium opened by bypassing the age gate to Scrapy so it can crawl it. Since everything works fine on its own I came to the conclusion that the problem is that I don't know how to connect them.
def parse_product(self, response):
product = ScrapesteamItem()
sel = self.driver
#Passing first age gate
if '/agecheck/app/' in response.url:
sel.get(response.url)
select = Select(sel.find_element_by_xpath("""//*[#id="ageYear"]"""))
select.select_by_visible_text("1900")
sel.find_element_by_xpath("""//*[#id="agecheck_form"]/a""").click()
#Pass Selenium newly opened site to Scrapy
#Passing second age gate
elif '/agecheck' in response.url:
sel.get(response.url)
sel.find_element_by_xpath("""//*[#id="app_agegate"]/div[3]/a[1]""").click()
#Pass Selenium newly opened site to Scrapy
#Scraping the data with scrapy
else:
name = response.css('.apphub_AppName ::text').extract()
price = response.css('div.game_purchase_price ::text, div.discount_final_price ::text').extract()
link = response.css('head > link:nth-child(40) ::attr(href)').extract()
for product in zip(name, price, link):
scrapedInfo = {
'NAME' : product[0],
'PRICE' : product[1].rstrip().lstrip(),
'LINK' : product[2]
}
yield scrapedInfo
I hope someone will know how to do it (if it's even possible).
P.S. I know there are much better ways to scrape Steam store, I know there's an API probably but before I go and learn that I would like to know if there's a way to do it like this even if it's sub-optimal.
The straight away answer will be: apply same scraping code that you did use for Scraping the data with scrapy, i.e. something like this:
from scrapy.spiders import xxxxxxxSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.xxxxxxxSpider.com']
sitemap_rules = [
('/product/', 'parse_product'),
]
def my_custom_parse_product(self, response):
name = response.css('.apphub_AppName ::text').extract()
price = response.css('div.game_purchase_price ::text, div.discount_final_price ::text').extract()
link = response.css('head > link:nth-child(40) ::attr(href)').extract()
for product in zip(name, price, link):
scrapedInfo = {
'NAME' : product[0],
'PRICE' : product[1].rstrip().lstrip(),
'LINK' : product[2]
}
yield scrapedInfo
def parse_product(self, response):
product = ScrapesteamItem()
sel = self.driver
#Passing first age gate
if '/agecheck/app/' in response.url:
sel.get(response.url)
select = Select(sel.find_element_by_xpath("""//*[#id="ageYear"]"""))
select.select_by_visible_text("1900")
sel.find_element_by_xpath("""//*[#id="agecheck_form"]/a""").click()
#Pass Selenium newly opened site to Scrapy
response = HtmlResponse(url=response.url, body=driver.page_source)
#scrapy.http.Response(url=response.url, body=driver.page_source)
self.parse(response);
#Passing second age gate
elif '/agecheck' in response.url:
sel.get(response.url)
sel.find_element_by_xpath("""//*[#id="app_agegate"]/div[3]/a[1]""").click()
#Pass Selenium newly opened site to Scrapy
response = HtmlResponse(url=response.url, body=driver.page_source)
#scrapy.http.Response(url=response.url, body=driver.page_source)
self.parse(response);
#Scraping the data with scrapy
else:
my_custom_parse_product(response) #will actually scrap data
But it may appear that age protected pages will contain same data, but in different elements (not in response.css('.apphub_AppName ::text') for instance) in this case you will need to implement own scrape code for each page type
I am absolutely stuck on this one. I am scraping restaurant URLs from a webpage and there is a button at the bottom to reveal more restaurants. The website button code is below (i believe):
<div id="restsPages">
<a class="next" data-url="https://hungryhouse.co.uk/takeaways/aberdeen-bridge-of-dee-ab10">Show more</a>
<a class="back">Back to top</a>
</div>
It is the "Show more" button i am trying to activate. The url within the "data-url" does not reveal more of the page.
It all seems a bit odd on what do do to activate the button from within the python spider?
The code i am trying to use to make this work is:
import scrapy
from hungryhouse.items import HungryhouseItem
from selenium import webdriver
class HungryhouseSpider(scrapy.Spider):
name = "hungryhouse"
allowed_domains = ["hungryhouse.co.uk"]
start_urls = ["https://hungryhouse.co.uk/takeaways/westhill-ab10",
]
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self,response):
self.driver.get(response.url)
while True:
next =self.driver.find_element_by_xpath('//*[#id="restsPages"]/a[#class="next"]')
try:
next.click()
except:
break
self.driver.close()
.... rest of the code follows
The error i get is: 'chromedriver' executable needs to be in PATH
This was resolved at Pressing a button within python code with reference to the answer at Error message: "'chromedriver' executable needs to be available in the path"
But specifically
self.driver = webdriver.Chrome()
needed to change to
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
in my case.
i.e. I needed to add the path to chromedriver.exe.
I'm scraping this site using selenium. Firstly, i clicked on the clear button beside Attraction Type. Then i clicked on the more link on the bottom of the category list . Now for each i find the element by id and click on the link. The problem is as i click on the first category Outdoor Activities, the website goes back to the initial state again and i get following error as i try to click the next link:
StaleElementReferenceException: Message: Element is no longer attached to the DOM
My code is:
class TripSpider(CrawlSpider):
name = "tspider"
allowed_domains = ["tripadvisor.ca"]
start_urls = ['http://www.tripadvisor.ca/Attractions-g147288-Activities-c42-Dominican_Republic.html']
def __init__(self):
self.driver = webdriver.Firefox()
self.driver.maximize_window()
def parse(self, response):
self.driver.get(response.url)
self.driver.find_element_by_class_name('filter_clear').click()
time.sleep(3)
self.driver.find_element_by_class_name('show').click()
time.sleep(3)
#to handle popups
self.driver.switch_to.window(browser.window_handles[-1])
# Close the new window
self.driver.close()
# Switch back to original browser (first window)
self.driver.switch_to.window(browser.window_handles[0])
divs = self.driver.find_elements_by_xpath('//div[contains(#id,"ATTR_CATEGORY")]')
for d in divs:
d.find_element_by_tag_name('a').click()
time.sleep(3)
The problem with this website in particular is that each time you click on an element the DOM changes, so you can`t loop through elements which have gone stale.
I have the same problem short time ago, and I solved it using different windows for each link.
You could change this part of the code:
divs = self.driver.find_elements_by_xpath('//div[contains(#id,"ATTR_CATEGORY")]')
for d in divs:
d.find_element_by_tag_name('a').click()
time.sleep(3)
For:
from selenium.webdriver.common.keys import Keys
mainWindow = self.driver.current_window_handle
divs = self.driver.find_elements_by_xpath('//div[contains(#id,"ATTR_CATEGORY")]')
for d in divs:
# Open the element in a new Window
d.find_element_by_tag_name('a').send_keys(Keys.SHIFT + Keys.ENTER)
self.driver.switch_to_window(self.driver.window_handles[1])
# Here you do whatever you want in the new window
# Close the window and continue
self.driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 'w')
self.driver.switch_to_window(mainWindow)
I want to get next page in https://www.google.com.tw/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=test
But my code not work.
Please guide me. Thank you so much.
scrapy shell "https://www.google.com.tw/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=test"
response.xpath("//a[#id='pnnext']/#href")
Here it is the working code
scrapy shell "https://www.google.com.tw/search?q=test"
response.xpath("//a[#id='pnnext']/#href")
The issue was in the way you were making the request to google.
In any case be aware about the policy dealing with Google search.
Google's Custom Search Terms of Service (TOS) can be found at http://www.google.com/cse/docs/tos.html.
UPDATE:
I wrote down a spider to test more in deep this issue.
Not pythonic at all (improvements are welcome), but I was interested in the mechanism of dealing with google results.
As previous comments suggested, a test for the internationalization of the interface is needed.
class googleSpider(CrawlSpider):
name = "googlish"
allowed_domains = ["google.com"]
start_urls = ["http://www.google.com"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
login_form = self.driver.find_element_by_name('q')
login_form.send_keys("scrapy\n")
time.sleep(4)
found = False
while not found:
try :
for element in self.driver.find_elements_by_xpath("//div[#class='rc']"):
print element.text + "\n"
for i in self.driver.find_elements_by_id('pnnext'):
i.click()
time.sleep(5)
except NoSuchElementException:
found = True
pass
self.driver.close()
Can you try using below x path and let me know what the result is.Looks like the the xpath used is not pointing to the exact location of web-element in the DOM.
//a[#id='pnnext']//span[2]