I've been trying to scrape Bioshock games from the steam store and save their name, price and link in a CSV file. I know how to do it just by using Scrapy, but I really want to know if there's a way to do it combining both Scrapy and Selenium. I want to use Selenium just to get rid of the age check gate that pops up on certain game store sites.
Example of an age gate
Example of another age gate
So I've managed to scrape games that don't have the age gate by using Scrapy and I've managed to bypass the age gates using Selenium.
The problem I'm having is passing the game store site that Selenium opened by bypassing the age gate to Scrapy so it can crawl it. Since everything works fine on its own I came to the conclusion that the problem is that I don't know how to connect them.
def parse_product(self, response):
product = ScrapesteamItem()
sel = self.driver
#Passing first age gate
if '/agecheck/app/' in response.url:
sel.get(response.url)
select = Select(sel.find_element_by_xpath("""//*[#id="ageYear"]"""))
select.select_by_visible_text("1900")
sel.find_element_by_xpath("""//*[#id="agecheck_form"]/a""").click()
#Pass Selenium newly opened site to Scrapy
#Passing second age gate
elif '/agecheck' in response.url:
sel.get(response.url)
sel.find_element_by_xpath("""//*[#id="app_agegate"]/div[3]/a[1]""").click()
#Pass Selenium newly opened site to Scrapy
#Scraping the data with scrapy
else:
name = response.css('.apphub_AppName ::text').extract()
price = response.css('div.game_purchase_price ::text, div.discount_final_price ::text').extract()
link = response.css('head > link:nth-child(40) ::attr(href)').extract()
for product in zip(name, price, link):
scrapedInfo = {
'NAME' : product[0],
'PRICE' : product[1].rstrip().lstrip(),
'LINK' : product[2]
}
yield scrapedInfo
I hope someone will know how to do it (if it's even possible).
P.S. I know there are much better ways to scrape Steam store, I know there's an API probably but before I go and learn that I would like to know if there's a way to do it like this even if it's sub-optimal.
The straight away answer will be: apply same scraping code that you did use for Scraping the data with scrapy, i.e. something like this:
from scrapy.spiders import xxxxxxxSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.xxxxxxxSpider.com']
sitemap_rules = [
('/product/', 'parse_product'),
]
def my_custom_parse_product(self, response):
name = response.css('.apphub_AppName ::text').extract()
price = response.css('div.game_purchase_price ::text, div.discount_final_price ::text').extract()
link = response.css('head > link:nth-child(40) ::attr(href)').extract()
for product in zip(name, price, link):
scrapedInfo = {
'NAME' : product[0],
'PRICE' : product[1].rstrip().lstrip(),
'LINK' : product[2]
}
yield scrapedInfo
def parse_product(self, response):
product = ScrapesteamItem()
sel = self.driver
#Passing first age gate
if '/agecheck/app/' in response.url:
sel.get(response.url)
select = Select(sel.find_element_by_xpath("""//*[#id="ageYear"]"""))
select.select_by_visible_text("1900")
sel.find_element_by_xpath("""//*[#id="agecheck_form"]/a""").click()
#Pass Selenium newly opened site to Scrapy
response = HtmlResponse(url=response.url, body=driver.page_source)
#scrapy.http.Response(url=response.url, body=driver.page_source)
self.parse(response);
#Passing second age gate
elif '/agecheck' in response.url:
sel.get(response.url)
sel.find_element_by_xpath("""//*[#id="app_agegate"]/div[3]/a[1]""").click()
#Pass Selenium newly opened site to Scrapy
response = HtmlResponse(url=response.url, body=driver.page_source)
#scrapy.http.Response(url=response.url, body=driver.page_source)
self.parse(response);
#Scraping the data with scrapy
else:
my_custom_parse_product(response) #will actually scrap data
But it may appear that age protected pages will contain same data, but in different elements (not in response.css('.apphub_AppName ::text') for instance) in this case you will need to implement own scrape code for each page type
Related
Please note - I'm very unexperienced and this is my first 'real' project.
I'm going to try to explain my problem as best as I can, apologies if some of the terms are incorrect.
I'm trying to scrape the following webpage - https://www.eaab.org.za/agent_agency_search?type=Agents&search_agent=+&submit_agent_search=GO
I can scrape the 'Name' and 'Status', but I also need to get some of the information in the 'Full Details' popup window.
I have noticed that when clicking on the 'Full Details' button the URL stays the same.
Below is what my code looks like:
import scrapy
from FirstScrape.items import FirstscrapeItem
class FirstSpider(scrapy.Spider):
name = "spiderman"
start_urls = [
"https://www.eaab.org.za/agent_agency_search?type=Agents&search_agent=+&submit_agent_search=GO"
]
def parse(self, response):
item = FirstscrapeItem()
item['name'] = response.xpath("//tr[#class='even']/td[1]/text()").get()
item['status'] = response.xpath("//tr[#class='even']/td[2]/text()").get()
#first refers to firstname in the popup window
item['first'] = response.xpath("//div[#class='result-list default']/tbody/tr[2]/td[2]/text()").get()
return item
I launch my code from the terminal and export it to a .csv file.
Not sure if this will help but this is the popup / fancy box window:
popup window
Do I need to use Selenium to click on the button or am I just missing something? Any help will be appreciated.
I'm very eager to learn more about Python and scraping.
Thank you.
In the Full Detail you have the href attribute you need to get this url and make requests.
Maybe it helps you:
import scrapy
from scrapy.crawler import CrawlerProcess
class FirstSpider(scrapy.Spider):
name = "spiderman"
start_urls = [
"https://www.eaab.org.za/agent_agency_search?type=Agents&search_agent=+&submit_agent_search=GO"
]
def parse(self, response):
all_urls = [i.attrib["href"] for i in response.css(".agent-detail")]
for url in all_urls:
yield scrapy.Request(url=f"https://www.eaab.org.za{url}", callback=self.parse_data)
def parse_data(self, response):
print(response.css("td::text").extract())
print("-----------------------------------")
This is the URL you need to extract from your starting page:
Full Detail
To get the content of pop-up-window open this extracted URL as another request.
I tried many ways to scrape ikea page and I figured out that at last page ikea actually shows all the items. But when I try to scrape last page of ikea's product it only returns me the 24 first items (which corresponds to the items displayed for the first page.
this is the URL of the page:
https://www.ikea.com/fr/fr/cat/lits-bm003/?page=12
and this is the spider :
import scrapy
import pprint
class SpiderSpider(scrapy.Spider):
name = 'Ikea'
pages = 9
start_urls = ['https://www.ikea.com/fr/fr/cat/canapes-fu003/?page=12']
def parse(self, response):
data = {}
products = response.css('div.plp-product-list')
for product in products:
for p in product.css('div.range-revamp-product-compact'):
yield {
'Title' : p.css('div.range-revamp-header-section__title--small::text').getall()[0],
'Price' : p.css('span.range-revamp-price__integer::text').getall()[0],
'Desc' : p.css('span.range-revamp-header-section__description-text::text').getall()[0],
'Img' : p.css('img.range-revamp-aspect-ratio-image__image::attr(src)').getall()[0]
}
Scrapy's spider doesn't run JavaScript (that's the job of a browser), it will only load the same response content as a cURL would.
To do what exactly you suggest, you need a browser-based solution, like Selenium (Python) or Cypress (JavaScript). Either that or go through each page separately. Try to use a 'headless browser'.
There are probably better ways of doing this, but to address your exact question, this is the intended answer.
I just started and I've been on this for a week or two. Just using the internet to help but now I reached the point where I cant understand or my problem cannot be found anywhere else. In case you didnt understand my program I want to scrape data then click on a button then scrape data until I scrape an already collected data. then go to the next page which is in the list.
I reached the point where I scrape the first 8 data but I cant find a way to click on the "see more!" button. I know I should use Selenium and the button's Xpath. Anyway here is my code :
class KickstarterSpider(scrapy.Spider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ["https://www.kickstarter.com/projects/zwim/zwim-smart-swimming-goggles/community", "https://www.kickstarter.com/projects/zunik/oriboard-the-amazing-origami-multifunctional-cutti/community"]
def _init_(self, driver):
self.driver = webdriver.Chrome(chromedriver)
def parse(self, response):
self.driver.get('https://www.kickstarter.com/projects/zwim/zwim-smart-swimming-goggles/community')
backers = response.css('.founding-backer.community-block-content')
b = backers[0]
while True:
try:
seemore = selfdriver.find_element_by_xpath('//*[#id="content-wrap"]').click()
except:
break
self.driver.close()
def parse2(self,response):
print('you are here!')
for b in backers:
name = b.css('.name.js-founding-backer-name::text').extract_first()
backed = b.css('.backing-count.js-founding-backer-backings::text').extract_first()
print(name, backed)
Be shure web driver used in scrapy loads and interprets JS (idk... it can be a solution)
I'm new to scrapy and tried to crawl from a couple of sites, but wasn't able to get more than a few images from there.
For example, for http://shop.nordstrom.com/c/womens-dresses-new with the following code -
def parse(self, response):
for dress in response.css('article.npr-product-module'):
yield {
'src': dress.css('img.product-photo').xpath('#src').extract_first(),
'url': dress.css('a.product-photo-href').xpath('#href').extract_first()
}
I got 6 products. I expect 66.
For URL https://www.renttherunway.com/products/dress with the following code -
def parse(self, response):
for dress in response.css('div.cycle-image-0'):
yield {
'image-url': dress.xpath('.//img/#src').extract_first(),
}
I got 12. I expect roughly 100.
Even when I changed it to crawl every 'next' page, I got the same number per page but it went through all pages successfully.
I have tried a different USER_AGENT, disabled COOKIES, and DOWNLOAD_DELAY of 5.
I imagine I will run into the same problem on any site so folks should have seen this before but can't find a reference to it.
What am I missing?
It's one of those weird websites where they store product data as json in html source and unpack it with javascript on page load later.
To figure this out usually what you want to do is
disable javascript and do scrapy view <url>
investigate the results
find the id in the product url and search that id in page source to check whether it exists and if so where it is hidden. If it doesn't exist that means it's being populated by some AJAX request -> reenable javascript, go to the page and dig through browser inspector's network tab to find it.
if you do regex based search:
re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
You'll get a huge json that contains all products and their information.
import json
import re
data = re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
data = json.loads(data[0])['data']
print(len(data['ProductResult']['Products']))
>> 66
That gets a correct amount of products!
So in your parse you can do this:
def parse(self, response):
for product in data['ProductResult']['Products']:
# find main image
image_url = [m['Url'] for m in product['Media'] if m['Type'] == 'MainImage']
yield {'image_url': image_url}
I am trying to extract data from a search box, you can see a good example on wikipedia
This is my code:
driver = webdriver.Firefox()
driver.get(response.url)
city = driver.find_element_by_id('searchInput')
city.click()
city.clear()
city.send_keys('a')
time.sleep(1.5) #waiting for ajax to load
selen_html = driver.page_source
#print selen_html.encode('utf-8')
hxs = HtmlXPathSelector(text=selen_html)
ajaxWikiList = hxs.select('//div[#class="suggestions"]')
items=[]
for city in ajaxWikiList:
item=TestItem()
item['ajax'] = city.select('/div[#class="suggestions-results"]/a/#title').extract()
items.append(item)
print items
Xpath expression is ok, I checked on a static page. If I uncomment the line that prints out scrapped html code the code for the box shows at the end of the file. But for some reason I can't extract data from it with the above code? I must miss something since I tried 2 different sources, wikipedia page is just another source where I can't get these data extracted.
Any advice here? Thanks!
Instead of passing the .page_source which in your case contains an empty suggestions div, get the innerHTML of the element and pass it to the Selector:
selen_html = driver.find_element_by_class_name('suggestions').get_attribute('innerHTML')
hxs = HtmlXPathSelector(text=selen_html)
suggestions = hxs.select('//div[#class="suggestions-results"]/a/#title').extract()
for suggestion in suggestions:
print suggestion
Outputs:
Animal
Association football
Arthropod
Australia
AllMusic
African American (U.S. Census)
Album
Angiosperms
Actor
American football
Note that it would be better to use selenium Waits feature to wait for the element to be accessible/visible, see:
How can I get Selenium Web Driver to wait for an element to be accessible, not just present?
Selenium waitForElement
Also, note that HtmlXPathSelector is deprecated, use Selector instead.