I would like to scrape an arbitrary offer from aliexpress. Im trying to use scrapy and selenium. The issue I face is that when I use chrome and do right click > inspect on a element I see the real HTML but when I do right click > view source I see something different - a HTML CSS and JS mess all around.
As far as I understand the content is pulled asynchronously? I guess this is the reason why I cant find the element I am looking for on the page.
I was trying to use selenium to load the page first and then get the content I want but failed. I'm trying to scroll down to get to reviews section and get its content
Is this some advanced anti-bot solution that they have or maybe my approach is wrong?
The code that I currently have:
import scrapy
from selenium import webdriver
import logging
import time
logging.getLogger('scrapy').setLevel(logging.WARNING)
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://pl.aliexpress.com/item/32998115046.html']
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
self.driver.get(response.url)
scroll_retries = 20
data = ''
while scroll_retries > 0:
try:
data = self.driver.find_element_by_class_name('feedback-list-wrap')
scroll_retries = 0
except:
self.scroll_down(500)
scroll_retries -= 1
print("----------")
print(data)
print("----------")
self.driver.close()
def scroll_down(self, pixels):
self.driver.execute_script("window.scrollTo(0, {});".format(pixels))
time.sleep(2)
By watching requests in network tab in inspect tool of browser you will find out comments are comming from here so you can crawl this page instead.
Related
I'm new in python and i try to crawl a whole website recursive with selenium.
I would like to do this with selenium because i want get all cookies which the website is used. I know that other tools can crawl a website easier and faster but other tools can't give me all cookies (first and third party).
Here my code:
from selenium import webdriver
import os, shutil
url = "http://example.com/"
links = set()
def crawl(start_link):
driver.get(start_link)
elements = driver.find_elements_by_tag_name("a")
urls_to_visit = set()
for el in elements:
urls_to_visit.add(el.get_attribute('href'))
for el in urls_to_visit:
if url in el:
if el not in links:
links.add(el)
crawl(el)
else:
return
dir_name = "userdir"
if os.path.isdir(dir_name):
shutil.rmtree(dir_name)
co = webdriver.ChromeOptions()
co.add_argument("--user-data-dir=userdir")
driver = webdriver.Chrome(options = co)
crawl(url)
print(links)
driver.close();
My problem is that the crawl function not open all pages from the website apparently. On some websites i can navigate to pages by hand that the function not reached. Why?
One thing I have noticed while using webdriver is that it needs time to load the page, the elements are not instantly available just like in a regular browser.
You may want to add some delays, or a loop to check for some type of footer to indicate that the page is loaded and you can start crawling.
I just started and I've been on this for a week or two. Just using the internet to help but now I reached the point where I cant understand or my problem cannot be found anywhere else. In case you didnt understand my program I want to scrape data then click on a button then scrape data until I scrape an already collected data. then go to the next page which is in the list.
I reached the point where I scrape the first 8 data but I cant find a way to click on the "see more!" button. I know I should use Selenium and the button's Xpath. Anyway here is my code :
class KickstarterSpider(scrapy.Spider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ["https://www.kickstarter.com/projects/zwim/zwim-smart-swimming-goggles/community", "https://www.kickstarter.com/projects/zunik/oriboard-the-amazing-origami-multifunctional-cutti/community"]
def _init_(self, driver):
self.driver = webdriver.Chrome(chromedriver)
def parse(self, response):
self.driver.get('https://www.kickstarter.com/projects/zwim/zwim-smart-swimming-goggles/community')
backers = response.css('.founding-backer.community-block-content')
b = backers[0]
while True:
try:
seemore = selfdriver.find_element_by_xpath('//*[#id="content-wrap"]').click()
except:
break
self.driver.close()
def parse2(self,response):
print('you are here!')
for b in backers:
name = b.css('.name.js-founding-backer-name::text').extract_first()
backed = b.css('.backing-count.js-founding-backer-backings::text').extract_first()
print(name, backed)
Be shure web driver used in scrapy loads and interprets JS (idk... it can be a solution)
Hi I'm new to Python and crawling. I've been researching and going over Stackoverflow and come up with Python + Selenium to open Webdriver to open the URL and get the page source and turn it into the data I need. However, I know there's a better approach (for example, scraping w/o selenium, not having to scrape page source, posting data to ASP, etc) and I hope I can seek some help here for an educational purpose.
Here's what I'd like to achieve.
Start:
http://www.asos.com/Women/New-In-Clothing/Cat/pgecategory.aspx?cid=2623
Obtain: product title, price, img, and its link
Next: go to next page if there is, if not, output
BEFORE you go into my code, here is some background information. Asos is a site that uses pagination so this is related to scraping through multipages. Also, I tried without Selenium by posting to http://www.asos.com/services/srvWebCategory.asmx/GetWebCategories
with this data:
{'cid':'2623', 'strQuery':"", 'strValues':'undefined', 'currentPage':'0',
'pageSize':'204','pageSort':'-1','countryId':'10085','maxResultCount':''}
but there's no return I get.
I know my approach is not good, I'd much appreciate any help/recommendation/approach/idea! Thanks!
import scrapy
import time
import logging
from random import randint
from selenium import webdriver
from asos.items import ASOSItem
from scrapy.selector import Selector
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class ASOSSpider(scrapy.Spider):
name = "asos"
allowed_domains = ["asos.com"]
start_urls = [
"http://www.asos.com/Women/New-In-Clothing/Cat/pgecategory.aspx?cid=2623#/parentID=-1&pge=0&pgeSize=204&sort="
]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
view_204 = self.driver.find_element_by_xpath("//div[#class='product-count-bottom']/a[#class='view-max-paged']")
view_204.click() #click and show 204 pictures
time.sleep(5) #wait till 204 images loaded, I've also tried the explicit wait, but i got timedout
# element = WebDriverWait(self.driver, 8).until(EC.presence_of_element_located((By.XPATH, "category-controls bottom")))
logging.debug("wait time has reached! go CRAWL!")
next = self.driver.find_element_by_xpath("//li[#class='page-skip']/a")
pageSource = Selector(text=self.driver.page_source) # load page source instead, cant seem to crawl the page by just passing the reqular request
for sel in pageSource.xpath("//ul[#id='items']/li"):
item = ASOSItem()
item["product_title"] = sel.xpath("a[#class='desc']/text()").extract()
item["product_link"] = sel.xpath("a[#class='desc']/#href").extract()
item["product_price"] = sel.xpath("div/span[#class='price']/text()").extract()
item["product_img"] = sel.xpath("div/a[#class='productImageLink']/img/#src").extract()
yield item
next.click()
self.driver.close()
I'm trying to use scrapy to crawl some advertise information from this web sites.
That website has some div tag with class="product-card new_ outofstock installments_ ".
When I use:
items = response.xpath("//div[contains(#class, 'product-')]")
I get some node with class attribute = "product-description" but not "product-card".
When I use:
items = response.xpath("//div[contains(#class, 'product-card')]")
I still get nothing in result.
Why is that ?
As pointed in the previous answer, the content you are trying to scrape is generated dynamically using javascript. If performance is not a big deal for you, then you can use Selenium to emulate a real user and interact with the site. At the same time you can let Scrapy get the data for you.
If you want a similar example of how to do this, consider this tutorial: http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/
The data you want is being populated by javascripts.
You would have to use a selenium webdriver to extract the data.
If you want to check before hand if data is being populated using javascript, open a scrapy shell and try extracting the data as below.
scrapy shell 'http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT'
>>>response.xpath('//div[contains(#class,"product-card")]')
Output:
[]
Now, if you use the same Xpath in the browser and get a result as below:
Then the data is populated using scripts and selenium would have to be used to get data.
Here is an example to extract data using selenium:
import scrapy
from selenium import webdriver
from scrapy.http import TextResponse
class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['lazada.vn']
start_urls = ['http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
page = TextResponse(response.url, body=self.driver.page_source, encoding='utf-8')
required_data = page.xpath('//div[contains(#class,"product-card")]').extract()
self.driver.close()
Here are some examples of "selenium spiders":
Executing Javascript Submit form functions using scrapy in python
Snipplr
Scrapy with selenium
Extract data from dynamic webpages
I want to extract data from Amazon.
This is my source code :
from scrapy.contrib.spiders import CrawlSpider
from scrapy import Selector
from selenium import webdriver
from selenium.webdriver.support.select import Select
from time import sleep
import selenium.webdriver.support.ui as ui
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import HtmlResponse, TextResponse
from extraction.items import ProduitItem
class RunnerSpider(CrawlSpider):
name = 'products'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
items = []
sel = Selector(response)
self.driver.get(response.url)
recherche = self.driver.find_element_by_xpath('//*[#id="twotabsearchtextbox"]')
recherche.send_keys("A")
recherche.submit()
resultat = self.driver.find_element_by_xpath('//ul[#id="s-results-list-atf"]')
resultas = resultat.find_elements_by_xpath('//li')
for result in resultas:
item = ProduitItem()
lien = result.find_element_by_xpath('//div[#class="s-item-container"]/div/div/div[2]/div[1]/a')
lien.click()
#lien.implicitly_wait(2)
res = self.driver.find_element_by_xpath('//h1[#id="aiv-content-title"]')
item['TITRE'] = res.text
item['IMAGE'] = lien.find_element_by_xpath('//div[#id="dv-dp-left-content"]/div[1]/div/div/img').get_attribute('src')
items.append(item)
self.driver.close()
yield items
When I run my code I get this error :
Element not found in the cache - perhaps the page has changed since it was looked up Stacktrace:
If you tell Selenium to click on a likn you are moved from the original page to the page behind the link.
In your case you have a result site with some URLs to products on Amazon then you click one of the links in this result list and are moved to the detail site. In this case the site changes and the rest of the elements you want to iterate over in your for loop is not there -- that's why you get the exception.
Why don't you use the search result site to extract the title and the image? Both are there you would only need to change the XPath expressions to get the right fields of your lien.
Update
To get the Title from the search result site extract the text in the h2 element of the a element you want to click.
To get the image you need to take the other div in the li element: where in your XPath you select div[2] you need to select div[1] to get the image.
If you open the search result site in the browser and look at the sources with developer tools you can see which XPath expression to use for the elements.