use scrapy to crawl node - python

I'm trying to use scrapy to crawl some advertise information from this web sites.
That website has some div tag with class="product-card new_ outofstock installments_ ".
When I use:
items = response.xpath("//div[contains(#class, 'product-')]")
I get some node with class attribute = "product-description" but not "product-card".
When I use:
items = response.xpath("//div[contains(#class, 'product-card')]")
I still get nothing in result.
Why is that ?

As pointed in the previous answer, the content you are trying to scrape is generated dynamically using javascript. If performance is not a big deal for you, then you can use Selenium to emulate a real user and interact with the site. At the same time you can let Scrapy get the data for you.
If you want a similar example of how to do this, consider this tutorial: http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/

The data you want is being populated by javascripts.
You would have to use a selenium webdriver to extract the data.
If you want to check before hand if data is being populated using javascript, open a scrapy shell and try extracting the data as below.
scrapy shell 'http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT'
>>>response.xpath('//div[contains(#class,"product-card")]')
Output:
[]
Now, if you use the same Xpath in the browser and get a result as below:
Then the data is populated using scripts and selenium would have to be used to get data.
Here is an example to extract data using selenium:
import scrapy
from selenium import webdriver
from scrapy.http import TextResponse
class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['lazada.vn']
start_urls = ['http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
page = TextResponse(response.url, body=self.driver.page_source, encoding='utf-8')
required_data = page.xpath('//div[contains(#class,"product-card")]').extract()
self.driver.close()
Here are some examples of "selenium spiders":
Executing Javascript Submit form functions using scrapy in python
Snipplr
Scrapy with selenium
Extract data from dynamic webpages

Related

HTML acquired in Python code is not the same as displayed webpage

I have recently started learning web scraping with Scrapy and as a practice, I decided to scrape a weather data table from this url.
By inspecting the table element of the page, I copy its XPath into my code but I only get an empty list when running the code. I tried to check which tables are present in the HTML using this code:
from scrapy import Selector
import requests
import pandas as pd
url = 'https://www.wunderground.com/history/monthly/OIII/date/2000-5'
html = requests.get(url).content
sel = Selector(text=html)
table = sel.xpath('//table')
It only returns one table and it is not the one I wanted.
After some research, I found out that it might have something to do with JavaScript rendering in the page source code and that Python requests can't handle JavaScript.
After going through a number of SO Q&As, I came upon a certain requests-html library which can apparently handle JS execution so I tried acquiring the table using this code snippet:
from requests_html import HTMLSession
from scrapy import Selector
session = HTMLSession()
resp = session.get('https://www.wunderground.com/history/monthly/OIII/date/2000-5')
resp.html.render()
html = resp.html.html
sel = Selector(text=html)
tables = sel.xpath('//table')
print(tables)
But the result doesn't change. How can I acquire that table?
Problem
Multiple problems may be at play here—not only javascript execution, but HTML5 APIs, cookies, user agent, etc.
Solution
Consider using Selenium with headless Chrome or Firefox web driver. Using selenium with a web driver ensures that page will be loaded as intended. Headless mode ensures that you can run your code without spawning the GUI browser—you can, of course, disable headless mode to see what's being done to the page in realtime and even add a breakpoint so that you can debug beyond pdb in the browser's console.
Example Code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.wunderground.com/history/monthly/OIII/date/2000-5")
tables = driver.find_elements_by_xpath('//table') # There are several APIs to locate elements available.
print(tables)
References
Selenium Github: https://github.com/SeleniumHQ/selenium
Selenium (Python) Documentation: https://selenium-python.readthedocs.io/getting-started.html
Locating Elements: https://selenium-python.readthedocs.io/locating-elements.html
you can use scrapy-splash plugin to work scrapy with Splash (scrapinghub's javascript browser)
Using splash you can render javascript and also execute user events like mouse click

Cannot scrape AliExpress HTML element

I would like to scrape an arbitrary offer from aliexpress. Im trying to use scrapy and selenium. The issue I face is that when I use chrome and do right click > inspect on a element I see the real HTML but when I do right click > view source I see something different - a HTML CSS and JS mess all around.
As far as I understand the content is pulled asynchronously? I guess this is the reason why I cant find the element I am looking for on the page.
I was trying to use selenium to load the page first and then get the content I want but failed. I'm trying to scroll down to get to reviews section and get its content
Is this some advanced anti-bot solution that they have or maybe my approach is wrong?
The code that I currently have:
import scrapy
from selenium import webdriver
import logging
import time
logging.getLogger('scrapy').setLevel(logging.WARNING)
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://pl.aliexpress.com/item/32998115046.html']
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
self.driver.get(response.url)
scroll_retries = 20
data = ''
while scroll_retries > 0:
try:
data = self.driver.find_element_by_class_name('feedback-list-wrap')
scroll_retries = 0
except:
self.scroll_down(500)
scroll_retries -= 1
print("----------")
print(data)
print("----------")
self.driver.close()
def scroll_down(self, pixels):
self.driver.execute_script("window.scrollTo(0, {});".format(pixels))
time.sleep(2)
By watching requests in network tab in inspect tool of browser you will find out comments are comming from here so you can crawl this page instead.

Crawl a whole website with selenium in python recursive

I'm new in python and i try to crawl a whole website recursive with selenium.
I would like to do this with selenium because i want get all cookies which the website is used. I know that other tools can crawl a website easier and faster but other tools can't give me all cookies (first and third party).
Here my code:
from selenium import webdriver
import os, shutil
url = "http://example.com/"
links = set()
def crawl(start_link):
driver.get(start_link)
elements = driver.find_elements_by_tag_name("a")
urls_to_visit = set()
for el in elements:
urls_to_visit.add(el.get_attribute('href'))
for el in urls_to_visit:
if url in el:
if el not in links:
links.add(el)
crawl(el)
else:
return
dir_name = "userdir"
if os.path.isdir(dir_name):
shutil.rmtree(dir_name)
co = webdriver.ChromeOptions()
co.add_argument("--user-data-dir=userdir")
driver = webdriver.Chrome(options = co)
crawl(url)
print(links)
driver.close();
My problem is that the crawl function not open all pages from the website apparently. On some websites i can navigate to pages by hand that the function not reached. Why?
One thing I have noticed while using webdriver is that it needs time to load the page, the elements are not instantly available just like in a regular browser.
You may want to add some delays, or a loop to check for some type of footer to indicate that the page is loaded and you can start crawling.

Scrapy for dynamic data website using Angular or VueJs

How do I scrape data using Scrapy Framework from websites which loads data using javascript frameworks? Scrapy download the html from each page requests but some website uses js frameworks like Angular or VueJs which will load data separately.
I have tried using a combination of Scrapy,Selenium and chrome driver to retrieve the htmls which gives the rendered html with content. But when using this method I am not able to retain the session cookies set for selecting country and currency as each request is handled by a separate instance of selenium or chrome.
Please suggest if there is any alternative options to scrape the dynamic content while retaining the session.
Adding the code which i used to set the country and currency
import scrapy
from selenium import webdriver
class SettingSpider(scrapy.Spider):
name = 'setting'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def __init__(self):
self.driver = webdriver.Chrome()
def start_requests(self):
url = 'http://www.example.com/intl/settings'
self.driver.get(response.url)
yield scrapy.Request(url, self.parse)
def parse(self, response):
csrf = response.xpath('//input[#name="CSRFToken"]/#value').extract_first().strip()
print('------------->' + csrf)
url = 'http://www.example.com/intl/settings'
form_data = {'shippingCountry': 'ARE', 'language': 'en', 'billingCurrency': 'USD', 'indicativeCurrency': '',
'CSRFToken:': csrf}
yield scrapy.FormRequest(url, formdata=form_data, callback=self.after_post)
what you said
as each request is handled by a separate instance of selenium or chrome
is not correct,
You can continue to use Selenium and i suggest you to use phantomJS instead of chrome.
i can't help more because you didn't put your code.
one example for phantomJS:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 800)
driver.get("https://example.com/")
driver.close()
and if you don't want to use Selenium, you can use Splash
Splash is a javascript rendering service with an HTTP API. It's a
lightweight browser with an HTTP API, implemented in Python 3 using
Twisted and QT5
as #Granitosaurus said in this question
Bonus points for it being developed by the same guys who are
developing Scrapy.

Python/Scrapy scraping from Techcrunch

I am trying to build a spider to scrape some Data from the website Techcrunch - Heartbleed search
my tought was to give a tag when executing the spider from the command line (example: Heartbleed). The spider should then search trough all the associated search results, open each link and get the data contained within.
import scrapy
class TechcrunchSpider(scrapy.Spider):
name = "tech_search"
def start_requests(self):
url = 'https://techcrunch.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + '?s=' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
pass
this code can be executed with : scrapy crawl tech_search -s DOWNLOAD_DELAY=1.5 -o tech_search.jl -a tag=EXAMPLEINPUT"
Getting the data from the individual pages is not the problem, but actually getting the url to them is(from the search page linked above):
the thing is , when looking at the source Html file (Ctrl + u) of the Search site(link above), then i cant find anything about the searched elements(example : "What Is Heartbleed? The Video"). Any suggestions how to obtain these elements?
I suggest that you define your scrapy class along the lines shown in this answer but using the PhantomJS selenium headless browser. The essential problem is that when scrapy downloads those pages it uses javascript code to build the HTML (DOM) that you see but cannot access via the route you have chosen.

Categories

Resources