Scrapy: scraping website where targeted items are populated using document.write

Scrapy: scraping website where targeted items are populated using document.write - python

I am trying to scrap a website where targeted items are populated using document.write method. How can I get full browser html rendered version of the website in the Scrapy?

You can't do this, as scrapy will not execute the JavaScript code.
What you can do:
Rely on a headless browser like Selenium, which will execute the JavaScript. Afterwards, use XPath (or simple DOM access) like before to query the web page after executing the page.
Understand where the contents come from, and load and parse the source directly instead. Chrome Dev Tools / Firebug might help you with that, have a look at the "Network" panel that shows fetched data.
Especially look for JSON, sometimes also XML.

Related

Difficulty in web-scraping data using scrapy

I am trying to scrape data using scrapy from https://www.ta.com/portfolio/business-services, however the response is NULL. I am looking to scrape href in div.tiles js-portfolio-tiles using the code response.css("div.tiles.js-portfolio-tiles a::attr(href)").extract()
I think this has something to do with ::before that appears just before this, but maybe not. How do I go about extracting this? website HTML

The elements that you are interested in retrieving are loaded by your browser using javascript. By default scrapy is not able to load elements using javascript as it is not a browser, it simply retrieves the raw HTML.
Scrapy shell is an invaluable tool for inspecting what is available in the response that scrapy receives.
This set of commands will open the response in your default web browser:
$ scrapy shell
>>> fetch("https://www.ta.com/portfolio/business-services")
>>> view (response)
As you can see the js-portfolio tiles are not visible as they have not been loaded.
I have had a look at the AJAX requests in the network panel of the developer tools and it appears that the information you require may be available in an XHR request. If it is not then you will need to use additional software to load the javascript, namely scrapy splash or selenium, I would advise exploring the AJAX (XHR) request first though as this will be much faster and easier.
See this question for additional details on using your browsers dev tools to inspect AJAX requests.

How can I get the HTML code of a webpage after JS has been executed?

I'm trying to scrape a JS intensive website and I wanted to do this by loading the page, rendering the JS and then doing the scraping with BeautifulSoup.
I want to do this, if possible, on a RaspberryPi
I've tried using Requests-HTML, which worked fine for a while but couldn't get Python3.7 to run it on the Raspberry, due to memory limitations.
Then i tried using Selenium, with both Geckodriver, which isn't available for arm6 and I don't know how to compile for the Raspberry, and PhantomJS, which i couldn't get to work properly.

You have two options.
Use tool that could mimic browser and render the js parts of the page, like selenium
Examine the page and see which requests to the backend are obtaining the data you need
I would go with the first approach if I need a general purpose tool, that can scrape data from all sorts of pages
And I would go with the second if I need to scrape pages from several sites and be done with it. If you provide some link I can try to help you with this.

miss part of html by getting html by requests in python [duplicate]

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?

In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.

Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()

I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.

I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.

For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.

How to select "Load more results" button when scraping using Python & lxml

I am scraping a webpage. The webpage consists of 50 entries. After 50 entries it gives a
Load more reults button. I need to automatically select it. How can I do it. For scraping I am using Python, Lxml.

Even JavaScript is using http requests to get the data, so one method would be to investigate, what requests are providing the data when user asks to "Load more results" and emulate these requests.
This is not traditional scraping, which is based on plain or rendered html content and detecting further links, but can be working solution.
Next actions:
visit the page in Google Chrome or Firefox
press F12 to start up Developer tools or Firebug
switch to "Network" tab
click "Load more results"
check, what http requests have served data for loading more results and what data they return.
try to emulate these requests from Python
Note, that the data do not necessarily come in HTML or XML form, but could be in JSON. But Python provide enough tools to process this format too.

You can't do that. The functionality is provided by javascript, which lxml will not execute.

Viewing whole webpage with beautifulsoup

I am scraping a website but it only shows a portion of the website at the bottom it has a view more button. Is there anyway to view everything on the webpage via python?

BeautifulSoup just parses the returned HTML. It doesn't execute JavaScript, which is often used to load new content or to modify the existing webpage after it has loaded.
You'll need to execute the JavaScript, which requires more than just an HTML parser. You basically need to use a browser. There are a few Python packages to do this:
Selenium
Ghost.py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy: scraping website where targeted items are populated using document.write - python

I am trying to scrap a website where targeted items are populated using document.write method. How can I get full browser html rendered version of the website in the Scrapy?

Related

Difficulty in web-scraping data using scrapy

How can I get the HTML code of a webpage after JS has been executed?

miss part of html by getting html by requests in python [duplicate]

How to select "Load more results" button when scraping using Python & lxml

Viewing whole webpage with beautifulsoup

Categories

Resources