I am trying to scrape data using scrapy from https://www.ta.com/portfolio/business-services, however the response is NULL. I am looking to scrape href in div.tiles js-portfolio-tiles using the code response.css("div.tiles.js-portfolio-tiles a::attr(href)").extract()
I think this has something to do with ::before that appears just before this, but maybe not. How do I go about extracting this? website HTML
The elements that you are interested in retrieving are loaded by your browser using javascript. By default scrapy is not able to load elements using javascript as it is not a browser, it simply retrieves the raw HTML.
Scrapy shell is an invaluable tool for inspecting what is available in the response that scrapy receives.
This set of commands will open the response in your default web browser:
$ scrapy shell
>>> fetch("https://www.ta.com/portfolio/business-services")
>>> view (response)
As you can see the js-portfolio tiles are not visible as they have not been loaded.
I have had a look at the AJAX requests in the network panel of the developer tools and it appears that the information you require may be available in an XHR request. If it is not then you will need to use additional software to load the javascript, namely scrapy splash or selenium, I would advise exploring the AJAX (XHR) request first though as this will be much faster and easier.
See this question for additional details on using your browsers dev tools to inspect AJAX requests.
Related
Hello is there a way to use two different web site urls and switching them?
I mean i have two different websites like:
import requests
session = request.session()
firstPage = session.get("https://stackoverflow.com")
print("Hey! im in first page now!")
secondPage = session.get("https://youtube.com")
print("Hey! im in second page now!")
i know a way to do it in selenium like this: driver.switch_to.window(driver.window_handles[1])
but i want do it in "Requests" so is there a way to do it?
Selenium and Requests are two fundamentally different services. Selenium is a headless browser which fully simulates a user. Requests is a python library which simply sends HTTP requests.
Because of this, Requests is particularly good for scraping static data and data that does not involve javascript rendering (through jQuery or similar), such as RESTful APIs, which often return JSON formatted data (with no HTML styling, or page rendering at all). With Requests, after the initial HTTP request is made, the data is saved in an object, and the connection is closed.
Selenium allows you to traverse through complex, javascript-rendered menus and the like, since each page is actually built (under the hood) as if it were being displayed to a user. Selenium encapsulates everything that your browser does except displaying the HTML (including the HTTP requests that Requests is built to perform). After connecting to a page with Selenium, the connection remains open. This allows you to navigate through a complex site where you would need the full URL of the final page to use Requests.
Because of this distinction, it makes sense that Selenium would have a switch_to_window method, but Requests would not. The way your code is written, you can access the response to the HTTP get calls which you've made directly though your variables (firstPage contains the response from stackoverflow, secondPage contains the response from youtube). While using Requests, you are never "in" a page in the sense that you can be with Selenium, since it is an HTTP library and not a full headless browser.
Depending on what you're looking to scrape, it might be better to use either Requests or Selenium.
Let say I am making a python request
url = "https://www.google.com"
r = requests.get(url)
Is there any method for getting all the network requests needed to load such a website, for example, those listed in the inspect element tool in chrome? I believe that I could achieve the same effect using Selenium, but is there any library or method that I could use to simply get all the network requests/network responses when requesting a URL.
Selenium Wire may be worth a try. I haven't been able to find much else in this space either.
https://github.com/wkeeling/selenium-wire
Selenium Wire extends Selenium's Python bindings to give you access to the underlying requests made by the browser. You author your code in the same way as you do with Selenium, but you get extra APIs for inspecting requests and responses and making changes to them on the fly.
This article describes more HTTP Request packages that may have similar capabilities or related extensions.
https://www.twilio.com/blog/5-ways-http-requests-python
I have a small project to webscrape prices from some stores using requests and beautiful soup.
One of these stores is now checking for javascript enable browser. I have found that it uses Magento 1.
I have tried the requests-html library, but it did not work.
I know that I can use selenium and headeless-chrome. I have tested it and works fine.
But as long as I run the project on the cloud, it would be much easier and less expensive to use requestes.
On stackoverflow there is a post where one solution was suggested: send the request with the cookied that is set by a website when it checks for javascript enable.
https://stackoverflow.com/a/66917621
Has anyone tried this solution on Magento 1?
I suspect that the data that I need to scrape is not generated by javascript on the page, but the check for javascript enable browser does not allow the page to load.
I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.
I am trying to scrap a website where targeted items are populated using document.write method. How can I get full browser html rendered version of the website in the Scrapy?
You can't do this, as scrapy will not execute the JavaScript code.
What you can do:
Rely on a headless browser like Selenium, which will execute the JavaScript. Afterwards, use XPath (or simple DOM access) like before to query the web page after executing the page.
Understand where the contents come from, and load and parse the source directly instead. Chrome Dev Tools / Firebug might help you with that, have a look at the "Network" panel that shows fetched data.
Especially look for JSON, sometimes also XML.