Waiting for page full load using Scrapy framework - python

I am trying to scrape some data from this website.
For data scraping I'm using Scrapy framework.
I inspected the webpage and found out that the data I want to extract has the following XPath:
//*[#id="weather-widget"]/div[2]/div[1]/div[1]/div[1]/h2
When I scraped the webpage and started looking through its content I found out that the page does not contain an element with the XPath above.
Is it possible to somehow wait for the page to load and extract the values that I need?

You are not finding the data you are looking for because it's loaded from another request. you can either use something like selenium or puppeteer to load the full page, Or you can try to open send the request directly to the api to get the data.
For the site you provided the request that generates the data is something similar to
https://openweathermap.org/data/2.5/weather?id=625665&appid=439d4b804bc8187953eb36d2a8c26a02
you can confirm it by opening DevTools > Network tab and refresh the page to see the request

Related

How to retrieve translation and pronunciation from Papago website using Python? [duplicate]

I am trying to create a script to download an ebook into a pdf. When I try to use beautifulsoup in it I to print the contents of a single page, I get a message in the console stating "Oh no! It looks like JavaScript is disabled in your browser. Please re-enable to access the reader."
I have already enabled Javascript in Chrome and this same piece of code works for a page like a stackO answer page. What could be blocking Javascript in this page and how can I bypass it?
My code for reference:
url = requests.get("https://platform.virdocs.com/r/s/0/doc/350551/sp/14552484/mi/47443495/?cfi=%2F4%2F2%5BP7001013978000000000000000003FF2%5D%2F2%2F2%5BP7001013978000000000000000010019%5D%2F2%2C%2F1%3A0%2C%2F1%3A0")
url.raise_for_status()
soup = bs4.BeautifulSoup(url.text, "html.parser")
elems = soup.select("p")
print(elems[0].getText())
The problem is that the page actually contains no content. To load the content it needs to run some JS code. The requests.get method does not run JS, it just loads the basic HTML.
What you need to do is to emulate a browser, i.e. 'open' the page, run JS, and then scrape content. One way to do it is to use a browser driver as described here - https://stackoverflow.com/a/57912823/9805867

Python Scrape Financial Data From iFrame

I'm trying to scrape data within the iFrame.
I have tried webdriver in Chrome as well as PhantomJS with no success. There are source links contained within the iframe where I assume its data is being pulled from, however, when using these links an error is generated saying "You can't render widget content without a correct InstanceId parameter."
Is it possible to access this data using python (PhantomJS)?
Go to network tools in your browser and investigate what data go to the server and just scrape via simple requests.

Unable to scrape dynamic web page

I am trying to scrape the table found https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&1_Filter-Family=595&2_StatusCodeText=4
I tried using BeautifulSoup and Soup is unable to parse the info located inside the "body" tag. I get a null output when I try to parse the table.
How can I workaround this?
This page uses JavaScript to add data but BeautifulSoup/LXML can't run JavaScript - if you turn off javaScrip in browser and load page then you will see what BeautifulSoup/LXML can get.
You may need Selenium to control web browser which can run JavaScript.
Or you can try to use DevTools in Chrome/Firefox (tab Network) to get url usesJavaScript(AJAX/XHR) to download data. And you can try to use this url withrequestsandBeautifulSoup`
I found it uses url:
https://ark.intel.com/libs/apps/intel/support/ark/advancedFilterSearch?productType=873&1_Filter-Family=595&2_StatusCodeText=4&forwardPath=/content/www/us/en/ark/search/featurefilter.html&pageNo=1
I didn't check if requests will need special settings (ie. cookies, headers) to get it.
You can use Puppeteer to 'control' the dynamic web page, and scrape it with BS.
See here : https://github.com/puppeteer/puppeteer/tree/master/examples

Difficulty in web-scraping data using scrapy

I am trying to scrape data using scrapy from https://www.ta.com/portfolio/business-services, however the response is NULL. I am looking to scrape href in div.tiles js-portfolio-tiles using the code response.css("div.tiles.js-portfolio-tiles a::attr(href)").extract()
I think this has something to do with ::before that appears just before this, but maybe not. How do I go about extracting this? website HTML
The elements that you are interested in retrieving are loaded by your browser using javascript. By default scrapy is not able to load elements using javascript as it is not a browser, it simply retrieves the raw HTML.
Scrapy shell is an invaluable tool for inspecting what is available in the response that scrapy receives.
This set of commands will open the response in your default web browser:
$ scrapy shell
>>> fetch("https://www.ta.com/portfolio/business-services")
>>> view (response)
As you can see the js-portfolio tiles are not visible as they have not been loaded.
I have had a look at the AJAX requests in the network panel of the developer tools and it appears that the information you require may be available in an XHR request. If it is not then you will need to use additional software to load the javascript, namely scrapy splash or selenium, I would advise exploring the AJAX (XHR) request first though as this will be much faster and easier.
See this question for additional details on using your browsers dev tools to inspect AJAX requests.

Scrapy: scraping website where targeted items are populated using document.write

I am trying to scrap a website where targeted items are populated using document.write method. How can I get full browser html rendered version of the website in the Scrapy?
You can't do this, as scrapy will not execute the JavaScript code.
What you can do:
Rely on a headless browser like Selenium, which will execute the JavaScript. Afterwards, use XPath (or simple DOM access) like before to query the web page after executing the page.
Understand where the contents come from, and load and parse the source directly instead. Chrome Dev Tools / Firebug might help you with that, have a look at the "Network" panel that shows fetched data.
Especially look for JSON, sometimes also XML.

Categories

Resources