Scraping href with BeautifulSoup

Scraping href with BeautifulSoup - python

I'm trying to scrape the CO2 per resource trend table data from this url: pcaiso.com/todaysoutlook/pages/emissions.html
The href attribute of the contains the dataset for the chart (as a very long string) I was attempting to return this attribute, but my code is returning a zero set for the following request, no matter how hard I try and google other suggestions.
url = 'http://www.caiso.com/todaysoutlook/pages/emissions.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
html = soup.find("a", {"class": "dropdown-item"})
print(html)
Any thoughts are appreciated! Thanks.

It seems like what you really are asking about is why the element does not have an href attribute when you inspect it in your code. The reason is, when you request the HTML page from the server it actually returns a static page without any chart data. When you view this page in a web browser, it runs some JavaScript code that queries the backend and populates the chart data dynamically. So you'll need to modify your approach to get that data.
One option is to manually inspect the page in your browser, reverse-engineer how it fetches the data, and do the same thing in your code. Most web browser have built-in developer tools that can help with this.
Another option is to use a browser automation tool like Selenium to load the page in a web browser environment and scrape the data from there. This is less efficient, but might be easier for someone inexperienced in web programming because you can treat the JavaScript functionality as a "black box" and interact with the page more like how a real user would.

Related

Beautiful Soup not picking up some data form the website

I have been trying to scrape some data using beautiful soup from https://www.eia.gov/coal/markets/. However when I parse the contents some of the data does not show up at all. Those data fields are visible in chrome inspector but not in the soup. The thing is they do not seem to be text elements. I think they are fed using an external database. I have attached the screenshots below. Is there any other way to scrape that data?
Thanks in advance.
Google inspector:
Beautiful soup parsed content:

#DMart is correct. The data you are looking for is being populated by Javascript, have a look at line 1629 in the page source. Beautiful soup doesn't act as a client browser so there is nowhere for the JS to execute. So it looks like selenium is your best bet.
See This thread for more information.

Not enough detail in your question but this information is probably dynamically loaded and you're not fetching the entire page source.
Without your code it's tough to see if you're using selenium to do it (you tagged this questions as such) which may indicate you're using page_source which does not guarantee you the entire completed source of the page you're looking at.
If you're using requests its even more unlikely you're capturing the entire page's completed source code.

The data is loaded via ajax, so it is not available in the initial document. If you go to the networking tab in chrome dev tools you will see that the site reaches out to https://www.eia.gov/coal/markets/coal_markets_json.php. I searched for some of the numbers in the response and it looks like the data you are looking for is there.
This is a direct json response from the backend. Its better than selenium if you can get it to work.

Thanks you all!
Opening the page using selenium using a webdriver and then parsing the page source using beautiful soup worked.
webdriver.get('https://www.eia.gov/coal/markets/')
html=webdriver.page_source
soup=BS(html)
table=soup.find("table",{'id':'snl_dpst'})
rows=table.find_all("tr")

Url request does not parse every information in HTML using Python

I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!

There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.

How to scrape a javascript website in Python?

I am trying to scrape a website. I have tried using two methods but both do not provide me with the full website source code that I am looking for. I am trying to scrape the news titles from the website URL provided below.
URL: "https://www.todayonline.com/"
These are the two methods I have tried but failed.
Method 1: Beautiful Soup
tdy_url = "https://www.todayonline.com/"
page = requests.get(tdy_url).text
soup = BeautifulSoup(page)
soup # Returns me a HTML with javascript text
soup.find_all('h3')
### Returns me empty list []
Method 2: Selenium + BeautifulSoup
tdy_url = "https://www.todayonline.com/"
options = Options()
options.headless = True
driver = webdriver.Chrome("chromedriver",options=options)
driver.get(tdy_url)
time.sleep(10)
html = driver.page_source
soup = BeautifulSoup(html)
soup.find_all('h3')
### Returns me only less than 1/4 of the 'h3' tags found in the original page source
Please help. I have tried scraping other news websites and it is so much easier. Thank you.

The news data on the website you are trying to scrape is fetched from the server using JavaScript (this is called XHR -- XMLHttpRequest). It is happening dynamically, while the page is loading or being scrolled. so this data is not returned inside the page returned by the server.
In the first example, you are getting only the page returned by the server -- without the news, but with JS that is supposed to get them. Neither requests nor BeautifulSoup can execute JS.
However, you can try to reproduce requests that are getting news titles from the server with Python requests. Do the following steps:
Open DevTools of your browser (usually you have to press F12 or the combination of Ctrl+Shift+I for that), and take a look at requests that are getting news titles from the server. Sometimes, it is even easier than web scraping with BeautifulSoup. Here is a screenshot (Firefox):
Copy the request link (right-click -> Copy -> Copy link), and pass it to requests.get(...).
Get .json() of the request. It will return a dict that is easy to work with. To better understand the structure of the dict, I would recommend to use pprint instead of simple print. Note you have to do from pprint import pprint before using it.
Here is an example of the code that gets the titles from the main news on the page:
import requests
nodes = requests.get("https://www.todayonline.com/api/v3/news_feed/7")\
.json()["nodes"]
for node in nodes:
print(node["node"]["title"])
If you want to scrape a group of news under caption, you need to change the number after news_feed/ in the request URL (to get it, you just need to filter the requests by "news_feed" in the DevTools and scroll the news page down).
Sometimes web sites have protection against bots (although the website you are trying to scrape doesn't). In such cases, you might need to do these steps as well.

You can access data via API (check out the Network tab):
For example,
import requests
url = "https://www.todayonline.com/api/v3/news_feed/7"
data = requests.get(url).json()

I will suggest you the fairly simple approach,
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.todayonline.com/googlenews.xml').content
soup = bs(page)
news = [i.text for i in soup.find_all('news:title')]
print(news)
output
['DBS named world’s best bank by New York-based financial publication',
'Russia has very serious questions to answer on Navalny - UK',
"Exclusive: 90% of China's Sinovac employees, families took coronavirus vaccine - CEO",
'Three militants killed after fatal attack on policeman in Tunisia',
.....]
Also, you can check the XML page for more information if required.
P.S. Always check for the compliance before scraping any website :)

There are different ways of gathering the content of a webpage that contains Javascript.
Using selenium with Firefox web driver
Using a headless browser with phantomJS
Making an API call using a REST client or python requests library
You have to do your research first

How to scrape dynamic content from a website?

So I'm using scrapy to scrape a data from Amazon books section. But somehow I got to know that it has some dynamic data. I want to know how dynamic data can be extracted from the website. Here's something I've tried so far:
import scrapy
from ..items import AmazonsItem
class AmazonSpiderSpider(scrapy.Spider):
name = 'amazon_spider'
start_urls = ['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6']
def parse(self, response):
items = AmazonsItem()
products_name = response.css('.s-access-title::attr("data-attribute")').extract()
for product_name in products_name:
print(product_name)
next_page = response.css('li.a-last a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Now I was using SelectorGadget to select a class which I have to scrape but in case of a dynamic website, it doesn't work.
So how do I scrape a website which has dynamic content?
what exactly is the difference between dynamic and static content?
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
how would I know that data is dynamically created?

So how do I scrape a website which has dynamic content?
there are a few options:
Use Selenium, which allows you to simulate opening a browser, letting the page render, then pull the html source code
Sometimes you can look at the XHR and see if you can fetch the data directly (like from an API)
Sometimes the data is within the <script> tags of the html source. You could search through those and use json.loads() once you manipulate the text into a json format
what exactly is the difference between dynamic and static content?
Dynamic means the data is generated from a request after the initial page request. Static means all the data is there at the original call to the site
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
Refer to your first question
how would I know that data is dynamically created?
You'll know it's dynamically created if you see it in the dev tools page source, but not in the html page source you first request. You can also see if the data is generated by additional requests in the dev tool and looking at Network -> XHR
Lastly
Amazon does offer an API to access the data. Try looking into that as well

If you want to load dynamic content, you will need to simulate a web browser. When you make an HTTP request, you will only get the text returned by that request, and nothing more. To simulate a web browser, and interact with data on the browser, use the selenium package for Python:
https://selenium-python.readthedocs.io/

So how do I scrape a website which has dynamic content?
Websites that have dynamic content have their own APIs from where they are pulling data. That data is not even fixed it will be different if you will check it after some time. But, it does not mean that you can't scrape a dynamic website. You can use automated testing frameworks like Selenium or Puppeteer.
what exactly is the difference between dynamic and static content?
As I have explained this in your first question, the static data is fixed and will remain the same forever but the dynamic data will be periodically updated or changes asynchronously.
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
for that, you can use libraries like BeautifulSoup in python and cheerio in Nodejs. Their docs are quite easy to understand and I will highly recommend you to read them thoroughly.
You can also follow this tutorial
how would I know that data is dynamically created?
While reloading the page open the network tab in chrome dev tools. You will see a lot of APIs are working behind to provide the relevant data according to the page you are trying to access. In that case, the website is dynamic.

So how do I scrape a website which has dynamic content?
To scrape the dynamic content from websites, we are required to let the web page load completely, so that the data can be injected into the page.
What exactly is the difference between dynamic and static content?
Content in static websites is fixed content that is not processed on the server and is directly returned by using prebuild source code files.
Dynamic websites load the contents by processing them on the server side in runtime. These sites can have different data every time you load the page, or when the data is updated.
How would I know that data is dynamically created?
You can open the Dev Tools and open the Networks tab. Over there once you refresh the page, you can look out for the XHR requests or requests to the APIs. If some requests like those exist, then the site is dynamic, else it is static.
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
To extract the dynamic content from the websites we can use Selenium (python - one of the best options) :
Selenium - an automated browser simulation framework
You can load the page, and use the CSS selector to match the data on the page. Following is an example of how you can use it.
import time
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6")
time.sleep(4)
titles = driver.find_elements_by_css_selector(
".a-size-medium.a-color-base.a-text-normal")
print(titles[0].text)
In case you don't want to use Python, there are other open-source options like Puppeteer and Playwright, as well as complete scraping platforms such as Bright Data that have built-in capabilities to extract dynamic content automatically.

Scraping data from complex website (hidden content)

I am just starting with web scraping and unfortunately, I am facing a showstopper: I would like pull some financial data but it seems that the website is quite complex (dynamic content etc.).
Data I would like pull
Website:
https://www.de.vanguard/web/cf/professionell/de/produktart/detailansicht/etf/9527/EQUITY/performance
So far, I've used Beautiful Soup to get this done. However, I cannot even find the table. Any ideas?

Look into using selenium to launch an automated web browser. This loads the web page and it's associated dynamic content, as well as allow you the option to 'click' on certain web elements to load content that may be generated on_click. You can use this in tandem with BeautifulSoup by passing driver.page_source to BeautifulSoup and parsing through it that way.
This SO answer provides a basic example that would serve as a good starting point: Python WebDriver how to print whole page source (html)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping href with BeautifulSoup - python

Related

Beautiful Soup not picking up some data form the website

Url request does not parse every information in HTML using Python

How to scrape a javascript website in Python?

How to scrape dynamic content from a website?

Scraping data from complex website (hidden content)

Categories

Resources