Web scraping cnbc.com

Web scraping cnbc.com - python

I am trying to scrape this page with bs4 and I was wondering how can I scrape the EUR/USD, price change, and price % ?
I am pretty new to this, so this is all I have so far:
import requests
from bs4 import BeautifulSoup
url = 'http://www.cnbc.com/pre-markets/'
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, 'lxml')
for r in soup.find_all('td', {'class': 'first text'}):
print(r)

The data you're looking for are probably loaded with javaScript and therefore you can't see them with bs4. But you can do it using an headless browser like PhantomJS, Selenium or Splash. See also this response: scraping dynamic updates of temperature sensor data from a website

Related

Beautiful Soup Link Scraping [duplicate]

I am testing using the requests module to get the content of a webpage. But when I look at the content I see that it does not get the full content of the page.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Also on the chrome web-browser if I look at the page source I do not see the full content.
Is there a way to get the full content of the example page that I have provided?

The page is rendered with JavaScript making more requests to fetch additional data. You can fetch the complete page with selenium.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
print(soup.prettify())
For other solutions see my answer to Scraping Google Finance (BeautifulSoup)

Request is different from getting page source or visual elements of the web page, also viewing source from web page doesn't give you full access to everything that is on the web page including database requests and other back-end stuff. Either your question is not clear enough or you've misinterpreted how web browsing works.

Extracting Scraped Web Content from iframe

Attempting to scrape the table at https://coronavirus.health.ny.gov/zip-code-vaccination-data
I've looked at
Python BeautifulSoup - Scrape Web Content Inside Iframes and have gotten this far but I don't know how to extract the information from soup.
Any help is greatly appreciated.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get("https://coronavirus.health.ny.gov/zip-code-vaccination-data")
soup = BeautifulSoup(r.content, "html.parser")
iframe_src = '//static-assets.ny.gov/load_global_footer/ajax?iframe=true'
r = s.get(f"https:{iframe_src}")
soup = BeautifulSoup(r.content, "html.parser")

Website you're trying to scrape generates <iframe> dynamically with javascript so you need either something to automate browser actions like selenium, puppeteer or assign <iframe> url to a variable, because it seems to not change in the near future. Here is an url of your <iframe>:
https://public.tableau.com/views/Vaccination_Rate_Public/NYSVaccinationRate?:embed=y&:showVizHome=n&:tabs=n&:toolbar=n&:device=desktop&showShareOptions=false&:apiID=host0#navType=1&navSrc=Parse

Not getting all the information when scraping bet365.com

I am having problem when trying to scrape https://www.bet365.com/ using urllib.request and BeautifulSoup.
The problem is, the code below doesn't get all the information on the page, for example players' names don't appear. Maybe another framework or configuration to extract the information?
My code is:
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.bet365.com/"
try:
page = urllib.request.urlopen(url)
except:
print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)

Looking at the source code for the page in question it looks like essentially all of the data is populated by Javascript. BeautifulSoup isn't a headless client, it's just something that downloads and parses HTML, so anything that's populated with Javascript it can't see. You'd need a headless browser like selenium to scrape something like that.

You need to use selenium instead of requests, along with Beautifulsoup as well.
from selenium import webdriver
url = "https://www.bet365.com"
driver = webdriver.Chrome(executable_path=r"the_path_of_driver")
driver.get(url)
driver.maximize_window() #optional, if you want to maximize the browser
driver.implicitly_wait(60) ##Optional, Wait the loading if error
soup = BeautifulSoup(driver.page_source, 'html.parser') #get the soup

How to get sub-content from wikipedia page using BeautifulSoup

I am trying to scrape sub-content from Wikipedia pages based on the internal link using python, The problem is that scrape all content from the page, how can scrape just internal link paragraph, Thanks in advance
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
sub_link="#الأسباب"
total=base_link+sub_link
r=requests.get(total)
soup = bs(r.text, 'html.parser')
results=soup.find('p')
print(results)

It is because it's not a sublink you are trying to scrape. It's an anchor.
Try to request the entire page and then to find the given id.
Something like this:
from bs4 import BeautifulSoup as soup
import requests
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
anchor_id="ﺍﻸﺴﺑﺎﺑ"
r=requests.get(base_link)
page = soup(r.text, 'html.parser')
span = page.find('span', {'id': anchor_id})
results = span.parent.find_next_siblings('p')
print(results[0].text)

What is difference between soup of selenium and requests?

I was crawling some information from the web, but there were different results while I'm using Selenium and requests
Selenium
driver.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(driver.page_source, 'html.parser')
sample= soup.find_all('div', class_='accord_hd')`
requests
response= requests.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(response.content, 'html.parser')
sample= soup.find_all('div', class_='accord_hd')`
while using Selenium, it returned an empty list.
but in requests, there was a list with some strings in it.
I experienced sth similar to this before, so I wonder what's going on here

requests will obtain/return the initial html source code.
selenium will simulate/automate the browser to open the web page, which then you can pull the html source that was used to render the page.
The difference between these 2 is requests does not support that rendering/java script if the site is dynamically created. While since selenium actually opens up the browser to display the page, will allow the page to render it's contents before getting the html source.
That's the reason why you may get 2 different responses when using requests versus selenium.
However, in the particular code you have given above, I had the exact same output with using Selenium and using requests
Code:
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(driver.page_source, 'html.parser')
sample_selenium= soup.find_all('div', class_='accord_hd')
driver.close()
import requests
response = requests.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(response.content, 'html.parser')
sample_requests= soup.find_all('div', class_='accord_hd')
print ('Selenium: %s items\nRequests: %s items' %(len(sample_selenium), len(sample_requests)))
Output:
Selenium: 11 items
Requests: 11 items

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping cnbc.com - python

The data you're looking for are probably loaded with javaScript and therefore you can't see them with bs4. But you can do it using an headless browser like PhantomJS, Selenium or Splash. See also this response: scraping dynamic updates of temperature sensor data from a website

Related

Beautiful Soup Link Scraping [duplicate]

Extracting Scraped Web Content from iframe

Not getting all the information when scraping bet365.com

How to get sub-content from wikipedia page using BeautifulSoup

What is difference between soup of selenium and requests?

Categories

Resources