Extracting Scraped Web Content from iframe

Extracting Scraped Web Content from iframe - python

Attempting to scrape the table at https://coronavirus.health.ny.gov/zip-code-vaccination-data
I've looked at
Python BeautifulSoup - Scrape Web Content Inside Iframes and have gotten this far but I don't know how to extract the information from soup.
Any help is greatly appreciated.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get("https://coronavirus.health.ny.gov/zip-code-vaccination-data")
soup = BeautifulSoup(r.content, "html.parser")
iframe_src = '//static-assets.ny.gov/load_global_footer/ajax?iframe=true'
r = s.get(f"https:{iframe_src}")
soup = BeautifulSoup(r.content, "html.parser")

Website you're trying to scrape generates <iframe> dynamically with javascript so you need either something to automate browser actions like selenium, puppeteer or assign <iframe> url to a variable, because it seems to not change in the near future. Here is an url of your <iframe>:
https://public.tableau.com/views/Vaccination_Rate_Public/NYSVaccinationRate?:embed=y&:showVizHome=n&:tabs=n&:toolbar=n&:device=desktop&showShareOptions=false&:apiID=host0#navType=1&navSrc=Parse

Related

Beautiful Soup Link Scraping [duplicate]

I am testing using the requests module to get the content of a webpage. But when I look at the content I see that it does not get the full content of the page.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Also on the chrome web-browser if I look at the page source I do not see the full content.
Is there a way to get the full content of the example page that I have provided?

The page is rendered with JavaScript making more requests to fetch additional data. You can fetch the complete page with selenium.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
print(soup.prettify())
For other solutions see my answer to Scraping Google Finance (BeautifulSoup)

Request is different from getting page source or visual elements of the web page, also viewing source from web page doesn't give you full access to everything that is on the web page including database requests and other back-end stuff. Either your question is not clear enough or you've misinterpreted how web browsing works.

Trying to Scrape a Span

I 've been trying to scrape two values from a website using beautiful soup in Python, and it's been giving me trouble. Here is the URL of the page I'm scraping:
https://www.stjosephpartners.com/Home/Index
Here are the values I'm trying to scrape:
HTML of Website to be Scraped
I tried:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.stjosephpartners.com/Home/Index').text
soup = BeautifulSoup(source, 'lxml')
gold_spot_shell = soup.find('div', class_ = 'col-lg-10').children
print(gold_spot_shell)
the output I got was: <list_iterator object at 0x039FD0A0>
When I tried using: gold_spot_shell = soup.find('div', class_ = 'col-lg-10').children
The output was: ['\n']
when I tried using: gold_spot_shell = soup.find('div', class_ = 'col-lg-10').span
The output was: none
The HTML definitely has at least one span child. I'm not sure how to scrape the values I'm after. Thanks.

Beautifulsoup + Request is not a good method to scrape dynamic website like this. That span is generated by javascript so when you get the html using request, it just does not exist.
You can try to use selenium instead.
You can check if the website is using javascript to render element or not by disabling javascript on the page and find that element again, or just "view page source"

How to get sub-content from wikipedia page using BeautifulSoup

I am trying to scrape sub-content from Wikipedia pages based on the internal link using python, The problem is that scrape all content from the page, how can scrape just internal link paragraph, Thanks in advance
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
sub_link="#الأسباب"
total=base_link+sub_link
r=requests.get(total)
soup = bs(r.text, 'html.parser')
results=soup.find('p')
print(results)

It is because it's not a sublink you are trying to scrape. It's an anchor.
Try to request the entire page and then to find the given id.
Something like this:
from bs4 import BeautifulSoup as soup
import requests
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
anchor_id="ﺍﻸﺴﺑﺎﺑ"
r=requests.get(base_link)
page = soup(r.text, 'html.parser')
span = page.find('span', {'id': anchor_id})
results = span.parent.find_next_siblings('p')
print(results[0].text)

webscraping: extracting url from xpath in html using python: airbnb listings

I am trying to extract urls for listings from a city page in AirBnb, using python 3 libraries. I am familiar with how to scrape simpler websites with Beautifulsoup and requests libraries.
url: 'https://www.airbnb.com/s/Denver--CO--United-States/homes'
element in the html
If I inspect the element of a link on the page (in Chrome), I get:
xpath: "//*[#id="listing-9770909"]/div[2]/a"
selector: "listing-9770909 > div._v72lrv > a"
My attempts:
import requests
from bs4 import BeautifulSoup
url = 'https://www.airbnb.com/s/Denver--CO--United-States/homes'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
divs = soup.find_all('div', attrs={'id': 'listing'})
attempt 2:
import requests
from lxml import html
page = requests.get(url)
root = html.fromstring(page.content)
tree = root.getroottree()
result = root.xpath('//div[#id="listing-9770909"]/div[2]/a')
for r in result:
print(r)
Neither of these returns anything. What I need to be able to extract is the url for the page link. Any ideas?

To extract the links, first you have to make sure that the urls to the links exists in the page source. For this you can search with any of the listing ids in the page source(ctrl+u if you are using google chrome,mozilla firefox). If the urls exist in the page source you can directly scrape them using xpath in the response text of the listing page. Here the above listing page of Airbnb is not having the links in the page source, so the page might be sending requests to some other pages(usually json requests). You can find out those requests and send requests to those pages and get the required data.
Please comment if you have any doubt regarding this.

Web scraping cnbc.com

I am trying to scrape this page with bs4 and I was wondering how can I scrape the EUR/USD, price change, and price % ?
I am pretty new to this, so this is all I have so far:
import requests
from bs4 import BeautifulSoup
url = 'http://www.cnbc.com/pre-markets/'
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, 'lxml')
for r in soup.find_all('td', {'class': 'first text'}):
print(r)

The data you're looking for are probably loaded with javaScript and therefore you can't see them with bs4. But you can do it using an headless browser like PhantomJS, Selenium or Splash. See also this response: scraping dynamic updates of temperature sensor data from a website

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting Scraped Web Content from iframe - python

Related

Beautiful Soup Link Scraping [duplicate]

Trying to Scrape a Span

How to get sub-content from wikipedia page using BeautifulSoup

webscraping: extracting url from xpath in html using python: airbnb listings

Web scraping cnbc.com

Categories

Resources