Cannot webscrape elements with Playwright and BeautifulSoup - python

I am trying to update my webscraping scripts as the site (https://covid19.gov.vn/) have updated but I can't for my life found out how to parse these elements. Inspecting the elements it seems like it is there as usual but I cannot parse it with BeautifulSoup. My initial attempts include using Playwright and tried again but I still couldn't scrape it correctly. Viewing the source it's almost like the elements is not there at all. Can anyone with more knowledge about HTML and webscraping explain to me how this works? I'm pretty much stuck here
This is basically my last attempt before I gave up looking at the page source:
from bs4 import BeautifulSoup as bs
import requests
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://covid19.gov.vn/")
page_content = page.content()
soup = bs(page_content, features="lxml")
test = soup.findAll('div', class_ = "content-tab show", id="vi")
print(test)
browser.close()
My idea was to scrape and just iterate through all the content inside. But well, it doesn't work. Much appreciated if anyone can help me with this! Thanks!

Try the code below - it is based on HTTP GET call that fetch the data you are looking for.
import requests
r = requests.get('https://static.pipezero.com/covid/data.json')
if r.status_code == 200:
data = r.json()
print(data['total']['internal'])
output
{'death': 17545, 'treating': 27876, 'cases': 707436, 'recovered': 475343}

Related

Scraping Webpage With Beautiful Soup

I am new to web scraping and I am trying to scrape wind data from a website. Here is the website: https://wx.ikitesurf.com/spot/507.
I understand that I can do this using selenium to find elements but I think I may have found a better way. Please correct if I am wrong. When in developer tools I can find this page by going to network->JS->getGraph?
https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881
This page contains all the data I need and it is constantly updating. Here is my code:
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
time.sleep(3)
wind = soup.find("last_ob_wind_desc")
print (wind)
I tried using beautiful soup to scrape but I always receive the answer "None". Does anyone know how I can scrape this page? I would like to know what I am doing wrong. Thanks for any help!
Removing callback=jQuery17200020271765600428093_1619158293267& from the api url will make it return proper json:
import requests
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url).json()
response is now a dictionary with the data. last_ob_wind_desc can be retrieved with response['last_ob_wind_desc'].
You can also save the data to csv or other file formats with pandas:
import pandas as pd
df = pd.json_normalize(response)
df.to_csv('filename.csv')

Python Webscraping not return the right text, and sometimes no text at all

I'm trying to retrieve the price of the item from this amazon page, URL:
https://www.amazon.com/FANMATS-University-Longhorns-Chrome-Emblem/dp/B00EPDLL6U/
Source Code
from bs4 import BeautifulSoup
import requests
text = "https://www.amazon.com/FANMATS-University-Longhorns-Chrome-Emblem/dp/B00EPDLL6U/"\
page = requests.get(text)
data = page.text
soup = BeautifulSoup(data, 'lxml')
web_text = soup.find_all('div')
print(web_text)
Everytime I run the program, I get an output of html that's nothing similar to that of the webpage, saying things like:
" Sorry! Something went wrong on our end. Please go back and try again..."
I'm not sure what I'm doing wrong, any help would be much appreciated. I'm new to python and webscraping so I'm sorry if my issue is super obvious. Thanks! :)
Website is serving content dynamically what request could not handle, use selenium instead:
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/FANMATS-University-Longhorns-Chrome-Emblem/dp/B00EPDLL6U/'
driver = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
time.sleep(3)
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
print(soup.select_one('span#priceblock_ourprice').get_text())
driver.close()

Issues with requests and BeautifulSoup

I'm trying to read a news webpage to get the titles of their stories. I'm attempting to put them in a list, but I keep getting an empty list. Can someone please point in the right direction here? What am I missing? Please see code below. Thanks.
import requests
from bs4 import BeautifulSoup
url = 'https://nypost.com/'
ttl_lst = []
soup = BeautifulSoup(requests.get(url).text, "lxml")
title = soup.findAll('h2', {'class': 'story-heading'})
for row in title:
ttl_lst.append(row.text)
print (ttl_lst)
the requests module only returns the first html file sent to them. Sites like nypost use ajax requests to get their articles. You will have to use something like selenium for this, which allows for ajax requests after the page loads.

BeautifulSoup doesn't return the expected tag as Chrome

I'm trying to parse a page to learn beautifulSoup, here is the code
import requests as req
from bs4 import BeautifulSoup
page = 'https://www.pathofexile.com/trade/search/Delirium/w0brcb'
resp = req.get(page)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all('results')
print(len(res))
Result: 0
The goal is to get the first price.
I tried to look for the tag in Chrome and it's there, but probably the browser does another requests to get the results.
Can someone explain what am I missing here?
website's source code
Problems with the code
Your code is looking for a "results"-element. What you really have to look for (based on your screenshot) is a div-element with the class "results".
So try this:
soup.find_all("div", attrs={"class":"results"})
But if you want the price you have to dig deeper for the element which contains the price:
price = soup.find("span", attrs={"data-field":"price"}).text
Problems with the site
It seems the site is loading data by Ajax. With Requests you get the page before/without Ajax data call.
In this case you should change from Requests to Selenium module. This will navigate through a "real Browser" and you can wait until data is finally loaded before you start scraping.
Documentation: Selenium

Parsing a webpage using Beautiful Soup in python, doesnt work with specific page

New to python, thought I'd try to make a web crawler as a first project. Found Beautiful Soup as the solution. All is well except that the ONE page I want to crawl yields no results :(
Here is the code:
import requests
from bs4 import BeautifulSoup
from mechanize import Browser
def crawl_list(max_pages):
mech = Browser()
place = 1
while place <= max_pages:
url = "http://www.crummy.com/software/BeautifulSoup/bs4/doc/"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
for link in soup.findAll('a'):
href = link.get('href')
print(href)
place += 1
crawl_list(1)
This code works wonders. I get the whole list of links. BUT, as soon as I put http://diseasesdatabase.com/disease_index_a.asp in the value of 'url', no dice.
Perhaps it has to do with the .asp? Can someone please solve this mystery?
I'm getting this as an error message:
mechanize._response.httperror_seek_wrapper: HTTP Error 410: Gone
Thanks in advance.

Categories

Resources