Scraping Webpage With Beautiful Soup - python

I am new to web scraping and I am trying to scrape wind data from a website. Here is the website: https://wx.ikitesurf.com/spot/507.
I understand that I can do this using selenium to find elements but I think I may have found a better way. Please correct if I am wrong. When in developer tools I can find this page by going to network->JS->getGraph?
https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881
This page contains all the data I need and it is constantly updating. Here is my code:
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
time.sleep(3)
wind = soup.find("last_ob_wind_desc")
print (wind)
I tried using beautiful soup to scrape but I always receive the answer "None". Does anyone know how I can scrape this page? I would like to know what I am doing wrong. Thanks for any help!

Removing callback=jQuery17200020271765600428093_1619158293267& from the api url will make it return proper json:
import requests
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url).json()
response is now a dictionary with the data. last_ob_wind_desc can be retrieved with response['last_ob_wind_desc'].
You can also save the data to csv or other file formats with pandas:
import pandas as pd
df = pd.json_normalize(response)
df.to_csv('filename.csv')

Related

Cannot webscrape elements with Playwright and BeautifulSoup

I am trying to update my webscraping scripts as the site (https://covid19.gov.vn/) have updated but I can't for my life found out how to parse these elements. Inspecting the elements it seems like it is there as usual but I cannot parse it with BeautifulSoup. My initial attempts include using Playwright and tried again but I still couldn't scrape it correctly. Viewing the source it's almost like the elements is not there at all. Can anyone with more knowledge about HTML and webscraping explain to me how this works? I'm pretty much stuck here
This is basically my last attempt before I gave up looking at the page source:
from bs4 import BeautifulSoup as bs
import requests
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://covid19.gov.vn/")
page_content = page.content()
soup = bs(page_content, features="lxml")
test = soup.findAll('div', class_ = "content-tab show", id="vi")
print(test)
browser.close()
My idea was to scrape and just iterate through all the content inside. But well, it doesn't work. Much appreciated if anyone can help me with this! Thanks!
Try the code below - it is based on HTTP GET call that fetch the data you are looking for.
import requests
r = requests.get('https://static.pipezero.com/covid/data.json')
if r.status_code == 200:
data = r.json()
print(data['total']['internal'])
output
{'death': 17545, 'treating': 27876, 'cases': 707436, 'recovered': 475343}

Python Beautiful Soup not pulling all the data

I'm currently looking to pull specific issuer data from URL html with a specific class and ID from the Luxembourg Stock Exchange using Beautiful Soup.
The example link I'm using is here: https://www.bourse.lu/security/XS1338503920/234821
And the data I'm trying to pull is the name under 'Issuer' stored as text; in this case it's 'BNP Paribas Issuance BV'.
I've tried using the class vignette-description-content-text, but it can't seem to find any data, as when looking through the soup, not all of the html is being pulled.
I've found that my current code only pulls some of the html, and I don't know how to expand the data it's pulling.
import requests
from bs4 import BeautifulSoup
URL = "https://www.bourse.lu/security/XS1338503920/234821"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='ResultsContainer', class_="vignette-description-content-text")
I have found similar problems and followed guides shown in link 1, link 2 and link 3, but the example html used seems very different to the webpage I'm looking to scrape.
Is there something I'm missing to pull and scrape the data?
Based on your code, I suspect you are trying to get element which has class=vignette-description-content-text and id=ResultsContaine.
The class_ is correct way to use ,but not with the id
Try this:
import requests
from bs4 import BeautifulSoup
URL = "https://www.bourse.lu/security/XS1338503920/234821"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
def applyFilter(element):
if element.has_attr('id') and element.has_attr('class'):
if "vignette-description-content-text" in element['class'] and element['id'] == "ResultsContainer":
return True
results = soup.find_all(applyFilter)
for result in results:
#Each result is an element here

Unable to scrape tabular data in NSE

I am trying to scrape the Advances/Declines from NSE website - https://www1.nseindia.com/live_market/dynaContent/live_market.htm
Advances/Declines is in tabular format in the HTML. But I am not able to retrieve the actual numerical value that is displayed in the site.
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www1.nseindia.com/live_market/dynaContent/live_market.htm"
webpage = requests.get(url);
soup = BeautifulSoup(webpage.content, "html.parser");
for tr in soup.find_all('tr'):
advance = tr.find_all('td')
print(advance)
I am only able to get an empty value or NONE. I am not sure what I am doing wrong. When I inspect the element in the website, I see the numerical values 978, 904 but in Spyder, the values in these elements are displayed with a hyphen. Can someone please help?
This page uses JavaScript to load these information but requests/BeautifulSoup can't run JavaScript.
Using DevTools in Chrome/Firefox (tab Network, filter xhr) I found url used by JavaScript to load it as JSON data so I don't have to even use BeautifulSoup to get it.
import requests
url = 'https://www1.nseindia.com/live_market/dynaContent/live_analysis/changePercentage.json'
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
data = r.json()
print(data['rows'][0]['advances'])
print(data['rows'][0]['declines'])
print(data['rows'][0]['unchanged'])
print(data['rows'][0]['total'])
BTW: It doesn't send data without User-Agent

using beautiful soup for simulating a page-click to access all HTML on a page?

I'm trying to scrape the following website:
https://www.bandsintown.com/?came_from=257&sort_by_filter=Number+of+RSVPs
I'm able to successfully scrape the events listed on the page using beautifulsoup, using the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.bandsintown.com/?came_from=257&sort_by_filter=Number+of+RSVPs'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
dates = soup.find_all('div', {'class': 'event-b58f7990'})
month=[]
day=[]
for i in dates:
md = i.find_all('div')
month.append(md[0].text)
day.append(md[1].text)
However, the issue I'm having is that I'm only able to scrape the first 18 events - the rest of the page is only available if the 'view all' button is clicked at the bottom. Is there a way in beautifulsoup, or otherwise, to simulate this button being clicked, so that I can scrape ALL of the data? I'd prefer to keep this in python as I'm doing most scraping with beautifulsoup. Thanks so much!
If you can work out the end point or set an end point for range in the following (with error handling for going too far) you can get a json response and parse out the info you require as follows. Depending on how many requests making you may choose to re-use connection with session.
import requests
import pandas as pd
url = 'https://www.bandsintown.com/upcomingEvents?came_from=257&sort_by_filter=Number+of+RSVPs&page={}&latitude=51.5167&longitude=0.0667'
results = []
for page in range(1,20):
data = requests.get(url.format(page)).json()
for item in data['events']:
results.append([item['artistName'], item['eventDate']['day'],item['eventDate']['month']])
df = pd.DataFrame(results)
print(df)

Scraping data from yahoo finance using BeautifulSoup in python

I'm using BeautifulSoup to scrap last 5 days data from yahoo finance. Here is the link but I'm not getting any data. The result give everything except the generated data.
This is what I tried:
url = "https://in.finance.yahoo.com/quote/20MICRONS.NS/history?period1=1199125800&period2=1490207400&interval=1d&filter=history&frequency=1d"
request = urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request).read()
soup = BeautifulSoup(response, 'html.parser')
Finance data is not embedded to the web page it is loaded by JavaScript. As you scroll down the page you will see website loading to new data to the page. The best way to solve this problem is to use selenium or PhantomJS like solutions.
You can use them with python.

Categories

Resources