Scraping data from yahoo finance using BeautifulSoup in python - python

I'm using BeautifulSoup to scrap last 5 days data from yahoo finance. Here is the link but I'm not getting any data. The result give everything except the generated data.
This is what I tried:
url = "https://in.finance.yahoo.com/quote/20MICRONS.NS/history?period1=1199125800&period2=1490207400&interval=1d&filter=history&frequency=1d"
request = urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request).read()
soup = BeautifulSoup(response, 'html.parser')

Finance data is not embedded to the web page it is loaded by JavaScript. As you scroll down the page you will see website loading to new data to the page. The best way to solve this problem is to use selenium or PhantomJS like solutions.
You can use them with python.

Related

Why is there an empty result while scraping a Exchange Rate table?

I want to scrape Korea Exchange Rate by using http://www.smbs.biz/ExRate/StdExRate.jsp this website.
Daily exchange rate is provided by table, So I tried to scrape using BeautifulSoup, but it's responses are empty.
Table is like,
url = "http://www.smbs.biz/ExRate/StdExRate.jsp"
html = requests.get(url, verify=False).text
soup = BeautifulSoup(html, 'html.parser')
title = soup.select_one('#frm_SearchDate > div:nth-child(17) > table')
title.text
Result :
'\n일별 매매기준율\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
Always and first of all, take a look at your soup to see if all the expected ingredients are there.
Data is loaded via XHR request and table is rendered dynamically by JavaScript, That is why you won't get the table with BeautifulSoup cause it could not find it in response of your request.
There are option to get it anyway:
check your browser dev tools on XHR tab to locate the api and pull part of info from there.
use selenium to get driver.page_source with whole table from the 'browser like'rendered version of website.
Example
import requests
from bs4 import BeautifulSoup
url = 'http://www.smbs.biz/ExRate/StdExRate_xml.jsp?arr_value=USD_2023-01-12_2023-02-03'
soup=BeautifulSoup(requests.get(url).text)
{s.get('label'):s.get('value') for s in soup.select('set')}
Output
{'23.01.12': '1245.3',
'23.01.13': '1244.6',
'23.01.16': '1240.6',
'23.01.17': '1234',
'23.01.18': '1238.5',
'23.01.19': '1239.8',
'23.01.20': '1236',
'23.01.25': '1234.4',
'23.01.26': '1233.4',
'23.01.27': '1231.4',
'23.01.30': '1230.2',
'23.01.31': '1228.7',
'23.02.01': '1230.8',
'23.02.02': '1231.4',
'23.02.03': '1219.3'}

Scraping Webpage With Beautiful Soup

I am new to web scraping and I am trying to scrape wind data from a website. Here is the website: https://wx.ikitesurf.com/spot/507.
I understand that I can do this using selenium to find elements but I think I may have found a better way. Please correct if I am wrong. When in developer tools I can find this page by going to network->JS->getGraph?
https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881
This page contains all the data I need and it is constantly updating. Here is my code:
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
time.sleep(3)
wind = soup.find("last_ob_wind_desc")
print (wind)
I tried using beautiful soup to scrape but I always receive the answer "None". Does anyone know how I can scrape this page? I would like to know what I am doing wrong. Thanks for any help!
Removing callback=jQuery17200020271765600428093_1619158293267& from the api url will make it return proper json:
import requests
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url).json()
response is now a dictionary with the data. last_ob_wind_desc can be retrieved with response['last_ob_wind_desc'].
You can also save the data to csv or other file formats with pandas:
import pandas as pd
df = pd.json_normalize(response)
df.to_csv('filename.csv')

Unable to navigate Amazon pagination with Python and BS4

I've been trying to create a simple web scraper program to scrape the book titles of a 100 bestseller list on Amazon. I've used this code before on another site with no problems. But for some reason, it scraps the first page fine but then posts the same results for the following iterations.
I'm not sure if it's something to do with how Amazon creates its urls or not. When I manually enter the "#2" (and beyond) at the end of the url in the browser it navigates fine.
(Once the scrape is working I plan on dumping the data in csv files. But for now, print to the terminal will do.)
import requests
from bs4 import BeautifulSoup
for i in range(5):
url = "https://smile.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_nav_kstore_4_158591011#{}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for book in soup.find_all('div', class_='zg_itemWrapper'):
title = book.find('div', class_='p13n-sc-truncate')
name = book.find('a', class_='a-link-child')
price = book.find('span', class_='p13n-sc-price')
print(title)
print(name)
print(price)
print("END")
This is a common problem that you have to face, some sites load the data asynchronous(with ajax) those are XMLHttpRequest that you can see in the tab networking of your DOM inspector. Usually the websites load the data from a different endpoint with POST method to solve that you can use urllib or requests library.
In this case the request is through a GET method and you can scrape it from this url with no need of extend your code https://www.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_pg_3?_encoding=UTF8&pg=3&ajax=1 where you only change the pg parameter

Web scraping cnbc.com

I am trying to scrape this page with bs4 and I was wondering how can I scrape the EUR/USD, price change, and price % ?
I am pretty new to this, so this is all I have so far:
import requests
from bs4 import BeautifulSoup
url = 'http://www.cnbc.com/pre-markets/'
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, 'lxml')
for r in soup.find_all('td', {'class': 'first text'}):
print(r)
The data you're looking for are probably loaded with javaScript and therefore you can't see them with bs4. But you can do it using an headless browser like PhantomJS, Selenium or Splash. See also this response: scraping dynamic updates of temperature sensor data from a website

Requests Only Returns Partial Results

I'm trying to scrape the links for the top 10 articles on medium each day - by the looks of it, it seems like all the article links are in the class "postArticle-content," but when I run this code, I only get the top 3. Is there a way to get all 10?
from bs4 import BeautifulSoup
import requests
r = requests.get("https://medium.com/browse/726a53df8c8b")
data = r.text
soup = BeautifulSoup(data)
data = soup.findAll('div', attrs={'class' : 'postArticle-content'})
for div in data:
links = div.findAll('a')
for link in links:
print(link.get('href'))
requests gave you the entire results.
That page contains only the first three. The website's design is to use javascript code, running in the browser, to load additional content and add it to the page.
You need an entire web browser, with a javascript engine, to do what you are trying to do. The requests and beautiful-soup libraries are not a web browser. They are merely an implementation of the HTTP protocol and an HTML parser, respectively.

Categories

Resources