I am trying to scrape the Advances/Declines from NSE website - https://www1.nseindia.com/live_market/dynaContent/live_market.htm
Advances/Declines is in tabular format in the HTML. But I am not able to retrieve the actual numerical value that is displayed in the site.
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www1.nseindia.com/live_market/dynaContent/live_market.htm"
webpage = requests.get(url);
soup = BeautifulSoup(webpage.content, "html.parser");
for tr in soup.find_all('tr'):
advance = tr.find_all('td')
print(advance)
I am only able to get an empty value or NONE. I am not sure what I am doing wrong. When I inspect the element in the website, I see the numerical values 978, 904 but in Spyder, the values in these elements are displayed with a hyphen. Can someone please help?
This page uses JavaScript to load these information but requests/BeautifulSoup can't run JavaScript.
Using DevTools in Chrome/Firefox (tab Network, filter xhr) I found url used by JavaScript to load it as JSON data so I don't have to even use BeautifulSoup to get it.
import requests
url = 'https://www1.nseindia.com/live_market/dynaContent/live_analysis/changePercentage.json'
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
data = r.json()
print(data['rows'][0]['advances'])
print(data['rows'][0]['declines'])
print(data['rows'][0]['unchanged'])
print(data['rows'][0]['total'])
BTW: It doesn't send data without User-Agent
Related
I want to scrape Korea Exchange Rate by using http://www.smbs.biz/ExRate/StdExRate.jsp this website.
Daily exchange rate is provided by table, So I tried to scrape using BeautifulSoup, but it's responses are empty.
Table is like,
url = "http://www.smbs.biz/ExRate/StdExRate.jsp"
html = requests.get(url, verify=False).text
soup = BeautifulSoup(html, 'html.parser')
title = soup.select_one('#frm_SearchDate > div:nth-child(17) > table')
title.text
Result :
'\n일별 매매기준율\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
Always and first of all, take a look at your soup to see if all the expected ingredients are there.
Data is loaded via XHR request and table is rendered dynamically by JavaScript, That is why you won't get the table with BeautifulSoup cause it could not find it in response of your request.
There are option to get it anyway:
check your browser dev tools on XHR tab to locate the api and pull part of info from there.
use selenium to get driver.page_source with whole table from the 'browser like'rendered version of website.
Example
import requests
from bs4 import BeautifulSoup
url = 'http://www.smbs.biz/ExRate/StdExRate_xml.jsp?arr_value=USD_2023-01-12_2023-02-03'
soup=BeautifulSoup(requests.get(url).text)
{s.get('label'):s.get('value') for s in soup.select('set')}
Output
{'23.01.12': '1245.3',
'23.01.13': '1244.6',
'23.01.16': '1240.6',
'23.01.17': '1234',
'23.01.18': '1238.5',
'23.01.19': '1239.8',
'23.01.20': '1236',
'23.01.25': '1234.4',
'23.01.26': '1233.4',
'23.01.27': '1231.4',
'23.01.30': '1230.2',
'23.01.31': '1228.7',
'23.02.01': '1230.8',
'23.02.02': '1231.4',
'23.02.03': '1219.3'}
I am new to web scraping and I am trying to scrape wind data from a website. Here is the website: https://wx.ikitesurf.com/spot/507.
I understand that I can do this using selenium to find elements but I think I may have found a better way. Please correct if I am wrong. When in developer tools I can find this page by going to network->JS->getGraph?
https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881
This page contains all the data I need and it is constantly updating. Here is my code:
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
time.sleep(3)
wind = soup.find("last_ob_wind_desc")
print (wind)
I tried using beautiful soup to scrape but I always receive the answer "None". Does anyone know how I can scrape this page? I would like to know what I am doing wrong. Thanks for any help!
Removing callback=jQuery17200020271765600428093_1619158293267& from the api url will make it return proper json:
import requests
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url).json()
response is now a dictionary with the data. last_ob_wind_desc can be retrieved with response['last_ob_wind_desc'].
You can also save the data to csv or other file formats with pandas:
import pandas as pd
df = pd.json_normalize(response)
df.to_csv('filename.csv')
I want to get the ticker values from this webpage https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false
However when using Beautifulsoup I don't seem to get all the content, and I don't quite understand how to change my code in order to achieve my goal
import urllib3
from bs4 import BeautifulSoup
def oslobors():
http=urllib3.PoolManager()
url = 'https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false'
response = http.request('GET', url)
soup=BeautifulSoup(response.data, "html.parser")
print(soup)
return
print(oslobors())
The content you wanna parse generates dynamically. You can either use any browser simulator like selenium or you can try the below url containing json response. The following is the easy way to go.
import requests
url = 'https://www.oslobors.no/ob/servlets/components?type=table&generators%5B0%5D%5Bsource%5D=feed.ob.quotes.EQUITIES%2BPCC&generators%5B1%5D%5Bsource%5D=feed.merk.quotes.EQUITIES%2BPCC&filter=&view=DELAYED&columns=PERIOD%2C+INSTRUMENT_TYPE%2C+TRADE_TIME%2C+ITEM_SECTOR%2C+ITEM%2C+LONG_NAME%2C+BID%2C+ASK%2C+LASTNZ_DIV%2C+CLOSE_LAST_TRADED%2C+CHANGE_PCT_SLACK%2C+TURNOVER_TOTAL%2C+TRADES_COUNT_TOTAL%2C+MARKET_CAP%2C+HAS_LIQUIDITY_PROVIDER%2C+PERIOD%2C+MIC%2C+GICS_CODE_LEVEL_1%2C+TIME%2C+VOLUME_TOTAL&channel=a66b1ba745886f611af56cec74115a51'
res = requests.get(url)
for ticker in res.json()['rows']:
ticker_name = ticker['values']['ITEM']
print(ticker_name)
Results you may get like (partial):
APP
HEX
APCL
ODFB
SAS NOK
WWI
ASC
I am writing a website scraper that saves all names of cryptocurrency from a table within a website. I wrote a script to get the response of the webpage and then by using the BeautifulSoup library to parse the response into an HTML object. The issue is the response is not returning the complete content of the webpage. It displays data from a certain position of the table and skips the data above it.
When I try debugging the code the response object has all the data from the webpage but when I try to print the data it only shows data from a certain point in the page.
Here is the code:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://coinmarketcap.com/all/views/all", headers={'User-Agent': 'Mozilla/5.0'})
print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
results = soup.find_all('table', attrs={'id': 'currencies-all'})
It would be really helpful if someone could tell me what I am doing wrong because I am unable to find out the issue.
Is it possible that you are hitting the buffer limit of your IDE's console?
On Spyder, the default is 500 lines and you will only see 500 lines of the sourcecode as a result. Try increasing this limit to see if that solves your issue.
On Spyder (windows), it's Tools > Preferences > IPython Console > Buffer (at the bottom).
I increased my buffer to 4000 and it still wasn't enough to fit the entire page but it did reveal more lines.
You are missing out one thing here. The table rows are nested within the table tag. Therefore, you need to first extract the table body then the table rows.
I use 'lxml' parser.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://coinmarketcap.com/all/views/all", headers={'User-Agent': 'Mozilla/5.0'})
print(response.text)
soup = BeautifulSoup(response.text, 'lxml')
results = soup.find('tbody')
curr_symbols = [x.text for x in results.find_all('td',attrs={'class':'text-left col-symbol'})]
print(curr_symbols)
print(len(curr_symbols)) # 1878
I'm trying to pull information from the 'Key Statistics' page for a ticker in Yahoo (since this isn't supported in the Pandas library).
Example for AAPL:
from bs4 import BeautifulSoup
import requests
url = 'http://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
enterpriseValue = soup.findAll('$ENTERPRISE_VALUE', attrs={'class': 'yfnc_tablehead1'}) #HTML tag for where enterprise value is located
print(enterpriseValue)
Edit: thanks Andy!
Question: This is printing an empty array. How do I change my findAll to return 598.56B?
Well, the reason the list that find_all returns is empty is because that data is generated with a separate call that isn't completed by just sending a GET request to that URL. If you look through the Network tab on Chrome/Firefox and filter by XHR, by examining the requests and responses of each network action, you can find what you URL you ought to be sending the GET request too.
In this case, it's https://query2.finance.yahoo.com/v10/finance/quoteSummary/AAPL?formatted=true&crumb=8ldhetOu7RJ&lang=en-US®ion=US&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents&corsDomain=finance.yahoo.com, as we can see here:
So, how do we recreate this? Simple! :
from bs4 import BeautifulSoup
import requests
r = requests.get('https://query2.finance.yahoo.com/v10/finance/quoteSummary/AAPL?formatted=true&crumb=8ldhetOu7RJ&lang=en-US®ion=US&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents&corsDomain=finance.yahoo.com')
data = r.json()
This will return the JSON response as a dict. From there, navigate through the dict until you find the data you're after:
financial_data = data['quoteSummary']['result'][0]['defaultKeyStatistics']
enterprise_value_dict = financial_data['enterpriseValue']
print(enterprise_value_dict)
>>> {'fmt': '598.56B', 'raw': 598563094528, 'longFmt': '598,563,094,528'}
print(enterprise_value_dict['fmt'])
>>> '598.56B'