I am trying to webscrape a government site that uses frameset.
Here is the URL - https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm
I've tried using splinter/selenium
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
browser.visit(url)
time.sleep(10)
full_xpath_frame = '/html/frameset/frameset/frame[2]'
tree = browser.find_by_xpath(full_xpath_frame)
for i in tree:
print(i.text)
It just returns an empty string.
I've tried using the requests library.
import requests
from lxml import HTML
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
# get response object
response = requests.get(url)
# get byte string
data = response.content
print(data)
And it returns this
b"<html>\r\n<head>\r\n<meta http-equiv='Content-Type'\r\ncontent='text/html; charset=iso-
8859-1'>\r\n<title>Lake_ County Election Results</title>\r\n</head>\r\n<FRAMESET rows='20%,
*'>\r\n<FRAME src='titlebar.htm' scrolling='no'>\r\n<FRAMESET cols='20%, *'>\r\n<FRAME
src='menu.htm'>\r\n<FRAME src='Lake_ElecSumm_all.htm' name='reports'>\r\n</FRAMESET>
\r\n</FRAMESET>\r\n<body>\r\n</body>\r\n</html>\r\n"
I've also tried using beautiful soup and it gave me the same thing. Is there another python library I can use in order to get the data that's inside the second table?
Thank you for any feedback.
As mentioned you could go for the frames and its src:
BeautifulSoup(r.text).select('frame')[1].get('src')
or directly to the menu.htm:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/menu.htm')
link_list = ['https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults'+a.get('href') for a in BeautifulSoup(r.text).select('a')]
for link in link_list[:1]:
r = requests.get(link)
soup = BeautifulSoup(r.text)
###...scrape what is needed
I am trying to scrape the Advances/Declines from NSE website - https://www1.nseindia.com/live_market/dynaContent/live_market.htm
Advances/Declines is in tabular format in the HTML. But I am not able to retrieve the actual numerical value that is displayed in the site.
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www1.nseindia.com/live_market/dynaContent/live_market.htm"
webpage = requests.get(url);
soup = BeautifulSoup(webpage.content, "html.parser");
for tr in soup.find_all('tr'):
advance = tr.find_all('td')
print(advance)
I am only able to get an empty value or NONE. I am not sure what I am doing wrong. When I inspect the element in the website, I see the numerical values 978, 904 but in Spyder, the values in these elements are displayed with a hyphen. Can someone please help?
This page uses JavaScript to load these information but requests/BeautifulSoup can't run JavaScript.
Using DevTools in Chrome/Firefox (tab Network, filter xhr) I found url used by JavaScript to load it as JSON data so I don't have to even use BeautifulSoup to get it.
import requests
url = 'https://www1.nseindia.com/live_market/dynaContent/live_analysis/changePercentage.json'
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
data = r.json()
print(data['rows'][0]['advances'])
print(data['rows'][0]['declines'])
print(data['rows'][0]['unchanged'])
print(data['rows'][0]['total'])
BTW: It doesn't send data without User-Agent
I'm trying to obtain the latest quarter's operating income/loss from a quarterly filling.
Desired output highlighted in green: financial statement
Here's the URL of the document that I'm trying to scrape: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm
If you'd like to see the data point visually, it is in PART I, Item 1. Financial Statements, Operating income.
The HTML code for the figure that I'm trying to get:
<ix:nonfraction id="fact-identifier-125" name="us-gaap:OperatingIncomeLoss" contextref="FD2019Q3QTD" unitref="usd" decimals="-6" scale="6" format="ixt:numdotdecimal" data-original-id="d305292495e1903-wk-Fact-6250FB76089207E7F73CB52756E0D8D0" continued-taxonomy="false" enabled-taxonomy="true" highlight-taxonomy="false" selected-taxonomy="false" hover-taxonomy="false" onclick="Taxonomies.clickEvent(event, this)" onkeyup="Taxonomies.clickEvent(event, this)" onmouseenter="Taxonomies.enterElement(event, this);" onmouseleave="Taxonomies.leaveElement(event, this);" tabindex="18" isadditionalitemsonly="false">11,544</ix:nonfraction>
The code that I used to obtain this data point (11,544).:
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm'
response = requests.get(url)
content = BeautifulSoup(response.content, 'html.parser')
operatingincomeloss = content.find('ix:nonfraction', attrs={"name": "us-gaap:OperatingIncomeLoss", "contextref":"FD2019Q3QTD"})
print (operatingincomeloss)
I also tried with
operatingincomeloss = content.find('ix:nonfraction', attrs={"name": "us-gaap:OperatingIncomeLoss"}
Eventually, I want to loop through all the relevant fillings to pull this data point. Currently, I'm just getting None. When I CTRl+F through content, I can't find the ix:nonfraction tag as well.
Page is loaded via JavaScript, I've attached the XHR request made and extracted the data required.
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.select("#d305292495e1903-wk-Fact-6250FB76089207E7F73CB52756E0D8D0"):
print(item.text)
Output:
11,544
Updated:
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("ix:nonfraction", {'contextref': 'FD2019Q3QTD', 'name': 'us-gaap:OperatingIncomeLoss'}):
print(item.text)
As #αԋɱҽԃ αмєяιcαη said, the page is loaded via JavaScript.
I have used the xhr request for this code.
Considering the attributes you have used, I have taken name attribute only, as contextref changes for each element.
You could also change the name attribute if you want to loop through other elements.
As you said you want to loop through this tag, I have printed all the output returning in the code below.
Code:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm')
soup = BeautifulSoup(res.text, 'html.parser')
for data in soup.find_all('ix:nonfraction', {'name': 'us-gaap:OperatingIncomeLoss'}):
print(data.text)
Output:
11,544
12,612
48,305
54,780
7,442
7,496
26,329
26,580
3,687
3,892
14,371
15,044
3,221
3,414
12,142
15,285
1,795
1,765
7,199
7,193
1,155
1,127
4,811
4,980
17,300
17,694
64,852
69,082
11,544
12,612
48,305
54,780
Trying get a table from the website SGX.
The page is saved to local drive and I am using BeautifulSoup to parse it:
soup = BeautifulSoup(open(pages), "lxml")
soup.prettify()
list_0 = soup.find_all('table')[0]
print list_0
What it returned, is not the first row on the page:
[<tr><td>Zhongmin Baihui</td><td>5SR</td><td class="nowrap">09:44 AM</td><td class="nowrap">09:49 AM</td><td>0.615</td><td>0.675</td><td>0.555</td></tr>]
What's the right way to retrieve this table?
Thank you.
Data are being fetched after page loads using AJAX request, by inspecting the page you can find the API URL (the Url below), and then you can use something like that:
import pandas as pd
import requests
import json
response = requests.get('https://api.sgx.com/securities/v1.1?excludetypes=bonds¶ms=nc%2Cadjusted-vwap%2Cb%2Cbv%2Cp%2Cc%2Cchange_vs_pc%2Cchange_vs_pc_percentage%2Ccx%2Ccn%2Cdp%2Cdpc%2Cdu%2Ced%2Cfn%2Ch%2Ciiv%2Ciopv%2Clt%2Cl%2Co%2Cp_%2Cpv%2Cptd%2Cs%2Csv%2Ctrading_time%2Cv_%2Cv%2Cvl%2Cvwap%2Cvwap-currency')
data = json.loads(response.content)["data"]["prices"]
df = pd.DataFrame(data)
print(df)
If your requirement are complex and your crawling done in regular basis I recommend using scrapy.
I want to get the ticker values from this webpage https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false
However when using Beautifulsoup I don't seem to get all the content, and I don't quite understand how to change my code in order to achieve my goal
import urllib3
from bs4 import BeautifulSoup
def oslobors():
http=urllib3.PoolManager()
url = 'https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false'
response = http.request('GET', url)
soup=BeautifulSoup(response.data, "html.parser")
print(soup)
return
print(oslobors())
The content you wanna parse generates dynamically. You can either use any browser simulator like selenium or you can try the below url containing json response. The following is the easy way to go.
import requests
url = 'https://www.oslobors.no/ob/servlets/components?type=table&generators%5B0%5D%5Bsource%5D=feed.ob.quotes.EQUITIES%2BPCC&generators%5B1%5D%5Bsource%5D=feed.merk.quotes.EQUITIES%2BPCC&filter=&view=DELAYED&columns=PERIOD%2C+INSTRUMENT_TYPE%2C+TRADE_TIME%2C+ITEM_SECTOR%2C+ITEM%2C+LONG_NAME%2C+BID%2C+ASK%2C+LASTNZ_DIV%2C+CLOSE_LAST_TRADED%2C+CHANGE_PCT_SLACK%2C+TURNOVER_TOTAL%2C+TRADES_COUNT_TOTAL%2C+MARKET_CAP%2C+HAS_LIQUIDITY_PROVIDER%2C+PERIOD%2C+MIC%2C+GICS_CODE_LEVEL_1%2C+TIME%2C+VOLUME_TOTAL&channel=a66b1ba745886f611af56cec74115a51'
res = requests.get(url)
for ticker in res.json()['rows']:
ticker_name = ticker['values']['ITEM']
print(ticker_name)
Results you may get like (partial):
APP
HEX
APCL
ODFB
SAS NOK
WWI
ASC