I am trying to scrape MSFT's income statement using code I found here: How to Web scraping SEC Edgar 10-K Dynamic data
They use the 'span' class to narrow the search. I do not see a span, so I am trying to use the <p class with no luck.
Here is my code, it is largely unchanged from the answer given. I changed the base_url and tried to change soup.find to 'p'. Is there a way to find the <p class or, even better, a way to find the income statement chart?
Here is the URL to the statement: https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm
from bs4 import BeautifulSoup
import requests
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = "https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm"
edgar_resp = requests.get(base_url, headers=headers)
edgar_str = edgar_resp.text
soup = BeautifulSoup(edgar_str, 'html.parser')
s = soup.find('p', recursive=True, string='INCOME STATEMENTS ')
t = s.find_next('table')
trs = t.find_all('tr')
for tr in trs:
if tr.text:
print(list(tr.stripped_strings))
Here is the code from the example:
from bs4 import BeautifulSoup
import requests
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = "https://www.sec.gov/Archives/edgar/data/200406/000020040621000057/jnj-20210704.htm"
edgar_resp = requests.get(base_url, headers=headers)
edgar_str = edgar_resp.text
soup = BeautifulSoup(edgar_str, 'html.parser')
s = soup.find('span', recursive=True, string='SALES BY SEGMENT OF BUSINESS ')
t = s.find_next('table')
trs = t.find_all('tr')
for tr in trs:
if tr.text:
print(list(tr.stripped_strings))
Thank you!
I'm not sure why that's not working, but you can try this:
s = soup.find('a', attrs={'name':'INCOME_STATEMENTS'})
This should match the <a name="INCOME_STATEMENTS"></a> element inside that paragraph.
Related
BeautifulSoup doesn’t find any tag on this page. Does anyone know what the problem can be?
I can find elements on the page with selenium, but since I have a list of pages, I don’t want to use selenium.
import requests
from bs4 import BeautifulSoup
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
soup.find_all('h1')
You can get the info on that page by adding headers to your requests, mimicking what you can see in Dev tools - Network tab main request to that url. Here is one way to get all links from that page:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'Cookie': 'sso_checked=1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
links = [a.get('href') for a in soup.select('a')]
print(links)
Result printed in terminal:
['/news', 'https://dzen.ru/news', 'https://dzen.ru/news/region/moscow', 'https://dzen.ru/news/rubric/mobilizatsiya', 'https://dzen.ru/news/rubric/personal_feed', 'https://dzen.ru/news/rubric/politics', 'https://dzen.ru/news/rubric/society', 'https://dzen.ru/news/rubric/business', 'https://dzen.ru/news/rubric/world', 'https://dzen.ru/news/rubric/sport', 'https://dzen.ru/news/rubric/incident', 'https://dzen.ru/news/rubric/culture', 'https://dzen.ru/news/rubric/computers', 'https://dzen.ru/news/rubric/science', 'https://dzen.ru/news/rubric/auto', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://mosregtoday.ru/soc/v-podmoskove-zapustili-chat-bot-po-voprosam-chastichnoj-mobilizacii/?utm_source=yxnews&utm_medium=desktop', ...]
This is the URL where I'm trying to extract the shipping price:
url = "https://www.amazon.com/AmazonBasics-Ultra-Soft-Micromink-Sherpa-Blanket/dp/B0843ZJGNP/ref=sr_1_1_sspa?dchild=1&keywords=amazonbasics&pd_rd_r=5cb1aaf8-d692-4abf-9131-ebd533ad5763&pd_rd_w=8Uw69&pd_rd_wg=kTKEB&pf_rd_p=9349ffb9-3aaa-476f-8532-6a4a5c3da3e7&pf_rd_r=PYFBYA98FS6B8BR7TGJD&qid=1623412994&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzM0xaSFIzVzFTUUpMJmVuY3J5cHRlZElkPUEwNzk3MjgzM1NQRlFQQkc4VFJGWSZlbmNyeXB0ZWRBZElkPUEwNzU1NzM0M0VMQ1hTNDJFTzYxQyZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU="
My code is:
r = requests.get(url,headers=HEADERS,proxies=proxyDict)
soup = BeautifulSoup(r.content,'html.parser')
needle="$93.63"
#I also tried complete sentences
#"$93.63 Shipping & Import Fees Deposit to India"
#"$93.63 Shipping & Import Fees Deposit to India"
print(soup.find_all(text=needle))
#I also tried print(soup.find_all(text=re.compile(needle)))
But this always returns an empty list.
I can see the required text in inspect element as well as downloaded soup that I printed on the console.
However when I do the same thing with the actual product price($27.99), soup.find_all() works as expected.
So far I haven't been able to figure out the problem here. Sorry for any silly mistakes.
Search the field, not the values.
import requests
from bs4 import BeautifulSoup
url = "https://www.amazon.com/AmazonBasics-Ultra-Soft-Micromink-Sherpa-Blanket/dp/B0843ZJGNP/ref=sr_1_1_sspa?dchild=1&keywords=amazonbasics&pd_rd_r=5cb1aaf8-d692-4abf-9131-ebd533ad5763&pd_rd_w=8Uw69&pd_rd_wg=kTKEB&pf_rd_p=9349ffb9-3aaa-476f-8532-6a4a5c3da3e7&pf_rd_r=PYFBYA98FS6B8BR7TGJD&qid=1623412994&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzM0xaSFIzVzFTUUpMJmVuY3J5cHRlZElkPUEwNzk3MjgzM1NQRlFQQkc4VFJGWSZlbmNyeXB0ZWRBZElkPUEwNzU1NzM0M0VMQ1hTNDJFTzYxQyZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU="
HEADERS = ({'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'})
r = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(r.content,'html.parser')
value = soup.find("span", {"id" : "priceblock_ourprice"}).contents
print(value)
from bs4 import BeautifulSoup as bs
import requests
url = "https://www.amazon.com/AmazonBasics-Ultra-Soft-Micromink-Sherpa-Blanket/dp/B0843ZJGNP/ref=sr_1_1_sspa?dchild=1&keywords=amazonbasics&pd_rd_r=5cb1aaf8-d692-4abf-9131-ebd533ad5763&pd_rd_w=8Uw69&pd_rd_wg=kTKEB&pf_rd_p=9349ffb9-3aaa-476f-8532-6a4a5c3da3e7&pf_rd_r=PYFBYA98FS6B8BR7TGJD&qid=1623412994&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzM0xaSFIzVzFTUUpMJmVuY3J5cHRlZElkPUEwNzk3MjgzM1NQRlFQQkc4VFJGWSZlbmNyeXB0ZWRBZElkPUEwNzU1NzM0M0VMQ1hTNDJFTzYxQyZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU="
soup = bs(requests.get(url).content, 'lxml').prettify()
print(soup)
I've recently started looking into purchasing some land, and I'm writing a little app to help me organize details in Jira/Confluence to help me keep track of who I've talked to and what I talked to them about in regards to each parcel of land individually.
So, I wrote this little scraper for landwatch(dot)com:
[url is just a listing on the website]
from bs4 import BeautifulSoup
import requests
def get_property_data(url):
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
title = soup.find_all(class_='b442a')[0].text
details = soup.find_all('p', class_='d19de')
price = soup.find_all('div', class_='_260f0')[0].text
deets = []
for i in range(len(details)):
if details[i].text != '':
deets.append(details[i].text)
detail = ''
for i in deets:
detail += '<p>' + i + '</p>'
return [title, detail, price]
Everything works great except that the class d19de has a ton of values hidden behind the Read More button.
While Googling away at this, I discovered How to Scrape reviews with read more from Webpages using BeautifulSoup, however I either don't understand what they're doing well enough to implement it, or this just doesn't work anymore:
import requests ; from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://www.mouthshut.com/product-reviews/Lakeside-Chalet-Mumbai-reviews-925017044").text, "html.parser")
for title in soup.select("a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_]"):
items = title.get('href')
if items:
broth = BeautifulSoup(requests.get(items).text, "html.parser")
for item in broth.select("div.user-review p.lnhgt"):
print(item.text)
Any thoughts on how to bypass that Read More button? I'm really hoping to do this in BeautifulSoup, and not selenium.
Here's an example URL for testing: https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403
That data is present within a script tag. Here is an example of extracting that content, parsing with json, and outputting land description info as a list:
from bs4 import BeautifulSoup
import requests, json
url = 'https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403'
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
all_data = json.loads(soup.select_one('[type="application/ld+json"]').string)
details = all_data['description'].split('\r\r')
You may wish to examine what else is in that script tag:
from pprint import pprint
pprint(all_data)
i am trying to parse the table from the link. i tried:
from bs4 import BeautifulSoup
import requests
url = 'http://www.stats.gov.cn/tjsj/zxfb/201810/t20181015_1627579.html'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for table in soup.find_all(class_='MsoNormalTable'):
print(table)
But can't get So, can you guide me, how do i parse table using python.
Could you just do this? I can't read the language but it might be right.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'http://www.stats.gov.cn/tjsj/zxfb/201810/t20181015_1627579.html'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'lxml')
middleTable = soup.find('table', class_='MsoNormalTable')
rows = middleTable.findAll('tr')
for eachRow in rows:
print(eachRow.text)
you can try :
soup.find_all("table", {"class": "MsoNormalTable"})
you should specify the tag and to access through attribute it should be passed in dictionary
Hi I have been trying to get the time data from this website: https://clockofeidolon.com (hours, minutes, seconds) and tried to use beautifulsoup to print contents of 'span class="big' tags since the time information is kept there and I have come up with this:
from bs4 import BeautifulSoup
from requests import Session
session = Session()
session.headers['user-agent'] = (
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/'
'66.0.3359.181 Safari/537.36'
)
url = 'https://clockofeidolon.com'
response = session.get(url=url)
data = response.text
soup = BeautifulSoup(data, "html.parser")
spans = soup.find_all('<span class="big')
print([span.text for span in spans])
But the output only shows "[]" and nothing else. How would I go about printing the number in each of the 3 tags?
As mentioned this can be achieved with selenium once you have the correct geckodriver installed the following should get you on the right track:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://clockofeidolon.com')
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
spans = soup.find_all(class_='big-hour')
for span in spans:
print(span.text)
driver.quit()