How would I scrape the sic code description?

How would I scrape the sic code description? - python

Hi I am using BS4 to scrape the sic codes and descriptions. I currently have the following code which does exactly what I want but I don't know how to scrape the description pictures below in the inspect element view as well as the view source.
To be clear the bit I want is "State commercial banks" and "LABORATORY ANALYTICAL INSTRUMENTS"
https://www.sec.gov/cgi-bin/browse-edgar?CIK=866054&owner=exclude&action=getcompany&Find=Search
<div class="companyInfo">
<span class="companyName">COMMERCIAL NATIONAL FINANCIAL CORP /PA <acronym title="Central Index Key">CIK</acronym>#: 0000866054 (see all company filings)</span>
<p class="identInfo"><acronym title="Standard Industrial Code">SIC</acronym>: 6022 - STATE COMMERCIAL BANKS<br />State location: PA | State of Inc.: <strong>PA</strong> | Fiscal Year End: 1231<br />(Office of Finance)<br />Get <b>insider transactions</b> for this <b>issuer</b>.
for cik_num in cik_num_list:
try:
url = r"https://www.sec.gov/cgi-bin/browse-edgar?CIK={}&owner=exclude&action=getcompany".format(cik_num)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
try:
comp_name = soup.find_all('div', {'class':'companyInfo'})[0].find('span').text
sic_code = soup.find_all('p', {'class':'identInfo'})[0].find('a').text

import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=866054&owner=exclude&action=getcompany&Find=Search'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
sic_code_desc = soup.select_one('.identInfo').a.find_next_sibling(text=True).split(maxsplit=1)[-1]
print(sic_code_desc)
Prints:
STATE COMMERCIAL BANKS
For url = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=1090872&owner=exclude&action=getcompany&Find=Search' it prints:
LABORATORY ANALYTICAL INSTRUMENTS

Related

Web scraping using BeautifulSoup - link embedded behind the marked up text

I am trying to scrape data from https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml. The goal is to get the latest 11 sectors' performance data from the US markets. But I cannot see the performance until I click on each sector. In other words, there is a link embedded behind each sector. I want a list of tuples, and each tuple should correspond to a sector and should contain the following data: the sector name, the amount the sector has moved, the market capitalization of the sector, the market weight of the sector, and a link to the fidelity page for that sector.
Below is the code I have so far. I got stuck on the part that I want to get the content of each sector. My code return nothing at all. Please help! Thank you in advance.
import requests
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
links_list = list()
next_page_link = soup.find_all("a", class_="heading1")
for link in next_page_link:
next_page = "https://eresearch.fidelity.com"+link.get("href")
links_list.append(next_page)
for item in links_list:
soup2 = BeautifulSoup(requests.get(item).content,'html.parser')
print(soup2)

Try:
import requests
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
sector_url = "https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector={sector_id}"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(
"{:<30} {:<8} {:<8} {:<8} {}".format(
"Sector name",
"Moving",
"MktCap",
"MktWght",
"Link",
)
)
for a in soup.select("a.heading1"):
sector_id = a["href"].split("=")[-1]
u = sector_url.format(sector_id=sector_id)
s = BeautifulSoup(requests.get(u).content, "html.parser")
data = s.select("td:has(.timestamp) span:nth-of-type(1)")
print(
"{:<30} {:<8} {:<8} {:<8} {}".format(
s.h1.text, *[d.text for d in data][:3], u
)
)
Prints:
Sector name Moving MktCap MktWght Link
Communication Services +1.78% $6.70T 11.31% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=50
Consumer Discretionary +0.62% $8.82T 12.32% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=25
Consumer Staples +0.26% $4.41T 5.75% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=30
Energy +3.30% $2.83T 2.60% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=10
Financials +1.59% $8.79T 11.22% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=40
Health Care +0.07% $8.08T 13.29% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=35
Industrials +1.41% $5.72T 8.02% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=20
Information Technology +1.44% $15.52T 28.04% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=45
Materials +1.60% $2.51T 2.46% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=15
Real Estate +1.04% $1.67T 2.58% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=60
Utilities -0.04% $1.56T 2.42% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=55

BeautifulSoup to scrape multiple link

I want to scrape this website by using BeautifulSoup, by first extracting every links, then opening them one by one. Once they are opened, I want to scrape the company name, it's tickers, stock exchange and extract the multiple PDF links whenever they are available. It will write them out in a csv file afterwards.
To make it happen, I first try that way :
import requests
from bs4 import BeautifulSoup
import re
import time
source_code = requests.get('https://www.responsibilityreports.co.uk/Companies?a=#')
soup = BeautifulSoup(source_code.content, 'lxml')
data = []
links = []
base = 'https://www.responsibilityreports.co.uk'
for link in soup.find_all('a', href=True):
data.append(str(link.get('href')))
print(link)
try:
for link in links:
url = base + link
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
for j in soup.find_all('a', href=True):
print(j)
except:
pass
As far as I know, this website doesn't forbid scraper. But while it actually gives me every links, I'm unable to open them, which doesn't allow me to keep my scraper going for the following tasks.
Thanks in advance!

You can use this example how to iterate over all company links:
import requests
from bs4 import BeautifulSoup
url = "https://www.responsibilityreports.co.uk/Companies?a=#"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = [
"https://www.responsibilityreports.co.uk" + a["href"]
for a in soup.select('a[href^="/Company"]')
]
for link in links:
soup = BeautifulSoup(requests.get(link).content, "html.parser")
name = soup.select_one("h1").get_text(strip=True)
ticker = soup.select_one(".ticker_name")
if ticker:
ticker = ticker.get_text(strip=True)
else:
ticker = "N/A"
# extract other info...
print(name)
print(ticker)
print(link)
print("-" * 80)
Prints:
3i Group plc
III
https://www.responsibilityreports.co.uk/Company/3i-group-plc
--------------------------------------------------------------------------------
3M Corporation
MMM
https://www.responsibilityreports.co.uk/Company/3m-corporation
--------------------------------------------------------------------------------
AAON Inc.
AAON
https://www.responsibilityreports.co.uk/Company/aaon-inc
--------------------------------------------------------------------------------
ABB Ltd
ABB
https://www.responsibilityreports.co.uk/Company/abb-ltd
--------------------------------------------------------------------------------
Abbott Laboratories
ABT
https://www.responsibilityreports.co.uk/Company/abbott-laboratories
--------------------------------------------------------------------------------
Abbvie Inc
ABBV
https://www.responsibilityreports.co.uk/Company/abbvie-inc
--------------------------------------------------------------------------------
Abercrombie & Fitch
ANF
https://www.responsibilityreports.co.uk/Company/abercrombie-fitch
--------------------------------------------------------------------------------
ABM Industries, Inc.
ABM
https://www.responsibilityreports.co.uk/Company/abm-industries-inc
--------------------------------------------------------------------------------
Acadia Realty Trust
AKR
https://www.responsibilityreports.co.uk/Company/acadia-realty-trust
--------------------------------------------------------------------------------
Acciona
N/A
https://www.responsibilityreports.co.uk/Company/acciona
--------------------------------------------------------------------------------
ACCO Brands
ACCO
https://www.responsibilityreports.co.uk/Company/acco-brands
--------------------------------------------------------------------------------
...and so on.

Generating URL for Yahoo news and Bing news with Python and BeautifulSoup

I want to scrape data from Yahoo News and 'Bing News' pages. The data that I want to scrape are headlines or/and text below headlines (what ever It can be scraped) and dates (time) when its posted.
I have wrote a code but It does not return anything. Its the problem with my url since Im getting response 404
Can you please help me with it?
This is the code for 'Bing'
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'http://www.bing.com/news/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
And this is for Yahoo:
term = 'usa'
url = 'http://news.search.yahoo.com/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Please help me to generate these urls, whats the logic behind them, Im still a noob :)

Basically your urls are just wrong. The urls that you have to use are the same ones that you find in the address bar while using a regular browser. Usually most search engines and aggregators use q parameter for the search term. Most of the other parameters are usually not required (sometimes they are - eg. for specifying result page no etc..).
Bing
from bs4 import BeautifulSoup
import requests
import re
term = 'usa'
url = 'https://www.bing.com/news/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_card in soup.find_all('div', class_="news-card-body"):
title = news_card.find('a', class_="title").text
time = news_card.find(
'span',
attrs={'aria-label': re.compile(".*ago$")}
).text
print("{} ({})".format(title, time))
Output
Jason Mohammed blitzkrieg sinks USA (17h)
USA Swimming held not liable by California jury in sexual abuse case (1d)
United States 4-1 Canada: USA secure payback in Nations League (1d)
USA always plays the Dalai Lama card in dealing with China, says Chinese Professor (1d)
...
Yahoo
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'https://news.search.yahoo.com/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_item in soup.find_all('div', class_='NewsArticle'):
title = news_item.find('h4').text
time = news_item.find('span', class_='fc-2nd').text
# Clean time text
time = time.replace('·', '').strip()
print("{} ({})".format(title, time))
Output
USA Baseball will return to Arizona for second Olympic qualifying chance (52 minutes ago)
Prized White Sox prospect Andrew Vaughn wraps up stint with USA Baseball (28 minutes ago)
Mexico defeats USA in extras for Olympic berth (13 hours ago)
...

How to access text from both <p> using beautifulsoup4?

I want to grab text from both <p>, how do I get that?
for first <p> my code is working but I couldn't able to get the second <p>.
<p>
<a href="https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/">
Emerging online threats changing Homeland Security's role from merely fighting terrorism
</a>
</p>
</hgroup>
</header>
<p>
Homeland Security Secretary Kirstjen Nielsen said Monday that her department may have been founded to combat terrorism, but its mission is shifting to also confront emerging online threats.
China, Iran and other countries are mimicking the approach that Russia used to interfere in the U.S. ...
<a class="more_link" href="https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/">
<span class="icon-arrow-2">
</span>
</a>
</p>
My code is:
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
article = "https://www.japantimes.co.jp/tag/cybersecurity/page/1/"
page = urllib.request.urlopen(article)
soup = BeautifulSoup(page, 'html.parser')
article = soup.find('div', class_="content_col")
date = article.h3.find('span', class_= "right date")
date = date.text
headline = article.p.find('a')
headline = headline.text
content = article.p.text
print(date, headline,content)

Use the parent id and p selector and index into returned list for required number of paragraphs. You can use the time tag for when posted
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/#.XJIQNDj7TX4')
soup = bs(r.content, 'lxml')
posted = soup.select_one('time').text
print(posted)
paras = [item.text.strip() for item in soup.select('#jtarticle p')]
print(paras[:2])

You could use the .find_next(). However, it's not the full article:
from bs4 import BeautifulSoup
import requests
article = "https://www.japantimes.co.jp/tag/cybersecurity/page/1/"
page = requests.get(article)
soup = BeautifulSoup(page.text, 'html.parser')
article = soup.find('div', class_="content_col")
date = article.h3.find('span', class_= "right date")
date_text = date.text
headline = article.p.find('a')
headline_text = headline.text
content_text = article.p.find_next('p').text
print(date_text, headline_text ,content_text)

BeautifulSoup Python

I'm scraping a news article using BeautifulSoup trying to only return the text body of the article itself, not all the additional "noise". Is there any easy way to do this?
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
element = soup.select_one('div.pg-rail-tall__body #body-text').text
print(element)
Trying to exclude some of the information returned such as
{CNN.VideoPlayer.handleUnmutePlayer = function
handleUnmutePlayer(containerId, dataObj) {'use strict';var
playerInstance,playerPropertyObj,rememberTime,unmuteCTA,unmuteIdSelector
= 'unmute_' +

The noise, as you call it, is the text in the <script>...</script> tags (JavaScript code). You can remove it using .extract() like:
for s in soup.find_all('script'):
s.extract()
You can use this:
r = requests.get('https://edition.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html')
soup = BeautifulSoup(r.text, 'html.parser')
[x.extract() for x in soup.find_all('script')] # Does the same thing as the 'for-loop' above
element = soup.find('div', class_='pg-rail-tall__body')
print(element.text)
Partial Output:
(CNN)Puerto Rico Gov. Ricardo Rosselló announced Monday that the
commonwealth will begin privatizing the Puerto Rico Electric Power
Authority, or PREPA. In comments published on Twitter, the governor
said the assets sale would transform the island's power generation
system into a "modern" and "efficient" one that would be less
expensive for citizens.He said the system operates "deficiently" and
that the improved infrastructure would respond more "agilely" to
natural disasters. The privatization process will begin "in the next
few days" and occur in three phases over the next 18 months, the
governor said.JUST WATCHEDSchool cheers as power returns after 112
daysReplayMore Videos ...MUST WATCHSchool cheers as power returns
after 112 days 00:48San Juan Mayor Carmen Yulin Cruz, known for her
criticisms of the Trump administration's response to Puerto Rico after
Hurricane Maria, spoke out against the move.Cruz, writing on her
official Twitter account, said PREPA's privatization would put the
commonwealth's economic development into "private hands" and that the
power authority will begin to "serve other interests.

Try this:
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-au$'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elementd = soup.findAll('div', {'class': 'zn-body__paragraph'})
elementp = soup.findAll('p', {'class': 'zn-body__paragraph'})
for i in elementp:
print(i.text)
for i in elementd:
print(i.text)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How would I scrape the sic code description? - python

Related

Web scraping using BeautifulSoup - link embedded behind the marked up text

BeautifulSoup to scrape multiple link

Generating URL for Yahoo news and Bing news with Python and BeautifulSoup

How to access text from both <p> using beautifulsoup4?

BeautifulSoup Python

Categories

Resources