BeautifulSoup to scrape multiple link - python

I want to scrape this website by using BeautifulSoup, by first extracting every links, then opening them one by one. Once they are opened, I want to scrape the company name, it's tickers, stock exchange and extract the multiple PDF links whenever they are available. It will write them out in a csv file afterwards.
To make it happen, I first try that way :
import requests
from bs4 import BeautifulSoup
import re
import time
source_code = requests.get('https://www.responsibilityreports.co.uk/Companies?a=#')
soup = BeautifulSoup(source_code.content, 'lxml')
data = []
links = []
base = 'https://www.responsibilityreports.co.uk'
for link in soup.find_all('a', href=True):
data.append(str(link.get('href')))
print(link)
try:
for link in links:
url = base + link
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
for j in soup.find_all('a', href=True):
print(j)
except:
pass
As far as I know, this website doesn't forbid scraper. But while it actually gives me every links, I'm unable to open them, which doesn't allow me to keep my scraper going for the following tasks.
Thanks in advance!

You can use this example how to iterate over all company links:
import requests
from bs4 import BeautifulSoup
url = "https://www.responsibilityreports.co.uk/Companies?a=#"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = [
"https://www.responsibilityreports.co.uk" + a["href"]
for a in soup.select('a[href^="/Company"]')
]
for link in links:
soup = BeautifulSoup(requests.get(link).content, "html.parser")
name = soup.select_one("h1").get_text(strip=True)
ticker = soup.select_one(".ticker_name")
if ticker:
ticker = ticker.get_text(strip=True)
else:
ticker = "N/A"
# extract other info...
print(name)
print(ticker)
print(link)
print("-" * 80)
Prints:
3i Group plc
III
https://www.responsibilityreports.co.uk/Company/3i-group-plc
--------------------------------------------------------------------------------
3M Corporation
MMM
https://www.responsibilityreports.co.uk/Company/3m-corporation
--------------------------------------------------------------------------------
AAON Inc.
AAON
https://www.responsibilityreports.co.uk/Company/aaon-inc
--------------------------------------------------------------------------------
ABB Ltd
ABB
https://www.responsibilityreports.co.uk/Company/abb-ltd
--------------------------------------------------------------------------------
Abbott Laboratories
ABT
https://www.responsibilityreports.co.uk/Company/abbott-laboratories
--------------------------------------------------------------------------------
Abbvie Inc
ABBV
https://www.responsibilityreports.co.uk/Company/abbvie-inc
--------------------------------------------------------------------------------
Abercrombie & Fitch
ANF
https://www.responsibilityreports.co.uk/Company/abercrombie-fitch
--------------------------------------------------------------------------------
ABM Industries, Inc.
ABM
https://www.responsibilityreports.co.uk/Company/abm-industries-inc
--------------------------------------------------------------------------------
Acadia Realty Trust
AKR
https://www.responsibilityreports.co.uk/Company/acadia-realty-trust
--------------------------------------------------------------------------------
Acciona
N/A
https://www.responsibilityreports.co.uk/Company/acciona
--------------------------------------------------------------------------------
ACCO Brands
ACCO
https://www.responsibilityreports.co.uk/Company/acco-brands
--------------------------------------------------------------------------------
...and so on.

Related

How can we use Mozilla to Screen Scrape raw data from real estate listings?

I'm looking at this URL.
https://www.century21.com/real-estate/long-island-city-ny/LCNYLONGISLANDCITY/
I'm trying to get this text, in a structured format.
FOR SALE
$1,248,000
3 beds
2 baths
45-09 Skillman Avenue
Sunnyside NY 11104
Listed By CENTURY 21 Sunny Gardens Realty, Inc.
##########################################
FOR SALE
$1,390,000
5 beds
3 baths
2,200 sq. ft
47-35 39th Place
Sunnyside NY 11104
Courtesy Of Keller Williams Realty of Greater Nassau
Here's the sample code that i tried to hack together.
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
url='https://www.century21.com/real-estate/long-island-city-ny/LCNYLONGISLANDCITY/'
driver = webdriver.Chrome('C:\\Utility\\chromedriver.exe')
driver.get(url)
sleep(3)
content = driver.page_source
soup = BeautifulSoup(content, features='html.parser')
for element in soup.findAll('div', attrs={'class': 'infinite-item property-card clearfix property-card-C2183089596 initialized visited'}):
#print(element)
address = element.find('div', attrs={'class': 'property-card-primary-info'})
print(address)
price = element.find('a', attrs={'class': 'listing-price'})
print(price)
When I run this, I get no addresses and no prices. Not sure why.
Web scraping is more of an art than a science. It's helpful to pull up the page source in chrome or browser of your choice so you can think about the DOM hierarchy and figure out how to get down into the elements that you need to scrape. Some websites have been built very cleanly and this isn't too much work, and others are scrapped together nonsense that are nightmares to dig data out of it.
This one, thankfully, is very clean.
This isn't perfect, but I think it will get you in the ballpark:
import requests
from bs4 import BeautifulSoup
url='https://www.century21.com/real-estate/long-island-city-ny/LCNYLONGISLANDCITY/'
page = requests.get(url)
soup = BeautifulSoup(page.content, features='html.parser')
for element in soup.findAll('div', attrs={'class': 'property-card'}):
address = element.find('div', attrs={'class': 'property-card-primary-info'}).find('div', attrs={'class': 'property-address-info'})
for address_item in address.children:
print(address_item.get_text().strip())
price = element.find('div',attrs={'class': 'property-card-primary-info'}).find('a', attrs={'class': 'listing-price'})
print(price.get_text().strip())

Web scraping using BeautifulSoup - link embedded behind the marked up text

I am trying to scrape data from https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml. The goal is to get the latest 11 sectors' performance data from the US markets. But I cannot see the performance until I click on each sector. In other words, there is a link embedded behind each sector. I want a list of tuples, and each tuple should correspond to a sector and should contain the following data: the sector name, the amount the sector has moved, the market capitalization of the sector, the market weight of the sector, and a link to the fidelity page for that sector.
Below is the code I have so far. I got stuck on the part that I want to get the content of each sector. My code return nothing at all. Please help! Thank you in advance.
import requests
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
links_list = list()
next_page_link = soup.find_all("a", class_="heading1")
for link in next_page_link:
next_page = "https://eresearch.fidelity.com"+link.get("href")
links_list.append(next_page)
for item in links_list:
soup2 = BeautifulSoup(requests.get(item).content,'html.parser')
print(soup2)
Try:
import requests
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
sector_url = "https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector={sector_id}"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(
"{:<30} {:<8} {:<8} {:<8} {}".format(
"Sector name",
"Moving",
"MktCap",
"MktWght",
"Link",
)
)
for a in soup.select("a.heading1"):
sector_id = a["href"].split("=")[-1]
u = sector_url.format(sector_id=sector_id)
s = BeautifulSoup(requests.get(u).content, "html.parser")
data = s.select("td:has(.timestamp) span:nth-of-type(1)")
print(
"{:<30} {:<8} {:<8} {:<8} {}".format(
s.h1.text, *[d.text for d in data][:3], u
)
)
Prints:
Sector name Moving MktCap MktWght Link
Communication Services +1.78% $6.70T 11.31% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=50
Consumer Discretionary +0.62% $8.82T 12.32% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=25
Consumer Staples +0.26% $4.41T 5.75% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=30
Energy +3.30% $2.83T 2.60% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=10
Financials +1.59% $8.79T 11.22% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=40
Health Care +0.07% $8.08T 13.29% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=35
Industrials +1.41% $5.72T 8.02% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=20
Information Technology +1.44% $15.52T 28.04% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=45
Materials +1.60% $2.51T 2.46% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=15
Real Estate +1.04% $1.67T 2.58% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=60
Utilities -0.04% $1.56T 2.42% https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=55

How would I scrape the sic code description?

Hi I am using BS4 to scrape the sic codes and descriptions. I currently have the following code which does exactly what I want but I don't know how to scrape the description pictures below in the inspect element view as well as the view source.
To be clear the bit I want is "State commercial banks" and "LABORATORY ANALYTICAL INSTRUMENTS"
https://www.sec.gov/cgi-bin/browse-edgar?CIK=866054&owner=exclude&action=getcompany&Find=Search
<div class="companyInfo">
<span class="companyName">COMMERCIAL NATIONAL FINANCIAL CORP /PA <acronym title="Central Index Key">CIK</acronym>#: 0000866054 (see all company filings)</span>
<p class="identInfo"><acronym title="Standard Industrial Code">SIC</acronym>: 6022 - STATE COMMERCIAL BANKS<br />State location: PA | State of Inc.: <strong>PA</strong> | Fiscal Year End: 1231<br />(Office of Finance)<br />Get <b>insider transactions</b> for this <b>issuer</b>.
for cik_num in cik_num_list:
try:
url = r"https://www.sec.gov/cgi-bin/browse-edgar?CIK={}&owner=exclude&action=getcompany".format(cik_num)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
try:
comp_name = soup.find_all('div', {'class':'companyInfo'})[0].find('span').text
sic_code = soup.find_all('p', {'class':'identInfo'})[0].find('a').text
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=866054&owner=exclude&action=getcompany&Find=Search'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
sic_code_desc = soup.select_one('.identInfo').a.find_next_sibling(text=True).split(maxsplit=1)[-1]
print(sic_code_desc)
Prints:
STATE COMMERCIAL BANKS
For url = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=1090872&owner=exclude&action=getcompany&Find=Search' it prints:
LABORATORY ANALYTICAL INSTRUMENTS

I want to scrape urls of all titles using python

I wrote a code to get all the title urls but have some issues like it displays None values. So could you please help me out?
Here is my code:
import requests
from bs4 import BeautifulSoup
import csv
def get_page(url):
response = requests.get(url)
if not response.ok:
print('server responded:', response.status_code)
else:
soup = BeautifulSoup(response.text, 'html.parser') # 1. html , 2. parser
return soup
def get_index_data(soup):
try:
titles_link = soup.find_all('div',class_="marginTopTextAdjuster")
except:
titles_link = []
urls = [item.get('href') for item in titles_link]
print(urls)
def main():
#url = "http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2653/rec/1"
mainurl = "http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1"
#get_page(url)
get_index_data(get_page(mainurl))
#write_csv(data,url)
if __name__ == '__main__':
main()
You are trying to get the href attribute of the div tag. Instead try selecting all the a tags. They seem to have a common class attribute body_link_11.
Use titles_link = soup.find_all('a',class_="body_link_11") instead of titles_link = soup.find_all('div',class_="marginTopTextAdjuster")
url = "http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1"
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
titles_link = []
titles_div = soup.find_all('div', attrs={'class': 'marginTopTextAdjuster'})
for link in titles_div:
tag = link.find_all('a', href=True)
try:
if tag[0].attrs.get('item_id', None):
titles_link.append({tag[0].text: tag[0].attrs.get('href', None)})
except IndexError:
continue
print(titles_link)
output:
[{'Civil Affairs Handbook, Japan, section 1a: population statistics.': '/cdm/singleitem/collection/p4013coll8/id/2653/rec/1'}, {'Army Air Forces Program 1943.': '/cdm/singleitem/collection/p4013coll8/id/2385/rec/2'}, {'Casualty report number II.': '/cdm/singleitem/collection/p4013coll8/id/3309/rec/3'}, {'Light armored division, proposed March 1943.': '/cdm/singleitem/collection/p4013coll8/id/2425/rec/4'}, {'Tentative troop list by type units for Blacklist operations.': '/cdm/singleitem/collection/p4013coll8/id/150/rec/5'}, {'Chemical Warfare Service: history of training, part 2, schooling of commissioned officers.': '/cdm/compoundobject/collection/p4013coll8/id/2501/rec/6'}, {'Horses in the German Army (1941-1945).': '/cdm/compoundobject/collection/p4013coll8/id/2495/rec/7'}, {'Unit history: 38 (MECZ) cavalry rcn. sq.': '/cdm/singleitem/collection/p4013coll8/id/3672/rec/8'}, {'Operations in France: December 1944, 714th Tank Battalion.': '/cdm/singleitem/collection/p4013coll8/id/3407/rec/9'}, {'G-3 Reports : Third Infantry Division. (22 Jan- 30 Mar 44)': '/cdm/singleitem/collection/p4013coll8/id/4393/rec/10'}, {'Summary of operations, 1 July thru 31 July 1944.': '/cdm/singleitem/collection/p4013coll8/id/3445/rec/11'}, {'After action report 36th Armored Infantry Regiment, 3rd Armored Division, Nov 1944 thru April 1945.': '/cdm/singleitem/collection/p4013coll8/id/3668/rec/12'}, {'Unit history, 38th Mechanized Cavalry Reconnaissance Squadron, 9604 thru 9665.': '/cdm/singleitem/collection/p4013coll8/id/3703/rec/13'}, {'Redeployment: occupation forces in Europe series, 1945-1946.': '/cdm/singleitem/collection/p4013coll8/id/2952/rec/14'}, {'Twelfth US Army group directives. Annex no. 1.': '/cdm/singleitem/collection/p4013coll8/id/2898/rec/15'}, {'After action report, 749th Tank Battalion: Jan, Feb, Apr - 8 May 45.': '/cdm/singleitem/collection/p4013coll8/id/3502/rec/16'}, {'743rd Tank Battalion, S3 journal history.': '/cdm/singleitem/collection/p4013coll8/id/3553/rec/17'}, {'History of military training, WAAC / WAC training.': '/cdm/singleitem/collection/p4013coll8/id/4052/rec/18'}, {'After action report, 756th Tank Battalion.': '/cdm/singleitem/collection/p4013coll8/id/3440/rec/19'}, {'After action report 92nd Cavalry Recon Squadron Mechanized 12th Armored Division, Jan thru May 45.': '/cdm/singleitem/collection/p4013coll8/id/3583/rec/20'}]
An easy way to do it with requests and BeautifulSoup:
import requests
from bs4 import BeautifulSoup
req = requests.get(url) # url stands for the page's url you want to find
soup = BeautifulSoup(req.text, "html.parser") # req.text is the complete html of the page
print(soup.title.string) # soup.title will give you the title of the page but with the <title> tags so .string removes them
Try this.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1'
html = req.get(url)
doc = SimplifiedDoc(html)
lst = doc.selects('div.marginTopTextAdjuster').select('a')
titles_link = [(utils.absoluteUrl(url,a.href),a.text) for a in lst if a]
print (titles_link)
Result:
[('http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2653/rec/1', 'Civil Affairs Handbook, Japan, section 1a: population statistics.'), ('http://cgsc.cdmhost.com/cdm/landingpage/collection/p4013coll8', 'World War II Operational Documents'), ('http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2385/rec/2', 'Army Air Forces Program 1943.'),...

Generating URL for Yahoo news and Bing news with Python and BeautifulSoup

I want to scrape data from Yahoo News and 'Bing News' pages. The data that I want to scrape are headlines or/and text below headlines (what ever It can be scraped) and dates (time) when its posted.
I have wrote a code but It does not return anything. Its the problem with my url since Im getting response 404
Can you please help me with it?
This is the code for 'Bing'
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'http://www.bing.com/news/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
And this is for Yahoo:
term = 'usa'
url = 'http://news.search.yahoo.com/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Please help me to generate these urls, whats the logic behind them, Im still a noob :)
Basically your urls are just wrong. The urls that you have to use are the same ones that you find in the address bar while using a regular browser. Usually most search engines and aggregators use q parameter for the search term. Most of the other parameters are usually not required (sometimes they are - eg. for specifying result page no etc..).
Bing
from bs4 import BeautifulSoup
import requests
import re
term = 'usa'
url = 'https://www.bing.com/news/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_card in soup.find_all('div', class_="news-card-body"):
title = news_card.find('a', class_="title").text
time = news_card.find(
'span',
attrs={'aria-label': re.compile(".*ago$")}
).text
print("{} ({})".format(title, time))
Output
Jason Mohammed blitzkrieg sinks USA (17h)
USA Swimming held not liable by California jury in sexual abuse case (1d)
United States 4-1 Canada: USA secure payback in Nations League (1d)
USA always plays the Dalai Lama card in dealing with China, says Chinese Professor (1d)
...
Yahoo
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'https://news.search.yahoo.com/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_item in soup.find_all('div', class_='NewsArticle'):
title = news_item.find('h4').text
time = news_item.find('span', class_='fc-2nd').text
# Clean time text
time = time.replace('ยท', '').strip()
print("{} ({})".format(title, time))
Output
USA Baseball will return to Arizona for second Olympic qualifying chance (52 minutes ago)
Prized White Sox prospect Andrew Vaughn wraps up stint with USA Baseball (28 minutes ago)
Mexico defeats USA in extras for Olympic berth (13 hours ago)
...

Categories

Resources