incomplete result in table scraping

incomplete result in table scraping - python

I am trying to scrape http://bifr.nic.in/asp/list.asp this page with beautifulsoup and get the table from it.
Following is my code
from bs4 import BeautifulSoup
import urllib.request
base_url = "http://bifr.nic.in/asp/list.asp"
page = urllib.request.urlopen(base_url)
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table",{"class":"forumline"})
tr = table.find_all("tr")
for rows in tr:
print(rows.get_text())
It shows no error, but when i execute it i am only able to get first row of content from the table.
List of Companies
Case
No
Company
Name
359 2000 A & F OVERSEAS LTD.
359 2000 A & F OVERSEAS LTD.
359 2000 A & F OVERSEAS LTD.
This is the result i am getting. I can't understand what's going on.

Try this to get all the data from that table:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen("http://bifr.nic.in/asp/list.asp")
soup = BeautifulSoup(page, "html5lib")
table = soup.select_one("table.forumline")
for items in table.select("tr")[4:]:
data = ' '.join([item.get_text(" ",strip=True) for item in items.select("td")])
print(data)
Partial Output:
359 2000 A & F OVERSEAS LTD.
99 1988 A B C PRODUCTS LTD.
103 1989 A INFRASTRUCTURE LTD.
3 2006 A V ALLOYS LTD.
13 1988 A V J WIRES LTD.

Probably the page code contains some errors in html markup, try use html5lib instead of html.parser, but before you need install it:
pip install html5lib
soup = BeautifulSoup(page, "html5lib")

Related

BeautifulSoup to scrape multiple link

I want to scrape this website by using BeautifulSoup, by first extracting every links, then opening them one by one. Once they are opened, I want to scrape the company name, it's tickers, stock exchange and extract the multiple PDF links whenever they are available. It will write them out in a csv file afterwards.
To make it happen, I first try that way :
import requests
from bs4 import BeautifulSoup
import re
import time
source_code = requests.get('https://www.responsibilityreports.co.uk/Companies?a=#')
soup = BeautifulSoup(source_code.content, 'lxml')
data = []
links = []
base = 'https://www.responsibilityreports.co.uk'
for link in soup.find_all('a', href=True):
data.append(str(link.get('href')))
print(link)
try:
for link in links:
url = base + link
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
for j in soup.find_all('a', href=True):
print(j)
except:
pass
As far as I know, this website doesn't forbid scraper. But while it actually gives me every links, I'm unable to open them, which doesn't allow me to keep my scraper going for the following tasks.
Thanks in advance!

You can use this example how to iterate over all company links:
import requests
from bs4 import BeautifulSoup
url = "https://www.responsibilityreports.co.uk/Companies?a=#"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = [
"https://www.responsibilityreports.co.uk" + a["href"]
for a in soup.select('a[href^="/Company"]')
]
for link in links:
soup = BeautifulSoup(requests.get(link).content, "html.parser")
name = soup.select_one("h1").get_text(strip=True)
ticker = soup.select_one(".ticker_name")
if ticker:
ticker = ticker.get_text(strip=True)
else:
ticker = "N/A"
# extract other info...
print(name)
print(ticker)
print(link)
print("-" * 80)
Prints:
3i Group plc
III
https://www.responsibilityreports.co.uk/Company/3i-group-plc
--------------------------------------------------------------------------------
3M Corporation
MMM
https://www.responsibilityreports.co.uk/Company/3m-corporation
--------------------------------------------------------------------------------
AAON Inc.
AAON
https://www.responsibilityreports.co.uk/Company/aaon-inc
--------------------------------------------------------------------------------
ABB Ltd
ABB
https://www.responsibilityreports.co.uk/Company/abb-ltd
--------------------------------------------------------------------------------
Abbott Laboratories
ABT
https://www.responsibilityreports.co.uk/Company/abbott-laboratories
--------------------------------------------------------------------------------
Abbvie Inc
ABBV
https://www.responsibilityreports.co.uk/Company/abbvie-inc
--------------------------------------------------------------------------------
Abercrombie & Fitch
ANF
https://www.responsibilityreports.co.uk/Company/abercrombie-fitch
--------------------------------------------------------------------------------
ABM Industries, Inc.
ABM
https://www.responsibilityreports.co.uk/Company/abm-industries-inc
--------------------------------------------------------------------------------
Acadia Realty Trust
AKR
https://www.responsibilityreports.co.uk/Company/acadia-realty-trust
--------------------------------------------------------------------------------
Acciona
N/A
https://www.responsibilityreports.co.uk/Company/acciona
--------------------------------------------------------------------------------
ACCO Brands
ACCO
https://www.responsibilityreports.co.uk/Company/acco-brands
--------------------------------------------------------------------------------
...and so on.

Getting first (or a specific) td in BeautifulSoup with no class

I have one of those nightmare tables with no class given for the tr and td tags.
A sample page is here: https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m
(You'll see in the code below that I'm getting multiple pages, but that's not the problem.)
I want the team name (nothing else) from each bracket. The output should be:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
Miami Rush Kendall SC
IMG
Tampa Bay United
etc.
I've been able to get every td in the specified tables. But every attempt to use [0] to get the first td of every row gives me an "index out of range" error.
The code is:
import requests
import csv
from bs4 import BeautifulSoup
batch_size = 2
urls = ['https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m', 'https://system.gotsport.com/org_event/events/1271/schedules?age=17&gender=m']
# iterate through urls
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# iterate through leagues and teams
leagues = soup.find_all('table', class_='table table-bordered table-hover table-condensed')
for league in leagues:
row = ''
rows = league.find_all('tr')
for row in rows:
team = row.find_all('td')
teamName = team[0].text.strip()
print(teamName)
After a couple of hours of work, I feel like I'm just one syntax change away from getting this right. Yes?

You can use a CSS Selector nth-of-type(n). It works for both links:
import requests
from bs4 import BeautifulSoup
url = "https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for tag in soup.select(".small-margin-bottom td:nth-of-type(1)"):
print(tag.text.strip())
Output:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
...
...
Real Salt Lake U19
Real Colorado
Empire United Soccer Academy

Each bracket corresponds to one "panel", and each panel has two rows, the first of which contains the first table of all teams in the match tables.
def main():
import requests
from bs4 import BeautifulSoup
url = "https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for panel in soup.find_all("div", {"class": "panel-body"}):
for row in panel.find("tbody").find_all("tr"):
print(row.find("td").text.strip())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
Miami Rush Kendall SC
IMG
Tampa Bay United
Weston FC
Chargers SC
South Florida FA
Solar SC
RISE SC
...

I think the problem is with the header of the table, which contains th elements instead of td elements. It leads to the index of range error, when you try to retrieve first element from an empty list. Try to add check for the length of the td:
for row in rows:
team = row.find_all('td')
if(len(team) > 0):
teamName = team[0].text.strip()
print(teamName)
It should print you the team names.

Web Scraping a page with multiple tables

I am trying to web scrape the second table from this website:
https://fbref.com/en/comps/9/stats/Premier-League-Stats
However, I have only ever managed to extract the information from the first table when trying to access the information by finding the table tag. Would anyone be able to explain to me why I cannot access the second table or show me how to do it.
import requests
from bs4 import BeautifulSoup
url = "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
pl_table = soup.find_all("table")
player_table = tables[0]

Something along these lines should do it
tables = soup.find_all("table") # returns a list of tables
second_table = tables[1]

The table is inside HTML comments <!-- ... -->.
To get the table from comments, you can use this example:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = BeautifulSoup(soup.select_one('#all_stats_standard').find_next(text=lambda x: isinstance(x, Comment)), 'html.parser')
#print some information from the table to screen:
for tr in table.select('tr:has(td)'):
tds = [td.get_text(strip=True) for td in tr.select('td')]
print('{:<30}{:<20}{:<10}'.format(tds[0], tds[3], tds[5]))
Prints:
Patrick van Aanholt Crystal Palace 1990
Max Aarons Norwich City 2000
Tammy Abraham Chelsea 1997
Che Adams Southampton 1996
Adrián Liverpool 1987
Sergio Agüero Manchester City 1988
Albian Ajeti West Ham 1997
Nathan Aké Bournemouth 1995
Marc Albrighton Leicester City 1989
Toby Alderweireld Tottenham 1989
...and so on.

How would I scrape the sic code description?

Hi I am using BS4 to scrape the sic codes and descriptions. I currently have the following code which does exactly what I want but I don't know how to scrape the description pictures below in the inspect element view as well as the view source.
To be clear the bit I want is "State commercial banks" and "LABORATORY ANALYTICAL INSTRUMENTS"
https://www.sec.gov/cgi-bin/browse-edgar?CIK=866054&owner=exclude&action=getcompany&Find=Search
<div class="companyInfo">
<span class="companyName">COMMERCIAL NATIONAL FINANCIAL CORP /PA <acronym title="Central Index Key">CIK</acronym>#: 0000866054 (see all company filings)</span>
<p class="identInfo"><acronym title="Standard Industrial Code">SIC</acronym>: 6022 - STATE COMMERCIAL BANKS<br />State location: PA | State of Inc.: <strong>PA</strong> | Fiscal Year End: 1231<br />(Office of Finance)<br />Get <b>insider transactions</b> for this <b>issuer</b>.
for cik_num in cik_num_list:
try:
url = r"https://www.sec.gov/cgi-bin/browse-edgar?CIK={}&owner=exclude&action=getcompany".format(cik_num)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
try:
comp_name = soup.find_all('div', {'class':'companyInfo'})[0].find('span').text
sic_code = soup.find_all('p', {'class':'identInfo'})[0].find('a').text

import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=866054&owner=exclude&action=getcompany&Find=Search'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
sic_code_desc = soup.select_one('.identInfo').a.find_next_sibling(text=True).split(maxsplit=1)[-1]
print(sic_code_desc)
Prints:
STATE COMMERCIAL BANKS
For url = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=1090872&owner=exclude&action=getcompany&Find=Search' it prints:
LABORATORY ANALYTICAL INSTRUMENTS

Scraping through on Wiki using "tr" and "td" with BeautifulSoup and python

Total python3 beginner here. I can't seem to get just the name of of the colleges to print out.
the class is no where near the college names and i can't seem to narrow the find_all down to what i need. and print to a new csv file. Any ideas?
import requests
from bs4 import BeautifulSoup
import csv
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
colleges = soup.find_all("table", class_ = "wikitable sortable")
for college in colleges:
first_level = college.find_all("tr")
print(first_level)

You can use soup.select() to utilize css selectors and be more precise:
import requests
from bs4 import BeautifulSoup
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
l = soup.select(".mw-parser-output > table:nth-of-type(2) > tbody > tr > td:nth-of-type(1) a")
for each in l:
print(each.text)
Printed result:
Brown University
Columbia University
Cornell University
Dartmouth College
Harvard University
University of Pennsylvania
Princeton University
Yale University
To put a single column into csv:
import pandas as pd
pd.DataFrame([e.text for e in l]).to_csv("your_csv.csv") # This will include index

With:
colleges = soup.find_all("table", class_ = "wikitable sortable")
you are getting all the tables with this class (there are five), not getting all the colleges in the table. So you can do something like this:
import requests
from bs4 import BeautifulSoup
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
college_table = soup.find("table", class_ = "wikitable sortable")
colleges = college_table.find_all("tr")
for college in colleges:
college_row = college.find('td')
college_link = college.find('a')
if college_link != None:
college_name = college_link.text
print(college_name)
EDIT: I added an if to discard the first line, that has the table header

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

incomplete result in table scraping - python

Probably the page code contains some errors in html markup, try use html5lib instead of html.parser, but before you need install it: pip install html5lib soup = BeautifulSoup(page, "html5lib")

Related

BeautifulSoup to scrape multiple link

Getting first (or a specific) td in BeautifulSoup with no class

Web Scraping a page with multiple tables

How would I scrape the sic code description?

Scraping through on Wiki using "tr" and "td" with BeautifulSoup and python

Categories

Resources