bs4 - How to extract table data from website?

bs4 - How to extract table data from website? - python

Here is the link,
https://www.vit.org/WebReports/vesselschedule.aspx
I'm using BeautifulSoup and my goal was to extract the table from it.
I wrote the code..
from bs4 import BeautifulSoup
import requests
import pandas as pd
url="https://www.vit.org/WebReports/vesselschedule.aspx"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
gdp_table = soup.find("table", attrs={"id": "ctl00_ContentPlaceHolder1_VesselScheduleControl1_Grid1"})
The final line of code gave me an error displaying 'None'.
I'm new to this web scraping, can you help me find a solution to get the table?

Why not pd.read_html(url)?
It will extract tables automatically

Problem is that the id by which you are looking for this table is appended to the element dynamically via js and as requests library is only downloading files at URL, nothing is dynamically appended and in result Your table is without id
If you encounter a similar error in the future (element exists but bs4 cant find it) try saving the response as text to an HTML file and inspect it in your browser.
For your particular case this code could be used:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.vit.org/WebReports/vesselschedule.aspx")
with open("tmp.html", "w") as f:
f.write(resp.text)
bs = BeautifulSoup(resp.text)
table = bs.find_all("table")[6] # not the best way to select elements
rows = table.find_all("tr")
Warning: Try avoiding such style of relative selecting, web pages are constantly updating and such code may procude errors in the future

I Parsed The Table and Added Each Rows in A List And Appended That To Data List
And Here You Go!..
And I Added The Total List In [Hashbin]
from bs4 import BeautifulSoup
import requests
url="https://www.vit.org/WebReports/vesselschedule.aspx"
soup = BeautifulSoup( requests.get(url).text )
table = soup.find_all('table')[6] # as it is not the best way as told by darkKnight
rows = table.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols])
print(data)

from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.vit.org/WebReports/vesselschedule.aspx")
soup=BeautifulSoup(res.text,"html.parser")
find columns by below code:
table=soup.find_all("table")[6]
columns=[col.get_text(strip=True) for col in table.find("tr",class_="HeadingRow").find_all("td")[1:-1]]
find row data by below code:
main_lst=[]
for row in table.find_all("tr",class_="Row"):
lst=[i.get_text(strip=True) for i in row.find_all("td")[1:-1]]
main_lst.append(lst)
create table using pandas
import pandas as pd
df=pd.DataFrame(columns=columns,data=main_lst)
df
Image:

You need a way to specify a pattern that uniquely identifies the target table given the nested tabular structure. The following css pattern will grab that table based on a string it contains ("Shipline"), an attribute that is not present, as well as the table's relationship to other elements within the DOM.
You can then pass that specific table to read_html and do some cleaning on the returned DataFrame.
import requests
from bs4 import BeautifulSoup as bs
from pandas import read_html as rh
r = requests.get('https://www.vit.org/WebReports/vesselschedule.aspx').text
soup = bs(r, 'lxml')
df = rh(str(soup.select_one('table table:not([style]):-soup-contains("Shipline")')))[0] #earlier soupsieve version use :contains
df.dropna(how='all', axis = 1, inplace = True)
df.columns = df.iloc[0, :]
df = df.iloc[1:, :]

Related

Python/BeautifulSoup script returning no results in CSV

I'm new to python (and coding in general) and am trying to write a script that will scrape all of the <p> tags from a given URL and then create a CSV file with them. It seems to run through okay, but the CSV file it creates doesn't have any data in it. Below is my code:
import requests
r = requests.get('https://seekingalpha.com/amp/article/4420423-chipotle-mexican-grill-inc-s-cmg-ceo-brian-niccol-on-q1-2021-results-earnings-call-transcript')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('<p>')
records = []
for result in results:
Comment = result.find('<p>').text
records.append((Comment))
import pandas as pd
df = pd.DataFrame(records, columns=['Comment'])
df.to_csv('CMG_test.csv', index=False, encoding='utf-8')
print('finished')
Any help greatly appreciated!

First, you need to pass CSS selectors to BeautifulSoup methods. <p> isn't a selector. p is. So, in order to find all p tags, you need to use the find_all method on the soup like so:
results = soup.find_all('p')
Take a look at this page for more info on the CSS selectors.
Secondly, in your iteration over results, you don't need to find the tag all over again. You can simply extract the text by result.text.
So, if you rewrite your code like the following:
import requests
r = requests.get('https://seekingalpha.com/amp/article/4420423-chipotle-mexican-grill-inc-s-cmg-ceo-brian-niccol-on-q1-2021-results-earnings-call-transcript')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('p')
records = []
for result in results:
Comment = result.text
records.append(Comment)
import pandas as pd
df = pd.DataFrame(records, columns=['Comment'])
df.to_csv('CMG_test.csv', index=False, encoding='utf-8')
print('finished')
You'll find your csv well-populated with the content you're looking for.

Finding tables returns [] with bs4

I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!

The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.

I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)

BeautifulSoup not working after the first page

I'm trying to use Python's BeautifulSoup to scrape data from the following website. The data on the website is split over four different pages. Each page has a unique link (i.e. http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/0 for the first page, http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/1 for the second page, etc.). I am able to successfully scrape the data on the first page, but when I try to scrape data for the second page onward it comes up empty. Here is the code I'm using:
# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
# Define url and request webpage
season = 2019
page = 1
url = "http://insider.espn.com/nbadraft/results/top100/_/year/{}/set/{}".format(season, page)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
# Scrape all of the data in the table
rows = page_soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
# Get the column headers
headers = player_stats[0]
# Remove the first row
player_stats.pop(0)
# Convert to pandas dataframe
df = pd.DataFrame(player_stats, columns = headers)
# Remove all rows where Name = None
df = df[~df['NAME'].isnull()]
# Remove PLAYER column because it's empty
df = df.drop(columns='PLAYER')
df
Any advice would be much appreciated! I'm a bit new to using BeautifulSoup, so I apologize in advance if the code isn't particularly nice or efficient.
Update: The links only work if opened in Chrome, which is likely what is causing the problem. Is there any way around it?

Get data from list using beautifulsoup in python

I have the webpage- https://dmesupplyusa.com/mobility/bariatric-rollator-with-8-wheels.html
Here there is a list of specifications under details that i want to extract as a table, i.e. the specification category as the header, and the specification value as the next row. How can i do this in python using beautifulsoup?

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
page = requests.get("https://dmesupplyusa.com/mobility/bariatric-rollator-with-8-wheels.html").content #Read Page source
page = bs(page) # Create Beautifulsoup object
data = page.find_all('strong', string="Product Specifications")[0].find_next('ul').text.strip().split('\n') # Extract requireed information
data = dict([zip(i.split(":")) for i in data])
df = pd.DataFrame(data).T
I hope this is what you are looking for.

BeautifulSoup find_all by table and id class returning no results?

I am trying to scrape box-score data from ProFootball reference. After running into issues with javascript, I turned to selenium to get the initial soup object. I'm trying to find a specific table on a website and subsequently iterate through its rows.
The code words if I simply find_all('table')[#] however the # changes depending on which box score I am looking at so it isn't reliable. I therefore want to use the id='player_offense' tag to identify the same table across games but when I use it it returns nothing. What am I missing here?
from selenium import webdriver
import os
from bs4 import BeautifulSoup
#path to chromedriver
chrome_path=os.path.expanduser('~/Documents/chromedriver.exe')
driver = webdriver.Chrome(path)
driver.get('https://www.pro-football-
reference.com/boxscores/201709070nwe.htm')
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
#doesn't work
soup.find('table',id='player_offense')
#works
table = soup.find_all('table')[3]

Data is in comments. Find the appropriate comment and then extract table
import requests
from bs4 import BeautifulSoup as bs
from bs4 import Comment
import pandas as pd
r= requests.get('https://www.pro-football-reference.com/boxscores/201709070nwe.htm#')
soup = bs(r.content, "lxml")
comments = soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
if 'id="player_offense"' in comment:
print(pd.read_html(comment)[0])
break

This also works.
from requests_html import HTMLSession, HTML
import pandas as pd
with HTMLSession() as s:
r = s.get('https://www.pro-football-reference.com/boxscores/201709070nwe.htm')
r = HTML(html=r.text)
r.render()
table = r.find('table#player_offense', first=True)
df = pd.read_html(table.html)
print(df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

bs4 - How to extract table data from website? - python

Why not pd.read_html(url)? It will extract tables automatically

Related

Python/BeautifulSoup script returning no results in CSV

Finding tables returns [] with bs4

BeautifulSoup not working after the first page

Get data from list using beautifulsoup in python

BeautifulSoup find_all by table and id class returning no results?

Categories

Resources