I'm trying to scrape data from this table
enter image description here
and here's the code I'm using
## Scraping data for schools
from urllib.request import urlopen
from bs4 import BeautifulSoup
#List of schools
page=urlopen('https://mcss.knack.com/school-districts#all-school-contacts/')
soup = BeautifulSoup(page,'html.parser')
School=[]
Address=[]
Phone=[]
Principal=[]
Email=[]
District=[]
# Indexing rows and then identifying cells
for rows in soup.findAll('tr'):
cells = rows.findAll('td')
if len(cells)==7:
School.append(soup.find("span", {'class':'col-0'}).text)
Address.append(soup.find("span", {'class':'col-1'}).text)
Phone.append(soup.find("span", {'class':'col-2'}).text)
Principal.append(soup.find("span", {'class':'col-3'}).text)
Email.append(soup.find("span", {'class':'col-4'}).text)
District.append(soup.find("span", {'class':'col-5'}).text)
import pandas as pd
school_frame = pd.DataFrame({'School' : School,
'Address' : Address,
'Phone':Phone,
'Principal':Principal,
'Email':Email,
'District':District
})
school_frame.head()
#school_frame.to_csv('school_address.csv')
And in return I'm getting only the header names of the columns of data frame.
What am I doing wrong?
When you check the actual value of page, you will see that it does not contain any table but an empty div which will later be filled by javascript dynamically. urllib.request does not run the javascript and just returns an empty page with no table to you. You could use selenium to emulate a browser (which runs javascript) and then fetch the resulting html of that website (see this stackoverflow answer for an example).
Related
I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!
The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.
I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)
I am scraping the ranking data from Vivino.
The website is: link
but I got an empty data list can anyone help me. Image of the result i got Image
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.vivino.com/explore?e=eJzLLbI11jNVy83MszVXy02ssDU2UEuutPXzUUu2dQ0NUiuwNVRLT7MtSyzKTC1JzFHLT7ItSizJzEsvjk8sSy1KTE9Vy7dNSS1OVisviY4FKgZTRhDKGMozgdDmEMoEAJ7xJhY%3D" #the url of the website we're scraping
page = requests.get(url)
html = BeautifulSoup(page.content, "html.parser")
Name_html = html.find_all(class_="item_description")
Rating_html = html.find_all(class_="vivinoRating__averageValue--3Navj")
Name_list = []
Rating_list = []
#loop over HTML elements, get text and add to list
for i in Name_html:
Name_list.append(i)
for i in Rating_html:
Rating_list.append(i)
#make a dictionary with column names for data to put in DataFrame
data = {"description": Name_list,
"airdate": Rating_list}
df = pd.DataFrame(data) #make a dataframe
The wines labels are loaded through Javascript, you can't scrape them with bs4 as if they're part of a static html.
You need to either manually send the POST requests to get the response content, or scrape the page dinamically with selenium through any of the available browser simulation drivers (i.e. chromedriver).
I'm trying to use Python's BeautifulSoup to scrape data from the following website. The data on the website is split over four different pages. Each page has a unique link (i.e. http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/0 for the first page, http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/1 for the second page, etc.). I am able to successfully scrape the data on the first page, but when I try to scrape data for the second page onward it comes up empty. Here is the code I'm using:
# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
# Define url and request webpage
season = 2019
page = 1
url = "http://insider.espn.com/nbadraft/results/top100/_/year/{}/set/{}".format(season, page)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
# Scrape all of the data in the table
rows = page_soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
# Get the column headers
headers = player_stats[0]
# Remove the first row
player_stats.pop(0)
# Convert to pandas dataframe
df = pd.DataFrame(player_stats, columns = headers)
# Remove all rows where Name = None
df = df[~df['NAME'].isnull()]
# Remove PLAYER column because it's empty
df = df.drop(columns='PLAYER')
df
Any advice would be much appreciated! I'm a bit new to using BeautifulSoup, so I apologize in advance if the code isn't particularly nice or efficient.
Update: The links only work if opened in Chrome, which is likely what is causing the problem. Is there any way around it?
Trying get a table from the website SGX.
The page is saved to local drive and I am using BeautifulSoup to parse it:
soup = BeautifulSoup(open(pages), "lxml")
soup.prettify()
list_0 = soup.find_all('table')[0]
print list_0
What it returned, is not the first row on the page:
[<tr><td>Zhongmin Baihui</td><td>5SR</td><td class="nowrap">09:44 AM</td><td class="nowrap">09:49 AM</td><td>0.615</td><td>0.675</td><td>0.555</td></tr>]
What's the right way to retrieve this table?
Thank you.
Data are being fetched after page loads using AJAX request, by inspecting the page you can find the API URL (the Url below), and then you can use something like that:
import pandas as pd
import requests
import json
response = requests.get('https://api.sgx.com/securities/v1.1?excludetypes=bonds¶ms=nc%2Cadjusted-vwap%2Cb%2Cbv%2Cp%2Cc%2Cchange_vs_pc%2Cchange_vs_pc_percentage%2Ccx%2Ccn%2Cdp%2Cdpc%2Cdu%2Ced%2Cfn%2Ch%2Ciiv%2Ciopv%2Clt%2Cl%2Co%2Cp_%2Cpv%2Cptd%2Cs%2Csv%2Ctrading_time%2Cv_%2Cv%2Cvl%2Cvwap%2Cvwap-currency')
data = json.loads(response.content)["data"]["prices"]
df = pd.DataFrame(data)
print(df)
If your requirement are complex and your crawling done in regular basis I recommend using scrapy.
I am using Pandas to parse the data from the following page: http://kenpom.com/index.php?y=2014
To get the data, I am writing:
dfs = pd.read_html(url)
The data looks great and is perfectly parsed, except it only takes data from the 40 first rows. It seems to be a problem with the separation of the tables, that makes it so that pandas does no get all the information.
How do you get pandas to get all the data from all the tables on that webpage?
The HTML of page you have posted have multiple <thead> and <tbody> tags wich confuses pandas.read_html.
Following this SO thread you can manually unwrap those tags:
import urllib
from bs4 import BeautifulSoup
html_table = urllib.request.urlopen(url).read()
# fix HTML
soup = BeautifulSoup(html_table, "html.parser")
# warn! id ratings-table is your page specific
for table in soup.findChildren(attrs={'id': 'ratings-table'}):
for c in table.children:
if c.name in ['tbody', 'thead']:
c.unwrap()
df = pd.read_html(str(soup), flavor="bs4")
len(df[0])
which returns 369.