I am using Pandas to parse the data from the following page: http://kenpom.com/index.php?y=2014
To get the data, I am writing:
dfs = pd.read_html(url)
The data looks great and is perfectly parsed, except it only takes data from the 40 first rows. It seems to be a problem with the separation of the tables, that makes it so that pandas does no get all the information.
How do you get pandas to get all the data from all the tables on that webpage?
The HTML of page you have posted have multiple <thead> and <tbody> tags wich confuses pandas.read_html.
Following this SO thread you can manually unwrap those tags:
import urllib
from bs4 import BeautifulSoup
html_table = urllib.request.urlopen(url).read()
# fix HTML
soup = BeautifulSoup(html_table, "html.parser")
# warn! id ratings-table is your page specific
for table in soup.findChildren(attrs={'id': 'ratings-table'}):
for c in table.children:
if c.name in ['tbody', 'thead']:
c.unwrap()
df = pd.read_html(str(soup), flavor="bs4")
len(df[0])
which returns 369.
Related
I am trying to scrape data from a table in a government website, I have tried to watch some tutorials but so far to no avail (coding dummy over here!!!)
I would like to get a .csv file out of the tables they have containing the date, the type of event, and the adopted measures for a project. I leave here the website if any of you wants to crack it!
https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps
!pip install beautifulsoup4
!pip install requests
from bs4 import BeautifulSoup
import requests
url= "https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
FIRST_table = soup.find('table', class_ = 'tableau-timeline')
print(FIRST_table)
for timeline in FIRST_table.find_all('tbody'):
rows= timeline.find_all('tr')
for row in rows:
pl_timeline = row.find('td', class_ = 'date').text
print(pl_timeline)
p
I was expecting to get in order the dates and to use the same for loop to get the also the other two columns by tweaking it for "Type d'événement" and "Mesures adoptées"
What am I doing wrong? How can I tweak it? (I am using colab if it makes any difference) Thanks in advance
Make your life easier and just use pandas to parse the tables and dump the data to a .csv file.
For example, this gets you all the tables, merges them, and spits out a .csv file:
import pandas as pd
import requests
url = "https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps"
tables = pd.read_html(requests.get(url).text, flavor="lxml")
df = pd.concat(tables).to_csv("data.csv", index=False)
Output:
I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!
The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.
I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)
Hello i am new in web scraping and i have a problem. I want to scrape data from this html code:
I want to have the data that belongs inside the
<tr> .. </tr>
tag.
My code is shown as below:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.basketball-reference.com/leagues/').text
soup = BeautifulSoup(html_text, 'lxml')
rows = soup.select('tr[data-row]')
print(rows)
I am inspired by this thread, but it's returning a empty array. Can anyone help me with this
Like I said in the comment, it looks as if the attribute data-row is being added at the client side - I couldn't find it in the HTML.
A quick and easy way to fix this would be to change your css selector. I came up with something like this
rows = soup.select('tr')
for row in rows:
if row.th.attrs['data-stat']=='season' and 'scope' in row.th.attrs:
print(row)
How about using pandas to make your web-scraping life (a bit) easier?
Here's how:
import pandas as pd
import requests
df = pd.read_html(requests.get('https://www.basketball-reference.com/leagues/').text, flavor="bs4")
df = pd.concat(df)
df.to_csv("basketball_table.csv", index=False)
Output:
I'm trying to scrape data from this table
enter image description here
and here's the code I'm using
## Scraping data for schools
from urllib.request import urlopen
from bs4 import BeautifulSoup
#List of schools
page=urlopen('https://mcss.knack.com/school-districts#all-school-contacts/')
soup = BeautifulSoup(page,'html.parser')
School=[]
Address=[]
Phone=[]
Principal=[]
Email=[]
District=[]
# Indexing rows and then identifying cells
for rows in soup.findAll('tr'):
cells = rows.findAll('td')
if len(cells)==7:
School.append(soup.find("span", {'class':'col-0'}).text)
Address.append(soup.find("span", {'class':'col-1'}).text)
Phone.append(soup.find("span", {'class':'col-2'}).text)
Principal.append(soup.find("span", {'class':'col-3'}).text)
Email.append(soup.find("span", {'class':'col-4'}).text)
District.append(soup.find("span", {'class':'col-5'}).text)
import pandas as pd
school_frame = pd.DataFrame({'School' : School,
'Address' : Address,
'Phone':Phone,
'Principal':Principal,
'Email':Email,
'District':District
})
school_frame.head()
#school_frame.to_csv('school_address.csv')
And in return I'm getting only the header names of the columns of data frame.
What am I doing wrong?
When you check the actual value of page, you will see that it does not contain any table but an empty div which will later be filled by javascript dynamically. urllib.request does not run the javascript and just returns an empty page with no table to you. You could use selenium to emulate a browser (which runs javascript) and then fetch the resulting html of that website (see this stackoverflow answer for an example).
Trying get a table from the website SGX.
The page is saved to local drive and I am using BeautifulSoup to parse it:
soup = BeautifulSoup(open(pages), "lxml")
soup.prettify()
list_0 = soup.find_all('table')[0]
print list_0
What it returned, is not the first row on the page:
[<tr><td>Zhongmin Baihui</td><td>5SR</td><td class="nowrap">09:44 AM</td><td class="nowrap">09:49 AM</td><td>0.615</td><td>0.675</td><td>0.555</td></tr>]
What's the right way to retrieve this table?
Thank you.
Data are being fetched after page loads using AJAX request, by inspecting the page you can find the API URL (the Url below), and then you can use something like that:
import pandas as pd
import requests
import json
response = requests.get('https://api.sgx.com/securities/v1.1?excludetypes=bonds¶ms=nc%2Cadjusted-vwap%2Cb%2Cbv%2Cp%2Cc%2Cchange_vs_pc%2Cchange_vs_pc_percentage%2Ccx%2Ccn%2Cdp%2Cdpc%2Cdu%2Ced%2Cfn%2Ch%2Ciiv%2Ciopv%2Clt%2Cl%2Co%2Cp_%2Cpv%2Cptd%2Cs%2Csv%2Ctrading_time%2Cv_%2Cv%2Cvl%2Cvwap%2Cvwap-currency')
data = json.loads(response.content)["data"]["prices"]
df = pd.DataFrame(data)
print(df)
If your requirement are complex and your crawling done in regular basis I recommend using scrapy.