Scraping data - attributes from a web page - python

Hello i am new in web scraping and i have a problem. I want to scrape data from this html code:
I want to have the data that belongs inside the
<tr> .. </tr>
tag.
My code is shown as below:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.basketball-reference.com/leagues/').text
soup = BeautifulSoup(html_text, 'lxml')
rows = soup.select('tr[data-row]')
print(rows)
I am inspired by this thread, but it's returning a empty array. Can anyone help me with this

Like I said in the comment, it looks as if the attribute data-row is being added at the client side - I couldn't find it in the HTML.
A quick and easy way to fix this would be to change your css selector. I came up with something like this
rows = soup.select('tr')
for row in rows:
if row.th.attrs['data-stat']=='season' and 'scope' in row.th.attrs:
print(row)

How about using pandas to make your web-scraping life (a bit) easier?
Here's how:
import pandas as pd
import requests
df = pd.read_html(requests.get('https://www.basketball-reference.com/leagues/').text, flavor="bs4")
df = pd.concat(df)
df.to_csv("basketball_table.csv", index=False)
Output:

Related

Scraping a table from html using python and beautifulsoup

I am trying to scrape data from a table in a government website, I have tried to watch some tutorials but so far to no avail (coding dummy over here!!!)
I would like to get a .csv file out of the tables they have containing the date, the type of event, and the adopted measures for a project. I leave here the website if any of you wants to crack it!
https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps
!pip install beautifulsoup4
!pip install requests
from bs4 import BeautifulSoup
import requests
url= "https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
FIRST_table = soup.find('table', class_ = 'tableau-timeline')
print(FIRST_table)
for timeline in FIRST_table.find_all('tbody'):
rows= timeline.find_all('tr')
for row in rows:
pl_timeline = row.find('td', class_ = 'date').text
print(pl_timeline)
p
I was expecting to get in order the dates and to use the same for loop to get the also the other two columns by tweaking it for "Type d'événement" and "Mesures adoptées"
What am I doing wrong? How can I tweak it? (I am using colab if it makes any difference) Thanks in advance
Make your life easier and just use pandas to parse the tables and dump the data to a .csv file.
For example, this gets you all the tables, merges them, and spits out a .csv file:
import pandas as pd
import requests
url = "https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps"
tables = pd.read_html(requests.get(url).text, flavor="lxml")
df = pd.concat(tables).to_csv("data.csv", index=False)
Output:

bs4 - How to extract table data from website?

Here is the link,
https://www.vit.org/WebReports/vesselschedule.aspx
I'm using BeautifulSoup and my goal was to extract the table from it.
I wrote the code..
from bs4 import BeautifulSoup
import requests
import pandas as pd
url="https://www.vit.org/WebReports/vesselschedule.aspx"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
gdp_table = soup.find("table", attrs={"id": "ctl00_ContentPlaceHolder1_VesselScheduleControl1_Grid1"})
The final line of code gave me an error displaying 'None'.
I'm new to this web scraping, can you help me find a solution to get the table?
Why not pd.read_html(url)?
It will extract tables automatically
Problem is that the id by which you are looking for this table is appended to the element dynamically via js and as requests library is only downloading files at URL, nothing is dynamically appended and in result Your table is without id
If you encounter a similar error in the future (element exists but bs4 cant find it) try saving the response as text to an HTML file and inspect it in your browser.
For your particular case this code could be used:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.vit.org/WebReports/vesselschedule.aspx")
with open("tmp.html", "w") as f:
f.write(resp.text)
bs = BeautifulSoup(resp.text)
table = bs.find_all("table")[6] # not the best way to select elements
rows = table.find_all("tr")
Warning: Try avoiding such style of relative selecting, web pages are constantly updating and such code may procude errors in the future
I Parsed The Table and Added Each Rows in A List And Appended That To Data List
And Here You Go!..
And I Added The Total List In [Hashbin]
from bs4 import BeautifulSoup
import requests
url="https://www.vit.org/WebReports/vesselschedule.aspx"
soup = BeautifulSoup( requests.get(url).text )
table = soup.find_all('table')[6] # as it is not the best way as told by darkKnight
rows = table.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols])
print(data)
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.vit.org/WebReports/vesselschedule.aspx")
soup=BeautifulSoup(res.text,"html.parser")
find columns by below code:
table=soup.find_all("table")[6]
columns=[col.get_text(strip=True) for col in table.find("tr",class_="HeadingRow").find_all("td")[1:-1]]
find row data by below code:
main_lst=[]
for row in table.find_all("tr",class_="Row"):
lst=[i.get_text(strip=True) for i in row.find_all("td")[1:-1]]
main_lst.append(lst)
create table using pandas
import pandas as pd
df=pd.DataFrame(columns=columns,data=main_lst)
df
Image:
You need a way to specify a pattern that uniquely identifies the target table given the nested tabular structure. The following css pattern will grab that table based on a string it contains ("Shipline"), an attribute that is not present, as well as the table's relationship to other elements within the DOM.
You can then pass that specific table to read_html and do some cleaning on the returned DataFrame.
import requests
from bs4 import BeautifulSoup as bs
from pandas import read_html as rh
r = requests.get('https://www.vit.org/WebReports/vesselschedule.aspx').text
soup = bs(r, 'lxml')
df = rh(str(soup.select_one('table table:not([style]):-soup-contains("Shipline")')))[0] #earlier soupsieve version use :contains
df.dropna(how='all', axis = 1, inplace = True)
df.columns = df.iloc[0, :]
df = df.iloc[1:, :]

Finding tables returns [] with bs4

I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!
The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.
I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)

Parsing a table on webpage using BeautifulSoup

Trying get a table from the website SGX.
The page is saved to local drive and I am using BeautifulSoup to parse it:
soup = BeautifulSoup(open(pages), "lxml")
soup.prettify()
list_0 = soup.find_all('table')[0]
print list_0
What it returned, is not the first row on the page:
[<tr><td>Zhongmin Baihui</td><td>5SR</td><td class="nowrap">09:44 AM</td><td class="nowrap">09:49 AM</td><td>0.615</td><td>0.675</td><td>0.555</td></tr>]
What's the right way to retrieve this table?
Thank you.
Data are being fetched after page loads using AJAX request, by inspecting the page you can find the API URL (the Url below), and then you can use something like that:
import pandas as pd
import requests
import json
response = requests.get('https://api.sgx.com/securities/v1.1?excludetypes=bonds&params=nc%2Cadjusted-vwap%2Cb%2Cbv%2Cp%2Cc%2Cchange_vs_pc%2Cchange_vs_pc_percentage%2Ccx%2Ccn%2Cdp%2Cdpc%2Cdu%2Ced%2Cfn%2Ch%2Ciiv%2Ciopv%2Clt%2Cl%2Co%2Cp_%2Cpv%2Cptd%2Cs%2Csv%2Ctrading_time%2Cv_%2Cv%2Cvl%2Cvwap%2Cvwap-currency')
data = json.loads(response.content)["data"]["prices"]
df = pd.DataFrame(data)
print(df)
If your requirement are complex and your crawling done in regular basis I recommend using scrapy.

Use Pandas to Get Multiple Tables From Webpage

I am using Pandas to parse the data from the following page: http://kenpom.com/index.php?y=2014
To get the data, I am writing:
dfs = pd.read_html(url)
The data looks great and is perfectly parsed, except it only takes data from the 40 first rows. It seems to be a problem with the separation of the tables, that makes it so that pandas does no get all the information.
How do you get pandas to get all the data from all the tables on that webpage?
The HTML of page you have posted have multiple <thead> and <tbody> tags wich confuses pandas.read_html.
Following this SO thread you can manually unwrap those tags:
import urllib
from bs4 import BeautifulSoup
html_table = urllib.request.urlopen(url).read()
# fix HTML
soup = BeautifulSoup(html_table, "html.parser")
# warn! id ratings-table is your page specific
for table in soup.findChildren(attrs={'id': 'ratings-table'}):
for c in table.children:
if c.name in ['tbody', 'thead']:
c.unwrap()
df = pd.read_html(str(soup), flavor="bs4")
len(df[0])
which returns 369.

Categories

Resources