Get data from list using beautifulsoup in python - python

I have the webpage- https://dmesupplyusa.com/mobility/bariatric-rollator-with-8-wheels.html
Here there is a list of specifications under details that i want to extract as a table, i.e. the specification category as the header, and the specification value as the next row. How can i do this in python using beautifulsoup?

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
page = requests.get("https://dmesupplyusa.com/mobility/bariatric-rollator-with-8-wheels.html").content #Read Page source
page = bs(page) # Create Beautifulsoup object
data = page.find_all('strong', string="Product Specifications")[0].find_next('ul').text.strip().split('\n') # Extract requireed information
data = dict([zip(i.split(":")) for i in data])
df = pd.DataFrame(data).T
I hope this is what you are looking for.

Related

get side panel info - wikipedia

I am looking for a way to get an information from a side panel of wikipedia website, example:
https://en.wikipedia.org/wiki/Netflix
(the panel on the right)
I was trying using this:
import wikipedia
xx = wikipedia.page("Netlix")
xx.title
print(xx.content)
It does provide the whole page content - besides side panel.
I was trying to avoid BeautifulSoup package, but I am not sure if it's possible. Any ideas?
This is how to approach the problem with BeautifulSoup:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url_wiki = 'https://en.wikipedia.org/wiki/Netflix'
table_class = "infobox vcard"
response = requests.get(url_wiki)
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(response.text, 'html.parser')
side_panel = soup.find('table', {'class': "infobox"})
df = pd.read_html(str(side_panel))
# convert list to dataframe
df = pd.DataFrame(df[0])
print(df)

How to download as a Pandas Data Frame tabel from Wikipedia in Python?

I would like to download table from Wikipedia.org from this link as a Pandas Data Frame to Jupyter Lab: https://pl.wikisource.org/wiki/Polskie_powiaty_wed%C5%82ug_kodu_TERYT
There is only one table and not complicated, how can I do that in Python ?
Type 1:
Just use pandas method pd.read_html method and from extract what so ever df you want
import pandas as pd
res=pd.read_html("https://pl.wikisource.org/wiki/Polskie_powiaty_wed%C5%82ug_kodu_TERYT")
df=res[3]
Type 2:
you can use both request and bs4 module to find table and parse data to pandas method
import requests
from bs4 import BeautifulSoup
res=requests.get("https://pl.wikisource.org/wiki/Polskie_powiaty_wed%C5%82ug_kodu_TERYT")
soup=BeautifulSoup(res.text,"html.parser")
data=soup.find_all("table")[3]
df=pd.read_html(str(data))
df[0]
Output:
Nazwa powiatu TERYT
0 aleksandrowski 04 01
1 augustowski 20 01
. ..... ..
You need to scrape HTML using requests library, after you need to search on tag using library (i use BeautifulSoup).
The code is similar to this:
import requests
from bs4 import BeautifulSoup
URL = "https://pl.wikisource.org/wiki/Polskie_powiaty_wed%C5%82ug_kodu_TERYT"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find("div", {"id":"mw-content-text"}).find("table",{"border":1}).find_all("td")
namelist = [results[i].text for i in range(0,len(results),2)]
numberlist = [results[i].text.strip('\n') for i in range(1,len(results),2)]
Then it returns a value of type string. Or you can get all values ​​as a list. It's very simple to convert to pandas after.

bs4 - How to extract table data from website?

Here is the link,
https://www.vit.org/WebReports/vesselschedule.aspx
I'm using BeautifulSoup and my goal was to extract the table from it.
I wrote the code..
from bs4 import BeautifulSoup
import requests
import pandas as pd
url="https://www.vit.org/WebReports/vesselschedule.aspx"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
gdp_table = soup.find("table", attrs={"id": "ctl00_ContentPlaceHolder1_VesselScheduleControl1_Grid1"})
The final line of code gave me an error displaying 'None'.
I'm new to this web scraping, can you help me find a solution to get the table?
Why not pd.read_html(url)?
It will extract tables automatically
Problem is that the id by which you are looking for this table is appended to the element dynamically via js and as requests library is only downloading files at URL, nothing is dynamically appended and in result Your table is without id
If you encounter a similar error in the future (element exists but bs4 cant find it) try saving the response as text to an HTML file and inspect it in your browser.
For your particular case this code could be used:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.vit.org/WebReports/vesselschedule.aspx")
with open("tmp.html", "w") as f:
f.write(resp.text)
bs = BeautifulSoup(resp.text)
table = bs.find_all("table")[6] # not the best way to select elements
rows = table.find_all("tr")
Warning: Try avoiding such style of relative selecting, web pages are constantly updating and such code may procude errors in the future
I Parsed The Table and Added Each Rows in A List And Appended That To Data List
And Here You Go!..
And I Added The Total List In [Hashbin]
from bs4 import BeautifulSoup
import requests
url="https://www.vit.org/WebReports/vesselschedule.aspx"
soup = BeautifulSoup( requests.get(url).text )
table = soup.find_all('table')[6] # as it is not the best way as told by darkKnight
rows = table.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols])
print(data)
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.vit.org/WebReports/vesselschedule.aspx")
soup=BeautifulSoup(res.text,"html.parser")
find columns by below code:
table=soup.find_all("table")[6]
columns=[col.get_text(strip=True) for col in table.find("tr",class_="HeadingRow").find_all("td")[1:-1]]
find row data by below code:
main_lst=[]
for row in table.find_all("tr",class_="Row"):
lst=[i.get_text(strip=True) for i in row.find_all("td")[1:-1]]
main_lst.append(lst)
create table using pandas
import pandas as pd
df=pd.DataFrame(columns=columns,data=main_lst)
df
Image:
You need a way to specify a pattern that uniquely identifies the target table given the nested tabular structure. The following css pattern will grab that table based on a string it contains ("Shipline"), an attribute that is not present, as well as the table's relationship to other elements within the DOM.
You can then pass that specific table to read_html and do some cleaning on the returned DataFrame.
import requests
from bs4 import BeautifulSoup as bs
from pandas import read_html as rh
r = requests.get('https://www.vit.org/WebReports/vesselschedule.aspx').text
soup = bs(r, 'lxml')
df = rh(str(soup.select_one('table table:not([style]):-soup-contains("Shipline")')))[0] #earlier soupsieve version use :contains
df.dropna(how='all', axis = 1, inplace = True)
df.columns = df.iloc[0, :]
df = df.iloc[1:, :]

Finding tables returns [] with bs4

I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!
The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.
I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)

Trying to scrape data stored in table using BeautifulSoup Python

I'm trying to scrape data from this table
enter image description here
and here's the code I'm using
## Scraping data for schools
from urllib.request import urlopen
from bs4 import BeautifulSoup
#List of schools
page=urlopen('https://mcss.knack.com/school-districts#all-school-contacts/')
soup = BeautifulSoup(page,'html.parser')
School=[]
Address=[]
Phone=[]
Principal=[]
Email=[]
District=[]
# Indexing rows and then identifying cells
for rows in soup.findAll('tr'):
cells = rows.findAll('td')
if len(cells)==7:
School.append(soup.find("span", {'class':'col-0'}).text)
Address.append(soup.find("span", {'class':'col-1'}).text)
Phone.append(soup.find("span", {'class':'col-2'}).text)
Principal.append(soup.find("span", {'class':'col-3'}).text)
Email.append(soup.find("span", {'class':'col-4'}).text)
District.append(soup.find("span", {'class':'col-5'}).text)
import pandas as pd
school_frame = pd.DataFrame({'School' : School,
'Address' : Address,
'Phone':Phone,
'Principal':Principal,
'Email':Email,
'District':District
})
school_frame.head()
#school_frame.to_csv('school_address.csv')
And in return I'm getting only the header names of the columns of data frame.
What am I doing wrong?
When you check the actual value of page, you will see that it does not contain any table but an empty div which will later be filled by javascript dynamically. urllib.request does not run the javascript and just returns an empty page with no table to you. You could use selenium to emulate a browser (which runs javascript) and then fetch the resulting html of that website (see this stackoverflow answer for an example).

Categories

Resources