I'm using the BeautifulSoup module to parse an html file that I want to extract certain information from. Specifically game scores and team names.
However, when I use the findAll function, it continually returns empty for a string that is certainly within the html. If someone can explain what I am doing wrong it will be greatly appreciated. See code below.
import urllib
import bs4
import re
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'http://www.foxsports.com/mlb/scores?season=2017&date=2017-05-09'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parser
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("div",{"class":"wisbb_teams"})
print(len(container))
I think the syntax your using is the old version of BeautifulSoup, try instead something like find_all snake_case (see the docs)
from bs4 import BeautifulSoup
# ...
page_html = uClient.read()
page_soup = BeautifulSoup(page_html, "html.parser")
list_of_divs = page_soup.find_all("div", class_="wisbb_name")
print(len(list_of_divs))
The older API used CamelCase, but bs4 uses snake_case
Also, notice that find_all takes can take a class_ parameter to find by class.
See this answer, https://stackoverflow.com/a/38471317/4443226, for some more info
Also, make sure you're looking for the correct classname! I don't see the class you're looking for, but rather these:
Related
I am trying to scrape some data from a website.
But when I want to print it I just get the tags back, but with out the information in it.
This is the code:
#Imports
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
#URL
my_url = 'https://website.com'
#Opening connection grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
#closing the page
uClient.close()
#Parse html
page_soup = soup(page_html,"html.parser")
price = page_soup.findAll("div",{"id":"lastTrade"})
print(price)
This ist what I get back:
[<div id="lastTrade"> </div>]
So does anyone can tell me what i have to change or add so I receive the actual infortmation from inside this tag?
Maybe loop through your list like this :
for res in price:
print(res.text)
the image shows the area that i want to access
containers = pagebs('div',{'class':"search-content"})
when i print containers it just displays
[<div class="search-content">
</div>]
nothing inside it. I tried searching for tags inside in it that didn't work
is there a workaround or i just can't access it not matter what i do
this is what i've written so far
from bs4 import BeautifulSoup as BS
from urllib.request import urlopen as uReq
url = 'https://bahrain.sharafdg.com/?q=asus%20laptops&post_type=product'
uclient = uReq(url)
pagehtml = uclient.read()
uclient.close()
pagebs = BS(pagehtml , 'html.parser')
containers = pagebs('div',{'class':"search-content"})
very new to Python. The following code will only allow me to display individual p entries from the extracted website (the first entry, 0, being the current example).
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://en.wikipedia.org/wiki/Young_Thug"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
page_soup.findAll("p")
paragraphs = page_soup.findAll("p")
paragraph = paragraphs[0].text.strip()
print(paragraph)
For some reason, I can't grip the particular for argument I would need to display all of the p elements on the site in a single block of text.
The eventual goal of the above code snippet is a reading grade level app, hence the stripped down text. Any help would be appreciated, thank you!
I’m not near my laptop to include the output, but generally it would be:
paragraphs = page_soup.findAll("p")
for para in paragraphs:
print (para.text.strip())
FindAll doesn't find the class I need. However I was able to find the class above that one, but the data structure is not that well organized.
Do you know what can we do to get the data or organize the output from the class above which has all the data together ?
Please see the HTML below and the images.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.vivino.com/explore?e=eJzLLbI11jNVy83MszU0UMtNrLA1MVBLrrQtLVYrsDVUK7ZNTlQrS7YtKSpNVSsviY4FioEpIwhlDKFMIJQ5VM4EAJCfGxQ='
#Opening a connection
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parse
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("div", {"class":"wine-explorer__results__item"})
len(container)
Thanks everyone, as you all suggested a module to read Javascript was needed to select that class. I've used selenium in this case, however PyQt5 might be a better option.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
my_url = 'https://www.vivino.com/explore?e=eJzLLbI11jNVy83MszU0UMtNrLA1MVBLrrQtLVYrsDVUK7ZNTlQrS7YtKSpNVSsviY4FioEpIwhlDKFMIJQ5VM4EAJCfGxQ='
#Opening a connection
#html parse
web_r = uReq(my_url)
driver=webdriver.Firefox()
driver.get(my_url)
page_soup = soup(web_r, "html.parser")
html = driver.execute_script("return document.documentElement.outerHTML")
#print(html)
html_page_soup = soup(html, "html.parser")
container = html_page_soup.findAll("div", {"class": "wine-explorer__results__item"})
len(container)
You can use Dryscrape module with bs4 because wine-explorer selector is created by javascript. Dryscrape module helps you for javascript support.
Try using the following instead:
container = page_soup.findAll("div", {"class": "wine-explorer__results"})
I'm using beautiful soup for the first time and the text from the span class is not being extracted. I'm not familiarized with HTML so I'm unsure as to why this happens, so it'd be great to understand.
I've used the code below:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.Close()
page_soup = soup(page_html, "html.parser")
content = page_soup.findAll("span",attrs={"data-item":"rate"})
With this code for index 0 it returns the following:
<span class="productdata" data-baserate-code="VRI" data-cc="AU" data-
item="rate" data-section="PHL" data-subsection="VR"></span>
However I'd expect something like this when I inspect via Chrome, which has the text such as the interest rate:
<span class="productdata" data-cc="AU" data-section="PHL" data-
subsection="VR" data-baserate-code="VRI" data-item="rate">5.20% p.a.</span>
Data you are trying to extract does not exists. It is loaded using JS after the page is loaded. Website uses a JSON api to load information on the page. So Beautiful soup can not find the data. Data can be viewed at following link that hits JSON API on the site and provides JSON data.
https://www.anz.com/productdata/productdata.asp?output=json&country=AU§ion=PHL
You can parse the json and get the data. Also for HTTP requests I would recommend requests package.
As others said, the content is JavaScript generated, you can use selenium together ChromeDriver to find the data you want with something like:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome")
items = driver.find_elements_by_css_selector("span[data-item='rate']")
itemsText = [item.get_attribute("textContent") for item in items]
>>> itemsText
['5.20% p.a.', '5.30% p.a.', '5.75% p.a.', '5.52% p.a.', ....]
As seen above, BeautifulSoup wasn't necessary at all, but you can use it instead to parse the page source and get the same results:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
items = soup.findAll("span",{"data-item":"rate"})
itemsText = [item.text for items in items]