Wrong parsing with BeautifulSoup

Wrong parsing with BeautifulSoup - python

I want to access to the title of this website:
https://zenodo.org/search?page=1&size=20&q=broma
Actualy, I use BeautifulSoup, but when I access with this code results are empty ([]):
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
def generateSoup(my_url):
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
return soup(page_html,"lxml")
page_soup = generateSoup('https://zenodo.org/search?page=1&size=20&q=broma')
containers = page_soup.findAll('a',{'class':'ng-binding'})
print(containers)
If you could correct my code or give me another library that I can work with, I would be very grateful for your help.
Thanks for all.
Edit: The problem is that the HTML WebSite not have this element:
Element

This website use AJAX to display the result,you can find the AJAX request to get the JSON result.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import json
def generateJson(my_url):
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
return json.loads(page_html.decode("utf-8"))
page_json = generateJson('https://zenodo.org/api/records/?page=1&size=20&q=broma')
print(page_json["hits"]["hits"][0]["metadata"]["title"])

Related

Python BeautifulSoup missing information inside tag how to include information inside the tag

I am trying to scrape some data from a website.
But when I want to print it I just get the tags back, but with out the information in it.
This is the code:
#Imports
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
#URL
my_url = 'https://website.com'
#Opening connection grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
#closing the page
uClient.close()
#Parse html
page_soup = soup(page_html,"html.parser")
price = page_soup.findAll("div",{"id":"lastTrade"})
print(price)
This ist what I get back:
[<div id="lastTrade"> </div>]
So does anyone can tell me what i have to change or add so I receive the actual infortmation from inside this tag?

Maybe loop through your list like this :
for res in price:
print(res.text)

WebScraping findAll() does not get all the contents

the image shows the area that i want to access
containers = pagebs('div',{'class':"search-content"})
when i print containers it just displays
[<div class="search-content">
</div>]
nothing inside it. I tried searching for tags inside in it that didn't work
is there a workaround or i just can't access it not matter what i do
this is what i've written so far
from bs4 import BeautifulSoup as BS
from urllib.request import urlopen as uReq
url = 'https://bahrain.sharafdg.com/?q=asus%20laptops&post_type=product'
uclient = uReq(url)
pagehtml = uclient.read()
uclient.close()
pagebs = BS(pagehtml , 'html.parser')
containers = pagebs('div',{'class':"search-content"})

BeautifulSoup encoded characters cannot be decoded

I am following some instructions form a video and I seem to have a hit a brickwall.
When running my script across the following website i am trying to access the containter that hosts the odds for each game in order to import thme into a separate csv file.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
#my_url = 'https://www.google.com/search?q=premier+league&rlz=1C1GCEU_en-GBIN877FR877&oq=pre&aqs=chrome.0.69i59j69i57j35i39j69i65l2j69i60l3.628j0j7&sourceid=chrome&ie=UTF-8#sie=lg;/g/11fj6snmjm;2;/m/02_tc;mt;fp;1;;'
my_url = 'https://sports.coral.co.uk/sport/football/matches/tomorrow'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.find_all("div",{"class":"oddsicard desktop-sport-card"})
print(len(container))
However, I run into the issue that i cannot select to the container as all of the characters are not readable. I have tried this on other pages and it seems to work, so i figure its something wrong with the decoding or the webpage itself.
When printing this is the output:
vn#���b��
��
�u��W��!�JE�O���;�����
��7�_�,p ��AGh��}���oP�.ܱy;o/��-�{A��rrsh|?[Z����
I�N��]����l�b՜��f6�='��.���R�NWex����&���Q�����m0��~�c�N���zA#/
If anyone could help that would be much appreciated.

This url sends data compressed with brotli and it won't to send it uncompressed when I tried to use header 'Accept-Encoding' with other compressions.
You have to install module brotlipy and use it to decompress content
import brotli
page_html = brotli.decompress(page_html)
from urllib.request import urlopen as uReq
my_url = 'https://sports.coral.co.uk/sport/football/matches/tomorrow'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
print(uClient.headers['Content-Encoding']) # `br` means `brotli`
import brotli
page_html = brotli.decompress(page_html)
print(page_html)

Beautiful Soup returning empty html

So this is my second question regarding Beautiful Soup (sorry, im a beginner)
I was trying to fetch data from this website:
https://www.ccna8.com/ccna4-v6-0-final-exam-full-100-2017/
My Code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
url = 'https://www.ccna8.com/ccna4-v6-0-final-exam-full-100-2017/'
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "lxml")
print(page_soup)
But for some reason it returns an empty string.
I've been searching for similar threads and apparently it has something to do with the website using external api's , but this website doesn't.

It seems that the content-type of the response if gzip so you need to handle that before you can process the html response.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import gzip
url = 'https://www.ccna8.com/ccna4-v6-0-final-exam-full-100-2017/'
uClient = uReq(url)
page_html = gzip.decompress(uClient.read())
uClient.close()
page_soup = soup(page_html, "lxml")
print(page_soup)

try using requests module
Ex:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.ccna8.com/ccna4-v6-0-final-exam-full-100-2017/'
uClient = requests.get(url)
page_soup = soup(uClient.text, "lxml")
print(page_soup)

How to find a particular class in BeautifulSoup using findAll

FindAll doesn't find the class I need. However I was able to find the class above that one, but the data structure is not that well organized.
Do you know what can we do to get the data or organize the output from the class above which has all the data together ?
Please see the HTML below and the images.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.vivino.com/explore?e=eJzLLbI11jNVy83MszU0UMtNrLA1MVBLrrQtLVYrsDVUK7ZNTlQrS7YtKSpNVSsviY4FioEpIwhlDKFMIJQ5VM4EAJCfGxQ='
#Opening a connection
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parse
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("div", {"class":"wine-explorer__results__item"})
len(container)

Thanks everyone, as you all suggested a module to read Javascript was needed to select that class. I've used selenium in this case, however PyQt5 might be a better option.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
my_url = 'https://www.vivino.com/explore?e=eJzLLbI11jNVy83MszU0UMtNrLA1MVBLrrQtLVYrsDVUK7ZNTlQrS7YtKSpNVSsviY4FioEpIwhlDKFMIJQ5VM4EAJCfGxQ='
#Opening a connection
#html parse
web_r = uReq(my_url)
driver=webdriver.Firefox()
driver.get(my_url)
page_soup = soup(web_r, "html.parser")
html = driver.execute_script("return document.documentElement.outerHTML")
#print(html)
html_page_soup = soup(html, "html.parser")
container = html_page_soup.findAll("div", {"class": "wine-explorer__results__item"})
len(container)

You can use Dryscrape module with bs4 because wine-explorer selector is created by javascript. Dryscrape module helps you for javascript support.

Try using the following instead:
container = page_soup.findAll("div", {"class": "wine-explorer__results"})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Wrong parsing with BeautifulSoup - python

Related

Python BeautifulSoup missing information inside tag how to include information inside the tag

WebScraping findAll() does not get all the contents

BeautifulSoup encoded characters cannot be decoded

Beautiful Soup returning empty html

How to find a particular class in BeautifulSoup using findAll

Categories

Resources