Beautiful soup not finding any elements under a class - python

I am trying to webscrape the prices of a website using BeautifulSoup:
The container class is shown below:
An example of the objects I want to retrieve from that class are shown below:
But I don't know why there are no objects being found under the containing class c1_t2i. It always prints a value of 0 in the print(len(containers))
The code is shown below:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myUrl = "https://www.lazada.com.ph/catalog/?q=lighters&_keyori=ss&from=input&spm=a2o4l.home.search.go.239e6ef0RMwbfH"
uClient = uReq(myUrl)
pageHtml = uClient.read()
uClient.close()
pageSoup = soup(pageHtml, "html.parser")
containers = pageSoup.findAll("div", {"class": "c1_t2i"})
print(len(containers))

If you open the page, and view page source. You won't be able to find the class "c1_t2i". The class you are looking for seems to be "c3e8SH".
I am however, not sure why this is happening. I am using chrome. Can you use chrome and check perhaps? You can also print out the parsed HTML, and search for the text "c1_t2i" or "c3e8SH", whichever is available there.
EDIT 1:
I think I understand the problem. The HTML you see when you do inspect element are generated using Javascript. However, the same classes are not available in the raw html that you get using the script. You need to use something like PhantomJS to execute the JS and get the resulting HTML. Check out this thread.
EDIT 2:
You can also try to disable JS and look at the page that comes up, and then see if you can select a class name from the basic HTML.

Related

Web scraping 'window' object

I am trying to get the body text of news articles like this one:
https://elpais.com/espana/2022-07-01/yolanda-diaz-lanza-su-proyecto-politico-sumar-y-convoca-el-primer-acto.html
In the source code, it can be found after "articleBody".
I've tried using bs4 Beautifulsoup but it looks like it cannot access the 'window' object where the article body information is. I'm able to get the text by using string functions:
text = re.search('"articleBody":"(.*)","keywords"', source_code)
Where source_code is a string that contains the source code of the URL. However, this method looks pretty inefficient compared to using the bs4 methods when the page allows it. Any advice, please?
You're right about BeautifulSoup not being able to handle window objects. In fact, you need to use Selenium for that kind of thing. Here's an example on how to do so with Python 3 (you'll have to adapt it slightly if you want to work in Python 2):
from selenium import webdriver
import time
# Create a new instance of Chrome and go to the website we want to scrape
browser = webdriver.Chrome()
browser.get("http://www.elpais.com/")
time.sleep(5) # Let the browser load
# Find the div element containing the article content
div = browser.find_element_by_class_name('articleContent')
# Print out all the text inside the div
print(div.text)
Hope this helps!
Try:
import json
import requests
from bs4 import BeautifulSoup
url = "https://elpais.com/espana/2022-07-01/yolanda-diaz-lanza-su-proyecto-politico-sumar-y-convoca-el-primer-acto.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for ld_json in soup.select('[type="application/ld+json"]'):
data = json.loads(ld_json.text)
if "#type" in data and "NewsArticle" in data["#type"]:
break
print(data["articleBody"])
Prints:
A una semana de que arranque Sumar ...
Or:
text = soup.select_one('[data-dtm-region="articulo_cuerpo"]').get_text(
strip=True
)
print(text)

Why is my parsed image link comming out in base64 format

i was trying to parse a image link from a website.
When i inspect the link on the website, it is this one :https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png but when i parse it with my code the output is data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').text
soup = BeautifulSoup(source, 'lxml')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.find('img', class_='css-1fxh5tw product-card__hero-image')['src']
print(image_scr)
I think the code isn't the issue but i don't know what's causing the link to come out in base64 format. So how could i set the code to render the link as .png ?
What happens?
First at all, take a look into your soup - There is the truth. Website provides not all information static, there are a lot things provided dynamically and also done by the browser -> So requests wont get this info this way.
Workaround
Take a look at the <noscript> next to your selection, it holds a smaller version of the image and is providing the src
Example
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').content
soup = BeautifulSoup(source, 'lxml')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.select_one('noscript img.css-1fxh5tw.product-card__hero-image')['src']
print(image_scr)
Output
https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png
If you like a "big picture" just replace parameter w_318 with w_1000...
Edit
Concerning your comment - There are a lot more solutions, but still depending on what you like to do with the information and what you gonna work with.
Following approache uses selenium that is unlike requests rendering the website and give you the "right page source" back but also needs more resources then requests:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok')
soup=BeautifulSoup(driver.page_source, 'html.parser')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.select_one('img.css-1fxh5tw.product-card__hero-image')['src']
print(image_scr)
Output
https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png
As you want to grab src meaning image data, so downloading data from server using requests, you need to use .content format as follows:
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').content

Why Beautiful Soup not extracting all of the "a" tags from a website

I am learning BeautifulSoup and i tried to extract all the "a" tags from a website. I am getting lot of "a" tags but few of them are ignored and i am confused why that is happening any help will be highly appreciated.
Link i used is : https://www.w3schools.com/python/
img : https://ibb.co/mmEKTK
red box in the image is a section that has been totally ignored by the bs4. It does contains "a" tags.
Code:
import requests
import bs4
import re
import html5lib
res = requests.get('https://www.w3schools.com/python/')
soup = bs4.BeautifulSoup(res.text,'html5lib')
try:
links_with_text = []
for a in soup.find_all('a', href=True):
print(a['href'])
except:
print ('none')
sorry for the code indentation i am new here.
The links which are being ignored by bs4 are dynamically rendered i.e Advertisements etc were not present in the HTML code but have been called by scripts based on your browser habits. requests package will only fetch static HTML content, you need to simulate browser to get the dynamic content.
Selenium can be used with any browser like Chrome, Firefox etc. If you want to achieve the same results on server (without UI), use headless browsers like Phatomjs.

Getting output 0 even if there are 25 same class

Image of the HTML
Link to the page
I am trying to see how many of class are there on this page but the output is 0. And I have been using BeautifulSoup for a while but never saw such error.
from bs4 import BeautifulSoup
import requests
result = requests.get("https://www.holonis.com/motivationquotes")
c = result.content
soup = BeautifulSoup(c)
samples = soup.findAll("div", {"class": "ng-scope"})
print(len(samples))
Output
0
and I want the correct output at least more than 25
This is a "dynamic" Angular-based page which needs a Javascript engine or a browser to be fully loaded. To put it differently - the HTML source code you see in the browser developer tools is not the same as you would see in the result.content - the latter is a non-rendered initial HTML of the page which does not contain the desired data.
You can use things like selenium to have the page rendered and loaded and then HTML-parse it, but, why don't make a direct request to the site API:
import requests
result = requests.get("https://www.holonis.com/api/v2/activities/motivationquotes/all?limit=15&page=0")
data = result.json()
for post in data["items"]:
print(post["body"]["description"])
Post descriptions are retrieved and printed for example-purposes only - the post dictionaries contain all the other relevant post data that is displayed on the web-page.
Basically, result.content does not contain any divs with ng-scope class. As stated in one of the comments the html you are trying to get is added there due to the javascript running on the browser.
I recommend you this package requests-html created by the author of very popular requests.
You can try to use the code below to build on that.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.holonis.com/motivationquotes')
r.html.render()
To see how many ng-scope classes are there just do this:
>>> len(r.html.find('.ng-scope'))
302
I assume you want to scrape all the hrefs from the a tags that are children of the divs you gave the image to. You can obtain them this way:
divs = r.html.find('[ng-if="!isVideo"]')
link_sets = (div.absolute_links for div in divs)
>>> list(set(chain.from_iterable(link_sets)))
['https://www.holonis.com/motivationquotes/o/byiqe-ydm',
'https://www.holonis.com/motivationquotes/o/rkhv0uq9f',
'https://www.holonis.com/motivationquotes/o/ry7ra2ycg',
...
'https://www.holonis.com/motivationquotes/o/sydzfwgcz',
'https://www.holonis.com/motivationquotes/o/s1eidcdqf']
There's nothing wrong with BeautifulSoup, in fact, the result of your GET request, does not contain any ng-scope text.
You can see the output here:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> result = requests.get("https://www.holonis.com/motivationquotes")
>>> c = result.content
>>>
>>> print(c)
**Verify the output yourself**
You only have ng-cloak class as you can see from the regex result:
import re
regex = re.compile('ng.*')
samples = soup.findAll("div", {"class": regex})
samples
#[<div class="ng-cloak slideRoute" data-ui-view="main" fixed="400" main-content="" ng-class="'{{$state.$current.animateClass}}'"></div>]
To get the content of that webpage, it is either wise to use their api or choose any browser simulator like selenium. That webpage loads it's content using lazyload. When you scroll down you will see more content. The webpage expands its content through pagination like https://www.holonis.com/excellenceaddiction/1. However, you can give this a go. I've created this script to parse content displayed within 4 pages. You can always change that page number to satisfy your requirement.
from selenium import webdriver
URL = "https://www.holonis.com/excellenceaddiction/{}"
driver = webdriver.Chrome() #If necessary, define the path
for link in [URL.format(page) for page in range(1,4)]:
driver.get(link)
for items in driver.find_elements_by_css_selector('.tile-content .tile-content-text'):
print(items.text)
driver.quit()
Btw, the above script parses the description of each post.

how to scraping text from hidden div and class using python?

i working on a script for scraping video titles from this webpage
" https://www.google.com.eg/trends/hotvideos "
but the proplem is the titles are hidden on the html source page but i can see it if i used the inspector to looking for that
that's my code it's working good with this ("class":"wrap")
but when i used that with the hidden one like "class":"hotvideos-single-trend-title-container" that's did't give me anything on output
#import urllib2
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.google.com.eg/trends/hotvideos').read()
soup = BeautifulSoup(html)
print (soup.findAll('div',{"class":"hotvideos-single-trend-title-container"}))
#wrap
The page is generated/populated by using Javascript.
BeautifulSoup won't help you here, you need a library which supports Javascript generated HTML pages, see here for a list or have a look at Selenium

Categories

Resources