Python Beautifulsoup - problem read <span> - python

I try to extract "brand-logo", "product-name", "price" and "best-price " from the following HTML:
<div class="container">
<div class="catalog-wrapper">
<div class="slideout-filters"></div>
<section class="catalog-top-banner"></section>
<section class="search-results">
<section class="catalog">
<div class="row">
<div class="col-xs-12 col-md-4 col-lg-3">
<div class="col-xs-12 col-md-8 col-lg-9">
<div class="catalog-container">
<a class="catalog-product catalog-item ">
<div class="product-image "></div>
<div class="product-description">
<div>
<div class="brand-logo">
<span>PACO RABANNE</span>
</div>
<span class="product-name">
PACO RABANNE PERFUME MUJER 30 ML
</span>
<span class="price">Normal: S/ 219</span>
<span class="best-price ">Internet: S/ 209</span>
"brand-logo" and "product-name, done, but I can not read "price" & "best-price "
I tried it this way:
box_3 = soup.find('div','col-xs-12 col-md-8 col-lg-9')
for div in box_3.find_all('div','product-description'):
d={}
d["Marca"] = div.find_all("div",{"class","brand-logo"})[0].getText()
d["Producto"] = div.find_all("span",{"class","product-name"})[0].getText()
d["Precio"] = div.find_all('span',class_='price')
d["Oferta"] = div.find_all('span',class_='best-price ')
l.append(d)
l
out:
{'Marca': 'PACO RABANNE',
'Oferta': [],
'Precio': [<span class="price">Normal: S/ 219</span>],
'Producto': 'PACO RABANNE PERFUME MUJER 30 ML'}
can anyone help me?

You can find the "product-description" div and then iterate over the desired div classes:
from bs4 import BeautifulSoup as soup
import re
_to_find = ['brand-logo', 'product-name', 'price', 'best-price']
s = soup(content, 'html.parser').find('div', {'class':'product-description'})
final_results = [(lambda x: s.find('span', {'class':i}).text if not x else x.text)(s.find('div', {'class':i})) \
for i in _to_find]
filtered = [re.sub('^[\n\s]+|[\n\s]+$', '', i) for i in final_results]
Output:
['PACO RABANNE', 'PACO RABANNE PERFUME MUJER 30 ML', 'Normal: S/ 219', 'Internet: S/ 209']

Unfortunatelly, without the actual website I'm unable to check the solution :(.
Maybe you should extract data from "not working" part the same way as the working one (this is lucky guess - without website or just website that will be parsed by bs4, I'm really unable to test it).
d["Precio"] = div.find_all('span',{"class","price"})[0].getText()
d["Oferta"] = div.find_all('span',{"class","best-price"})[0].getText()
It might be good idea to make new method/function that will get the chosen attribute and handle potential errors.

Related

BeautifulSoup Returns empty list which leads to an IndexError in my Python code

I am trying to do web scraping using BeautifulSoup. The code I have written is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(type(questions[0]))
When I run the code, I get the error message below:
print(type(questions[10]))
IndexError: list index out of range
Then i tried to print the list like below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(questions)
And then I get an empty list: []
What am I doing wrong?
Thanks for your answers.
.question-summary is incorrect locator because it's a portion of id meaning each id value start with question-summary. Now it's working.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select('[id^="question-summary"]')
print(questions)
Output:
1" data-post-type-id="1" id="question-summary-71715531">
<div class="s-post-summary--stats js-post-summary-stats">
<div class="s-post-summary--stats-item s-post-summary--stats-item__emphasized" title="Score of 0">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">votes</span>
</div>
<div class="s-post-summary--stats-item" title="0 answers">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">answers</span>
</div>
<div class="s-post-summary--stats-item" title="5 views">
<span class="s-post-summary--stats-item-number">5</span>
<span class="s-post-summary--stats-item-unit">views</span>
</div>
</div>
<div class="s-post-summary--content">
<h3 class="s-post-summary--content-title">
<a class="s-link" href="/questions/71715531/is-it-possible-to-draw-a-logistic-regression-graph-with-multiple-x-variable">Is it possible to draw a
logistic regression graph with multiple x variable?</a>
</h3>
<div class="s-post-summary--content-excerpt">
Currently, this is my X and V value. May I know is it possible to draw a logistic regression curve with X that has multiple column? Or I am required to draw multiple graphs to do so?
X = df1.drop(['...
</div>
<div class="s-post-summary--meta">
<div class="s-post-summary--meta-tags tags js-tags t-python-3ûx t-machine-learning">
<a class="post-tag flex--item mt0 js-tagname-python-3ûx" href="/questions/tagged/python-3.x" rel="tag" title="show questions tagged 'python-3.x'">python-3.x</a> <a class="post-tag flex--item mt0 js-tagname-machine-learning" href="/questions/tagged/machine-learning" rel="tag" title="show questions tagged 'machine-learning'">machine-learning</a>
</div>
<div class="s-user-card s-user-card__minimal">
<a class="s-avatar s-avatar__16 s-user-card--avatar" href="/users/14128881/christopher-chua"> <div class="gravatar-wrapper-16" data-user-id="14128881">
<img ,="" alt="user avatar" class="s-avatar--image" height="16" src="https://lh6.googleusercontent.com/-Sn3B_E5hiJc/AAAAAAAAAAI/AAAAAAAAAAA/AMZuucl1oyfdhJiXhrx73JLYqzKAK9icag/photo.jpg?sz=32" width="16"/>
</div>
</a>
<div class="s-user-card--info">
<div class="s-user-card--link d-flex gs4">
<a class="flex--item" href="/users/14128881/christopher-chua">Christopher Chua</a>
</div>
<ul class="s-user-card--awards">
<li class="s-user-card--rep"><span class="todo-no-class-here" dir="ltr" title="reputation score ">7</span></li>
</ul>
</div>
<time class="s-user-card--time">asked <span class="relativetime" title="2022-04-02 07:03:06Z">13 mins ago</span></time>
.. so on

Parsing HTML with BeatifulSoup class == AND title CONTAINS

I am trying to parse the following HTML:
<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
,
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>
I am trying to get the 'id' where the title contains 'Blue' AND the item is not sold.
I have tried:
soup.find_all("a",href=re.compile("Blue"),class_="")
links = soup.find_all("a", href=re.compile("Blue", "Add To Cart"))
ids = [tag["id"] for tag in soup.find_all("a", href=re.compile("Blue"))]
But it is not returning the info I'm looking for.
I would like it to return:
AddToCartSimple-3593
I think your html is corrupted. You can do the entire filtering with css selectors using :has, :not, and :contains (:-soup-contains - latest soupsieve), along with attribute = value selectors. The ^ is a starts with operator, meaning attribute value starts with the string after the =. The ~ is a general sibling combinator and the > is a child combinator. This means looking for a sibling with class (.) tocart and then a child with id that starts with AddToCartSimple-, but that doesn't have text containing SOLD displayed. Less specific than !="SOLD" , as it can be a partial string exclusion. Depends on observed variation in actual data.
from bs4 import BeautifulSoup as bs
html ='''
<div class="product-details">
<h4 class="title">Blue - Standard</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a>
</div>
<div class="product-details">
<h4 class="title">Blue - Wide</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576">SOLD</a>
</div>
'''
soup = bs(html, 'html.parser')
print(soup.select_one('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')['id'])
You should check there was a match before accessing with ['id'] of course. You could also go for all matches as follows:
[i['id'] for i in soup.select('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')]
To get the data where the "title" contains "Blue" and the item is not "SOLD":
Use a CSS selector .product-details > h4 a[title*='Blue'] which will select all a where the title=Blue under an h4 under the class product-details
Find the next div using the find_next() method, and check that the text is not "SOLD".
Print the next div's id
from bs4 import BeautifulSoup
html = """<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select(".product-details > h4 a[title*='Blue']"):
if tag.find_next("div").text != "SOLD":
print(tag.find_next("div")["id"])
Output:
AddToCartSimple-3593

Beautiful Soup - finding all classes which contain a know strin

I was trying to extract the string '£150,000' from this HTML code by identifying the string 'Purchase Price' within the class since the same class is used more than once
<div class="row mb-sm-1 property-header-row">
<div class="prop-capital-fields property-header-col col-6"><h3>£150,000</h3>
<p class="label-paragraph">
Purchase Price
</p>
</div>
<div class="prop-capital-fields property-header-col col-6"><h3>£180,000</h3>
<p class="label-paragraph">
Market Value
</p>
</div>
<div class="prop-capital-fields property-header-col col-6"><h3>£1,185</h3>
<p class="label-paragraph">
Potential Cashflow PCM
</p>
</div>
So I wrote the following code
property_ = soup.find(class_="properties-content-body col-xs-12 col-sm-12 col-md-7")
for a in property_.find_all('div', attrs={'class': 'prop-capital-fields property-header-col col-6'}, text="Purchase Price"):
purchase_price_list.append(a)
print(purchase_price_list)
but all I get is a blank list
I've tried many other things but I'm pretty sure I just don't know the correct way to do it.
Any help is appreciated.
I've found the answer:
for a in property_.find_all('div', attrs={'class': 'prop-capital-fields property-header-col col-6'}):
b = a.find('p').text.replace("\n", "").strip()
c = a.find('h3').text.strip()
if(b=='Purchase Price'):
purchase_price_list.append(c)

Can't get the xml element value using lxml xpath

I am trying to scrape a spotify playlist webpage to pull out artist and song name data. Here is my python code:
#! /usr/bin/python
from lxml import html
import requests
playlistPage = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
print("\n\nprinting variable playListPage: " + str(playlistPage))
tree = html.fromstring(playlistPage.content)
print("printing variable tree: " + str(tree))
artistList = tree.xpath("//span/a[#class='tracklist-row__artist-name-link']/text()")
print("printing variable artistList: " + str(artistList) + "\n\n")
Right now the final print statement is printing out an empty list.
Here is some example HTML from the page I'm trying to scrape. Ideally my code would pull out the string "M83"...not sure how much html is relevant so pasting what I believe necessary:
<div class="react-contextmenu-wrapper">
<div draggable="true">
<li class="tracklist-row" role="button" tabindex="0" data-testid="tracklist-row">
<div class="tracklist-col position-outer">
<div class="tracklist-play-pause tracklist-top-align">
<svg class="icon-play" viewBox="0 0 85 100">
<path fill="currentColor" d="M81 44.6c5 3 5 7.8 0 10.8L9 98.7c-5 3-9 .7-9-5V6.3c0-5.7 4-8 9-5l72 43.3z">
<title>
PLAY</title>
</path>
</svg>
</div>
<div class="position tracklist-top-align">
<span class="spoticon-track-16">
</span>
</div>
</div>
<div class="tracklist-col name">
<div class="track-name-wrapper tracklist-top-align">
<div class="tracklist-name ellipsis-one-line" dir="auto">
Intro</div>
<div class="second-line">
<span class="TrackListRow__artists ellipsis-one-line" dir="auto">
<span class="react-contextmenu-wrapper">
<span draggable="true">
<a tabindex="-1" class="tracklist-row__artist-name-link" href="/artist/63MQldklfxkjYDoUE4Tppz">
M83</a>
</span>
</span>
</span>
<span class="second-line-separator" aria-label="in album">
•</span>
<span class="TrackListRow__album ellipsis-one-line" dir="auto">
<span class="react-contextmenu-wrapper">
<span draggable="true">
<a tabindex="-1" class="tracklist-row__album-name-link" href="/album/6R0ynY7RF20ofs9GJR5TXR">
Hurry Up, We're Dreaming</a>
</span>
</span>
</span>
</div>
</div>
</div>
<div class="tracklist-col more">
<div class="tracklist-top-align">
<div class="react-contextmenu-wrapper">
<button class="_2221af4e93029bedeab751d04fab4b8b-scss c74a35c3aba27d72ee478f390f5d8c16-scss" type="button">
<div class="spoticon-ellipsis-16">
</div>
</button>
</div>
</div>
</div>
<div class="tracklist-col tracklist-col-duration">
<div class="tracklist-duration tracklist-top-align">
<span>
5:22</span>
</div>
</div>
</li>
</div>
</div>
A solution using Beautiful Soup:
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
soup = bs(page.content, 'lxml')
tracklist_container = soup.find("div", {"class": "tracklist-container"})
track_artists_container = tracklist_container.findAll("span", {"class": "artists-albums"})
artists = []
for ta in track_artists_container:
artists.append(ta.find("span").text)
print(artists[0])
prints
M83
This solution gets all the artists on the page so you could print out the list artists and get:
['M83',
'Charles Bradley',
'Bon Iver',
...
'Death Cab for Cutie',
'Destroyer']
And you can extend this to track names and albums quite easily by changing the classname in the findAll(...) function call.
Nice answer provided by #eNc. lxml solution :
from lxml import html
import requests
playlistPage = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
tree = html.fromstring(playlistPage.content)
artistList = tree.xpath("//span[#class='artists-albums']/a[1]/span/text()")
print(artistList)
Output :
['M83', 'Charles Bradley', 'Bon Iver', 'The Middle East', 'The Antlers', 'Handsome Furs', 'Frank Turner', 'Frank Turner', 'Amy Winehouse', 'Black Lips', 'M83', 'Florence + The Machine', 'Childish Gambino', 'DJ Khaled', 'Kendrick Lamar', 'Future Islands', 'Future Islands', 'JAY-Z', 'Blood Orange', 'Cut Copy', 'Rihanna', 'Tedeschi Trucks Band', 'Bill Callahan', 'St. Vincent', 'Adele', 'Beirut', 'Childish Gambino', 'David Guetta', 'Death Cab for Cutie', 'Destroyer']
Since you can't get all the results in one shot, maybe you should switch for Selenium.

python HTML parsing issue

Given an html page, I would like to only get an array of variables like this (id1, value1), (id2, value2), ...., the file is given like this:
<div class="col m3 s12 col_title"><div class="font-small grey-text truncate content" title="value1">value1</div></div>
<div class="col m7 s12 col_id"><div class="content wrap">id1</div></div>
every value is followed by a "content wrap" id.
I was thinking of something like:
match = re.compile('title="(.+?)".+?wrap"(.+?)"').findall(source)
This is an example:
<li class="collection-item Ids ">
<div class="row">
<div class="col m3 s12 col_title"><div class="font-small grey-text truncate content" title="filename1">filename1</div></div>
<div class="col m7 s12 col_id"><div class="content wrap">6000bc3211af43d7</div></div>
<div></div>
<div class="col m2 s12 col_time">
<div class="content">
<a href="http://test.com/test.php" target="_blank" class="secondary-content pull-right">
<span class="font-small grey-text" title="filex">test</span>
<i class="fa fa-external-link" aria-hidden="true" title="filey"></i>
</a>
</div>
</div>
</div>
Can you show the example for id1 and value1?
I have a idea :D
\w{1,}\d{1,}<
And getting from 1 to len(match)-1
It can not true.
You can try to use Beautiful Soup, it should have everything you need for parsing HTML.
For exemple, you could use :
# open the html from the website or from a file, check the doc
soup = BeautifulSoup(urllib.urlopen(yoururl), "lxml")
result = soup.find_all(class_="content wrap").get_text()
Here, result would be an array containing all the text contents inside the elements that have a "content wrap" class.
Building on TheWildHealer's answer, you can use the following:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://websitehere.com").text, "lxml")
results = []
for row in soup.find_all(class_ = "row"):
titleText = row.find(class_ = "col_title").get_text()
idText = row.find(class_ = "col_id").get_text()
results.append((idText, titleText))

Categories

Resources