Building Resilient Spiders Against Inconsistent HTML markup

Building Resilient Spiders Against Inconsistent HTML markup - python

I want to get player and referee content from this site and store it in a db. At first, when I looked through it, all the players and the referees were in response.css("div.prelims p.indent::text"), and I could use regex to parse the ones with players from the ones with referees. No problem.
Then I took a harder look at the rest of the site, only to see that they DO NOT follow this structure consistently. Here is an example:
<div class="prelims">
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p1">
<span class="num">1</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p2">
<span class="num">2</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p3">
<span class="num">3</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p4">
<span class="num">4</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p5">
<span class="num">5</span>
<p class="indent">Text about referee.</p>
</div>
<div class="num" id="p6">
Not only does this page have this 'num' and 'span' that the other page didn't, but my regex, which worked fine on the test page, breaks on the first p class=indent here.
What are some general principles of spider design that can make my spider more resilient against all this variability, and still be able to get the results into the right tables in my db? I am using DjangoItem, and was looking forward to a smooth pipeline into my db, but now I may have to wrangle this data to even get it into the right shape to insert. Your wisdom, insight, and experience greatly appreciated.

I think you can ignore the div tags if all the p tags that you want to capture have the indent class:
import re
text = r'''
<div class="prelims">
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p1">
<span class="num">1</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p2">
<span class="num">2</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p3">
<span class="num">3</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p4">
<span class="num">4</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p5">
<span class="num">5</span>
<p class="indent">Text about referee.</p>
</div>
<div class="num" id="p6">
'''
pattern = re.compile(r"<p.*class=[\"\']indent[\"\'].*>(.+)<\/p>", re.MULTILINE)
for m in re.findall(pattern, text):
print(m)
Output:
Text about players.
Text about players.
Text about players.
Text about players.
Text about players.
Text about referee.

Related

python scrap chrome web-store comment

I am trying to scrape reviews from Chrome Web-Store and having a problem with how to distinct between a comment and the replies to the comment.
Below is an example for such HTML, where the user "John Smith" has a comment and a reply.
I am currently using pyppeteer to scrap the content.
I tried querySelectionAll for .ba-bc-Xb-K and .ba-bc-Xb and several other ways, but was not able to clearly make identification
<div class="ba-fb">
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a/default-user=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">Lucy</span><span class="ba-Eb-Nf">Jun 26, 2022</span>
<div class="ba-Eb-N">
<div class="rsw-stars" aria-label="1 star">
<div class="rsw-starred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
</div>
</div>
<br>
<div class="ba-Eb-ba" dir="auto">We use this because its easy to hover and text over phone numbers of clients. IT VERY GLITCHY AND CRASHES OFTEN. 90% our of business runs on SMS texting, so I really wish I didn't use this for my company. If any has a better option please let me know!!!! I've been using this for a year and its getting better!!!!!!! AVOID IF YOU CAN!!!! Customer Service is ALSO TERRIBLE!</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BGDdh8EqnrWQb-fggox4SOmWi01kdMh4CdLCQD9oHM2uKG-GiDamTukgoJw7LwDNaVtssNY9zUfkPqZTbmL6bYR7YM8Tfa86zq-joAbx8qi5xjbhVjHguGAQoDUMi0YYV_pkFaVKt6ISOsZBGJKlLvhS3uCBg8VrwTO04skZFgbPvYGgPjeQgCKwOz4LyvBPf6dlvKz">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BGDdh8EqnrWQb-fggox4SOmWi01kdMh4CdLCQD9oHM2uKG-GiDamTukgoJw7LwDNaVtssNY9zUfkPqZTbmL6bYR7YM8Tfa86zq-joAbx8qi5xjbhVjHguGAQoDUMi0YYV_pkFaVKt6ISOsZBGJKlLvhS3uCBg8VrwTO04skZFgbPvYGgPjeQgCKwOz4LyvBPf6dlvKz">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a tabindex="0" class="ba-bc-zb-y z-b-ob-y" role="button">Reply</a><a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
<div class="yb-ba-Eb-k">
<div class="Fg-b-ob-k Pa">
<textarea class="Fg-b-ob-Gc" rows="5" maxlength="4096" aria-label="Write your reply" placeholder="Write your reply"></textarea>
<div class="Od"></div>
<div class="Fg-b-ob-Jb-k"><input type="button" value="Cancel" class="g-c g-c-aSvl1d Aa Fg-b-ob-Fb-c"> <input type="button" value="Post" class="g-c g-c-wb Aa Fg-b-ob-qd-c"></div>
</div>
</div>
<div class="Od"></div>
<div class="Fg-b-ob-fb"></div>
<div class="Fg-b-mb-Fk Pa"><a role="button" tabindex="0" class="mb-Fk-c">Load more replies</a></div>
</div>
</div>
</div>
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a-/AFdZucpu4S27XT0-ymC2sQo4ML3v0EkQWHfQeW-YO5jyPg=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">John Smith</span><span class="ba-Eb-Nf">May 24, 2022</span>
<div class="ba-Eb-N">
<div class="rsw-stars" aria-label="1 star">
<div class="rsw-starred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
</div>
</div>
<br>
<div class="ba-Eb-ba" dir="auto">Desktop app is interesting and the chrome browser buddy is even better. I wish I was not forced by my company to use the company.</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BE0070MjUM89cQCwjN0anL45obXJS3lggtKPsNh1lW8nApB3slGfCkLIRHtWYvTrteJ5Hsx15_Lq2GFBMLLbrKFghCR9XqAfnbN5yIZquqVmHLhEkzLpjGKotj-iX8wKux-rJoLU_8vz3wUKa76z0Ttw8QF2EKBKeT-vhT2WYDm8qPVpdpmgnYnObbYr_aDQlz4P5FD">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BE0070MjUM89cQCwjN0anL45obXJS3lggtKPsNh1lW8nApB3slGfCkLIRHtWYvTrteJ5Hsx15_Lq2GFBMLLbrKFghCR9XqAfnbN5yIZquqVmHLhEkzLpjGKotj-iX8wKux-rJoLU_8vz3wUKa76z0Ttw8QF2EKBKeT-vhT2WYDm8qPVpdpmgnYnObbYr_aDQlz4P5FD">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a tabindex="0" class="ba-bc-zb-y z-b-ob-y" role="button">Reply</a><a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
<div class="yb-ba-Eb-k">
<div class="Fg-b-ob-k Pa">
<textarea class="Fg-b-ob-Gc" rows="5" maxlength="4096" aria-label="Write your reply" placeholder="Write your reply"></textarea>
<div class="Od"></div>
<div class="Fg-b-ob-Jb-k"><input type="button" value="Cancel" class="g-c g-c-aSvl1d Aa Fg-b-ob-Fb-c"> <input type="button" value="Post" class="g-c g-c-wb Aa Fg-b-ob-qd-c"></div>
</div>
</div>
<div class="Od"></div>
<div class="Fg-b-ob-fb">
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a-/AFdZucpu4S27XT0-ymC2sQo4ML3v0EkQWHfQeW-YO5jyPg=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">John Smith</span><span class="ba-Eb-Nf">May 24, 2022</span><br>
<div class="ba-Eb-ba" dir="auto">I'm happy to chat with the engineering and UX team to tell you exactly how to fix it.</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BFmRFRFwJGRfUDOW8jG3rXnLzUlJu5dFPOnRhcZ3Qpf7k7js81NA_AsDgEfcDAZt0H9yZfs73z4D-hSlo1bxU2QLKaAXG2SMo-85mMfMl_-V6KnhrLHz2FEyGejziQP8UkVi-SsuqBw_lc0GmW9TtC5naBzAp94w9FygzBqeDyguPYXJMc">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BFmRFRFwJGRfUDOW8jG3rXnLzUlJu5dFPOnRhcZ3Qpf7k7js81NA_AsDgEfcDAZt0H9yZfs73z4D-hSlo1bxU2QLKaAXG2SMo-85mMfMl_-V6KnhrLHz2FEyGejziQP8UkVi-SsuqBw_lc0GmW9TtC5naBzAp94w9FygzBqeDyguPYXJMc">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="Fg-b-mb-Fk Pa"><a role="button" tabindex="0" class="mb-Fk-c">Load more replies</a></div>
</div>
</div>
</div>
</div>

Generally, I'd avoid using these classnames because they change too fast.
I see that comment on this page can be only one level deep. The parent comment has always <textarea>, so replies don't have it. You can distinguish parent-reply with this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser") # <-- html_doc is your snippet from the question
out = {}
for div in soup.select("div:has(>.comment-thread-displayname)"):
# this is a reply:
if not div.parent.find("textarea"):
continue
replies = []
for s in div.find_next_siblings():
if (reply := s.find(class_="comment-thread-displayname")) :
name, date, text = reply.parent.get_text(
strip=True, separator="\n"
).split("\n", maxsplit=2)
replies.append((name, date, text))
name, date, text = div.get_text(strip=True, separator="\n").split(
"\n", maxsplit=2
)
out[(name, date, text)] = replies
print(out)
Prints dictionary where keys are 3-item tuples of (name, date, text) of parent comment and values are lists of 3-item tuples of replies:
{
(
"Lucy",
"Jun 26, 2022",
"We use this because its easy to hover and text over phone numbers of clients. IT VERY GLITCHY AND CRASHES OFTEN. 90% our of business runs on SMS texting, so I really wish I didn't use this for my company. If any has a better option please let me know!!!! I've been using this for a year and its getting better!!!!!!! AVOID IF YOU CAN!!!! Customer Service is ALSO TERRIBLE!",
): [],
(
"John Smith",
"May 24, 2022",
"Desktop app is interesting and the chrome browser buddy is even better. I wish I was not forced by my company to use the company.",
): [
(
"John Smith",
"May 24, 2022",
"I'm happy to chat with the engineering and UX team to tell you exactly how to fix it.",
)
],
}

Pyton, Selenium: I need to collect urls but there no a tags in element

Good day, guys. I have a task to collect Name and Email for person from this site:
https://www.espeakers.com/s/nsas/search?available_on=&awards&budget=0%2C10&bureau_id=304&distance=1000&fee=false&items_per_page=3701&language=en&location=&norecord=false&nt=0&page=0&presenter_type=&q=%5B%5D&require&review=false&sort=speakername&video=false&virtual=false
I use selenium and python to scrape it, but I have a problem with accessing an url for people. The sample structure of person card is:
<div class="col-xs-12 col-sm-6 col-md-4 col-lg-3">
<div class="speaker-tile" id="sid12026">
<div class="speaker-thumb" style='background-image: url("https://streamer.espeakers.com/assets/6/12026/159445.jpg"); background-size: contain;'>
<div class="row">
<div class="col-xs-8 text-left">
</div>
<div class="col-xs-4 text-right speaker-top-actions">
<i class="fa fa-ellipsis-h fa-fw">
</i>
</div>
</div>
</div>
<div class="speaker-details">
<div class="speaker-name">
Alex Aanderud
</div>
<div class="row" style="margin-top: 15px;">
<div class="col-xs-12 col-sm-12">
<div class="speaker-location">
<i class="fa fa-map-marker mp-tertiary-background">
</i>
AZ
<span>
,
</span>
US
</div>
</div>
<div class="col-sm-6 col-xs-12">
<div class="speaker-awards">
</div>
</div>
</div>
<div class="speaker-oneline text-left">
<p>
</p>
<div>
Certified Trainer of Advanced Integrative Psychology and Certified John Maxwell Speaker, Trainer, Coach, will transform your organization and improve your results.
</div>
</div>
<div class="speaker-assets">
<div class="row">
</div>
</div>
<div class="speaker-actions">
<div class="row">
<div class="text-center col-xs-12">
<div class="btn btn-flat mp-primary btn-block">
<span class="hidden-xs hidden-sm">
View Profile
</span>
<span class="visible-xs visible-sm">
Profile
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
And the when you click on
<span class="hidden-xs hidden-sm">
View Profile
</span>
It moves you to page with person info where I can access it. How I can use selenium to do this, or there are others solutions that can help me.
Thanks!

If you notice, all the profile urls are of the form
https://www.espeakers.com/s/nsas/profile/id
where id is a 5 digits number such as 27397. So you just need to extract the id and concatenate it with the base url to obtain the profile url.
url = 'https://www.espeakers.com/s/nsas/profile/'
profile_urls = [url + el.get_attribute('id')[3:] for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-tile')]
names = [el.text for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-name')]
names is a list containing all the names, urls is a list containing the corresponding profile urls

Beautiful soup multiple Span Extract Table

I am currently working on my class assignment. I have to extract the data from the SPECS table from this webpage.
https://www.consumerreports.org/products/drip-coffee-maker/behmor-connected-alexa-enabled-temperature-control-396982/overview/
The data I need is stored as
<h2 class="crux-product-title">Specs</h2>
</div>
</div>
<div class="row">
<div class="col-xs-12">
<div class="product-model-features-specs-item">
<div class="row">
<div class='col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-key'>
<span class="crux-body-copy crux-body-copy--small--bold">
Programmable
<span class="product-model-tooltip">
<span class="crux-icons crux-icons-help-information" aria-hidden="true"></span>
<span class="product-model-tooltip-window">
<span class="crux-icons crux-icons-close" aria-hidden="true"></span>
<span class="crux-body-copy crux-body-copy--small--bold">Programmable</span>
<span class="crux-body-copy crux-body-copy--small">Programmable models have a clock and can be set to brew at a specified time.</span>
</span>
</span>
</span>
</div>
<div class="col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-value">
<span class='crux-body-copy crux-body-copy--small'>Yes</span>
</div>
</div>
</div>
</div>
</div>
<div class="row">
<div class="col-xs-12">
<div class="product-model-features-specs-item">
<div class="row">
<div class='col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-key'>
<span class="crux-body-copy crux-body-copy--small--bold">
Thermal carafe/mug
<span class="product-model-tooltip">
<span class="crux-icons crux-icons-help-information" aria-hidden="true"></span>
<span class="product-model-tooltip-window">
<span class="crux-icons crux-icons-close" aria-hidden="true"></span>
<span class="crux-body-copy crux-body-copy--small--bold">Thermal carafe/mug</span>
<span class="crux-body-copy crux-body-copy--small">Keeps coffee warm for about four hours; thermal mugs don't hold heat as well.</span>
</span>
</span>
</span>
I need to create Lists for the three span class
class="crux-body-copy crux-body-copy--small--bold
crux-body-copy crux-body-copy--small
crux-body-copy crux-body-copy--small
The problem with extracting the table is because of multiple span used in the table.
I used BEAUTIFUL SOUP and used find_all and find and used the span name to call it.
I always got the first value.
How do I do this?

I don't know if this will work for you.
from simplified_scrapy import SimplifiedDoc,req,utils
html = ''' ''' # Your html
doc = SimplifiedDoc(html)
spans = doc.selects('span.crux-body-copy crux-body-copy--small--bold')
for span in spans:
# print (span.firstText())
print (span.select('span.crux-body-copy crux-body-copy--small--bold').text)
print (span.select('span.crux-body-copy crux-body-copy--small').unescape())
Result:
Programmable
Programmable models have a clock and can be set to brew at a specified time.
Thermal carafe/mug
Keeps coffee warm for about four hours; thermal mugs don't hold heat as well.

Python Find text at end of html element

I need to pull the movie title and year out of the HTML text below using the BeautifulSoup find() method.
the below returns the name of the movie, but I'm unable to return only the year
find('p').find('a').text
<div class="col-sm-6 col-lg-3">
<div class="poster-container">
<a class="poster-link" href="/title/80244680/">
<img alt="A Tale of Two Kitchens (2019)" class="poster" src="https://occ-0-37-33.1.nflxso.net/dnm/api/v6/0DW6CdE4gYtYx8iy3aj8gs9WtXE/AAAABfTGUtIG2HYlEhUbvzPHmiAyPSkDcBIhQx_Ey06KfkgaUEwELBtJsJYP71-Vsx06NTKFKWZQupZGNVE8DCo8dC0j-zpcaNCPGFiyNJKN7tonZ3gMSAM.jpg?r=397"/>
<div class="overlay d-none d-lg-block text-center">
<span class="d-block font-weight-bold small mt-3">Documentaries</span>
<span class="d-block font-weight-bold small">International Movies</span>
</div>
</a>
</div>
<p><strong>A Tale of Two Kitchens</strong><br/>2019</p>
</div>
A Tale of Two Kitchens
<br/>

my_element.contents[-1]
This will give you the last element contained inside my_element: in this case, if my_element is the <p>, this will give the text "2019" as a NavigableString. (The first child is the <strong> tag, which contains <a> and all the rest.)

Use the following code.find the <a> tag and then use next_element
from bs4 import BeautifulSoup
html='''<div class="col-sm-6 col-lg-3">
<div class="poster-container">
<a class="poster-link" href="/title/80244680/">
<img alt="A Tale of Two Kitchens (2019)" class="poster" src="https://occ-0-37-33.1.nflxso.net/dnm/api/v6/0DW6CdE4gYtYx8iy3aj8gs9WtXE/AAAABfTGUtIG2HYlEhUbvzPHmiAyPSkDcBIhQx_Ey06KfkgaUEwELBtJsJYP71-Vsx06NTKFKWZQupZGNVE8DCo8dC0j-zpcaNCPGFiyNJKN7tonZ3gMSAM.jpg?r=397"/>
<div class="overlay d-none d-lg-block text-center">
<span class="d-block font-weight-bold small mt-3">Documentaries</span>
<span class="d-block font-weight-bold small">International Movies</span>
</div>
</a>
</div>
<p><strong>A Tale of Two Kitchens</strong><br/>2019</p>
</div>
A Tale of Two Kitchens
<br/>'''
soup=BeautifulSoup(html,'html.parser')
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p')
print(item.text)
Output:
A Tale of Two Kitchens2019
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p').find('a').text
print(item)
output:
A Tale of Two Kitchens
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p').find('a').next_element.next_element.next_element
print(item)
output:
2019

How to retrieve a price from inside an html block circumventing an <hr> tag

I'm using Python with BeautifulSoup.
I have a page with several of these html blocks:
<div class="col-sm-6 col-md-3"> <div class="thumbnail box-hover thumb-article-product"> <div class="ProductPicWrapper"> <div class="test"> <img width="120" height="120" src="https://cdn.smartoys.be/catalog/images/thumbs/120_120/products/x5055856419716.JPG.pagespeed.ic.uX1UW7-Gxw.webp" title="The Elder Scrolls Online : Summerset" alt="The Elder Scrolls Online : Summerset" class="img-responsive" data-pagespeed-url-hash="3637984103" onload="pagespeed.CriticalImages.checkImageForCriticality(this);"/> </div> </div> <div class="caption"> <p class="text-center nameart">The Elder Scrolls Online : Summerset</p> <p class="group inner list-group-item-heading nameart-ean text-center">5055856419716</br>Playstation 4</p> <hr> <p class="text-center article-price article-price-used ">Dès <span itemprop="price">10<span class="product-price-sm">.00€</span></span></p> <p class="text-center"> <span class="label"></span> </p> <div class="text-center"> <div class="btn-group"> Voir le produit </div> </div> </div> </div></div>
I would like to retrieve the price. I managed to do it from a page I saved locally from Chrome, but the html code is very different when getting it directly online.
From the downloaded page I just did the following to get the price (took out the loops for simplicity):
productblocks = soup.find_all("div",{"class": "col-sm-6 col-md-3"})
gameprice = productblocks[i].find("p", {"class": "text-center article-price article-price-used "}).text.encode('utf-8').strip()[:-3].replace('Dès ','')
However, when doing this with the online page, the following code does not include the price section:
productblocks = soup.find_all("div",{"class": "col-sm-6 col-md-3"})
I manage to get the name, code, etc. However, it seems that the price section is missing.
print productblocks[0]
returns:
<div class="col-sm-6 col-md-3"> <div class="thumbnail box-hover thumb-article-product"> <div class="ProductPicWrapper"> <div class="test"> <img alt="The Elder Scrolls Online : Summerset" class="img-responsive" height="120" src="https://cdn.smartoys.be/catalog/images/thumbs/120_120/products/x5055856419716.JPG.pagespeed.ic.CdYmLZol8V.jpg" title="The Elder Scrolls Online : Summerset" width="120"/> </div> </div> <div class="caption"> <p class="text-center nameart">The Elder Scrolls Online : Summerset</p><p class="group inner list-group-item-heading nameart-ean text-center">5055856419716</p></div></div></div>
which is obviously missing the price section. What am I doing wrong?
Thanks for the help.

Beautiful Soup is not able to parse after the hr tag in your html. You can try this to get the price value.
Demo:
from bs4 import BeautifulSoup
s = """<div class="col-sm-6 col-md-3"> <div class="thumbnail box-hover thumb-article-product"> <div class="ProductPicWrapper"> <div class="test"> <img width="120" height="120" src="https://cdn.smartoys.be/catalog/images/thumbs/120_120/products/x5055856419716.JPG.pagespeed.ic.uX1UW7-Gxw.webp" title="The Elder Scrolls Online : Summerset" alt="The Elder Scrolls Online : Summerset" class="img-responsive" data-pagespeed-url-hash="3637984103" onload="pagespeed.CriticalImages.checkImageForCriticality(this);"/> </div> </div> <div class="caption"> <p class="text-center nameart">The Elder Scrolls Online : Summerset</p> <p class="group inner list-group-item-heading nameart-ean text-center">5055856419716</br>Playstation 4</p> <hr> <p class="text-center article-price article-price-used ">Dès <span itemprop="price">10<span class="product-price-sm">.00€</span></span></p> <p class="text-center"> <span class="label"></span> </p> <div class="text-center"> <div class="btn-group"> Voir le produit </div> </div> </div> </div></div>"""
soup = BeautifulSoup(s, "html.parser")
productblocks = soup.find_all("div",{"class": "col-sm-6 col-md-3"})
print( productblocks[0].find("p", class_="group inner list-group-item-heading nameart-ean text-center").findNext("p").text.encode('utf-8').strip()[:-3].replace('Dès ','') )
Output:
10.00
Find the p tag before hr and then use findNext("p") to get the price tag.

There are simpler ways (a contains your HTML):
import re
re.findall( r'span itemprop="price">(\d+)<span', a )
['10']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Building Resilient Spiders Against Inconsistent HTML markup - python

Related

python scrap chrome web-store comment

Pyton, Selenium: I need to collect urls but there no a tags in element

Beautiful soup multiple Span Extract Table

Python Find text at end of html element

How to retrieve a price from inside an html block circumventing an <hr> tag

Categories

Resources