Python Beautifulsoup: finding an element after a specific string - python

I have the following html code:
<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
<span class="hthtb">
<div>
<span class="hthtb">September 30, 2018</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text1</div>
<span class="hthtb">
<div><span class="hthtb">Text2</span></div>
</span>
</div>
<div
class="aAAD">
<div class="Bgbcca">MyText</div>
<span class="hthtb">
<div>
<span class="hthtb">Text3</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text4</div>
<span class="hthtb">
<div><span
class="hthtb">Text5</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text6</div>
<span class="hthtb">
<div><span
class="hthtb">Text7</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">
Text8/div>
<span class="hthtb">
<div>
<span class="hthtb">
<div>Text9</div>
<div>Text10</div>
</span>
</div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text11</div>
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>
How can I find Text3 which is located right after the div element with the string of MyText?

You can use lxml.html solution:
from lxml import html
source = """
<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
...
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>"""
tree = html.fromstring(source)
print(tree.xpath('//div[.="MyText"]/following-sibling::span/div/span/text()'))

Only if your structure is the final one, you can have the right value doing this:
from bs4 import BeautifulSoup as bfs
html = """<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
<span class="hthtb">
<div>
<span class="hthtb">September 30, 2018</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text1</div>
<span class="hthtb">
<div><span class="hthtb">Text2</span></div>
</span>
</div>
<div
class="aAAD">
<div class="Bgbcca">MyText</div>
<span class="hthtb">
<div>
<span class="hthtb">Text3</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text4</div>
<span class="hthtb">
<div><span
class="hthtb">Text5</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text6</div>
<span class="hthtb">
<div><span
class="hthtb">Text7</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">
Text8/div>
<span class="hthtb">
<div>
<span class="hthtb">
<div>Text9</div>
<div>Text10</div>
</span>
</div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text11</div>
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>"""
soup = bfs(html, 'html.parser')
result = ''
for div0 in soup.find_all('div',{'class':'aAAD'}):
for div1 in div0.find_all('div', {'class':'Bgbcca'}):
if div1.get_text() == 'MyText':
span = div0.find('span',{'class':'hthtb'})
if span:
span_to_return = span.find('span',{'class':'hthtb'})
if span_to_return:
result = span_to_return.get_text()
print(result)

You can build a custom query function to pass into find():
def has_my_text(tag):
found = tag.select_one('.Bgbcca')
# important to assign the result to avoid calling
# .get_text() on a NoneType, resulting in an error.
if found:
return found.get_text() == "MyText"
soup = bs4.... # assign your soup object
found = soup.find(has_my_text)
# <div class="Bgbcca">MyText</div>
# <span class="hthtb">
# <div>
# <span class="hthtb">Text3</span>
# </div>
# </span>
# </div>
# Note your span class is nested so we go two level in
result = found.select_one('.hthtb').select_one('.hthtb').get_text()
# 'Text3'
# This below also works if your other span are always empty texts
result = found.select_one('.hthtb').get_text().strip()
Note, the find() and select_one assume we only need the first match found. If you need to handle multiple matches, you'll need to use find_all() and select() and make changes to your code accordingly.
If you want to handle variable texts, you can define your function like this:
def has_my_text(tag, text):
found = tag.select_one('.Bgbcca')
if found:
return found.get_text() == text
And wrap the function in your find() like this:
txt = "MyText"
soup.find(lambda tag: has_my_text(tag, txt))

Related

Websrcaping html <time class=""></time>

I would like to get the date. But I have got just a None.
<div class="article-cover">
<div class="article-cover-img">
<img src="https://api.hvg.hu/Img/da658e97-86c0-40f3-acd3-b0a850f32c30/e6f183bb-25a9-468e-ae30-6d98952ffc00.jpg" alt="Will Smith lemondott amerikai filmakadémiai tagságáról" width="800" height="370">
</div>
<div class="article-cover-text">
<div class="article-info byline">
<div class="info">
<time class="article-datetime" datetime="2022-04-02T08:13:00.0000000+02:00">2022. április. 02. 08:13</time>
<time class="lastdate" datetime="2022-04-02T08:16:17.0000000+02:00">2022. április. 02. 08:16</time>
Kult
</div>
</div>
<div class="article-title article-title">
<h1>Will Smith lemondott amerikai filmakadémiai tagságáról</h1>
</div>
</div>
<button class="articlesavebutton bookmark large" data-id="994e8ff1-f28e-4153-9f6c-87283f187af7" data-event-category="Myhvg_article_save" data-event-action="ClickOnLink" data-event-label="Article_save_MTI"></button>
</div>
I would like to get back: 2022. április. 02. 08:13
My code is:
article_soup = BeautifulSoup(article.content, "html.parser")
d=article_soup.find('time', class\_='article-datetime')
Main issue is a typo class\_='article-datetime' and to get the text use simply the get_text() method:
article_soup.find('time', class_='article-datetime').get_text()
or
article_soup.find('time', class_='article-datetime').text
Example
html = '''
<div class="article-cover">
<div class="article-cover-img">
<img src="https://api.hvg.hu/Img/da658e97-86c0-40f3-acd3-b0a850f32c30/e6f183bb-25a9-468e-ae30-6d98952ffc00.jpg" alt="Will Smith lemondott amerikai filmakadémiai tagságáról" width="800" height="370">
</div>
<div class="article-cover-text">
<div class="article-info byline">
<div class="info">
<time class="article-datetime" datetime="2022-04-02T08:13:00.0000000+02:00">2022. április. 02. 08:13</time>
<time class="lastdate" datetime="2022-04-02T08:16:17.0000000+02:00">2022. április. 02. 08:16</time>
Kult
</div>
</div>
<div class="article-title article-title">
<h1>Will Smith lemondott amerikai filmakadémiai tagságáról</h1>
</div>
</div>
<button class="articlesavebutton bookmark large" data-id="994e8ff1-f28e-4153-9f6c-87283f187af7" data-event-category="Myhvg_article_save" data-event-action="ClickOnLink" data-event-label="Article_save_MTI"></button>
</div>'''
article_soup = BeautifulSoup(html)
article_soup.find('time', class_='article-datetime').get_text()
Output
2022. április. 02. 08:13
soup = BeautifulSoup(html, "html.parser")
adate = soup.findAll("time", {"class": "article-datetime"})
print(adate[0].get_text())

Scrape everything between two unested tags

Is it possible to scrape everything between two unested tags ?
For instance:
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
So I would like to scrape just what is located under Title 1 until Title 2. Is this possible using bs4 ?
Right now I have something like this (problem is it scrape everything since classes are all the same):
for i in soup.findAll("div",{"class":"div"}):
print(i.span.text)
Now I get:
span1
span2
span3
span4
I'd like to get:
span1
span2
I don't know if this is best solution to this problem but you can split your text and scrape only the part that you need.
text = """
<h3>Title 1</h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2</h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
sub_text = text.split(soup.find('h3', text="Title 2").string)[0]
This will give:
'"\n<h3>Title 1</h3>\n<div class="div">\n <span class="span">span1</span>\n <label class="label">label1</label>\n</div>\n<div class="div">\n <span class="span">span2</span>\n</div>\n<h3>'
After converting that string into a bs4 object, you can scrape all you need:
scrape_me = BeautifulSoup(sub_text, 'lxml')
for i in scrape_me.findAll("div", class_="div"):
print(i.span.text)
# -> span1 span2
One approach is:
find the second class="span", then navigate backwards, and find_all_previous() div.
The tags are in backward order, so use the reversed() function..
find the <span> tags
from bs4 import BeautifulSoup
html = """
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
for tag in reversed(
soup.select_one("div:nth-of-type(2) span.span").find_all_previous("div")
):
print(tag.find("span").text)
Output:
span1
span2

How to get all values of a nested div

I'd like to take all the values of a nested div.
<div class="upcoming-events__event js-event-filter" data-eventtype="Mission Day" data-region="AMER" data-eventdate="2019-07-13" data-latlon="40.167207,-105.101928" data-distance="8693671.480264762" style="display: block;">
<a href="http://events.ingress.com/MissionDay/Longmont2019" class="upcoming-events__event-link">
<img src="/assets/img/events/md-2019-7-longmontcousa.jpg" class="upcoming-events__image">
<div class="upcoming-events__content">
<img src="/assets/img/missionday.png" class="event-icon">
<div class="upcoming-events__date">Jul 13, 2019</div>
<div class="upcoming-events__location">Longmont, CO, USA</div>
</div>
</a>
</div>
for each class upcoming-events__event I'd like to print
upcoming-events__location, upcoming-events__date.
For more information: upcoming-events__event-link
Using bs4, you can get the text.
from bs4 import BeautifulSoup
html='''<div class="upcoming-events__event js-event-filter" data-eventtype="Mission Day" data-region="AMER" data-eventdate="2019-07-13" data-latlon="40.167207,-105.101928" data-distance="8693671.480264762" style="display: block;">
<a href="http://events.ingress.com/MissionDay/Longmont2019" class="upcoming-events__event-link">
<img src="/assets/img/events/md-2019-7-longmontcousa.jpg" class="upcoming-events__image">
<div class="upcoming-events__content">
<img src="/assets/img/missionday.png" class="event-icon">
<div class="upcoming-events__date">Jul 13, 2019</div>
<div class="upcoming-events__location">Longmont, CO, USA</div>
</div>
</a>
</div>
'''
soup=BeautifulSoup(html,'lxml')
for i in (soup.select('.upcoming-events__event')):
location=i.select('.upcoming-events__location')[0].string
date=i.select('.upcoming-events__date')[0].string
link=i.select('.upcoming-events__event-link')[0].get('href')
print(f'{location}, {date}\nFor more info :{link}')

How to extract divs and classes

I am new to Python, I need to get title, isbn, price and publication date for my very first crawler.
<div class="col-md-7 col-sm-7">
<h4>Pocket Anatomy and Physiology, 3rd Edition</h4>
<div>Shirley A. Jones</div>
<div>ISBN-13: 978-0-8036-5658-1</div>
<p class="price"> $39.95 (US)</p>
<div class="prd_lst">
<ul class="book_list">
</ul>
<div class="mobile_add_tocart">
<button type="button" class="addtocart" onclick="window.location.href='https://shoppingcart.fadavis.com/ShoppingCart/AddToCart?guid=74779e63-ccfb-454e-a6b9-b4e9f9a50793&productid=10959&applicationid=5'"> <span class="cart_icon sprite pull-left"></span>Add to Cart</button>
</div>
<div class="popover bottom Available_tooltip"><div class="arrow"></div>
<div class="popover-content">
<ul class="book_list">
</ul>
<div class="clearfix"></div>
</div>
</div>
<div class="clearfix"></div>
</div>
<p>Publication Date: 10/12/2016</p>
<div class="available active">
<div class="available_icon sprite pull-left"></div>
Available</div>
</div>
import bs4
html = """
<div class="col-md-7 col-sm-7">
<h4>Pocket Anatomy and Physiology, 3rd Edition</h4>
<div>Shirley A. Jones</div>
<div>ISBN-13: 978-0-8036-5658-1</div>
<p class="price"> $39.95 (US)</p>
<div class="prd_lst">
<ul class="book_list">
</ul>
<div class="mobile_add_tocart">
<button type="button" class="addtocart" onclick="window.location.href='https://shoppingcart.fadavis.com/ShoppingCart/AddToCart?guid=74779e63-ccfb-454e-a6b9-b4e9f9a50793&productid=10959&applicationid=5'"> <span class="cart_icon sprite pull-left"></span>Add to Cart</button>
</div>
<div class="popover bottom Available_tooltip"><div class="arrow"></div>
<div class="popover-content">
<ul class="book_list">
</ul>
<div class="clearfix"></div>
</div>
</div>
<div class="clearfix"></div>
</div>
<p>Publication Date: 10/12/2016</p>
<div class="available active">
<div class="available_icon sprite pull-left"></div>
Available</div>
</div>
</div>
"""
soup=bs4.BeautifulSoup(html,'lxml')
div = soup.find('div', {'class': 'col-md-7'})
divs = div.findAll('div')
price = div.find('p', {'class': 'price'})
date = div.findAll('p')
print(divs[0].text)
print(divs[1].text)
print(price.text)
print(date[-1].text)
Output
Shirley A. Jones
ISBN-13: 978-0-8036-5658-1
$39.95 (US)
Publication Date: 10/12/2016

how to extract data from two html page?

I want to extract data from two html page .As I extact data from one page and going another page some element change ,data are present in list and list changes.
My code for below problem
details_containers = soup_page.findAll("div",{"id":"RESTAURANT_DETAILS"})
details_container = details_containers[0].findAll("div",{"class":"content"})
cuisine = details_container[0].text.strip()
print(cuisine)
meals = details_container[1].text.strip()
print(meals)
hotel_features = details_container[2].text.strip()
print(hotel_features)
From first html I want cuisine ,meals , retaurant_features content values . But there are some extra content values of hours,average prices.
<div id="RESTAURANT_DETAILS" class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
<div class="header_with_improve wrap">
<a href="/UpdateListing-g297595-d6384395-Ocellus-Raipur_Raipur_District_Chhattisgarh.html" onclick="ta.setEvtCookie('UpdateListing', 'entry-detail-moreinfo', null, 0, '/UpdateListingRedesign')">
<div class="improve_listing_btn ui_button primary">Improve this listing</div>
</a>
<h3 class="tabs_header">Restaurant Details</h3> </div>
<div class="details_tab">
<div class="table_section">
<div class="row">
<div class="ratingSummary wrap">
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">
Rating summary
</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Food</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_35" alt="3.5 of 5 bubbles"></span>
</div>
</div>
</li>
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Value</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_35" alt="3.5 of 5 bubbles"></span>
</div>
</div>
</li>
</ul>
</div>
</div>
</div>
<div class="row">
<div class="title">
Average prices
</div>
<div class="content">
<span>₹ 448 -
₹ 768</span>
</div>
</div>
<div class="row">
<div class="title">
Cuisine
</div>
<div class="content">
Indian, Asian, Italian, French, Chinese, International, Vegetarian Friendly
</div>
</div>
<div class="row">
<div class="title">
Meals
</div>
<div class="content">
Breakfast, Lunch, Dinner, Brunch
</div>
</div>
<div class="row">
<div class="title">
Restaurant features
</div>
<div class="content">
Reservations, Seating, Takeout, Private Dining, Waitstaff
</div>
</div>
<div class="row">
<div class="title">
Good for
</div>
<div class="content">
Groups, Business meetings, Child-friendly
</div>
</div>
<div class="row">
<div class="hours title">
Open Hours
</div>
<div class="hours content">
<div class="detail">
<span class="day">Sunday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Monday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Tuesday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Wednesday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Thursday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Friday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Saturday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
</div>
</div>
</div>
<div class="additional_info">
<div class="title">
Location and Contact Information </div>
<div class="content">
<ul class="detailsContent">
<li>
<div class="detail">Address:
<span> <span class="format_address"><span class="street-address">G.E. Road</span> | <span class="extended-address">Mayura Hotel</span>, <span class="locality">Raipur 492001, </span><span class="country-name">India</span> </span>
</span>
</div>
</li>
<li>
<div class="detail">Location:
<span> Asia</span>
<span> > India</span>
<span> > Chhattisgarh</span>
<span> > Raipur District</span>
<span> > Raipur</span>
</div>
</li>
<li>
<div class="detail">Phone Number:
<span>+91 77142 00500</span>
</div>
</li>
<li>
<span class="ui_icon email"></span>
<a target="_blank"" href="mailto:banquet#themayurahotels.com" onclick="ta.trackEventOnPage('Eatery_Listing','Email','6384395')">
E-mail </a>
</li>
<!--trkP:waypoint_for_poi_2-->
<!-- PLACEMENT waypoint_for_poi -->
<div id="taplc_waypoint_for_poi_1" class="ppr_rup ppr_priv_waypoint_for_poi" data-placement-name="waypoint_for_poi">
</div>
<!--etk-->
</ul>
</div>
</div>
<!--[if lte IE 9]>
<style>
.details_block .threeColumnList{
height: 350px;
overflow: auto;
}
</style>
<![endif]-->
</div>
</div>
From second html I want cuisine ,meals , retaurant_features content values like above html. But in this extra content values of hours,average prices are not present
<div id="RESTAURANT_DETAILS" class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
<div class="header_with_improve wrap">
<a href="/UpdateListing-g297595-d8595502-Barbeque_Nation-Raipur_Raipur_District_Chhattisgarh.html" onclick="ta.setEvtCookie('UpdateListing', 'entry-detail-moreinfo', null, 0, '/UpdateListingRedesign')">
<div class="improve_listing_btn ui_button primary">Improve this listing</div>
</a>
<h3 class="tabs_header">Restaurant Details</h3> </div>
<div class="details_tab">
<div class="table_section">
<div class="row">
<div class="ratingSummary wrap">
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">
Rating summary
</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Food</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
</div>
</div>
</li>
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Value</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_40" alt="4.0 of 5 bubbles"></span>
</div>
</div>
</li>
</ul>
</div>
</div>
</div>
<div class="row">
<div class="title">
Cuisine
</div>
<div class="content">
Indian, Barbecue, Asian, Vegetarian Friendly, Vegan Options, Gluten Free Options
</div>
</div>
<div class="row">
<div class="title">
Meals
</div>
<div class="content">
Lunch, Dinner
</div>
</div>
<div class="row">
<div class="title">
Restaurant features
</div>
<div class="content">
Reservations, Seating, Waitstaff, Wheelchair Accessible, Validated Parking
</div>
</div>
<div class="row">
<div class="title">
Good for
</div>
<div class="content">
Groups, Special Occasion Dining, Kids, Child-friendly
</div>
</div>
</div>
<div class="additional_info">
<div class="title">
Location and Contact Information </div>
<div class="content">
<ul class="detailsContent">
<li>
<div class="detail">Address:
<span> <span class="format_address"> | <span class="extended-address">Magneto The Mall, 2nd Floor</span>, <span class="locality">Raipur 429010, </span><span class="country-name">India</span> </span>
</span>
</div>
</li>
<li>
<div class="detail">Location:
<span> Asia</span>
<span> > India</span>
<span> > Chhattisgarh</span>
<span> > Raipur District</span>
<span> > Raipur</span>
</div>
</li>
<li>
<div class="detail">Phone Number:
<span>+91 77160 60008</span>
</div>
</li>
<li>
<span class="ui_icon email"></span>
<a target="_blank"" href="mailto:feedback#barbeque-nation.com" onclick="ta.trackEventOnPage('Eatery_Listing','Email','8595502')">
E-mail </a>
</li>
<!--trkP:waypoint_for_poi_2-->
<!-- PLACEMENT waypoint_for_poi -->
<div id="taplc_waypoint_for_poi_1" class="ppr_rup ppr_priv_waypoint_for_poi" data-placement-name="waypoint_for_poi">
</div>
<!--etk-->
</ul>
</div>
</div>
<!--[if lte IE 9]>
<style>
.details_block .threeColumnList{
height: 350px;
overflow: auto;
}
</style>
<![endif]-->
</div>
</div>
Instead of obtaining a list of all <div class="content"> blocks and selecting several by their index (which is changing from the first page to the second), you can find all <div class="row">, which contain a title and the respective content.
rows = details_container.findAll('div', {'class': 'row'})
# used to store data extracted from HTML <div class="row"> elements
data = {}
for row in rows:
title = row.find('div', {'class': 'title'})
content = row.find('div', {'class': 'content'})
if title and content:
# here I am just formatting the dict key to be more python-ish. totally optional
title = title.text.strip().lower().replace(' ', '-')
data[title] = content
# tested with the HTML from the first page
print data.keys()
#=> [u'cuisine', u'restaurant-features', u'average-prices', u'good-for', u'open-hours', u'meals']
print type(data['cuisine'])
#=> <class 'bs4.element.Tag'>
Now you can extract the content items from the HTML webpage without caring what order they appear in. This code should work on any HTML that has the same general structure as the two pages you provided. I hope this helps!

Categories

Resources