Websrcaping html <time class=""></time> - python

I would like to get the date. But I have got just a None.
<div class="article-cover">
<div class="article-cover-img">
<img src="https://api.hvg.hu/Img/da658e97-86c0-40f3-acd3-b0a850f32c30/e6f183bb-25a9-468e-ae30-6d98952ffc00.jpg" alt="Will Smith lemondott amerikai filmakadémiai tagságáról" width="800" height="370">
</div>
<div class="article-cover-text">
<div class="article-info byline">
<div class="info">
<time class="article-datetime" datetime="2022-04-02T08:13:00.0000000+02:00">2022. április. 02. 08:13</time>
<time class="lastdate" datetime="2022-04-02T08:16:17.0000000+02:00">2022. április. 02. 08:16</time>
Kult
</div>
</div>
<div class="article-title article-title">
<h1>Will Smith lemondott amerikai filmakadémiai tagságáról</h1>
</div>
</div>
<button class="articlesavebutton bookmark large" data-id="994e8ff1-f28e-4153-9f6c-87283f187af7" data-event-category="Myhvg_article_save" data-event-action="ClickOnLink" data-event-label="Article_save_MTI"></button>
</div>
I would like to get back: 2022. április. 02. 08:13
My code is:
article_soup = BeautifulSoup(article.content, "html.parser")
d=article_soup.find('time', class\_='article-datetime')

Main issue is a typo class\_='article-datetime' and to get the text use simply the get_text() method:
article_soup.find('time', class_='article-datetime').get_text()
or
article_soup.find('time', class_='article-datetime').text
Example
html = '''
<div class="article-cover">
<div class="article-cover-img">
<img src="https://api.hvg.hu/Img/da658e97-86c0-40f3-acd3-b0a850f32c30/e6f183bb-25a9-468e-ae30-6d98952ffc00.jpg" alt="Will Smith lemondott amerikai filmakadémiai tagságáról" width="800" height="370">
</div>
<div class="article-cover-text">
<div class="article-info byline">
<div class="info">
<time class="article-datetime" datetime="2022-04-02T08:13:00.0000000+02:00">2022. április. 02. 08:13</time>
<time class="lastdate" datetime="2022-04-02T08:16:17.0000000+02:00">2022. április. 02. 08:16</time>
Kult
</div>
</div>
<div class="article-title article-title">
<h1>Will Smith lemondott amerikai filmakadémiai tagságáról</h1>
</div>
</div>
<button class="articlesavebutton bookmark large" data-id="994e8ff1-f28e-4153-9f6c-87283f187af7" data-event-category="Myhvg_article_save" data-event-action="ClickOnLink" data-event-label="Article_save_MTI"></button>
</div>'''
article_soup = BeautifulSoup(html)
article_soup.find('time', class_='article-datetime').get_text()
Output
2022. április. 02. 08:13

soup = BeautifulSoup(html, "html.parser")
adate = soup.findAll("time", {"class": "article-datetime"})
print(adate[0].get_text())

Related

Scrape everything between two unested tags

Is it possible to scrape everything between two unested tags ?
For instance:
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
So I would like to scrape just what is located under Title 1 until Title 2. Is this possible using bs4 ?
Right now I have something like this (problem is it scrape everything since classes are all the same):
for i in soup.findAll("div",{"class":"div"}):
print(i.span.text)
Now I get:
span1
span2
span3
span4
I'd like to get:
span1
span2
I don't know if this is best solution to this problem but you can split your text and scrape only the part that you need.
text = """
<h3>Title 1</h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2</h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
sub_text = text.split(soup.find('h3', text="Title 2").string)[0]
This will give:
'"\n<h3>Title 1</h3>\n<div class="div">\n <span class="span">span1</span>\n <label class="label">label1</label>\n</div>\n<div class="div">\n <span class="span">span2</span>\n</div>\n<h3>'
After converting that string into a bs4 object, you can scrape all you need:
scrape_me = BeautifulSoup(sub_text, 'lxml')
for i in scrape_me.findAll("div", class_="div"):
print(i.span.text)
# -> span1 span2
One approach is:
find the second class="span", then navigate backwards, and find_all_previous() div.
The tags are in backward order, so use the reversed() function..
find the <span> tags
from bs4 import BeautifulSoup
html = """
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
for tag in reversed(
soup.select_one("div:nth-of-type(2) span.span").find_all_previous("div")
):
print(tag.find("span").text)
Output:
span1
span2

BeautilfulSoup find_all method returns the same elements

Hi here is my soup object:
<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>
How can I get all the c-codes and its corresponding text from the object?
For example: c-code: "c5ff5b1d0dc93c" and its corresponding text: "Herren" for the first row...
My code looks like this (categories is the soup object):
for category in categories.find_all('div'):
category = categories.find('div')
print(category)
I only receive the information of the first row....
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
What happens?
categories holds your html
in your loop you do category = categories.find('div') - find('div') always returns the first occurrence, so category will always be <div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
You should do category = element.get_text() to get the text and code = element.get('data-navi-cat') to get the code.
Example
from bs4 import BeautifulSoup
html = '''<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>'''
soup = BeautifulSoup(html, "lxml")
for element in soup.find_all('div'):
category = element.get_text()
code = element.get('data-navi-cat')
print(category, code)
Output
Herren
c5ff5b1d0dc93c
Frauen
c5ff5b1d0dc95f
A-Jugend (U19)
c5ff5b1d0dc978
B-Jugend (U17)
c5ff5b1d0dc98c

How to get all values of a nested div

I'd like to take all the values of a nested div.
<div class="upcoming-events__event js-event-filter" data-eventtype="Mission Day" data-region="AMER" data-eventdate="2019-07-13" data-latlon="40.167207,-105.101928" data-distance="8693671.480264762" style="display: block;">
<a href="http://events.ingress.com/MissionDay/Longmont2019" class="upcoming-events__event-link">
<img src="/assets/img/events/md-2019-7-longmontcousa.jpg" class="upcoming-events__image">
<div class="upcoming-events__content">
<img src="/assets/img/missionday.png" class="event-icon">
<div class="upcoming-events__date">Jul 13, 2019</div>
<div class="upcoming-events__location">Longmont, CO, USA</div>
</div>
</a>
</div>
for each class upcoming-events__event I'd like to print
upcoming-events__location, upcoming-events__date.
For more information: upcoming-events__event-link
Using bs4, you can get the text.
from bs4 import BeautifulSoup
html='''<div class="upcoming-events__event js-event-filter" data-eventtype="Mission Day" data-region="AMER" data-eventdate="2019-07-13" data-latlon="40.167207,-105.101928" data-distance="8693671.480264762" style="display: block;">
<a href="http://events.ingress.com/MissionDay/Longmont2019" class="upcoming-events__event-link">
<img src="/assets/img/events/md-2019-7-longmontcousa.jpg" class="upcoming-events__image">
<div class="upcoming-events__content">
<img src="/assets/img/missionday.png" class="event-icon">
<div class="upcoming-events__date">Jul 13, 2019</div>
<div class="upcoming-events__location">Longmont, CO, USA</div>
</div>
</a>
</div>
'''
soup=BeautifulSoup(html,'lxml')
for i in (soup.select('.upcoming-events__event')):
location=i.select('.upcoming-events__location')[0].string
date=i.select('.upcoming-events__date')[0].string
link=i.select('.upcoming-events__event-link')[0].get('href')
print(f'{location}, {date}\nFor more info :{link}')

Python Beautifulsoup: finding an element after a specific string

I have the following html code:
<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
<span class="hthtb">
<div>
<span class="hthtb">September 30, 2018</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text1</div>
<span class="hthtb">
<div><span class="hthtb">Text2</span></div>
</span>
</div>
<div
class="aAAD">
<div class="Bgbcca">MyText</div>
<span class="hthtb">
<div>
<span class="hthtb">Text3</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text4</div>
<span class="hthtb">
<div><span
class="hthtb">Text5</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text6</div>
<span class="hthtb">
<div><span
class="hthtb">Text7</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">
Text8/div>
<span class="hthtb">
<div>
<span class="hthtb">
<div>Text9</div>
<div>Text10</div>
</span>
</div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text11</div>
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>
How can I find Text3 which is located right after the div element with the string of MyText?
You can use lxml.html solution:
from lxml import html
source = """
<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
...
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>"""
tree = html.fromstring(source)
print(tree.xpath('//div[.="MyText"]/following-sibling::span/div/span/text()'))
Only if your structure is the final one, you can have the right value doing this:
from bs4 import BeautifulSoup as bfs
html = """<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
<span class="hthtb">
<div>
<span class="hthtb">September 30, 2018</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text1</div>
<span class="hthtb">
<div><span class="hthtb">Text2</span></div>
</span>
</div>
<div
class="aAAD">
<div class="Bgbcca">MyText</div>
<span class="hthtb">
<div>
<span class="hthtb">Text3</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text4</div>
<span class="hthtb">
<div><span
class="hthtb">Text5</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text6</div>
<span class="hthtb">
<div><span
class="hthtb">Text7</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">
Text8/div>
<span class="hthtb">
<div>
<span class="hthtb">
<div>Text9</div>
<div>Text10</div>
</span>
</div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text11</div>
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>"""
soup = bfs(html, 'html.parser')
result = ''
for div0 in soup.find_all('div',{'class':'aAAD'}):
for div1 in div0.find_all('div', {'class':'Bgbcca'}):
if div1.get_text() == 'MyText':
span = div0.find('span',{'class':'hthtb'})
if span:
span_to_return = span.find('span',{'class':'hthtb'})
if span_to_return:
result = span_to_return.get_text()
print(result)
You can build a custom query function to pass into find():
def has_my_text(tag):
found = tag.select_one('.Bgbcca')
# important to assign the result to avoid calling
# .get_text() on a NoneType, resulting in an error.
if found:
return found.get_text() == "MyText"
soup = bs4.... # assign your soup object
found = soup.find(has_my_text)
# <div class="Bgbcca">MyText</div>
# <span class="hthtb">
# <div>
# <span class="hthtb">Text3</span>
# </div>
# </span>
# </div>
# Note your span class is nested so we go two level in
result = found.select_one('.hthtb').select_one('.hthtb').get_text()
# 'Text3'
# This below also works if your other span are always empty texts
result = found.select_one('.hthtb').get_text().strip()
Note, the find() and select_one assume we only need the first match found. If you need to handle multiple matches, you'll need to use find_all() and select() and make changes to your code accordingly.
If you want to handle variable texts, you can define your function like this:
def has_my_text(tag, text):
found = tag.select_one('.Bgbcca')
if found:
return found.get_text() == text
And wrap the function in your find() like this:
txt = "MyText"
soup.find(lambda tag: has_my_text(tag, txt))

How to extract divs and classes

I am new to Python, I need to get title, isbn, price and publication date for my very first crawler.
<div class="col-md-7 col-sm-7">
<h4>Pocket Anatomy and Physiology, 3rd Edition</h4>
<div>Shirley A. Jones</div>
<div>ISBN-13: 978-0-8036-5658-1</div>
<p class="price"> $39.95 (US)</p>
<div class="prd_lst">
<ul class="book_list">
</ul>
<div class="mobile_add_tocart">
<button type="button" class="addtocart" onclick="window.location.href='https://shoppingcart.fadavis.com/ShoppingCart/AddToCart?guid=74779e63-ccfb-454e-a6b9-b4e9f9a50793&productid=10959&applicationid=5'"> <span class="cart_icon sprite pull-left"></span>Add to Cart</button>
</div>
<div class="popover bottom Available_tooltip"><div class="arrow"></div>
<div class="popover-content">
<ul class="book_list">
</ul>
<div class="clearfix"></div>
</div>
</div>
<div class="clearfix"></div>
</div>
<p>Publication Date: 10/12/2016</p>
<div class="available active">
<div class="available_icon sprite pull-left"></div>
Available</div>
</div>
import bs4
html = """
<div class="col-md-7 col-sm-7">
<h4>Pocket Anatomy and Physiology, 3rd Edition</h4>
<div>Shirley A. Jones</div>
<div>ISBN-13: 978-0-8036-5658-1</div>
<p class="price"> $39.95 (US)</p>
<div class="prd_lst">
<ul class="book_list">
</ul>
<div class="mobile_add_tocart">
<button type="button" class="addtocart" onclick="window.location.href='https://shoppingcart.fadavis.com/ShoppingCart/AddToCart?guid=74779e63-ccfb-454e-a6b9-b4e9f9a50793&productid=10959&applicationid=5'"> <span class="cart_icon sprite pull-left"></span>Add to Cart</button>
</div>
<div class="popover bottom Available_tooltip"><div class="arrow"></div>
<div class="popover-content">
<ul class="book_list">
</ul>
<div class="clearfix"></div>
</div>
</div>
<div class="clearfix"></div>
</div>
<p>Publication Date: 10/12/2016</p>
<div class="available active">
<div class="available_icon sprite pull-left"></div>
Available</div>
</div>
</div>
"""
soup=bs4.BeautifulSoup(html,'lxml')
div = soup.find('div', {'class': 'col-md-7'})
divs = div.findAll('div')
price = div.find('p', {'class': 'price'})
date = div.findAll('p')
print(divs[0].text)
print(divs[1].text)
print(price.text)
print(date[-1].text)
Output
Shirley A. Jones
ISBN-13: 978-0-8036-5658-1
$39.95 (US)
Publication Date: 10/12/2016

Categories

Resources