How to get all values of a nested div - python

I'd like to take all the values of a nested div.
<div class="upcoming-events__event js-event-filter" data-eventtype="Mission Day" data-region="AMER" data-eventdate="2019-07-13" data-latlon="40.167207,-105.101928" data-distance="8693671.480264762" style="display: block;">
<a href="http://events.ingress.com/MissionDay/Longmont2019" class="upcoming-events__event-link">
<img src="/assets/img/events/md-2019-7-longmontcousa.jpg" class="upcoming-events__image">
<div class="upcoming-events__content">
<img src="/assets/img/missionday.png" class="event-icon">
<div class="upcoming-events__date">Jul 13, 2019</div>
<div class="upcoming-events__location">Longmont, CO, USA</div>
</div>
</a>
</div>
for each class upcoming-events__event I'd like to print
upcoming-events__location, upcoming-events__date.
For more information: upcoming-events__event-link

Using bs4, you can get the text.
from bs4 import BeautifulSoup
html='''<div class="upcoming-events__event js-event-filter" data-eventtype="Mission Day" data-region="AMER" data-eventdate="2019-07-13" data-latlon="40.167207,-105.101928" data-distance="8693671.480264762" style="display: block;">
<a href="http://events.ingress.com/MissionDay/Longmont2019" class="upcoming-events__event-link">
<img src="/assets/img/events/md-2019-7-longmontcousa.jpg" class="upcoming-events__image">
<div class="upcoming-events__content">
<img src="/assets/img/missionday.png" class="event-icon">
<div class="upcoming-events__date">Jul 13, 2019</div>
<div class="upcoming-events__location">Longmont, CO, USA</div>
</div>
</a>
</div>
'''
soup=BeautifulSoup(html,'lxml')
for i in (soup.select('.upcoming-events__event')):
location=i.select('.upcoming-events__location')[0].string
date=i.select('.upcoming-events__date')[0].string
link=i.select('.upcoming-events__event-link')[0].get('href')
print(f'{location}, {date}\nFor more info :{link}')

Related

Websrcaping html <time class=""></time>

I would like to get the date. But I have got just a None.
<div class="article-cover">
<div class="article-cover-img">
<img src="https://api.hvg.hu/Img/da658e97-86c0-40f3-acd3-b0a850f32c30/e6f183bb-25a9-468e-ae30-6d98952ffc00.jpg" alt="Will Smith lemondott amerikai filmakadémiai tagságáról" width="800" height="370">
</div>
<div class="article-cover-text">
<div class="article-info byline">
<div class="info">
<time class="article-datetime" datetime="2022-04-02T08:13:00.0000000+02:00">2022. április. 02. 08:13</time>
<time class="lastdate" datetime="2022-04-02T08:16:17.0000000+02:00">2022. április. 02. 08:16</time>
Kult
</div>
</div>
<div class="article-title article-title">
<h1>Will Smith lemondott amerikai filmakadémiai tagságáról</h1>
</div>
</div>
<button class="articlesavebutton bookmark large" data-id="994e8ff1-f28e-4153-9f6c-87283f187af7" data-event-category="Myhvg_article_save" data-event-action="ClickOnLink" data-event-label="Article_save_MTI"></button>
</div>
I would like to get back: 2022. április. 02. 08:13
My code is:
article_soup = BeautifulSoup(article.content, "html.parser")
d=article_soup.find('time', class\_='article-datetime')
Main issue is a typo class\_='article-datetime' and to get the text use simply the get_text() method:
article_soup.find('time', class_='article-datetime').get_text()
or
article_soup.find('time', class_='article-datetime').text
Example
html = '''
<div class="article-cover">
<div class="article-cover-img">
<img src="https://api.hvg.hu/Img/da658e97-86c0-40f3-acd3-b0a850f32c30/e6f183bb-25a9-468e-ae30-6d98952ffc00.jpg" alt="Will Smith lemondott amerikai filmakadémiai tagságáról" width="800" height="370">
</div>
<div class="article-cover-text">
<div class="article-info byline">
<div class="info">
<time class="article-datetime" datetime="2022-04-02T08:13:00.0000000+02:00">2022. április. 02. 08:13</time>
<time class="lastdate" datetime="2022-04-02T08:16:17.0000000+02:00">2022. április. 02. 08:16</time>
Kult
</div>
</div>
<div class="article-title article-title">
<h1>Will Smith lemondott amerikai filmakadémiai tagságáról</h1>
</div>
</div>
<button class="articlesavebutton bookmark large" data-id="994e8ff1-f28e-4153-9f6c-87283f187af7" data-event-category="Myhvg_article_save" data-event-action="ClickOnLink" data-event-label="Article_save_MTI"></button>
</div>'''
article_soup = BeautifulSoup(html)
article_soup.find('time', class_='article-datetime').get_text()
Output
2022. április. 02. 08:13
soup = BeautifulSoup(html, "html.parser")
adate = soup.findAll("time", {"class": "article-datetime"})
print(adate[0].get_text())

Scrape everything between two unested tags

Is it possible to scrape everything between two unested tags ?
For instance:
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
So I would like to scrape just what is located under Title 1 until Title 2. Is this possible using bs4 ?
Right now I have something like this (problem is it scrape everything since classes are all the same):
for i in soup.findAll("div",{"class":"div"}):
print(i.span.text)
Now I get:
span1
span2
span3
span4
I'd like to get:
span1
span2
I don't know if this is best solution to this problem but you can split your text and scrape only the part that you need.
text = """
<h3>Title 1</h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2</h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
sub_text = text.split(soup.find('h3', text="Title 2").string)[0]
This will give:
'"\n<h3>Title 1</h3>\n<div class="div">\n <span class="span">span1</span>\n <label class="label">label1</label>\n</div>\n<div class="div">\n <span class="span">span2</span>\n</div>\n<h3>'
After converting that string into a bs4 object, you can scrape all you need:
scrape_me = BeautifulSoup(sub_text, 'lxml')
for i in scrape_me.findAll("div", class_="div"):
print(i.span.text)
# -> span1 span2
One approach is:
find the second class="span", then navigate backwards, and find_all_previous() div.
The tags are in backward order, so use the reversed() function..
find the <span> tags
from bs4 import BeautifulSoup
html = """
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
for tag in reversed(
soup.select_one("div:nth-of-type(2) span.span").find_all_previous("div")
):
print(tag.find("span").text)
Output:
span1
span2

BeautilfulSoup find_all method returns the same elements

Hi here is my soup object:
<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>
How can I get all the c-codes and its corresponding text from the object?
For example: c-code: "c5ff5b1d0dc93c" and its corresponding text: "Herren" for the first row...
My code looks like this (categories is the soup object):
for category in categories.find_all('div'):
category = categories.find('div')
print(category)
I only receive the information of the first row....
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
What happens?
categories holds your html
in your loop you do category = categories.find('div') - find('div') always returns the first occurrence, so category will always be <div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
You should do category = element.get_text() to get the text and code = element.get('data-navi-cat') to get the code.
Example
from bs4 import BeautifulSoup
html = '''<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>'''
soup = BeautifulSoup(html, "lxml")
for element in soup.find_all('div'):
category = element.get_text()
code = element.get('data-navi-cat')
print(category, code)
Output
Herren
c5ff5b1d0dc93c
Frauen
c5ff5b1d0dc95f
A-Jugend (U19)
c5ff5b1d0dc978
B-Jugend (U17)
c5ff5b1d0dc98c

Python Beautifulsoup: finding an element after a specific string

I have the following html code:
<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
<span class="hthtb">
<div>
<span class="hthtb">September 30, 2018</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text1</div>
<span class="hthtb">
<div><span class="hthtb">Text2</span></div>
</span>
</div>
<div
class="aAAD">
<div class="Bgbcca">MyText</div>
<span class="hthtb">
<div>
<span class="hthtb">Text3</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text4</div>
<span class="hthtb">
<div><span
class="hthtb">Text5</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text6</div>
<span class="hthtb">
<div><span
class="hthtb">Text7</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">
Text8/div>
<span class="hthtb">
<div>
<span class="hthtb">
<div>Text9</div>
<div>Text10</div>
</span>
</div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text11</div>
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>
How can I find Text3 which is located right after the div element with the string of MyText?
You can use lxml.html solution:
from lxml import html
source = """
<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
...
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>"""
tree = html.fromstring(source)
print(tree.xpath('//div[.="MyText"]/following-sibling::span/div/span/text()'))
Only if your structure is the final one, you can have the right value doing this:
from bs4 import BeautifulSoup as bfs
html = """<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
<span class="hthtb">
<div>
<span class="hthtb">September 30, 2018</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text1</div>
<span class="hthtb">
<div><span class="hthtb">Text2</span></div>
</span>
</div>
<div
class="aAAD">
<div class="Bgbcca">MyText</div>
<span class="hthtb">
<div>
<span class="hthtb">Text3</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text4</div>
<span class="hthtb">
<div><span
class="hthtb">Text5</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text6</div>
<span class="hthtb">
<div><span
class="hthtb">Text7</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">
Text8/div>
<span class="hthtb">
<div>
<span class="hthtb">
<div>Text9</div>
<div>Text10</div>
</span>
</div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text11</div>
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>"""
soup = bfs(html, 'html.parser')
result = ''
for div0 in soup.find_all('div',{'class':'aAAD'}):
for div1 in div0.find_all('div', {'class':'Bgbcca'}):
if div1.get_text() == 'MyText':
span = div0.find('span',{'class':'hthtb'})
if span:
span_to_return = span.find('span',{'class':'hthtb'})
if span_to_return:
result = span_to_return.get_text()
print(result)
You can build a custom query function to pass into find():
def has_my_text(tag):
found = tag.select_one('.Bgbcca')
# important to assign the result to avoid calling
# .get_text() on a NoneType, resulting in an error.
if found:
return found.get_text() == "MyText"
soup = bs4.... # assign your soup object
found = soup.find(has_my_text)
# <div class="Bgbcca">MyText</div>
# <span class="hthtb">
# <div>
# <span class="hthtb">Text3</span>
# </div>
# </span>
# </div>
# Note your span class is nested so we go two level in
result = found.select_one('.hthtb').select_one('.hthtb').get_text()
# 'Text3'
# This below also works if your other span are always empty texts
result = found.select_one('.hthtb').get_text().strip()
Note, the find() and select_one assume we only need the first match found. If you need to handle multiple matches, you'll need to use find_all() and select() and make changes to your code accordingly.
If you want to handle variable texts, you can define your function like this:
def has_my_text(tag, text):
found = tag.select_one('.Bgbcca')
if found:
return found.get_text() == text
And wrap the function in your find() like this:
txt = "MyText"
soup.find(lambda tag: has_my_text(tag, txt))

Find and retrieve content from html text using BeautifulSoup

I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.
I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.
(Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).
Any help is appreciated. Thank you!
<div id="directs">
<label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
rdfs:<span>label</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:string</span>
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
dc:<span>title</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
ods:<span>modified</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:dateTime</span>
2016-07-05T12:26:02Z
</div>
</div>
</div>
<label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
rdf:<span>type</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="<http://dati.camera.it/ocd/intervento>">
ocd:intervento
</a>
</div>
</div>
<label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
ocd:<span>rif_deputato</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="<http://dati.camera.it/ocd/deputato.rdf/d15080_17>">
http://dati.camera.it/ocd/deputato.rdf/d15080_17
</a>
</div>
</div>
<label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
dc:<span>relation</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
target="_blank" title="<http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010>">
http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
</a>
</div>
</div>
</div>
You can find the data-labels in one pass and the div content in another. Then, the results can be zipped together to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser').find('div', {'id':'directs'})
_labels = [i.a['data-label'] for i in d.find_all('label')]
_content = [i.text for i in d.find_all('div', {'class':re.compile('c2 value\s*')})]
result = dict(zip(_labels, _content))
Output:
{'label': '\n\n\nxsd:string \n intervento di Fabrizio CICCHITTO\n \n\n',
'Title': '\n\n\n intervento di Fabrizio CICCHITTO\n \n\n',
'': '\n\n\nxsd:dateTime \n 2016-07-05T12:26:02Z\n \n\n',
'type': '\n\n\n ocd:intervento\n \n\n',
'rierimento a deputato': '\n\n\n http://dati.camera.it/ocd/deputato.rdf/d15080_17\n \n\n',
'Relation': '\n\n\n http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010\n \n\n'}

Categories

Resources