Find text between specific id beautifulsoup - python

I've an html like the following example:
<a class="anchor" id="category-1"></a>
<h2 class="text-muted">First Category</h2>
<div class="row">
<a class="anchor-entry" id="cat1-first-id"></a>
<div class="col-lg-10">
<h3>First H3 Title</h3>
</div>
<a class="anchor-entry" id="cat1-second-id"></a>
<div class="col-lg-10">
<h3>Second H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat1-third-id"></a>
<div class="col-lg-10">
<h3>Third H3 Title</h3>
</div>
</div>
</div>
<a class="anchor" id="category-2"></a>
<h2 class="text-muted">Second Category</h2>
<div class="row">
<a class="anchor-entry" id="cat2-first-id"></a>
<div class="col-lg-10">
<h3>First H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat2-second-id"></a>
<div class="col-lg-10">
<h3>Second H3 Title</h3>
</div>
</div>
</div>
<a class="anchor" id="category-3"></a>
<h2 class="text-muted">Third Category</h2>
<div class="row">
<a class="anchor-entry" id="cat3-first-id"></a>
<div class="col-lg-10">
<h3>Cat-3 First H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat3-second-id"></a>
<div class="col-lg-10">
<h3>Cat-3 Second H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat3-third-id"></a>
<div class="col-lg-10">
<h3>Cat-3 Third H3 Title</h3>
</div>
</div>
</div>
</div>
so there are some blocks not within any div, but contained between a with the specific id.
I've the list of every id I need (category-1, category-2) and I would like to get in a python object (dict, dataframe, whatever) all the h3 text for each category:
d = {
'category-1': ['Cat-1 First H3 Title', 'Cat-1 Second H3 Title', 'Cat-1 Third H3 Title'],
'categor-2': ['Cat-2 First H3 Title', 'Cat-2 Second H3 Title']
}
The problem is that I didn't find any method to get in between information:
import requests
from bs4 import BeautifulSoup
url = 'myUrl'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
category_list = ['category-1', 'category-2']
for i in category_list:
# list like: [<a class="anchor" id="category-1"></a>]
catid = soup.find_all(id=i)
# long list like: [<a class="anchor-entry" id="cat1-first-id"></a>, ...]
cata = soup.find_all('a', {'class': 'anchor-entry'})
But catid and cata aren't linked and I stopped here.

Your code will only select a tags with class anchor-entry.
category_list = ['category-1', 'category-2', 'category-3']
category_tags = soup.find_all("a", {"class": "anchor"})
d = {}
for i in category_list:
tag = soup.find("a", {"id": i}).find_next()
while tag not in category_tags:
tag = tag.find_next()
if tag is None: break
if tag.name == "h3":
if d.get(i): d[i].append(tag.text)
else: d[i] = [tag.text]
My approach is to traverse the html tree, get h3 headers and store them in d until another category-id is found.

Related

Websrcaping html <time class=""></time>

I would like to get the date. But I have got just a None.
<div class="article-cover">
<div class="article-cover-img">
<img src="https://api.hvg.hu/Img/da658e97-86c0-40f3-acd3-b0a850f32c30/e6f183bb-25a9-468e-ae30-6d98952ffc00.jpg" alt="Will Smith lemondott amerikai filmakadémiai tagságáról" width="800" height="370">
</div>
<div class="article-cover-text">
<div class="article-info byline">
<div class="info">
<time class="article-datetime" datetime="2022-04-02T08:13:00.0000000+02:00">2022. április. 02. 08:13</time>
<time class="lastdate" datetime="2022-04-02T08:16:17.0000000+02:00">2022. április. 02. 08:16</time>
Kult
</div>
</div>
<div class="article-title article-title">
<h1>Will Smith lemondott amerikai filmakadémiai tagságáról</h1>
</div>
</div>
<button class="articlesavebutton bookmark large" data-id="994e8ff1-f28e-4153-9f6c-87283f187af7" data-event-category="Myhvg_article_save" data-event-action="ClickOnLink" data-event-label="Article_save_MTI"></button>
</div>
I would like to get back: 2022. április. 02. 08:13
My code is:
article_soup = BeautifulSoup(article.content, "html.parser")
d=article_soup.find('time', class\_='article-datetime')
Main issue is a typo class\_='article-datetime' and to get the text use simply the get_text() method:
article_soup.find('time', class_='article-datetime').get_text()
or
article_soup.find('time', class_='article-datetime').text
Example
html = '''
<div class="article-cover">
<div class="article-cover-img">
<img src="https://api.hvg.hu/Img/da658e97-86c0-40f3-acd3-b0a850f32c30/e6f183bb-25a9-468e-ae30-6d98952ffc00.jpg" alt="Will Smith lemondott amerikai filmakadémiai tagságáról" width="800" height="370">
</div>
<div class="article-cover-text">
<div class="article-info byline">
<div class="info">
<time class="article-datetime" datetime="2022-04-02T08:13:00.0000000+02:00">2022. április. 02. 08:13</time>
<time class="lastdate" datetime="2022-04-02T08:16:17.0000000+02:00">2022. április. 02. 08:16</time>
Kult
</div>
</div>
<div class="article-title article-title">
<h1>Will Smith lemondott amerikai filmakadémiai tagságáról</h1>
</div>
</div>
<button class="articlesavebutton bookmark large" data-id="994e8ff1-f28e-4153-9f6c-87283f187af7" data-event-category="Myhvg_article_save" data-event-action="ClickOnLink" data-event-label="Article_save_MTI"></button>
</div>'''
article_soup = BeautifulSoup(html)
article_soup.find('time', class_='article-datetime').get_text()
Output
2022. április. 02. 08:13
soup = BeautifulSoup(html, "html.parser")
adate = soup.findAll("time", {"class": "article-datetime"})
print(adate[0].get_text())

Parsing HTML with BeatifulSoup class == AND title CONTAINS

I am trying to parse the following HTML:
<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
,
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>
I am trying to get the 'id' where the title contains 'Blue' AND the item is not sold.
I have tried:
soup.find_all("a",href=re.compile("Blue"),class_="")
links = soup.find_all("a", href=re.compile("Blue", "Add To Cart"))
ids = [tag["id"] for tag in soup.find_all("a", href=re.compile("Blue"))]
But it is not returning the info I'm looking for.
I would like it to return:
AddToCartSimple-3593
I think your html is corrupted. You can do the entire filtering with css selectors using :has, :not, and :contains (:-soup-contains - latest soupsieve), along with attribute = value selectors. The ^ is a starts with operator, meaning attribute value starts with the string after the =. The ~ is a general sibling combinator and the > is a child combinator. This means looking for a sibling with class (.) tocart and then a child with id that starts with AddToCartSimple-, but that doesn't have text containing SOLD displayed. Less specific than !="SOLD" , as it can be a partial string exclusion. Depends on observed variation in actual data.
from bs4 import BeautifulSoup as bs
html ='''
<div class="product-details">
<h4 class="title">Blue - Standard</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a>
</div>
<div class="product-details">
<h4 class="title">Blue - Wide</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576">SOLD</a>
</div>
'''
soup = bs(html, 'html.parser')
print(soup.select_one('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')['id'])
You should check there was a match before accessing with ['id'] of course. You could also go for all matches as follows:
[i['id'] for i in soup.select('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')]
To get the data where the "title" contains "Blue" and the item is not "SOLD":
Use a CSS selector .product-details > h4 a[title*='Blue'] which will select all a where the title=Blue under an h4 under the class product-details
Find the next div using the find_next() method, and check that the text is not "SOLD".
Print the next div's id
from bs4 import BeautifulSoup
html = """<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select(".product-details > h4 a[title*='Blue']"):
if tag.find_next("div").text != "SOLD":
print(tag.find_next("div")["id"])
Output:
AddToCartSimple-3593

BeautilfulSoup find_all method returns the same elements

Hi here is my soup object:
<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>
How can I get all the c-codes and its corresponding text from the object?
For example: c-code: "c5ff5b1d0dc93c" and its corresponding text: "Herren" for the first row...
My code looks like this (categories is the soup object):
for category in categories.find_all('div'):
category = categories.find('div')
print(category)
I only receive the information of the first row....
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
What happens?
categories holds your html
in your loop you do category = categories.find('div') - find('div') always returns the first occurrence, so category will always be <div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
You should do category = element.get_text() to get the text and code = element.get('data-navi-cat') to get the code.
Example
from bs4 import BeautifulSoup
html = '''<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>'''
soup = BeautifulSoup(html, "lxml")
for element in soup.find_all('div'):
category = element.get_text()
code = element.get('data-navi-cat')
print(category, code)
Output
Herren
c5ff5b1d0dc93c
Frauen
c5ff5b1d0dc95f
A-Jugend (U19)
c5ff5b1d0dc978
B-Jugend (U17)
c5ff5b1d0dc98c

Return text surrounded by double tag with BeautifulSoup

I am looping through list with urls. On each page there is between 1 and n descriptions which are surrounded by double p tag.
BeautifulSoup.find(class_='view-content')
# url 1
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One animal</p>
</p>
</div>
</div>
</div>
# url 2
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One person</p>
</p>
</div>
</div>
<div class="row">
<div class="description">
<p><p>Two people </p>
</p>
</div>
</div>
</div>
When I use
for d in soup.find(class_='view-content').find_all('p'):
dd = d.contents[0]
print(dd)
I get
<p>One animal</p>One animal
<p>One person</p>One person
<p>Two people</p>Two people
Instead of expected
One animal
One person
Two people
Any way to retrieve content surrounded by double p tags?
Edit: The following returns the same, but at least without p tags.
for d in soup.find_all("div",class_="view-content"):
print(' '.join(i.text for i in review.find_all('p')[1:]))
Another solution.
from simplified_scrapy import SimplifiedDoc
html = '''
# url 1
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One animal</p>
</p>
</div>
</div>
</div>
# url 2
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One person</p>
</p>
</div>
</div>
<div class="row">
<div class="description">
<p><p>Two people </p>
</p>
</div>
</div>
</div>
'''
doc = SimplifiedDoc(html)
divs = doc.selects('div.view-content')
datas=[]
for div in divs:
datas.extend ([p.text for p in div.ps])
print (datas)
Result:
['One animal', 'One person', 'Two people']

Python v3 , Beautifoulsoup - multiple div tags with same name

soup = BeautifulSoup(html, "html.parser") # BeautifulSoup(markup, "lxml")
items = soup.find_all("div","_3u1 _gli _uvb", recursive=True)
for item in items:
abouts = item.find_all("div", {"class":"_glo"}, recursive = True)[0].text
print (abouts)
HTML page:
<div class="_glo">
<div>
<div class="_ajw">
<div class="_52eh">
"text
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
</div>
</div>
Afternoon , i am trying to scrape a webpage using beautifullsoup, python. I need al the "text" strings in a separate variable. When i print abouts i get :"text text text" I want it to be seperated.
Kind regards
Try this:
items = soup.find_all('div', attrs={'class':'_ajw'})
dict = {}
for i in range(len(items)):
dict['text'+str(i+1)] = item[i].find('div', attrs={'class':'_52eh'}).text
print(dict)
This will give you something like this:
{'text1': text, 'text2': text, 'text3': text}
I'd use soup.select to apply a class selector to the html. It is a fast method to get a list of the appropriate elements by class
from bs4 import BeautifulSoup as bs
html = '''
<div class="_glo">
<div>
<div class="_ajw">
<div class="_52eh">
"text
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
</div>
</div>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('._52eh')]
print(items)

Categories

Resources