How to properly get an element with BeautifulSoup? - python

I'm new to Python and trying to parse a simple HTML. However, one thing stops me: for example, I have this html:
<div class = "quote">
<div class = "whatever">
some unnecessary text here
</div>
<div class = "text">
Here's the desired text!
</div>
</div>
I need to extract text from second div (text). This way I get it:
print repr(link.find('div').findNextSibling())
However, this returns the whole div (with "div" word): <div class="text">Here's the desired text!</div>
And I don't know how to get text only.
Adding .text results in \u043a\u0430\u043a \u0440\u0430\u0437\u0440\u0430\u0431 strings\
Adding .strings returns "None"
Adding .string returns both "None" and \u042f\u0445\u0438\u043a\u043e - \u0435\u0441\u043b\u0438\
Maybe there's something wrong with repr
P.S. I need to save tags inside div too.

Why don't you simply search the <div> element based in its class attribute? Something like the following seems to work for me:
from bs4 import BeautifulSoup
html = '''<div class = "quote">
<div class = "whatever">
some unnecessary text here
</div>
<div class = "text">
Here's the desired text!
</div>
</div>'''
link = BeautifulSoup(html, 'html')
print link.find('div', class_="text").text.strip()
It yields:
Here's the desired text!

Related

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>
To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

Combining "find" with function in Beaitufulsoup

I am trying to get list of the tags which are, CTC_3D_Printer, ctc_prusa_i3_pro_b, CTC_Upgrades from the following html source code
html = """
<div class="content_stack">
<h2 class="section-header justify">
Tags
</h2>
<div class="thing-detail-tags-container">
<div class="taglist">
CTC_3D_Printer
ctc_prusa_i3_pro_b
CTC_Upgrades
</div>
</div>
</div>
<div class="content_stack">
<h2 class="section-header">
Design Tools
</h2>
<div class="taglist">
<span>Tinkercad</span>
</div>
</div>
"""
Normally I would use:
tags = soup.find("h2", string = "Tags").findNextSibling()
to get the tags. However as there is extra space surrounding the Tags I can not use it. Tags are not always the first element in comes right after the <div class="content_stack">. How could I solve my problem, by combining "find" with some pre-defined function?
As explained in Kinds of filters in the docs, you just write a function (that takes a BS tag object and returns true if it's a match), and pass it to find.
Their example is a function that finds only tags with a class but without an id:
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
For your case, you just want to do an in check on the text:
h2 = soup.find('h2', string=lambda s: 'Tags' in s)
… or maybe:
h2 = soup.find(lambda tag: tag.name=='h2' and 'Tags' in tag.string)

How to wrap string by tag in Beautifulsoup?

I want to wrap the content of a lot of div-elements/blocks with p tags:
<div class='value'>
some content
</div>
It should become:
<div class='value'>
<p>
some content
</p>
</div>
My idea was to get the content (using bs4) by filtering strings with find_all and then wrap it with the new tag. Don't know, if its working. I cant filter content from tags with specific attributes/values.
I can do this instead of bs4 with regex. But I'd like to do all transformations (there are some more beside this one) in bs4.
Believe it or not, you can use wrap. :-)
Because you might, or might not, want to wrap inner div elements I decided to alter your HTML code a little bit, so that I could give you code that shows how to alter an inner div without changing the one 'outside' it. You will see how to alter all divs, I'm sure.
Here's how.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('pjoern.htm').read(), 'lxml')
>>> inner_div = soup.findAll('div')[1]
>>> inner_div
<div>
some content
</div>
>>> inner_div.contents[0].wrap(soup.new_tag('p'))
<p>
some content
</p>
>>> print(soup.prettify())
<html>
<body>
<div class="value">
<div>
<p>
some content
</p>
</div>
</div>
</body>
</html>

BS4 Searching by Class_ Returning Empty

I currently am successfully scraping the data I need by chaining bs4 .contents together following a find_all('div'), but that seems inherently fragile. I'd like to go directly to the tag I need by class, but my "class_=" search is returning None.
I ran the following code on the html below, which returns None:
soup = BeautifulSoup(text) # this works fine
tag = soup.find(class_ = "loan-section-content") # this returns None
Also tried soup.find('div', class_ = "loan-section-content") - also returns None.
My html is:
<div class="loan-section">
<div class="loan-section-title">
<span class="text-light"> Some Text </span>
</div>
<div class="loan-section-content">
<div class="row">
<div class="col-sm-6">
<strong>More text</strong>
<br/>
<strong>
Dakar, Senegal
</strong>
try this
soup.find(attrs={'class':'loan-section-content'})
or
soup.find('div','loan-section-content')
attrs will search on attributes
Demo:

find tags in defined scope using beautifulsoup

I have using beautifulsoup to extract datas.
I hava such a html file:
<div class=a>
<a href='google.com'>a</a>
</div>
<div class=b>
<a href='google.com'>c</a>
<a href='google.com'>d</a>
</div>
I want to extract data 'c,d' in ,I don't need data 'a' in
so I do:
google_list = soup.findAll('a',href='google.com')
for item in google_list:
print item.strings
it will print a,c,d.
so my problem is how to just print 'c','d' in without 'a' in
You could just select based upon the div whose class is b and then after that use your original query on that tag so that you look for its children:
div = soup.find_all('div', {"class":"b"})[0]
items = div.find_all('a', href="google.com")
I stopped using Beautiful soup a few years back and prefer the lxml library whose html parser is flexible and also allows xpath queries.
html = """<div class=a>
<a href='google.com'>a</a>
</div>
<div class=b>
<a href='google.com'>c</a>
<a href='google.com'>d</a>
</div>
"""
root = lxml.html.fromstring(html).getroottree()
root.xpath("//div[#class='b']/a[#href='google.com']/text()")
# ['c', 'd']
This finds all the text from all the anchors which refer to 'google.com' that are inside any div with a class 'b'.

Categories

Resources