I am trying to get list of the tags which are, CTC_3D_Printer, ctc_prusa_i3_pro_b, CTC_Upgrades from the following html source code
html = """
<div class="content_stack">
<h2 class="section-header justify">
Tags
</h2>
<div class="thing-detail-tags-container">
<div class="taglist">
CTC_3D_Printer
ctc_prusa_i3_pro_b
CTC_Upgrades
</div>
</div>
</div>
<div class="content_stack">
<h2 class="section-header">
Design Tools
</h2>
<div class="taglist">
<span>Tinkercad</span>
</div>
</div>
"""
Normally I would use:
tags = soup.find("h2", string = "Tags").findNextSibling()
to get the tags. However as there is extra space surrounding the Tags I can not use it. Tags are not always the first element in comes right after the <div class="content_stack">. How could I solve my problem, by combining "find" with some pre-defined function?
As explained in Kinds of filters in the docs, you just write a function (that takes a BS tag object and returns true if it's a match), and pass it to find.
Their example is a function that finds only tags with a class but without an id:
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
For your case, you just want to do an in check on the text:
h2 = soup.find('h2', string=lambda s: 'Tags' in s)
… or maybe:
h2 = soup.find(lambda tag: tag.name=='h2' and 'Tags' in tag.string)
Related
I am writing a Python script using pdfminer.six to convert a huge bulk of pdfs to html to upload them on a e-store afterwards. So far the main text blocks have been parsed quite well, but in the process I had to replace all spans to divs (and strip the spans from their attributes) for obvious reasons, so now a document's structure is as follows:
<div> #first main block
<div>
Product desc heading
</div>
<div>
Product desc text
</div>
#etc etc
</div>
<div> #second main block
<div>
Product specs heading
</div>
<div>
Product specs text
</div>
#etc etc
</div>
The problem is the navigation in identical divs. If I try to find the very first div and add some attributes to it, like the docs suggest:
firstdiv = soup.find('div')
firstdiv['class'] = 'main_productinfo'
The result is quite predictable - IDLE prints out the following error:
File "C:\Users\blabla\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\element.py", line 1036, in __setitem__
self.attrs[key] = value
TypeError: 'NoneType' object does not support item assignment
, since the find() method doesn't return a particular result (may or may not find).
I want to strain the first block in each file and then parse the tables (found in the specs block below) to html and join these two in each upload file.
How can I add attributes to the first tag without converting the soup to string again and again (and thus making it really, really ugly, since it converts the newly refined soup without any whitespaces) and replacing parts of the string in str(soup)? I'm quite new to Python and nothing readily comes to mind.
UPD:
I'm using Python 3.7.2 on Win 7 64.
I'm not getting that error:
import bs4
html = '''<div> #first main block
<div>
Product desc heading
</div>
<div>
Product desc text
</div>
#etc etc
</div>
<div> #second main block
<div>
Product specs heading
</div>
<div>
Product specs text
</div>
#etc etc
</div>'''
soup = bs4.BeautifulSoup(html, 'html.parser')
firstdiv = soup.find('div')
Output:
print (firstdiv)
<div> #first main block
<div>
Product desc heading
</div>
<div>
Product desc text
</div>
#etc etc
</div>
Then:
firstdiv['class'] = 'main_productinfo'
print (firstdiv)
<div class="main_productinfo"> #first main block
<div>
Product desc heading
</div>
<div>
Product desc text
</div>
#etc etc
</div>
<div id="main-content" class="content">
<div class="metaline">
<span class="article-meta author">jorden</span>
</div>
"
1.name:jorden>
2.age:28
--
"
<span class="D2"> from 111.111.111.111 </span>
</div>
I only need
1.name:jorden
2.age:28
xxx.select('#main-content') this will return all things, but i only need part of them.
Because they are not in any tags, i don't know how to do.
You want to find the tag before the text in question (in your case, <div class="metaline">) and then look at the next sibling in the HTML parse tree:
text = soup.find("div", class_='metaline').next_sibling
print(text)
# "
# 1.name:jorden>
# 2.age:28
#
# --
# "
#
Once you get the raw text, strip it, etc.
I'm new to Python and trying to parse a simple HTML. However, one thing stops me: for example, I have this html:
<div class = "quote">
<div class = "whatever">
some unnecessary text here
</div>
<div class = "text">
Here's the desired text!
</div>
</div>
I need to extract text from second div (text). This way I get it:
print repr(link.find('div').findNextSibling())
However, this returns the whole div (with "div" word): <div class="text">Here's the desired text!</div>
And I don't know how to get text only.
Adding .text results in \u043a\u0430\u043a \u0440\u0430\u0437\u0440\u0430\u0431 strings\
Adding .strings returns "None"
Adding .string returns both "None" and \u042f\u0445\u0438\u043a\u043e - \u0435\u0441\u043b\u0438\
Maybe there's something wrong with repr
P.S. I need to save tags inside div too.
Why don't you simply search the <div> element based in its class attribute? Something like the following seems to work for me:
from bs4 import BeautifulSoup
html = '''<div class = "quote">
<div class = "whatever">
some unnecessary text here
</div>
<div class = "text">
Here's the desired text!
</div>
</div>'''
link = BeautifulSoup(html, 'html')
print link.find('div', class_="text").text.strip()
It yields:
Here's the desired text!
I want to parse HTML and turn them into string templates. In the example below, I seeked out elements marked with x-inner and they became template placeholders in the final string. Also x-attrsite also became a template placeholder (with a different command of course).
Input:
<div class="x,y,z" x-attrsite>
<div x-inner></div>
<div>
<div x-inner></div>
</div>
</div>
Desired output:
<div class="x,y,z" {attrsite}>{inner}<div>{inner}</div></div>
I know there is HTMLParser and BeautifulSoup, but I am at a loss on how to extract the strings before and after the x-* markers and to escape those strings for templating.
Existing curly braces are handled sanely, like this sample:
<div x-maybe-highlighted> The template string "there are {n} message{suffix}" can be used.</div>
BeautifulSoup can handle the case:
find all div elements with x-attrsite attribute, remove the attribute and add {attrsite} attribute with a value None (produces an attribute with no value)
find all div elements with x-inner attribute and use replace_with() to replace the element with a text {inner}
Implementation:
from bs4 import BeautifulSoup
data = """
<div class="x,y,z" x-attrsite>
<div x-inner></div>
<div>
<div x-inner></div>
</div>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
for div in soup.find_all('div', {'x-attrsite': True}):
del div['x-attrsite']
div['{attrsite}'] = None
for div in soup.find_all('div', {'x-inner': True}):
div.replace_with('{inner}')
print(soup.prettify())
Prints:
<div class="x,y,z" {attrsite}>
{inner}
<div>
{inner}
</div>
</div>
I currently am successfully scraping the data I need by chaining bs4 .contents together following a find_all('div'), but that seems inherently fragile. I'd like to go directly to the tag I need by class, but my "class_=" search is returning None.
I ran the following code on the html below, which returns None:
soup = BeautifulSoup(text) # this works fine
tag = soup.find(class_ = "loan-section-content") # this returns None
Also tried soup.find('div', class_ = "loan-section-content") - also returns None.
My html is:
<div class="loan-section">
<div class="loan-section-title">
<span class="text-light"> Some Text </span>
</div>
<div class="loan-section-content">
<div class="row">
<div class="col-sm-6">
<strong>More text</strong>
<br/>
<strong>
Dakar, Senegal
</strong>
try this
soup.find(attrs={'class':'loan-section-content'})
or
soup.find('div','loan-section-content')
attrs will search on attributes
Demo: