Given the following code:
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>
How to extract the word test from <div class="category5"> test using BeautifulSoup i.e how to deal with nested divs? I tried to lookup on the Internet but I didn't find any case that treat an easy to grasp example so I set up this one. Thanks.
xpath should be the straight forward answer, however this is not supported in BeautifulSoup.
Updated: with a BeautifulSoup solution
To do so, given that you know the class and element (div) in this case, you can use a for/loop with attrs to get what you want:
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>'''
content = BeautifulSoup(html)
for div in content.findAll('div', attrs={'class':'category5'}):
print div.text
test
I have no problem extracting the text from your html sample, like #MartijnPieters suggested, you will need to find out why your div element is missing.
Another update
As you're missing lxml as a parser for BeautifulSoup, that's why None was returned as you haven't parsed anything to start with. Install lxml should solve your issue.
You may consider using lxml or similar which supports xpath, dead easy if you ask me.
from lxml import etree
tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[#class="category5"]/text()')
[' test\n ']
Related
I want to wrap the content of a lot of div-elements/blocks with p tags:
<div class='value'>
some content
</div>
It should become:
<div class='value'>
<p>
some content
</p>
</div>
My idea was to get the content (using bs4) by filtering strings with find_all and then wrap it with the new tag. Don't know, if its working. I cant filter content from tags with specific attributes/values.
I can do this instead of bs4 with regex. But I'd like to do all transformations (there are some more beside this one) in bs4.
Believe it or not, you can use wrap. :-)
Because you might, or might not, want to wrap inner div elements I decided to alter your HTML code a little bit, so that I could give you code that shows how to alter an inner div without changing the one 'outside' it. You will see how to alter all divs, I'm sure.
Here's how.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('pjoern.htm').read(), 'lxml')
>>> inner_div = soup.findAll('div')[1]
>>> inner_div
<div>
some content
</div>
>>> inner_div.contents[0].wrap(soup.new_tag('p'))
<p>
some content
</p>
>>> print(soup.prettify())
<html>
<body>
<div class="value">
<div>
<p>
some content
</p>
</div>
</div>
</body>
</html>
I'm struggling to find a simple to solve this problem and hope you might be able to help.
I've been using Beautifulsoup's find all and trying some regex to find all the items except the 'emptyLine' line in the html below:
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 last">...</div>
<div class="product_item2 emptyItem">...</div>
Is there a simple way to find all the items except one including the 'emptyItem'?
Just skip elements containing the emptyItem class. Working sample:
from bs4 import BeautifulSoup
data = """
<div>
<div class="product_item0">test0</div>
<div class="product_item1">test1</div>
<div class="product_item2">test2</div>
<div class="product_item2 emptyItem">empty</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for elm in soup.select("div[class^=product_item]"):
if "emptyItem" in elm["class"]: # skip elements having emptyItem class
continue
print(elm.get_text())
Prints:
test0
test1
test2
Note that the div[class^=product_item] is a CSS selector that would match all div elements with a class starting with product_item.
I am trying to parse with BeautifulSoup an awful HTML page to retrieve a few information. The code below:
import bs4
with open("smartradio.html") as f:
html = f.read()
soup = bs4.BeautifulSoup(html)
x = soup.find_all("div", class_="ue-alarm-status", playerid="43733")
print(x)
extracts the fragments I would like to analyze further:
[<div alarmid="f319e1fb" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 1: </div>
<div>allumé</div>
<div>7:00</div>
</div>
<div>
<div class="ue-alarm-dow">Lu, Ma, Me, Je, Ve </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>, <div alarmid="ea510709" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 2: </div>
<div>allumé</div>
<div>7:30</div>
</div>
<div>
<div class="ue-alarm-dow">Sa </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>]
I am interested in retrieving:
the hour (line 5 and 14)
the string (days in French) under <div class="ue-alarm-dow">
I believe that for the days it is enough to repeat a find() or find_all(). I am mentioning that because while it grabs the right information, I am not sure that this is the right way to parse the file with BeautifulSoup (but at least it works):
for y in x:
z = y.find("div", class_="ue-alarm-dow")
print(z.text)
# output:
# Lu, Ma, Me, Je, Ve
# Sa
I do not know how to get to the hour, though. Is there a way to navigate the tree by path (in the sense that I know that the hour is under the second <div>, three <div> deep)? Or should I do it differently?
You can also rely on the allumé text and get the next sibling div element:
y.find('div', text=u'allumé').find_next_sibling('div').text
or, in a similar manner, relying on the class of the previous div:
y.find('div', class_='ue-alarm-edit').find_next_siblings('div')[1].text
or, using regular expressions:
y.find('div', text=re.compile(r'\d+:\d+')).text
or, just get the div by index:
y.find_all('div')[4].text
I have the following given html structure
<li class="g">
<div class="vsc">
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</div>
</li>
The above html structure keeps repeating, what can be the easiest way to parse all the links(stackoverflow.com) from the above html structure using BeautifulSoup and Python?
BeautifulSoup 4 offers a convenient way of accomplishing this, using CSS selectors:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print [a["href"] for a in soup.select('h3.r a')]
This also has the advantage of constraining the selection by context: it selects only those anchor nodes that are children of a h3 node with class r.
Omitting the constraint or choosing one most suitable for the need is easy by just tweaking the selector; see the CSS selector docs for that.
Using CSS selectors as proposed by Petri is probably the best way to do it using BS. However, i can't hold myself back to recommend using lxml.html and xpath, that are pretty much perfect for the job.
Test html:
html="""
<html>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="gamma"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
</html>"""
and it's basically a oneliner:
import lxml.html as lh
doc=lh.fromstring(html)
doc.xpath('.//li[#class="g"][div/#class = "vsc"][div/#class = "alpha"][div/#class = "beta"][h3/#class = "r"]/h3/a/#href')
Out[264]:
['http://www.correct.com', 'http://www.correct.com']
Consider the following:
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
How would you go about taking out the sitemap line with regex in python?
Sitemap
The following can be used to pull out the anchor tags.
'/<a(.*?)a>/i'
However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?
Don't use a regex. Use BeautfulSoup, an HTML parser.
from BeautifulSoup import BeautifulSoup
html = \
"""
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>"""
soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a
# Sitemap
Parsing HTML with regular expression is a bad idea!
Think about the following piece of html
<a></a > <!-- legal html, but won't pass your regex -->
Sitemap<!-- proof that a>b iff ab>1 -->
There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.
You should consider using Beautiful Soup python HTML parser.
Anyhow, a ad-hoc solution using regex is
import re
data = """
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
"""
e = re.compile('<a *[^>]*>.*</a *>')
print e.findall(data)
Output:
>>> e.findall(data)
['Foo1', 'Home', 'Extract', 'Sitemap']
In order to extract the contents of the tagline:
Sitemap
... I would use:
>>> import re
>>> s = '''
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>'''
>>> m = re.compile(r'(.*?)').search(s)
>>> m.group(1)
'Sitemap'
Use BeautifulSoup or lxml if you need to parse HTML.
Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?
If you really have to use regular expressions, have a look at findall.