If the HTML code looks like this:
<div class="div1">
<p>hello</p>
<p>hi</p>
<div class="nesteddiv">
<p>one</p>
<p>two</p>
<p>three</p>
</div>
</div>
How do I extract just
<div class="div1">
<p>hello</p>
<p>hi</p>
</div>
I already tried parser.find('div', 'div1') but I'm getting the whole div including the nested one.
You actually want to extract() the nested div from the document and then get the first div. Here is an example (where html is the HTML you provided in the question):
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.div.div.extract()
<div class="nesteddiv">
<p>one</p>
<p>two</p>
<p>three</p>
</div>
>>> soup.div
<div class="div1">
<p>hello</p>
<p>hi</p>
</div>
Why not just find() the nested div and then remove it from the tree using extract()?
Related
I'm currently scraping elements from a webpage. Let's say i'm iterating over a HTML reponse and a part of that response looks like this:
<div class="col-sm-12 col-md-5">
<div class="material">
<div class="material-parts">
<span class="material-part" title="SLT-4 2435">
<img src="/images/train-material/mat_slt4.png"/> </span>
<span class="material-part" title="SLT-6 2631">
<img src="/images/train-material/mat_slt6.png"/> </span>
</div>
</div>
</div>
I know I can access the first element under title within the span class like so:
row[-1].find('span')['title']
"SLT-4 2435
But I would like to select the second title under the span class (if it exists) as a string too, like so: "SLT-4 2435, SLT-6 2631"
Any ideas?
You can use the find_all() function to find all the span elements with class material-part
titles = []
for material_part in row[-1].find_all('span', class_='material-part'):
titles.append(material_part['title'])
result = ', '.join(titles)
In alternativ to find() / find_all() you could use css selectors:
soup.select('span.material-part[title]')
,iterate the ResultSet with list comprehension and join() your texts to a single string:
','.join([t.get('title') for t in soup.select('span.material-part[title]')])
Example
from bs4 import BeautifulSoup
html = '''<div class="col-sm-12 col-md-5">
<div class="material">
<div class="material-parts">
<span class="material-part" title="SLT-4 2435">
<img src="/images/train-material/mat_slt4.png"/> </span>
<span class="material-part" title="SLT-6 2631">
<img src="/images/train-material/mat_slt6.png"/> </span>
</div>
</div>
</div>'''
soup = BeautifulSoup(html)
','.join([t.get('title') for t in soup.select('span.material-part[title]')])
Output
SLT-4 2435,SLT-6 2631
Let's assume I have a html like following:
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
I want to move all divs with the class answer-div into the previous question-div. Can I handle it with beautifulsoup?
You can also use insert
from bs4 import BeautifulSoup
html="""
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
"""
soup=BeautifulSoup(html,'html.parser')
for div in soup.findAll('div',{"class":"answer-div"}):
div.find_previous_sibling('div').insert(0,div)
print(soup)
Output
<div class="question-div"><div class="answer-div"></div></div>
<div class="question-div"><div class="answer-div"></div></div>
<div class="question-div"><div class="answer-div"></div></div>
No hands-on experience with beautifulsoup but I will give this one a shot!
The way I look at it is, you find all the div's with question and answer separately.
div_ques_Blocks = soup.find_all('div', class_="question-div")
div_ans_Blocks = soup.find_all('div', class_="answer-div")
and then loop through the question-div to insert/append the answer-div
for divtag in div_ans_Blocks :
print divtag.find_previous_sibling('div')
If the above print statement gives you all the answer-div, you can then try appending them instead of priting, maybe like this?
I'm struggling to find a simple to solve this problem and hope you might be able to help.
I've been using Beautifulsoup's find all and trying some regex to find all the items except the 'emptyLine' line in the html below:
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 last">...</div>
<div class="product_item2 emptyItem">...</div>
Is there a simple way to find all the items except one including the 'emptyItem'?
Just skip elements containing the emptyItem class. Working sample:
from bs4 import BeautifulSoup
data = """
<div>
<div class="product_item0">test0</div>
<div class="product_item1">test1</div>
<div class="product_item2">test2</div>
<div class="product_item2 emptyItem">empty</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for elm in soup.select("div[class^=product_item]"):
if "emptyItem" in elm["class"]: # skip elements having emptyItem class
continue
print(elm.get_text())
Prints:
test0
test1
test2
Note that the div[class^=product_item] is a CSS selector that would match all div elements with a class starting with product_item.
Given the following code:
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>
How to extract the word test from <div class="category5"> test using BeautifulSoup i.e how to deal with nested divs? I tried to lookup on the Internet but I didn't find any case that treat an easy to grasp example so I set up this one. Thanks.
xpath should be the straight forward answer, however this is not supported in BeautifulSoup.
Updated: with a BeautifulSoup solution
To do so, given that you know the class and element (div) in this case, you can use a for/loop with attrs to get what you want:
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>'''
content = BeautifulSoup(html)
for div in content.findAll('div', attrs={'class':'category5'}):
print div.text
test
I have no problem extracting the text from your html sample, like #MartijnPieters suggested, you will need to find out why your div element is missing.
Another update
As you're missing lxml as a parser for BeautifulSoup, that's why None was returned as you haven't parsed anything to start with. Install lxml should solve your issue.
You may consider using lxml or similar which supports xpath, dead easy if you ask me.
from lxml import etree
tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[#class="category5"]/text()')
[' test\n ']
Consider the following:
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
How would you go about taking out the sitemap line with regex in python?
Sitemap
The following can be used to pull out the anchor tags.
'/<a(.*?)a>/i'
However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?
Don't use a regex. Use BeautfulSoup, an HTML parser.
from BeautifulSoup import BeautifulSoup
html = \
"""
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>"""
soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a
# Sitemap
Parsing HTML with regular expression is a bad idea!
Think about the following piece of html
<a></a > <!-- legal html, but won't pass your regex -->
Sitemap<!-- proof that a>b iff ab>1 -->
There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.
You should consider using Beautiful Soup python HTML parser.
Anyhow, a ad-hoc solution using regex is
import re
data = """
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
"""
e = re.compile('<a *[^>]*>.*</a *>')
print e.findall(data)
Output:
>>> e.findall(data)
['Foo1', 'Home', 'Extract', 'Sitemap']
In order to extract the contents of the tagline:
Sitemap
... I would use:
>>> import re
>>> s = '''
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>'''
>>> m = re.compile(r'(.*?)').search(s)
>>> m.group(1)
'Sitemap'
Use BeautifulSoup or lxml if you need to parse HTML.
Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?
If you really have to use regular expressions, have a look at findall.