I'm struggling to find a simple to solve this problem and hope you might be able to help.
I've been using Beautifulsoup's find all and trying some regex to find all the items except the 'emptyLine' line in the html below:
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 last">...</div>
<div class="product_item2 emptyItem">...</div>
Is there a simple way to find all the items except one including the 'emptyItem'?
Just skip elements containing the emptyItem class. Working sample:
from bs4 import BeautifulSoup
data = """
<div>
<div class="product_item0">test0</div>
<div class="product_item1">test1</div>
<div class="product_item2">test2</div>
<div class="product_item2 emptyItem">empty</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for elm in soup.select("div[class^=product_item]"):
if "emptyItem" in elm["class"]: # skip elements having emptyItem class
continue
print(elm.get_text())
Prints:
test0
test1
test2
Note that the div[class^=product_item] is a CSS selector that would match all div elements with a class starting with product_item.
Related
Let's assume I have a html like following:
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
I want to move all divs with the class answer-div into the previous question-div. Can I handle it with beautifulsoup?
You can also use insert
from bs4 import BeautifulSoup
html="""
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
"""
soup=BeautifulSoup(html,'html.parser')
for div in soup.findAll('div',{"class":"answer-div"}):
div.find_previous_sibling('div').insert(0,div)
print(soup)
Output
<div class="question-div"><div class="answer-div"></div></div>
<div class="question-div"><div class="answer-div"></div></div>
<div class="question-div"><div class="answer-div"></div></div>
No hands-on experience with beautifulsoup but I will give this one a shot!
The way I look at it is, you find all the div's with question and answer separately.
div_ques_Blocks = soup.find_all('div', class_="question-div")
div_ans_Blocks = soup.find_all('div', class_="answer-div")
and then loop through the question-div to insert/append the answer-div
for divtag in div_ans_Blocks :
print divtag.find_previous_sibling('div')
If the above print statement gives you all the answer-div, you can then try appending them instead of priting, maybe like this?
I'm getting my toes wet with BeautifulSoup and am hung up on scraping some particular info. The HTML looks like the following, for example:
<div class="row">
::before
<div class="course-short clearfix">
::before
<div class="course-meta col-sm-12">
<dl>
<dt>Language:</dt>
<dd>English</dd>
<dt>Author:</dt>
<dd>John Doe</dd>
<dt>Institution:</dt>
<dd>American University</dd>
</dl>
</div>
...
<div class="row">
::before
<div class="course-short clearfix">
::before
<div class="course-meta col-sm-12">
<dl>
<dt>Language:</dt>
<dd>English</dd>
<dt>Author:</dt>
<dd>John Doe, Jr.</dd>
<dt>Institution:</dt>
<dd>Another University</dd>
</dl>
</div>
...
Each page has about 10 <div class="row"> tags, each with the same <dt> and <dd> pattern (e.g., Language, Author, Institution).
I am trying to scrape the <dd>American University</dd> info, ultimately to create a loop so that I can get that info specific to each <div class="row"> tag.
I've managed the following so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.oeconsortium.org/courses/search/?search=statistics")
bsObj = BeautifulSoup(html.read(), "html.parser")
institutions = [x.text.strip() for x in bsObj.find_all('div', 'course-meta col-sm-12', 'dd')]
But that only gives me the following mess for each respective <div class="row"> :
Language:\n\n\t\t\t\t\t\t\t\t\t\t\tEnglish\t\t\t\t\t\t\t\t\t\nAuthor:\nJohn Doe\nInstitution:\nAmerican University\n
Language:\n\n\t\t\t\t\t\t\t\t\t\t\tEnglish\t\t\t\t\t\t\t\t\t\nAuthor:\nJohn Doe, Jr.\nInstitution:\nAnother University\n
...
(I know how to .strip(); that's not the issue.)
I can't figure out how to target that third <dd></dd> for each respective <div class="row">. I feel like it may be possible by also targeting the <dt>Institution:</dt> tag (which is "Institution" in every respective case), but I can't figure it out.
Any help is appreciated. I am trying to make a LOOP so that I can loop over, say, ten <div class="row"> instances and just pull out the info specific to the "Institution" <dd> tag.
Thank you!
I can't figure out how to target that third <dd></dd> for each respective <div class="row">
find_all will return a list of all occurrences, so you could just take the third element of the result. Although you may want to wrap the whole thing with try catch to prevent IndexError
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
I'd make a function out of this code to reuse it for different pages:
soup = BeautifulSoup(html.read(), "html.parser")
result = []
for meta in soup.find_all('div', {'class': 'course-meta'}):
try:
institution = meta.find_all('dd')[2].text.strip()
result.append(institution) # or whatever you like to store it
except IndexError:
pass
Given the following code:
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>
How to extract the word test from <div class="category5"> test using BeautifulSoup i.e how to deal with nested divs? I tried to lookup on the Internet but I didn't find any case that treat an easy to grasp example so I set up this one. Thanks.
xpath should be the straight forward answer, however this is not supported in BeautifulSoup.
Updated: with a BeautifulSoup solution
To do so, given that you know the class and element (div) in this case, you can use a for/loop with attrs to get what you want:
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>'''
content = BeautifulSoup(html)
for div in content.findAll('div', attrs={'class':'category5'}):
print div.text
test
I have no problem extracting the text from your html sample, like #MartijnPieters suggested, you will need to find out why your div element is missing.
Another update
As you're missing lxml as a parser for BeautifulSoup, that's why None was returned as you haven't parsed anything to start with. Install lxml should solve your issue.
You may consider using lxml or similar which supports xpath, dead easy if you ask me.
from lxml import etree
tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[#class="category5"]/text()')
[' test\n ']
I have the following given html structure
<li class="g">
<div class="vsc">
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</div>
</li>
The above html structure keeps repeating, what can be the easiest way to parse all the links(stackoverflow.com) from the above html structure using BeautifulSoup and Python?
BeautifulSoup 4 offers a convenient way of accomplishing this, using CSS selectors:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print [a["href"] for a in soup.select('h3.r a')]
This also has the advantage of constraining the selection by context: it selects only those anchor nodes that are children of a h3 node with class r.
Omitting the constraint or choosing one most suitable for the need is easy by just tweaking the selector; see the CSS selector docs for that.
Using CSS selectors as proposed by Petri is probably the best way to do it using BS. However, i can't hold myself back to recommend using lxml.html and xpath, that are pretty much perfect for the job.
Test html:
html="""
<html>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="gamma"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
</html>"""
and it's basically a oneliner:
import lxml.html as lh
doc=lh.fromstring(html)
doc.xpath('.//li[#class="g"][div/#class = "vsc"][div/#class = "alpha"][div/#class = "beta"][h3/#class = "r"]/h3/a/#href')
Out[264]:
['http://www.correct.com', 'http://www.correct.com']
If the HTML code looks like this:
<div class="div1">
<p>hello</p>
<p>hi</p>
<div class="nesteddiv">
<p>one</p>
<p>two</p>
<p>three</p>
</div>
</div>
How do I extract just
<div class="div1">
<p>hello</p>
<p>hi</p>
</div>
I already tried parser.find('div', 'div1') but I'm getting the whole div including the nested one.
You actually want to extract() the nested div from the document and then get the first div. Here is an example (where html is the HTML you provided in the question):
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.div.div.extract()
<div class="nesteddiv">
<p>one</p>
<p>two</p>
<p>three</p>
</div>
>>> soup.div
<div class="div1">
<p>hello</p>
<p>hi</p>
</div>
Why not just find() the nested div and then remove it from the tree using extract()?