I am trying to parse with BeautifulSoup an awful HTML page to retrieve a few information. The code below:
import bs4
with open("smartradio.html") as f:
html = f.read()
soup = bs4.BeautifulSoup(html)
x = soup.find_all("div", class_="ue-alarm-status", playerid="43733")
print(x)
extracts the fragments I would like to analyze further:
[<div alarmid="f319e1fb" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 1: </div>
<div>allumé</div>
<div>7:00</div>
</div>
<div>
<div class="ue-alarm-dow">Lu, Ma, Me, Je, Ve </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>, <div alarmid="ea510709" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 2: </div>
<div>allumé</div>
<div>7:30</div>
</div>
<div>
<div class="ue-alarm-dow">Sa </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>]
I am interested in retrieving:
the hour (line 5 and 14)
the string (days in French) under <div class="ue-alarm-dow">
I believe that for the days it is enough to repeat a find() or find_all(). I am mentioning that because while it grabs the right information, I am not sure that this is the right way to parse the file with BeautifulSoup (but at least it works):
for y in x:
z = y.find("div", class_="ue-alarm-dow")
print(z.text)
# output:
# Lu, Ma, Me, Je, Ve
# Sa
I do not know how to get to the hour, though. Is there a way to navigate the tree by path (in the sense that I know that the hour is under the second <div>, three <div> deep)? Or should I do it differently?
You can also rely on the allumé text and get the next sibling div element:
y.find('div', text=u'allumé').find_next_sibling('div').text
or, in a similar manner, relying on the class of the previous div:
y.find('div', class_='ue-alarm-edit').find_next_siblings('div')[1].text
or, using regular expressions:
y.find('div', text=re.compile(r'\d+:\d+')).text
or, just get the div by index:
y.find_all('div')[4].text
Related
Consider the following html snippet
<html>
.
.
.
<div>
<p> Hello </p>
<div>
<b>
Text1
</b>
<p>
This is a huge paragraph text
</p>
.
.
.
</div>
</div>
.
.
.
<div>
<i>
Text2
</i>
</div>
Let us say that I need to extract everything from Text1 to Text2, including the tags.
Using a few methods, I have been able to extract the tags of those two, i.e. their unique ID.
Essentially I have 2 Element.etree elements, corresponding to the two tags I require.
How do I extract everything in between the two tags?
(One possible solution I can think of is to find the two tags common ancestor, and do a iterwalk() and start extracting at Element1, and stop at 2. However, I'm not exactly sure how this would be)
Any solution would be appreciated.
Please note that I have already found the two tags that I require, and I'm not looking for solutions to find those tags (for eg using xpath)
Edit: My desired output is
<b>
Text1
</b>
<p>
This is a huge paragraph text
</p>
.
.
.
</div>
</div>
.
.
.
<div>
<i>
Text2
</i>
Please note that I wouldn't mind the initial 2 <div> tags, but do not want the Hello.
The same goes with the closing tags of the end. I'm mostly interested in the inbetween contents.
Edit 2: I have extracted the Etree elements using complex xpath conditions, which was not feasible with other alternatives such as bs4, so any solution using the lxml elements would be appreciated :)
After review and questioning:
from essentials.tokening import CreateToken # This was imported just to generate a random string - pip install mknxgn_essentials
import bs4
HTML = """<html>
<div>
<div>
<div id="start">
Hello, My name is mark
</div>
</div>
</div>
<div>
This is in the middle
</div>
<div>
<div id="end">
This is the end
</div>
</div>
<div>
Do not include this.
</div>
</html>"""
RandomString = CreateToken(30, HTML) #Generate a random string that could never occur on it's own in the file, if it did occur, use something else
soup = bs4.BeautifulSoup(HTML, features="lxml") # Convert the text into soup
start_div = soup.find("div", attrs={"id": "start"}) #assuming you can find this element
start_div.insert_before(RandomString) # insert the random string before this element
end_div = soup.find("div", attrs={"id": "end"}) #again, i was assuming you can also find this element
end_div.insert_after(RandomString) # insert the random string after this element
print(str(soup).split(RandomString)[1]) # Get between both random strings
The output of this returns:
>>> <div id="start">
>>> Hello, My name is mark
>>> </div>
>>> </div>
>>> </div>
>>> <div>
>>> This is in the middle
>>> </div>
>>> <div>
>>> <div id="end">
>>> This is the end
>>> </div>
I'm getting my toes wet with BeautifulSoup and am hung up on scraping some particular info. The HTML looks like the following, for example:
<div class="row">
::before
<div class="course-short clearfix">
::before
<div class="course-meta col-sm-12">
<dl>
<dt>Language:</dt>
<dd>English</dd>
<dt>Author:</dt>
<dd>John Doe</dd>
<dt>Institution:</dt>
<dd>American University</dd>
</dl>
</div>
...
<div class="row">
::before
<div class="course-short clearfix">
::before
<div class="course-meta col-sm-12">
<dl>
<dt>Language:</dt>
<dd>English</dd>
<dt>Author:</dt>
<dd>John Doe, Jr.</dd>
<dt>Institution:</dt>
<dd>Another University</dd>
</dl>
</div>
...
Each page has about 10 <div class="row"> tags, each with the same <dt> and <dd> pattern (e.g., Language, Author, Institution).
I am trying to scrape the <dd>American University</dd> info, ultimately to create a loop so that I can get that info specific to each <div class="row"> tag.
I've managed the following so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.oeconsortium.org/courses/search/?search=statistics")
bsObj = BeautifulSoup(html.read(), "html.parser")
institutions = [x.text.strip() for x in bsObj.find_all('div', 'course-meta col-sm-12', 'dd')]
But that only gives me the following mess for each respective <div class="row"> :
Language:\n\n\t\t\t\t\t\t\t\t\t\t\tEnglish\t\t\t\t\t\t\t\t\t\nAuthor:\nJohn Doe\nInstitution:\nAmerican University\n
Language:\n\n\t\t\t\t\t\t\t\t\t\t\tEnglish\t\t\t\t\t\t\t\t\t\nAuthor:\nJohn Doe, Jr.\nInstitution:\nAnother University\n
...
(I know how to .strip(); that's not the issue.)
I can't figure out how to target that third <dd></dd> for each respective <div class="row">. I feel like it may be possible by also targeting the <dt>Institution:</dt> tag (which is "Institution" in every respective case), but I can't figure it out.
Any help is appreciated. I am trying to make a LOOP so that I can loop over, say, ten <div class="row"> instances and just pull out the info specific to the "Institution" <dd> tag.
Thank you!
I can't figure out how to target that third <dd></dd> for each respective <div class="row">
find_all will return a list of all occurrences, so you could just take the third element of the result. Although you may want to wrap the whole thing with try catch to prevent IndexError
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
I'd make a function out of this code to reuse it for different pages:
soup = BeautifulSoup(html.read(), "html.parser")
result = []
for meta in soup.find_all('div', {'class': 'course-meta'}):
try:
institution = meta.find_all('dd')[2].text.strip()
result.append(institution) # or whatever you like to store it
except IndexError:
pass
I have a trouble. My aim is to parse the data until some moment. Then, I want to stop parsing.
<span itemprop="address">
Some address
</span>
<i class="fa fa-signal">
</i>
...
</p>
</div>
</div>
<div class="search_pagination" id="pagination">
<ul class="pagination">
</ul>
</div>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="panel" itemscope="" itemtype="http://schema.org/WPSideBar">
<h2 class="heading_a" itemprop="name">
Top-10 today
</h2> #a lot of tags after that moment
I want to get all the values from <span itemprop="address"> (there are a lot of them before) until the moment Top-10 today.
You can actually let BeautifulSoup parse only the tags you are interested in via SoupStrainer:
from bs4 import BeautifulSoup, SoupStrainer
only_addresses = SoupStrainer("span", itemprop="address")
soup = BeautifulSoup(html_doc, "html.parser", parse_only=only_addresses)
If you though have some "addresses" before the "Top-10 today" and some after but you are interested in those coming before it, you can make a custom searching function:
def search_addresses(tag):
return tag.name == "span" and tag.get("itemprop") == "address" and \
tag.find_next("h2", text=lambda text: text and "Top-10 today" in text)
addresses = soup.find_all(search_addresses)
It does not look trivial, but the idea is simple - we are using find_next() for every "address" to check if "Top-10 today" heading exists after it.
Given the following code:
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>
How to extract the word test from <div class="category5"> test using BeautifulSoup i.e how to deal with nested divs? I tried to lookup on the Internet but I didn't find any case that treat an easy to grasp example so I set up this one. Thanks.
xpath should be the straight forward answer, however this is not supported in BeautifulSoup.
Updated: with a BeautifulSoup solution
To do so, given that you know the class and element (div) in this case, you can use a for/loop with attrs to get what you want:
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>'''
content = BeautifulSoup(html)
for div in content.findAll('div', attrs={'class':'category5'}):
print div.text
test
I have no problem extracting the text from your html sample, like #MartijnPieters suggested, you will need to find out why your div element is missing.
Another update
As you're missing lxml as a parser for BeautifulSoup, that's why None was returned as you haven't parsed anything to start with. Install lxml should solve your issue.
You may consider using lxml or similar which supports xpath, dead easy if you ask me.
from lxml import etree
tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[#class="category5"]/text()')
[' test\n ']
Consider the following:
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
How would you go about taking out the sitemap line with regex in python?
Sitemap
The following can be used to pull out the anchor tags.
'/<a(.*?)a>/i'
However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?
Don't use a regex. Use BeautfulSoup, an HTML parser.
from BeautifulSoup import BeautifulSoup
html = \
"""
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>"""
soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a
# Sitemap
Parsing HTML with regular expression is a bad idea!
Think about the following piece of html
<a></a > <!-- legal html, but won't pass your regex -->
Sitemap<!-- proof that a>b iff ab>1 -->
There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.
You should consider using Beautiful Soup python HTML parser.
Anyhow, a ad-hoc solution using regex is
import re
data = """
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
"""
e = re.compile('<a *[^>]*>.*</a *>')
print e.findall(data)
Output:
>>> e.findall(data)
['Foo1', 'Home', 'Extract', 'Sitemap']
In order to extract the contents of the tagline:
Sitemap
... I would use:
>>> import re
>>> s = '''
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>'''
>>> m = re.compile(r'(.*?)').search(s)
>>> m.group(1)
'Sitemap'
Use BeautifulSoup or lxml if you need to parse HTML.
Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?
If you really have to use regular expressions, have a look at findall.