Consider the following:
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
How would you go about taking out the sitemap line with regex in python?
Sitemap
The following can be used to pull out the anchor tags.
'/<a(.*?)a>/i'
However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?
Don't use a regex. Use BeautfulSoup, an HTML parser.
from BeautifulSoup import BeautifulSoup
html = \
"""
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>"""
soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a
# Sitemap
Parsing HTML with regular expression is a bad idea!
Think about the following piece of html
<a></a > <!-- legal html, but won't pass your regex -->
Sitemap<!-- proof that a>b iff ab>1 -->
There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.
You should consider using Beautiful Soup python HTML parser.
Anyhow, a ad-hoc solution using regex is
import re
data = """
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
"""
e = re.compile('<a *[^>]*>.*</a *>')
print e.findall(data)
Output:
>>> e.findall(data)
['Foo1', 'Home', 'Extract', 'Sitemap']
In order to extract the contents of the tagline:
Sitemap
... I would use:
>>> import re
>>> s = '''
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>'''
>>> m = re.compile(r'(.*?)').search(s)
>>> m.group(1)
'Sitemap'
Use BeautifulSoup or lxml if you need to parse HTML.
Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?
If you really have to use regular expressions, have a look at findall.
Related
I'm placing here HTML code :
<div class="rendering rendering_person rendering_short rendering_person_short">
<h3 class="title">
<a rel="Person" href="https://moh-it.pure.elsevier.com/en/persons/massimo-eraldo-abate" class="link person"><span>Massimo Eraldo Abate</span></a>
</h3>
<ul class="relations email">
<li class="email"><span>massimo.abate#ior.it</span></li>
</ul>
<p class="type"><span class="family">Person: </span>Academic</p>
</div>
From above code how to extract Massimo Eraldo Abate?
Please help me.
You can extract the name using
response.xpath('//h3[#class="title"]/a/span/text()').extract_first()
Also, look at this Scrapinghub's blogpost for introduction to XPath.
Please take a look at this page. there are lots of ways of extracting text
scrapy docs
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
I am trying to parse with BeautifulSoup an awful HTML page to retrieve a few information. The code below:
import bs4
with open("smartradio.html") as f:
html = f.read()
soup = bs4.BeautifulSoup(html)
x = soup.find_all("div", class_="ue-alarm-status", playerid="43733")
print(x)
extracts the fragments I would like to analyze further:
[<div alarmid="f319e1fb" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 1: </div>
<div>allumé</div>
<div>7:00</div>
</div>
<div>
<div class="ue-alarm-dow">Lu, Ma, Me, Je, Ve </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>, <div alarmid="ea510709" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 2: </div>
<div>allumé</div>
<div>7:30</div>
</div>
<div>
<div class="ue-alarm-dow">Sa </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>]
I am interested in retrieving:
the hour (line 5 and 14)
the string (days in French) under <div class="ue-alarm-dow">
I believe that for the days it is enough to repeat a find() or find_all(). I am mentioning that because while it grabs the right information, I am not sure that this is the right way to parse the file with BeautifulSoup (but at least it works):
for y in x:
z = y.find("div", class_="ue-alarm-dow")
print(z.text)
# output:
# Lu, Ma, Me, Je, Ve
# Sa
I do not know how to get to the hour, though. Is there a way to navigate the tree by path (in the sense that I know that the hour is under the second <div>, three <div> deep)? Or should I do it differently?
You can also rely on the allumé text and get the next sibling div element:
y.find('div', text=u'allumé').find_next_sibling('div').text
or, in a similar manner, relying on the class of the previous div:
y.find('div', class_='ue-alarm-edit').find_next_siblings('div')[1].text
or, using regular expressions:
y.find('div', text=re.compile(r'\d+:\d+')).text
or, just get the div by index:
y.find_all('div')[4].text
Given the following code:
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>
How to extract the word test from <div class="category5"> test using BeautifulSoup i.e how to deal with nested divs? I tried to lookup on the Internet but I didn't find any case that treat an easy to grasp example so I set up this one. Thanks.
xpath should be the straight forward answer, however this is not supported in BeautifulSoup.
Updated: with a BeautifulSoup solution
To do so, given that you know the class and element (div) in this case, you can use a for/loop with attrs to get what you want:
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>'''
content = BeautifulSoup(html)
for div in content.findAll('div', attrs={'class':'category5'}):
print div.text
test
I have no problem extracting the text from your html sample, like #MartijnPieters suggested, you will need to find out why your div element is missing.
Another update
As you're missing lxml as a parser for BeautifulSoup, that's why None was returned as you haven't parsed anything to start with. Install lxml should solve your issue.
You may consider using lxml or similar which supports xpath, dead easy if you ask me.
from lxml import etree
tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[#class="category5"]/text()')
[' test\n ']
I have the following given html structure
<li class="g">
<div class="vsc">
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</div>
</li>
The above html structure keeps repeating, what can be the easiest way to parse all the links(stackoverflow.com) from the above html structure using BeautifulSoup and Python?
BeautifulSoup 4 offers a convenient way of accomplishing this, using CSS selectors:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print [a["href"] for a in soup.select('h3.r a')]
This also has the advantage of constraining the selection by context: it selects only those anchor nodes that are children of a h3 node with class r.
Omitting the constraint or choosing one most suitable for the need is easy by just tweaking the selector; see the CSS selector docs for that.
Using CSS selectors as proposed by Petri is probably the best way to do it using BS. However, i can't hold myself back to recommend using lxml.html and xpath, that are pretty much perfect for the job.
Test html:
html="""
<html>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="gamma"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
</html>"""
and it's basically a oneliner:
import lxml.html as lh
doc=lh.fromstring(html)
doc.xpath('.//li[#class="g"][div/#class = "vsc"][div/#class = "alpha"][div/#class = "beta"][h3/#class = "r"]/h3/a/#href')
Out[264]:
['http://www.correct.com', 'http://www.correct.com']
If the HTML code looks like this:
<div class="div1">
<p>hello</p>
<p>hi</p>
<div class="nesteddiv">
<p>one</p>
<p>two</p>
<p>three</p>
</div>
</div>
How do I extract just
<div class="div1">
<p>hello</p>
<p>hi</p>
</div>
I already tried parser.find('div', 'div1') but I'm getting the whole div including the nested one.
You actually want to extract() the nested div from the document and then get the first div. Here is an example (where html is the HTML you provided in the question):
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.div.div.extract()
<div class="nesteddiv">
<p>one</p>
<p>two</p>
<p>three</p>
</div>
>>> soup.div
<div class="div1">
<p>hello</p>
<p>hi</p>
</div>
Why not just find() the nested div and then remove it from the tree using extract()?