In Python 3, how would you go about taking the string between header tags, for example, printing Hello, world!, out of <h1>Hello, world!</h1>:
import urllib
from urllib.request import urlopen
#example URL that includes an <h> tag: http://www.hobo-web.co.uk/headers/
userAddress = input("Enter a website URL: ")
webPage = urllib.request.urlopen(userAddress)
list = []
while webPage != "":
webPage.read()
list.append()
You need an HTML Parser. For example, BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(webPage)
print(soup.find("h1").get_text(strip=True))
Demo:
>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>>
>>> url = "http://www.hobo-web.co.uk/headers/"
>>> webPage = urlopen(url)
>>>
>>> soup = BeautifulSoup(webPage, "html.parser")
>>> print(soup.find("h1").get_text(strip=True))
How To Use H1-H6 HTML Elements Properly
I'm not allowed to use any additional libraries, aside from what comes with python. Does python come with the ability to parse HTML, albeit in a less efficient way?
If you are, for some reason, not allowed to use third-parties, you can use a built-in html.parser module. Some people also use regular expressions to parse HTML. It is not always a bad thing, but you have to be very careful with that, see:
RegEx match open tags except XHTML self-contained tags
Definitely HTMLParser is your best friend to deal with that issue.
There are related question which already exist and cover your needs.
Related
I am using the lxml and requests modules, and just trying to parse the article from a website. I tried using find_all from BeautifulSoup but still came up empty
from lxml import html
import requests
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)
article = tree.xpath('//div[#class="article"]/text()')
Once I print article, I get a list of ['\n','\n','\n','\n','\n'], rather than the body of the article. Where exactly am I going wrong?
I would use bs4 and the class name in css select_one
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)
If you use
article = tree.xpath('//div[#class="article"]//text()')
you get a list and still get all the \n but also the text which I think you can handle with re.sub or conditional logic.
I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<p>£682m</p>")
>>> text = soup.find("p").string
>>> print text
£682m
How can I decode the HTML entities in text to get "£682m" instead of "£682m".
Python 3.4+
Use html.unescape():
import html
print(html.unescape('£682m'))
FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
Python 2.6-3.3
You can use HTMLParser.unescape() from the standard library:
For Python 2.6-2.7 it's in HTMLParser
For Python 3 it's in html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
You can also use the six compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
Beautiful Soup handles entity conversion. In Beautiful Soup 3, you'll need to specify the convertEntities argument to the BeautifulSoup constructor (see the 'Entity Conversion' section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.
Beautiful Soup 3
>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>£682m</p>",
... convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>
Beautiful Soup 4
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>£682m</p>")
<html><body><p>£682m</p></body></html>
You can use replace_entities from w3lib.html library
In [202]: from w3lib.html import replace_entities
In [203]: replace_entities("£682m")
Out[203]: u'\xa3682m'
In [204]: print replace_entities("£682m")
£682m
Beautiful Soup 4 allows you to set a formatter to your output
If you pass in formatter=None, Beautiful Soup will not modify strings
at all on output. This is the fastest option, but it may lead to
Beautiful Soup generating invalid HTML/XML, as in these examples:
print(soup.prettify(formatter=None))
# <html>
# <body>
# <p>
# Il a dit <<Sacré bleu!>>
# </p>
# </body>
# </html>
link_soup = BeautifulSoup('A link')
print(link_soup.a.encode(formatter=None))
# A link
I had a similar encoding issue. I used the normalize() method. I was getting a Unicode error using the pandas .to_html() method when exporting my data frame to an .html file in another directory. I ended up doing this and it worked...
import unicodedata
The dataframe object can be whatever you like, let's call it table...
table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
table.index+= 1
encode table data so that we can export it to out .html file in templates folder(this can be whatever location you wish :))
#this is where the magic happens
html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')
export normalized string to html file
file = open("templates/home.html","w")
file.write(html_data)
file.close()
Reference: unicodedata documentation
This probably isnt relevant here. But to eliminate these html entites from an entire document, you can do something like this: (Assume document = page and please forgive the sloppy code, but if you have ideas as to how to make it better, Im all ears - Im new to this).
import re
import HTMLParser
regexp = "&.+?;"
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
h = HTMLParser.HTMLParser()
unescaped = h.unescape(e) #finds the unescaped value of the html entity
page = page.replace(e, unescaped) #replaces html entity with unescaped value
I'm trying to parse this site and for reasons I can't understand, nothing is happening.
url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
response = urllib2.urlopen(url).read()
doc = BeautifulSoup(response)
divs = doc.findAll('div')
print len(divs) # prints 0.
This site is a real-state ads in Rio de Janeiro, Brazil. I can't find anything in the html source that could prevents Beautifulsoup of working. Would it be the size?
I'm using Enthought Canopy Python 2.7.6, IPython Notebook 2.0, Beautifulsoup 4.3.2.
This is because you are letting BeautifulSoup to choose the best suitable parser for you. And, this really depends on what modules are installed in your python environment.
According to the documentation:
The first argument to the BeautifulSoup constructor is a string or an
open filehandle–the markup you want parsed. The second argument is how
you’d like the markup parsed.
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
So, different parsers - different results:
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
>>> import urllib2
>>> response = urllib2.urlopen(url).read()
>>> len(BeautifulSoup(response, 'lxml').find_all('div'))
558
>>> len(BeautifulSoup(response, 'html.parser').find_all('div'))
558
>>> len(BeautifulSoup(response, 'html5lib').find_all('div'))
0
The solution for you would be to specify a parser that can handle parsing of this particular page, you may need to install lxml or html5lib.
Also see: Differences between parsers.
Something is wrong with your environment, Here is the output I get:
>>> url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> response = urllib2.urlopen(url).read()
>>> doc = BeautifulSoup(response)
>>> divs = doc.findAll('div')
>>> print len(divs) # prints 0.
558
I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody':
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})
Now in the above code we can use findAll to get tags and information related to them, but I want to use XPath. Is it possible to use XPath with BeautifulSoup? If possible, please provide me example code.
Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.
try:
# Python 2
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
There is also a dedicated lxml.html() module with additional functionality.
Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:
import lxml.html
import requests
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)
Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have very complete CSS selector support:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.
I can confirm that there is no XPath support within Beautiful Soup.
As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[#title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[#class="item-price"]/text()')
print('Buyers: ', buyers)
print('Prices: ', prices)
I used this as a reference.
BeautifulSoup has a function named findNext from current element directed childern,so:
father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
Above code can imitate the following xpath:
div[class=class_value]/div[id=id_value]
from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[#id="BGINP01_S1"]/section/div/font/text()')
Above used the combination of Soup object with lxml and one can extract the value using xpath
when you use lxml all simple:
tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[#class="shared-components"]/#href')
but when use BeautifulSoup BS4 all simple too:
first remove "//" and "#"
second - add star before "="
try this magic:
soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')
as you see, this does not support sub-tag, so i remove "/#href" part
I've searched through their docs and it seems there is no XPath option.
Also, as you can see here on a similar question on SO, the OP is asking for a translation from XPath to BeautifulSoup, so my conclusion would be - no, there is no XPath parsing available.
Maybe you can try the following without XPath
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<html>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p>More information...</p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))
This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.
Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.
from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()
use soup.find(class_='myclass')
Is there any Python library that allows me to parse an HTML document similar to what jQuery does?
i.e. I'd like to be able to use CSS selectors syntax to grab an arbitrary set of nodes from the document, read their content/attributes, etc.
The only Python HTML parsing lib I've used before was BeautifulSoup, and even though it's fine I keep thinking it would be faster to do my parsing if I had jQuery syntax available. :D
If you are fluent with BeautifulSoup, you could just add soupselect to your libs.
Soupselect is a CSS selector extension for BeautifulSoup.
Usage:
from bs4 import BeautifulSoup as Soup
from soupselect import select
import urllib
soup = Soup(urllib.urlopen('http://slashdot.org/'))
select(soup, 'div.title h3')
[<h3><span><a href='//science.slashdot.org/'>Science</a>:</span></h3>,
<h3><a href='//slashdot.org/articles/07/02/28/0120220.shtml'>Star Trek</h3>,
..]
Consider PyQuery:
http://packages.python.org/pyquery/
>>> from pyquery import PyQuery as pq
>>> from lxml import etree
>>> import urllib
>>> d = pq("<html></html>")
>>> d = pq(etree.fromstring("<html></html>"))
>>> d = pq(url='http://google.com/')
>>> d = pq(url='http://google.com/', opener=lambda url: urllib.urlopen(url).read())
>>> d = pq(filename=path_to_html_file)
>>> d("#hello")
[<p#hello.hello>]
>>> p = d("#hello")
>>> p.html()
'Hello world !'
>>> p.html("you know <a href='http://python.org/'>Python</a> rocks")
[<p#hello.hello>]
>>> p.html()
u'you know Python rocks'
>>> p.text()
'you know Python rocks'
The lxml library supports CSS selectors.
BeautifulSoup, now has support for css selectors
import requests
from bs4 import BeautifulSoup as Soup
html = requests.get('https://stackoverflow.com/questions/3051295').content
soup = Soup(html)
Title of this question
soup.select('h1.grid--cell :first-child')[0].text
Number of question upvotes
# first item
soup.select_one('[itemprop="upvoteCount"]').text
using Python Requests to get the html page