I want to extract a few text out of a webpage. I searched StackOverFlow (as well as other sites) to find a proper method. I used HTML2TEXT, BEAUTIFULSOUP, NLTK and some other manual methods to do extraction and I failed for example:
HTML2TEXT works on offline (=saved pages) and I need to do it online.
BS4 won't work properly on Unicode (My page is in UTF8 Persian encoding) and it won't extract the text. It also returns HTML tags\codes. I only need rendered text.
NLTK won't work on my Persian text.
Even while trying to open my page with urllib.request.urlopen I encounter some errors.
So as you see I'm so much stuck after trying several methods.
Here's my target URL: http://vynylyn.yolasite.com/page2.php
I want to extract only Persian paragraphs without tags\codes.
(Note: I use Eclipse Kepler w\ Python 34 also I want to extract text then I want to do POS Tagging, Word\Sentence Tokenizing, etc on the text.)
What are my options to get this working?
I'd go for your second option at first. BeautifulSoup 4 should (and does) definitely support unicode (note it's UTF-8, a global character encoding, so there's nothing Persian about it).
And yes, you will get tags, as it's an HTML page. Try searching for a unique ID, or look at the HTML structure on the page(s). For your example, look for element main and then content elements below that, or maybe use div#I1_sys_txt in that specific page. Once you have your element, you just need to call get_text().
Try this (now in Python 3):
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
content = requests.get('http://vynylyn.yolasite.com/page2.php')
soup = BeautifulSoup(content.text)
tag = soup.find('div', id='I1_sys_txt')
print(tag.get_text() if tag else "<none found>")
Is there a difference between the capabiities of lxml and html5lib parsers in the context of beautifulsoup? I am trying to learn to use BS4 and using the following code construct --
ret = requests.get('http://www.olivegarden.com')
soup = BeautifulSoup(ret.text, 'html5lib')
for item in soup.find_all('a'):
print item['href']
I started out with using lxml as the parser but noticed that for some websites the for loop just is never entered even though there are valid links in the page. The same page works with html5ib parser. Are there any specific type of pages that might not work with lxml?
I am on Ubuntu using python-lxml 2.3.2-1 with libxml2 2.7.8.dfsg-5.1ubunt
and html5lib-1.0b3
EDIT: I updated to lxml 3.1.2 and still see the same issue. On a mac though running 3.0.x the same page is being parsed properly. The website in question is www.olivegarden.com
html5lib uses the HTML parsing algorithm as defined in the HTML spec, and as implemented in all major browsers. lxml uses libxml2's HTML parser — this is based on their XML parser, ultimately, and does not follow any error handling for invalid HTML used anywhere else.
Most web developers only test with web browsers — standards be damned — so if you want to get what the page's author intended, you'll likely need to use something like html5lib that matches current browsers,
I want to parse html code in python and tried beautiful soup and pyquery already. The problem is that those parsers modify original code e.g insert some tag or etc. Is there any parser out there that do not change the code?
I tried HTMLParser but no success! :(
It doesn't modify the code and just tells me where tags are placed. But it fails in parsing web pages like mail.live.com
Any idea how to parse a web page just like a browser?
You can use BeautifulSoup to extract just text and not modify the tags. Its in their documentation.
Same question here:
How to extract text from beautiful soup
No, to this moment there is no such HTML parser and every parser has it's own limitations.
Have you tried the webkit engine with Python bindings?
See this: https://github.com/niwibe/phantompy
You can traverse the real DOM of the parsed web page and do what you need to do.
How to extract from html page links for javascript, css and img tags ? Do I need to use regular expression or there is already some lightweight library for html parsing ?
HTML5Lib in combination with lxml is what I like to use extract data from HTML documents. It recovers from errors in a similar way to modern browsers so it makes broken html easier to work with.
If you actually want to run js code in web pages (say the link is calculated via a function), you should consider looking at the webkit and jswebkit packages which will let you run javascript in a headless webkit window that can get you dynamically generated content for your python parser to examine.
It's really not hard at all to run js in python via webkit, though expect memory usage on par with running a webkit browser.
BeautifulSoup will do the trick.
import urllib
from BeautifulSoup import BeautifulSoup
sock = urllib.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(sock.read())
sock.close()
img = soup.findAll("img")
script = soup.findAll("script", {"type" : "text/javascript"})
css = soup.findAll("link", {"rel" : "stylesheet"})
HTML is not a language which is parsable by regular expressions. SO don't even try. It will break.
What I typically use is Beautiful Soup which is a parser library especially build for gathering information from potentially invalid markup, exactly like the stuff you will find out there.
From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn and understand. But I see a lot of people seem to favour lxml and I've heard that lxml is faster.
So I'm wondering what are the advantages of one over the other? When would I want to use lxml and when would I be better off using BeautifulSoup? Are there any other libraries worth considering?
Pyquery provides the jQuery selector interface to Python (using lxml under the hood).
http://pypi.python.org/pypi/pyquery
It's really awesome, I don't use anything else anymore.
For starters, BeautifulSoup is no longer actively maintained, and the author even recommends alternatives such as lxml.
Quoting from the linked page:
Version 3.1.0 of Beautiful Soup does
significantly worse on real-world HTML
than version 3.0.8 does. The most
common problems are handling
tags incorrectly, "malformed start
tag" errors, and "bad end tag" errors.
This page explains what happened, how
the problem will be addressed, and
what you can do right now.
This page was originally written in
March 2009. Since then, the 3.2 series
has been released, replacing the 3.1
series, and development of the 4.x
series has gotten underway. This page
will remain up for historical
purposes.
tl;dr
Use 3.2.0 instead.
In summary, lxml is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a soupparser module to fall back on BeautifulSoup's functionality. BeautifulSoup is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.
lxml documentation says that both parsers have advantages and disadvantages. For this reason, lxml provides a soupparser so you can switch back and forth. Quoting,
BeautifulSoup uses a different parsing approach. It is not a real HTML
parser but uses regular expressions to dive through tag soup. It is
therefore more forgiving in some cases and less good in others. It is
not uncommon that lxml/libxml2 parses and fixes broken HTML better,
but BeautifulSoup has superiour support for encoding detection. It
very much depends on the input which parser works better.
In the end they are saying,
The downside of using this parser is that it is much slower than
the HTML parser of lxml. So if performance matters, you might want
to consider using soupparser only as a fallback for certain cases.
If I understand them correctly, it means that the soup parser is more robust --- it can deal with a "soup" of malformed tags by using regular expressions --- whereas lxml is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to BeautifulSoup itself, not just to the soupparser for lxml.
They also show how to benefit from BeautifulSoup's encoding detection, while still parsing quickly with lxml:
>>> from BeautifulSoup import UnicodeDammit
>>> def decode_html(html_string):
... converted = UnicodeDammit(html_string, isHTML=True)
... if not converted.unicode:
... raise UnicodeDecodeError(
... "Failed to detect encoding, tried [%s]",
... ', '.join(converted.triedEncodings))
... # print converted.originalEncoding
... return converted.unicode
>>> root = lxml.html.fromstring(decode_html(tag_soup))
(Same source: http://lxml.de/elementsoup.html).
In words of BeautifulSoup's creator,
That's it! Have fun! I wrote Beautiful Soup to save everybody time.
Once you get used to it, you should be able to wrangle data out of
poorly-designed websites in just a few minutes. Send me email if you
have any comments, run into problems, or want me to know about your
project that uses Beautiful Soup.
--Leonard
Quoted from the Beautiful Soup documentation.
I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.
Also, from the lxml website,
lxml has been downloaded from the Python Package Index more than two
million times and is also available directly in many package
distributions, e.g. for Linux or MacOS-X.
And, from Why lxml?,
The C libraries libxml2 and libxslt have huge benefits:...
Standards-compliant... Full-featured... fast. fast! FAST! ... lxml
is a new Python binding for libxml2 and libxslt...
Don't use BeautifulSoup, use
lxml.soupparser then you're sitting on top of the power of lxml and can use the good bits of BeautifulSoup which is to deal with really broken and crappy HTML.
I've used lxml with great success for parsing HTML. It seems to do a good job of handling "soupy" HTML, too. I'd highly recommend it.
Here's a quick test I had lying around to try handling of some ugly HTML:
import unittest
from StringIO import StringIO
from lxml import etree
class TestLxmlStuff(unittest.TestCase):
bad_html = """
<html>
<head><title>Test!</title></head>
<body>
<h1>Here's a heading
<p>Here's some text
<p>And some more text
<b>Bold!</b></i>
<table>
<tr>row
<tr><td>test1
<td>test2
</tr>
<tr>
<td colspan=2>spanning two
</table>
</body>
</html>"""
def test_soup(self):
"""Test lxml's parsing of really bad HTML"""
parser = etree.HTMLParser()
tree = etree.parse(StringIO(self.bad_html), parser)
self.assertEqual(len(tree.xpath('//tr')), 3)
self.assertEqual(len(tree.xpath('//td')), 3)
self.assertEqual(len(tree.xpath('//i')), 0)
#print(etree.tostring(tree.getroot(), pretty_print=False, method="html"))
if __name__ == '__main__':
unittest.main()
For sure i would use EHP. It is faster than lxml, much more elegant and simpler to use.
Check out. https://github.com/iogf/ehp
<body ><em > foo <font color="red" ></font></em></body>
from ehp import *
data = '''<html> <body> <em> Hello world. </em> </body> </html>'''
html = Html()
dom = html.feed(data)
for ind in dom.find('em'):
print ind.text()
Output:
Hello world.
A somewhat outdated speed comparison can be found here, which clearly recommends lxml, as the speed differences seem drastic.