The solution of just removing all HTML tags is not apprioriate for my App.
From what I have seen so far, I have found two solutions to clean HTML in Python:
bleach (uses html5lib). It works perfectly well on dev server but I could not have it work on production. there is an 'ImportError: No module named html5lib' when I try to import html5lib. Just as if the folder was not there. Maybe a problem with the python path of GAE.
lxml. It is more complicated to have it work on dev server : must install two third party binaries (libxslt and libxml2) to my local Python and then pip install lxml. Then on production, once I declared the lxml library in my app.yaml, it worked just fine.
Are there better solutions than lxml?
Thanks in advance
...
soup = BeautifulSoup(html, "lxml")
File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 152, in __init__
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
The above outputs on my Terminal. I am on Mac OS 10.7.x. I have Python 2.7.1, and followed this tutorial to get Beautiful Soup and lxml, which both installed successfully and work with a separate test file located here. In the Python script that causes this error, I have included this line:
from pageCrawler import comparePages
And in the pageCrawler file I have included the following two lines:
from bs4 import BeautifulSoup
from urllib2 import urlopen
Any help in figuring out what the problem is and how it can be solved would much be appreciated.
I have a suspicion that this is related to the parser that BS will use to read the HTML. They document is here, but if you're like me (on OSX) you might be stuck with something that requires a bit of work:
You'll notice that in the BS4 documentation page above, they point out that by default BS4 will use the Python built-in HTML parser. Assuming you are in OSX, the Apple-bundled version of Python is 2.7.2 which is not lenient for character formatting. I hit this same problem, so I upgraded my version of Python to work around it. Doing this in a virtualenv will minimize disruption to other projects.
If doing that sounds like a pain, you can switch over to the LXML parser:
pip install lxml
And then try:
soup = BeautifulSoup(html, "lxml")
Depending on your scenario, that might be good enough. I found this annoying enough to warrant upgrading my version of Python. Using virtualenv, you can migrate your packages fairly easily.
I'd prefer the built in python html parser, no install no dependencies
soup = BeautifulSoup(s, "html.parser")
For basic out of the box python with bs4 installed then you can process your xml with
soup = BeautifulSoup(html, "html5lib")
If however you want to use formatter='xml' then you need to
pip3 install lxml
soup = BeautifulSoup(html, features="xml")
Run these three commands to make sure that you have all the relevant packages installed:
pip install bs4
pip install html5lib
pip install lxml
Then restart your Python IDE, if needed.
That should take care of anything related to this issue.
Actually 3 of the options mentioned by other work.
# 1.
soup_object= BeautifulSoup(markup,"html.parser") #Python HTML parser
# 2.
pip install lxml
soup_object= BeautifulSoup(markup,'lxml') # C dependent parser
# 3.
pip install html5lib
soup_object= BeautifulSoup(markup,'html5lib') # C dependent parser
I am using Python 3.6 and I had the same original error in this post. After I ran the command:
python3 -m pip install lxml
it resolved my problem
Install LXML parser in python environment.
pip install lxml
Your problem will be resolve. You can also use built-in python package for the same as:
soup = BeautifulSoup(s, "html.parser")
Note: The "HTMLParser" module has been renamed to "html.parser" in Python3
Instead of using lxml use html.parser, you can use this piece of code:
soup = BeautifulSoup(html, 'html.parser')
Although BeautifulSoup supports the HTML parser by default
If you want to use any other third-party Python parsers you need to install that external parser like(lxml).
soup_object= BeautifulSoup(markup, "html.parser") #Python HTML parser
But if you don't specified any parser as parameter you will get an warning that no parser specified.
soup_object= BeautifulSoup(markup) #Warnning
To use any other external parser you need to install it and then need to specify it. like
pip install lxml
soup_object= BeautifulSoup(markup, 'lxml') # C dependent parser
External parser have c and python dependency which may have some advantage and disadvantage.
In my case I had an outdated version of the lxml package. So I just updated it and this fixed the issue.
sudo python3 -m pip install lxml --upgrade
I encountered the same issue. I found the reason is that I had a slightly-outdated python six package.
>>> import html5lib
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/html5lib/__init__.py", line 16, in <module>
from .html5parser import HTMLParser, parse, parseFragment
File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 2, in <module>
from six import with_metaclass, viewkeys, PY3
ImportError: cannot import name viewkeys
Upgrading your six package will solve the issue:
sudo pip install six=1.10.0
In some references, use the second instead of the first:
soup_object= BeautifulSoup(markup,'html-parser')
soup_object= BeautifulSoup(markup,'html.parser')
The error is coming because of the parser you are using. In general, if you have HTML file/code then you need to use html5lib(documentation can be found here) & in-case you have XML file/data then you need to use lxml(documentation can be found here). You can use lxml for HTML file/code also but sometimes it gives an error as above. So, better to choose the package wisely based on the type of data/file. You can also use html_parser which is built-in module. But, this also sometimes do not work.
For more details regarding when to use which package you can see the details here
Blank parameter will result in a warning for best available.
soup = BeautifulSoup(html)
---------------/UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.----------------------/
python --version Python 3.7.7
PyCharm 19.3.4 CE
My solution was to remove lxml from conda and reinstalling it with pip.
I am using python 3.8 in pycharm. I assume that you had not installed "lxml" before you started working. This is what I did:
Go to File -> Settings
Select " Python Interpreter " on the left menu bar of settings, select "Python Interpreter."
Click the "+" icon over the list of packages.
Search for "lxml."
Click "Install Package" on the bottom left of the "Available Package" window.
pip install lxml then keeping xml in soup = BeautifulSoup(URL, "xml") did the job on Mac.
This method worked for me. I prefer to mention that I was trying this in the virtual environment. First:
pip install --upgrade bs4
Secondly, I used:
html.parser
instead of
html5lib
I fixed with below changes
Before changes
soup = BeautifulSoup(r.content, 'html5lib' )
print (soup.prettify())
After change
soup = BeautifulSoup(r.content, features='html')
print(soup.prettify())
my code works properly
BS4 by default expects an HTML document. Therefore, it parses an XML document as an HTML one. Pass features="xml" as an argument in the constructor. It resolved my issue.
You may want to double check that you're using the right interpreter if you have multiple versions of Python installed.
Once I chose the correct version of Python, lxml was found.
I am trying this program ( https://gist.github.com/eknowles/9939273) to work but when I put the code in PyCharm, it underlines json, requests and BeautifulSoup imports and it says "no module named beautifulsoup...".
Then I tried to install with "easy_install requests" or "easy_install json" but it spits this:
PS C:\Users\Ruzgar> easy_install json
Searching for json
Reading https://pypi.python.org/simple/json/
Couldn't find index page for 'json' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
No local packages or download links found for json
error: Could not find suitable distribution for Requirement.parse('json')
How can I make this code work?
I understand that I have to fix this import problem first. (I use Python 2.5.4 by the way)
Python 2.5.4 I am using.
You need to update your version of Python. Requests and BeautifulSoup require Python version greater than 2.6; and json is also included in Python from version 2.6
I recommend you install the latest stable version of Python 2.7. You can find it at the download page.
I cheanged the project interpreter to Python 3.4. Now it gives me
different modules are not installed.
Use Python 2.7 as not everything is compatible with 3.4. Once you have downloaded and installed it, restart PyCharm and load your project again; then follow the following steps:
Click on File, then Settings
On the left, under 'Project Settings', click on Python Interpreter
On the right hand side, click on 'Configure Interpreters'
On the right, you will see a list of interpreters available on your system. The interpreter for your project will be highlighted. In the bottom half you will see the packages installed for the interpreter. Click on "Install".
You'll see a new window popup. This is a browser for the Python Package Index (PyPI). In the search box, type requests; when you see the results filtered, click on requests, and then click Install Package. Repeat this process for BeautifulSoup. Remember, you don't need to install json since its already included.
Click Apply, then OK to dismiss the window. Give PyCharm a few seconds to rebuild its cache, and everything should work.
download the package of beautiful soup (I think you should download
the Beautiful Soup 3 because your python version < 2.6)and install it;
if you want to know more about this package,see here
As the Beautiful Soup documentation says:
If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all.
This is exactly what I want, and what I've done... up to the point of using it in my code. I don't know how to import Beautiful Soup 4. Unlike v3, there's no standalone BeautifulSoup.py, just that bs4 directory with a bunch of python scripts. Does anyone have an example of how to use Beautiful Soup 4 when you have the source code in your project?
That 'bunch of python scripts' is called a python package; there should be a __init__.py file in there somewhere. Together they form a coherent whole, a namespaced set of modules.
You can just import the BeautifulSoup class from the bs4 package:
from bs4 import BeautifulSoup
See the documentation for more info.
I've unpacked BeautifulSoup into c:\python2.6\lib\site-packages, which is in sys.path, but when I enter import BeautifulSoup I get an import error saying no such module exists. Obviously I'm doing something stupid... what is it?
You might have more than one python version installed? Check the version you are running.
Also, I found using easy_install worked well for installing BeautifulSoup.