Strange behaviour with BeautifulSoup and converting HTML entities

Strange behaviour with BeautifulSoup and converting HTML entities - python

I have a strange problem with converting special characters from HTML. I have a Django project where text is stored HTML-encoded in a MySQL database. This is necessary, because I don't want to lose any formatting of the text.
In a preliminary step I must do operational things on the text like calculating positions, so I need to convert it first and clear it from all HTML-Tags. This is done by BeautifulSoup:
convertedText = str(BeautifulSoup(text.text, convertEntities=BeautifulSoup.HTML_ENTITIES))
convertedText = ''.join(BeautifulSoup(convertedText).findAll(text=True))
By working on my Django-default test-server everything works fine, but when I run it on my production server there are strange behaviors when converting special characters.
An example:
Test server
MySQL-Query gives me: <p>bassverstärker</p>
is correctly converted to: bassverstärker
Production server
MySQL-Query gives me: <p>bassverstärker</p>
This is is wrongly converted to: bassverst\ucc44rker
Somehow the ä is converted into \ucc44 and this results in a wrong character.
My configuration:
Test server:
Django build-in solution (python manage.py runserver)
BeautifulSoup 3.2.1
Python 2.6.5
Ubuntu 2.6.32-43-generic
Production server:
Cherokee 1.2.101
BeautifulSoup 3.2.1
python 2.7.3
Ubuntu 3.2.0-32-generic
Because I don't know at which level the error occurs, I would like to ask if anybody can help me with this. Many thanks in advance.

I found a way to fix this. I didn't know that BeautifulSoup has the builtin method getText(). When converting HTML through:
convertedText = BeautifulSoup(text.text, convertEntities=BeautifulSoup.HTML_ENTITIES).getText()
eveything works fine on both servers. Although this works, it would be interesting to know why both servers are behaving differently when working with the example in the question.
However, thanks to all.

Related

Importing from Oracle using the correct encoding with Python

I apologize for making a character encoding question since I know you folk get many everyday, but I couldn't figure out my problem so I asked anyway.
Here is what we are doing:
Take Data from an Oracle DB using Python and cx_Oracle.
Write the data to a file using Python.
Ingest the file into Postgres using Python and psycopg2.
Here are the important Oracle settings:
SQL> select * from NLS_DATABASE_PARAMETERS;
PARAMETER VALUE
------------------------------ ----------------------------------------
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CURRENCY $
NLS_ISO_CURRENCY AMERICA
NLS_NUMERIC_CHARACTERS .,
NLS_CHARACTERSET US7ASCII
According to this NLS_LANG faq, you are meant to set the NLS_LANG according to what your client OS is using.
Running locale gives us: LANG=en_US.UTF-8 (all of the other fields were also en_US.UTF-8).
So, in our Python script, we set it like this:
os.environ["NLS_LANG"] = "AMERICAN_AMERICA.AL32UTF8"
Then we import the data and write it to a file.
row = cur.fetchall()
fil.write(row[0][0]) #For this test, I am only writing one row and one field.
We ingest that file into our UTF-8 Postgres DB.
Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
(In some text editors, the symbol shows up as ï¿½).
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
And even if we are using regional pages, shouldn't it still work, since the client is using US and the Oracle server is using AMERICAN?
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Note: The Oracle field is a CHAR field and not a NCHAR field.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
Thanks for your time, I hope you have a good day.

Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
Mostly right but not quite. PostgreSQL will refuse to insert non-UTF8 text characters when using that encoding (do a search on StackOverflow for "Invalid UTF8 postgresql"). Most likely the character you are seeing is a valid UTF8 character that is not recognized by your font and therefore is showing the replacement character. If the symbol is in your Oracle db and is actually the replacement symbol there, then what do you want to replace it with? If that is the case, the information is already missing.
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
It is.
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Most likely your problem is upstream of the Oracle db. I would find out what is actually inserting problem data into the Oracle db and fix it there. If you can check the data in Pg against the data in Oracle, you should be able to determine if the data is character for character the same (and flag any differences). That's how to check your current import.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
That's another possibility. Personally for file transformations I prefer Perl because of integrated regular expressions and absolutely top rate PostgreSQL support. However I recognize your import routine may not be readily convertable at this point. I am a little more familiar with troubleshooting UTF8 conversion issues in Perl than in Python. I do wonder however if you can check the data that is coming out in binary format for such symbols.

Python 2.7, Appengine Data Store & Unicode

So I've been reading quite a bit about Unicoding tonight because I was thinking of switching to Jinja2, which requires Unicode to be used everywhere within the app. I think I have a good idea of how to deal with it, but I wanted to hear if this is reasonable before I started to code my app:
Dealing with External Text-Inputs (via html forms)
a) Make sure all html pages are utf-8 encoded.
b) Once users press submit, make sure the data is converted into Unicode as soon as the python backend receives it...decode(self.request.get('stuff'),utf-8)
c) Stay in unicode, transfer the outputs to Jinja2 which will always it using the default encoding of utf-8.
Information from the appengine datastore
Because google stores everything as Unicode, all data coming in from the datastore is already unicode and I don't have to worry about anything (yay!)
Strings within the app
Make sure all "" start with a u (i.e. u"hello world"), this will force everything to be in unicode.
Well the above is my strategy to keep everything consistent. Is there anything else I need to account for?
thanks!

You should not need to .decode(self.request.get('stuff'),utf-8 if you using webapp or webapp2. The framework respects the input type of the data as specified.
Everything else looks right.
Also I believe that
from __future__ import unicode_strings
should be
from __future__ import unicode_literals
and is only available in 2.6 and 2.7 So in App Engine it would only be available if you are using 2.7

Getting a URL with Python

I'm trying to do something similar to placekitten.com, wherein a user can input two strings after the base URL and have those strings alter the output. I'm doing this in Python, and I cannot for the life of me figure out how to grab the URL. In PHP I can do it with query string and $_REQUEST. I can't find a similar method in Python that doesn't rely on CGI.
(I know I could do this with Django, but that's serious overkill for this project.)

This is just by looking at the docs but have you tried it?
cherrypy.request.path_info
The docs say:
The ‘relative path’ portion of the Request-URI. This is relative to the script_name (‘mount point’) of the application which is handling this request.
http://docs.cherrypy.org/stable/refman/_cprequest.html#cherrypy._cprequest.Request.path_info

Get the GET query in a Python CGI script (Apache, mod_cgi)

I am trying to get the query string in a CGI python script because for some reason the .FieldStorage and .getvalue functions are returning "None". The query is being generated correctly. I checked it in Wireshark and it is correctly produced.
Any ideas on how I can just get the string itself correctly?
I was trying to use os.environ('QUERY_STRING') but that didn't work either.
Josh

I think I figured it out. I need to use the environ variable passed to the application function by mod_wsgi.
Thanks!

Decoding problems in Django and lxml

I have a strange problem with lxml when using the deployed version of my Django application. I use lxml to parse another HTML page which I fetch from my server. This works perfectly well on my development server on my own computer, but for some reason it gives me UnicodeDecodeError on the server.
('utf8', "\x85why hello there!", 0, 1, 'unexpected code byte')
I have made sure that Apache (with mod_python) runs with LANG='en_US.UTF-8'.
I've tried googling for this problem and tried different approaches to decoding the string correctly, but I can't figure it out.
In your answer, you may assume that my string is called hello or something.

"\x85why hello there!" is not a utf-8 encoded string. You should try decoding the webpage before passing it to lxml. Check what encoding it uses by looking at the http headers when you fetch the page maybe you find the problem there.

Doesn't syntax such as u"\x85why hello there!" help?
You may find the following resources from the official Python documentation helpful:
Python introduction, Unicode Strings
Sequence Types — str, unicode, list, tuple, buffer, xrange

Since modifying site.py is not an ideal solution try this at the start of your program:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.