converting UTF-16 special characters to UTF-8 - python

I'm working in django and Python and I'm having issues with saving utf-16 characters in PostgreSQL. Is there any method to convert utf-16 to utf-8 before saving?
I'm using python 2.6 here is my code snippets
sample_data="This is the time of year when Travel & Leisure, TripAdvisor and other travel media trot out their “Best†lists, so I thought I might share my own list of outstanding hotels I’ve had the good fortune to visit over the years."
Above data contains some latin special characters but it is not showing correctly, I just want to show those latin special characters in appropriate formats.

There are no such things as "utf-16 characters". You should show your data by using print repr(data), and tell us which pieces of your data you are having trouble with. Show us the essence of your data e.g. the repr() of "Leisure “Best†lists I’ve had"
What you actually have is a string of bytes containing text encoded in UTF-8. Here is its repr():
'Leisure \xe2\x80\x9cBest\xe2\x80\x9d lists I\xe2\x80\x99ve had'
You'll notice 3 clumps of guff in what you showed. These correspond to the 3 clumps of \xhh in the repr.
Clump1 (\xe2\x80\x9c) decodes to U+201C LEFT DOUBLE QUOTATION MARK.
Clump 2 is \xe2\x80\x9d. Note that only the first 2 "latin special characters" aka "guff" showed up in your display. That is because your terminal's encoding is cp1252 which doesn't map \x9d; it just ignored it. Unicode is U+201D RIGHT DOUBLE QUOTATION MARK.
Clump 3: becomes U+2019 RIGHT SINGLE QUOTATION MARK (being used as an apostrophe).
As you have UTF-8-encoded bytes, you should be having no trouble with PostgreSQL. If you are getting errors, show your code, the full error message and the full traceback.
If you really need to display the guff to your Windows terminal, print guff.decode('utf8').encode('cp1252') ... just be prepared for unicode characters that are not supported by cp1252.
Update in response to comment I dont have any issue with saving data,problem is while displaying it is showing weired characters,so what iam thinking is convert those data before saving am i right?
Make up your mind. (1) In your question you say "I'm having issues with saving utf-16 characters in PostgreSQL". (2) Now you say "I dont have any issue with saving data,problem is while displaying it is showing weired characters"
Summary: Your sample data is encoded in UTF-8. If UTF-8 is not acceptable to PostgreSQL, decode it to Unicode. If you are having display problems, first try displaying the corresponding Unicode; if that doesn't work, try an encoding that your terminal will support (presumably one of the cp125X family.

This works for me to convert strings: sample_data.decode('mbcs').encode('utf-8')

Related

Encode a raw string so it can be decoded as json

I am throwing in the towel here. I'm trying to convert a string scraped from the source code of a website with scrapy (injected javascript) to json so I can easily access the data. The problem comes down to a decode error. I tried all kinds of encoding, decoding, escaping, codecs, regular expressions, string manipulations and nothing works. Oh, using Python 3.
I narrowed down the culprit on the string (or at least part of it)
scraped = '{"propertyNotes": [{"title": "Local Description", "text": "\u003Cp\u003EAPPS\u003C/p\u003E\n\n\u003Cp\u003EBig Island Revealed (comes as app or as a printed book)\u003C/p\u003E\n\n\u003Cp\u003EAloha Big Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island Smart Maps (I like this one a lot)\u003C/p\u003E\n\n\u003Cp\u003EBig Island Adventures (includes videos)\u003C/p\u003E\n\n\u003Cp\u003EThe descriptions of beaches are helpful. Suitability for swimming, ease of access, etc. is included. Some beaches are great for picnics and scenic views, while others are suitable for swimming and snorkeling. Check before you go.\u003C/p\u003E"}]}'
scraped_raw = r'{"propertyNotes": [{"title": "Local Description", "text": "\u003Cp\u003EAPPS\u003C/p\u003E\n\n\u003Cp\u003EBig Island Revealed (comes as app or as a printed book)\u003C/p\u003E\n\n\u003Cp\u003EAloha Big Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island Smart Maps (I like this one a lot)\u003C/p\u003E\n\n\u003Cp\u003EBig Island Adventures (includes videos)\u003C/p\u003E\n\n\u003Cp\u003EThe descriptions of beaches are helpful. Suitability for swimming, ease of access, etc. is included. Some beaches are great for picnics and scenic views, while others are suitable for swimming and snorkeling. Check before you go.\u003C/p\u003E"}]}'
data = json.loads(scraped_raw) #<= works
print(data["propertyNotes"])
failed = json.loads(scraped) #no work
print(failed["propertyNotes"])
Unfortunately, I cannot find a way for scrapy/splash to return the string as raw. So, somehow I need to have python interprets the string as raw while it is loading the json. Please help
Update:
What worked for that string was json.loads(str(data.encode('unicode_escape'), 'utf-8')) However, it didnt work with the larger string. The error I get doing this is JSONDecodeError: Invalid \escape on the larger json string
The problem exists because the string you're getting has escaped control characters which when interpreted by python become actual bytes when encoded (while this is not necessarily bad, we know that these escaped characters are control characters that json would not expect). Similar to Turn's answer, you need to interpret the string without interpreting the escaped values which is done using
json.loads(scraped.encode('unicode_escape'))
This works by encoding the contents as expected by the latin-1 encoding whilst interpreting any \u003 like escaped character as literally \u003 unless it's some sort of control character.
If my understanding is correct however, you may not want this because you then lose the escaped control characters so the data might not be the same as the original.
You can see this in action by noticing that the control chars disappear after converting the encoded string back to a normal python string:
scraped.encode('unicode_escape').decode('utf-8')
If you want to keep the control characters you're going to have to attempt to escape the strings before loading them.
If you are using Python 3.6 or later I think you can get this to work with
json.loads(scraped.encode('unicode_escape'))
As per the docs, this will give you an
Encoding suitable as the contents of a Unicode literal in
ASCII-encoded Python source code, except that quotes are not escaped.
Decodes from Latin-1 source code. Beware that Python source code
actually uses UTF-8 by default.
Which seems like exactly what you need.
Ok. so since I am on windows, I have to set the console to handle special characters. I did this by typing chcp 65001 into the terminal. I also use a regular expression and chained the string manipulation functions which is the python way anyways.
usable_json = json.loads(re.search('start_sub_string(.*)end_sub_string', hxs.xpath("//script[contains(., 'some_string')]//text()").extract_first()).group(1))
Then everything went smoth. I'll sort out the encoding and escaping when writing to database down the line.

Processing delimiters with python

Im currently trying to parse a apache log in a format I can't do normally. (Tried using goaccess)
In sublime it the delimiters show up as ENQ, SOH, and ETX which too my understanding are "|", space, and superscript L. Im trying to use re.split to separate the individual components of the log, but i'm not sure how to deal w/ the superscript L.
On sublime it shows up as 3286d68255beaf010000543a000012f1/Madonna_Home_1.jpgENQx628a135bENQZ1e5ENQAB50632SOHA50.134.214.130SOHC98.138.19.91SOHD42857ENQwwww.newprophecy.net...
With ENQ's as '|' and SOH as ' ' when I open the file in a plain text editor (Like notepad)
I just need to parse out the IP addresses so the rest of the line is mostly irrelevant.
Currently I have
pkts = re.split("\s|\\|")
But I don't know what to do for the L.
Those 3-letter codes are ASCII control codes - these are ASCII characters which occur prior to 32 (space character) in the ASCII character set. You can find a full list online.
These character do not correspond to anything printable, so you're incorrect in assuming they correspond to those characters. You can refer to them as literals in several languages using \x00 notation - for example, control code ETX corresponds to \x03 (see the reference I linked to above). You can use these to split strings or anything else.
This is the literal answer to your question, but all this aside I find it quite unlikely that you actually need to split your Apache log file by control codes. At a guess what's actually happened is that perhaps som Unicode characters have crept into your log file somehow, perhaps with UTF-8 encoding. An encoding is a way of representing characters that extend beyond the 255 limit of a single byte by encoding extended characters with multiple bytes.
There are several types of encoding, but UTF-8 is one of the most popular. If you use UTF-8 it has the property that standard ASCII characters will appear as normal (so you might never even realise that UTF-8 was being used), but if you view the file in an editor which isn't UTF-8 aware (or which incorrectly identifies the file as plain ASCII) then you'll see these odd control codes. These are places where really the code and the character(s) before or after it should be interpreted together as a single unit.
I'm not sure that this is the reason, it's just an educated guess, but if you haven't already considered it then it's important to figure out the encoding of your file since it'll affect how you interpret the entire content of it. I suggest loading the file into an editor that understands encodings (I'm sure something as popular as Sublime does with proper configuration) and force the encoding to UTF-8 and see if that makes the content seem more sensible.

ASCII as default encoding in python instead of utf-8

I only code in English but I have to deal with python unicode all the time.
Sometimes its hard to remove unicode character from and dict.
How can I change python default character encoding to ASCII???
That would be the wrong thing to do. As in very wrong. To start with, it would only give you an UnicodeDecodeError instead of removing the characters. Learn proper encoding and decoding to/from unicode so that you can filter out tthe values using rules like errors="ignore"
You can't just ignore the characters taht are part of your data, just because
you 'dislike" then. It is text, and in an itnerconected World, text is not composed of only 26 glyphs.
I'd suggest you get started by reading this document: http://www.joelonsoftware.com/articles/Unicode.html

Do UTF-8 characters cover all encodings of ISO8859-xx and windows-12xx?

I am trying to write a generic document indexer from a bunch of documents with different encodings in python. I would like to know if it is possible to read all of my documents (that are encoded with utf-8,ISO8859-xx and windows-12xx) with utf-8 without character loss?
The reading part is as follows:
fin=codecs.open(doc_name, "r","utf-8");
doc_content=fin.read()
I'm going to rephrase your question slightly. I believe you are asking, "can I open a document and read it as if it were UTF-8, provided that it is actually intended to be ISO8869-xx or Windows-12xx, without loss?". This is what the Python code you've posted attempts to do.
The answer to that question is no. The Python code you posted will mangle the documents if they contain any characters above ordinal 127. This is because the "codepages" use the numbers from 128 to 255 to represent one character each, where UTF-8 uses that number range to proxy multibyte characters. So, each character in your document which is not in ASCII will be either interpreted as an invalid string or will be combined with the succeeding byte(s) to form a single UTF-8 codepoint, if you incorrectly parse the file as UTF-8.
As a concrete example, say your document is in Windows-1252. It contains the byte sequence 0xC3 0xAE, or "î" (A-tilde, registered trademark sign). In UTF-8, that same byte sequence represents one character, "ï" (small 'i' with diaresis). In Windows-874, that same sequence would be "รฎ". These are rather different strings - a moral insult could become an invitation to play chess, or vice versa. Meaning is lost.
Now, for a slightly different question - "can I losslessly convert my files from their current encoding to UTF-8?" or, "can I represent all the data from the current files as a UTF-8 bytestream?". The answer to these questions is (modulo a few fuzzy bits) yes. Unicode is designed to have a codepoint for every ideoglyph in any previously existing codepage, and by and large has succeeded in this goal. There are a few rough edges, but you will likely be well-served by using Unicode as your common interchange format (and UTF-8 is a good choice for a representation thereof).
However, to effect the conversion, you must already know and state the format in which the files exist as they are being read. Otherwise Python will incorrectly deal with non-ASCII characters and you will badly damage your text (irreparably, in fact, if you discard either the invalid-in-UTF8 sequences or the origin of a particular wrongly-converted byte range).
In the event that the text is all, 100% ASCII, you can open it as UTF-8 without a problem, as the first 127 codepoints are shared between the two representations.
UTF-8 covers everything in Unicode. I don't know for sure whether ISO-8859-xx and Windows-12xx are entirely covered by Unicode, but I strongly suspect they are.
I believe there are some encodings which include characters which aren't in Unicode, but I would be fairly surprised if you came across those characters. Covering the whole of Unicode is "good enough" for almost everything - that's the purpose of Unicode, after all. It's meant to cover everything we could possibly need (which is why it's grown :)
EDIT: As noted, you have to know the encoding of the file yourself, and state it - you can't just expect files to magically be read correctly. But once you do know the encoding, you could convert everything to UTF-8.
You'll need to have some way of determining which character set the document uses. You can't just open each one as "utf-8" and expect it to get magically converted. Open it with the proper character set, then convert.
The best way to be sure would be to convert a large set of documents, then convert them back and do a comparison.

"Broken" unicode strings encoded in UTF-8?

I have been studying unicode and its Python implementation now for two days, and I think I'm getting a glimpse of what it is about. Just to get confident, I'm asking if my assumptions for my current problems are correct.
In Django, forms give me unicode strings which I suspect to be "broken". Unicode strings in Python should be encoded in UTF-8, is that right? After entering the string "fähre" into a text field, the browser sends the string "f%c3%a4hre" in the POST request (checked via wireshark). When I retrieve the value via form.cleaned_data, I'm getting the string u'f\xa4hre' (note it is a unicode string), though. As far as I understand that, that is ISO-8859-1-encoded unicode string, which is incorrect. The correct string should be u'f\xc3\xa4hre', which would be a UTF-8-encoded unicode string. Is that a Django bug or is there something wrong with my understanding of it?
To fix the issue, I wrote a function to apply it to any text input from Django forms:
def fix_broken_unicode(s):
return unicode(s.encode(u'utf-8'), u'iso-8859-1')
which does
>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'
That doesn't seem very elegant to me, but setting Django's settings.DEFAULT_CHARSET to 'utf-8' didn't help, nor did anything else. I am trying to work with unicode throughout the whole application so I won't get any weird errors later on, but it obviously does not suffice to mark all strings with u'...'.
Edit: Considering the answers from Dirk and sth, I will now save the strings to the database as they are. The real problem was that I was trying to urlencode these kinds of strings to use them as input for the Twitter API etc. In GET or POST requests, though, UTF-8 encoding is obviously expected which the standard urllib.urlencode() function does not process correctly (throws exceptions). Take a look at my solution in the pastebin and feel free to comment on it also.
u'f\xa4hre'is a unicode string, not encoded as anything. The unicode codepoint 0xa4 is the character ä. It is not really important that ä would also be encoded as byte 0xa4 in ISO-8859-1.
The unicode string can contain any unicode characters without encoding them in some way. For example 轮渡 would be represented as u'\u8f6e\u6e21', which are simply two unicode codepoints. The UTF-8 encoding would be the much longer '\xe8\xbd\xae\xe6\xb8\xa1'.
So there is no need to fix the encoding, you are just seeing the internal representation of the unicode string.
Not exactly: after having been decoded, the unicode string is unicode which means, it may contain characters with codes beyond 255. How the interpreter represents these depends on the platform, but usually nowadays it uses character elements with a width of at least 16 bits. ISO-8859-1 is a proper subset of unicode. Thus, the string u'f\xa4hre' is actually proper -- the \xa4 is a rendering artifact, since Python doesn't know if (and when) it is safe to include characters with codes beyond a certain range on the console.
UTF-8 is a transport encoding that is, a special way to write unicode data such, that it can be stored in "channels" with an element width of 8 bits per character/byte. In order to compute the proper "external" (or transport) encoding of a unicode string, you'd use the encode method, passing the desired representation. It returns a properly encoded byte string (as opposed to a unicode character string).
The reverse transformation is decode which takes a byte string and an encoding name and yields a unicode character string.

Categories

Resources