Encode and decoding German text in Python - python

I am dealing with German text that I'd like to encode and decode to get rid of some characters. For instance, say I have
text = 'führt - möglich'
I would like to obtain:
corrected_text = 'führt - möglich'
If I encode text once using cp1252 and decode the result with utf8, I get:
text.encode('cp1252').decode('utf8')
# 'führt - möglich'
The first word is OK, but there remain some characters to replace in the second word. I can encode/decode a second time to get
text.encode('cp1252').decode('utf8').encode('cp1252').decode('utf8', 'ignore')
# 'fhrt - möglich'
This is now OK for the second word, but the first has a missing ü.
I could code and use this debugging table, together with str.replace(), to solve the above issue. However, I would like to know: given text, is there a way of using encode and decode to get corrected_text?

Related

'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)

I am running into this problem where when I try to decode a string I run into one error,when I try to encode I run into another error,errors below,is there a permanent solution for this?
P.S please note that you may not be able to reproduce the encoding error with the string I provided as I couldnt copy/paste some errors
text = "sometext"
string = '\n'.join(list(set(text)))
try:
print "decode"
text = string.decode('UTF-8')
except Exception as e:
print e
text = string.encode('UTF-8')
Errors:-
error while using string.decode('UTF-8')
'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)
Error while using string.encode('UTF-8')
Exception All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
The First Error
The code you have provided will work as the text is a a bytestring (as you are using Python 2). But what you're trying to do is to decode from a UTF-8 string to
an ASCII one, which is possible, but only if that Unicode string contains only characters that have an ASCII equivalent (you can see the list of ASCII characters here). In your case, it's encountering a unicode character (specifically ☂) which has no ASCII equivalent. You can get around this behaviour by using:
string.decode('UTF-8', 'ignore')
Which will just ignore (i.e. replace with nothing) the characters that cannot be encoded into ASCII.
The Second Error
This error is more interesting. It appears the text you are trying to encode into UTF-8 contains either NULL bytes or specific control characters, which are not allowed by the version of Unicode (UTF-8) that you are trying to encode into. Again, the code that you have actually provided works, but something in the text that you are trying to encode is violating the encoding. You can try the same trick as above:
string.encode('UTF-8', 'ignore')
Which will simply remove the offending characters, or you can look into what it is in your specific text input that is causing the problem.

How can I encode/decode \xbe in python?

I have an excel file I am reading in python using the xlrd module. I am extracting the values from each row, adding some additional data and writing it all out to a new text file. However I am running into an issue with cells that contain text with the fraction 3/4. Python reads the value as \xbe, and each time I encounter it, I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbe' in position 317: ordinal not in range(128)
I am converting my list of values from each row into a string, have tried the following without success:
row_vals_str = [unicode(str(val), 'utf-8') for val in row_vals]
row_vals_str = [str(val).encode('utf-8') for val in row_vals]
row_vals_str = [str(val).decode() for val in row_vals]
Each time I hit the first occurrence of the 3/4 fraction I get the same error.
How can I convert this to something that can be written to text?
I came across this thread but didn't find an answer: How to convert \xXY encoded characters to UTF-8 in Python?
It is latin-1 group. you can use latin1 to decode the char or replace to different one if you do not need it.
http://www.codetable.net/hex/be
>>> '\xbe'.decode('latin1')
u'\xbe'
>>> '\xbe'.decode('cp1252')
u'\xbe'
>>> '\xbe this is a test'.replace('\xbe','3/4')
'3/4 this is a test'
What actually ending up working was to to decode the string, then encode it, then replace:
row_vals_str = [str(val).decode('latin1').encode('utf8').replace(r'\xbe', '3/4') for val in row_vals]

Removing unusual characters, after decoding and encoding

So I've been looking into this for quite a bit, and so far I am taking a string and doing the following:
title = title.decode('windows-1252')
title = title.encode('utf-8','replace')
My string is as follows, although there can be other characters that are not removed.
Bus • Lorry • IT & Construction
Punctuation removed:
title = title.translate(string.punctuation)
This seems to become (after punctuation removal):
Bus • Lorry • IT Construction
Although now I get an issue where I split the string and try and join it back together. I split it to:
['Bus', '\xc3\xa2\xe2\x82\xac\xc2\xa2', 'Lorry', '\xc3\xa2\xe2\x82\xac\xc2\xa2', 'IT', 'Construction']
By:
words = text.split(' ')
The try to rejoin once I've down some stemming per word:
text = ' '.join([stemmer.stem(word) for word in words])
And then, at this point I get an issue:
'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
But I'm confused as from reading sites, I need to encode and decode, which I think I've done correctly already....
You need to decode after the data is input, use it as unicode and encode it only for output. An UnicodeDecodeError is raised when something tries to make an encoded string into an unicode object without knowing the original encoding.
In your case I'd try to do the splitting and running the stemmer before encoding to UTF-8. This would only be needed for output or (possibly) storage.

coding (unicode) hex to octal

I read a text file, which has some characters like that '\260' (it means '°'), and then I add it to DB (sqlite3).
After that, I try to get the information from DB, but the sql-query will be built with '\xb0'(it means '°' too), because I get this information from XML file.
I try to replace hex characters with octal chracters: text = text.replace(r'\xb0', '\260') but it doesn't work, why? I cannot build correct sql-query.
Maybe there are some solutions for this problem e.g. encode, decode etc.
\260 is the same thing as \xb0:
>>> '\xb0'
'\xb0'
>>> '\260'
'\xb0'
You probably want to decode your input to unicode and insert that instead. If your data is encoded to Latin 1 then decode:
>>> print '\xb0'.decode('latin1')
°
sqlite3 can handle unicode just fine, and by decoding you make sure you are handling text values, not byte values, which can differ from codec to codec.

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END
Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.
In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв
xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

Categories

Resources