writing utf-8 encoded text to a file

writing utf-8 encoded text to a file - python

I am trying to print chinese text to a file. When i print it on the terminal, it looks correct to me. When i type print >> filename... i get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 24: ordinal not in range(128)
I dont know what else i need to do.. i already encoded all textual data to utf-8 and used string formatting.
This is my code:
# -*- coding: utf-8 -*-
exclude = string.punctuation
def get_documents_cleaned(topics,filename):
c = codecs.open(filename, "w", "utf-8")
for top in topics:
print >> c , "num" , "," , "text" , "," , "id" , "," , top
document_results = proj.get('docs',text=top)['results']
for doc in document_results:
print "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
print >> c , "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
get_documents_cleaned(my_topics,"export_c.csv")
print doc[0]['text'] looks like this:
u' \u3001 \u6010...'

Since your first print statement works, it's clear, that it's not the format function raising the UnicodeDecodeError.
Instead it's a problem with the file writer. c seems expect a unicode object but only gets a UTF-8 encoded str object (let's name it s). So c tries to call s.decode() implicitly which results in the UnicodeDecodeError.
You could fix your problem by simply calling s.decode('utf-8') before printing or by using the Python default open(filename, "w") function instead.

Related

Convert 'bytes' object to string

I tried to find solution, but still stuck with it.
I want to use PyVisa to control function generator.
I have a waveform which is a list of values between 0 and 16382
Then I have to prepare it in a way that each waveform point occupies 2 bytes.
A value is represented in big-endian, MSB-first format, and is a straight binary. So I do binwaveform = pack('>'+'h'*len(waveform), *waveform)
And then when I try to write it to the instrument with AFG.write('trace ememory, '+ header + binwaveform) I get an error:
File ".\afg3000.py", line 97, in <module>
AFG.write('trace ememory, '+ header + binwaveform)
TypeError: Can't convert 'bytes' object to str implicitly
I tried to solve it with AFG.write('trace ememory, '+ header + binwaveform.decode()) but it looks that by default it tries to use ASCII characters what is not correct for some values: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 52787: invalid start byte
Could you please help with it?

binwaveform is a packed byte array of an integer. E.g:
struct.pack('<h', 4545)
b'\xc1\x11'
You can't print it as it makes no sense to your terminal. In the above example,
0xC1 is invalid ASCII and UTF-8.
When you append a byte string to a regular str (trace ememory, '+ header + binwaveform), Python wants to convert it to readable text but doesn't know how.
Decoding it implies that it's text - it's not.
The best thing to do is print the hex representation of it:
import codecs
binwaveform_hex = codecs.encode(binwaveform, 'hex')
binwaveform_hex_str = str(binwaveform_hex)
AFG.write('trace ememory, '+ header + binwaveform_hex_str)

Python url encode/decode - Convert % escaped hexadecimal digits into string

For example, if I have an encoded string as:
url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pav%C3%A9+cafe&postalCode=5067'
The name parameter has the characters %C3%A9 which actually implies the character é.
Hence, I would like the output to be:
new_url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067'
I tried the following steps on a Python terminal:
>>> import urllib2
>>> url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pav%C3%A9+cafe&postalCode=5067'
>>> new_url=urllib2.unquote(url).decode('utf8')
>>> print new_url
locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067
>>>
However, when I tried the same thing within a Python script and run as myscript.py, I am getting the following stack trace:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 88: ordinal not in range(128)
I am using Python 2.6.6 and cannot switch to other versions due to work reasons.
How can I overcome this error?
Any help is greatly appreciated. Thanks in advance!
######################################################
EDIT
I realized that I am getting the above expected output.
However, I would like to convert the parameters in the new_url into a dictionary as follows. While doing so, I am not able to retain the special character 'é' in my name parameter.
print new_url
params_list = new_url.split("&")
print(params_list)
params_dict={}
for p in params_list:
temp = p.split("=")
params_dict[temp[0]] = temp[1]
print(params_dict)
Outputs:
new_url
locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067
params_list
[u'locality=Norwood', u'address=138+The+Parade', u'region=SA', u'country=AU', u'name=Pav\xe9+cafe', u'postalCode=5067']
params_dict
{u'name': u'Pav\xe9+cafe', u'locality': u'Norwood', u'country': u'AU', u'region': u'SA', u'address': u'138+The+Parade', u'postalCode': u'5067'}
Basically ... the name is now 'Pav\xe9+cafe' as opposed to the required 'Pavé'.
How can I still retain the same special character in my params_dict?

This is actually due to the difference between __repr__ and __str__. When printing a unicode string, __str__ is used and results in the é you see when printing new_url. However, when a list or dict is printed, __repr__ is used, which uses __repr__ for each object within lists and dicts. If you print the items separately, they print as you desire.
# -*- coding: utf-8 -*-
new_url = u'name=Pavé+cafe&postalCode=5067'
print(new_url) # name=Pavé+cafe&postalCode=5067
params_list = [s for s in new_url.split("&")]
print(params_list) # [u'name=Pav\xe9+cafe', u'postalCode=5067']
print(params_list[0]) # name=Pavé+cafe
print(params_list[1]) # postalCode=5067
params_dict = {}
for p in params_list:
temp = p.split("=")
params_dict[temp[0]] = temp[1]
print(params_dict) # {u'postalCode': u'5067', u'name': u'Pav\xe9+cafe'}
print(params_dict.values()[0]) # 5067
print(params_dict.values()[1]) # Pavé+cafe
One way to print the list and dict is to get their string representation, then decode them withunicode-escape:
print(str(params_list).decode('unicode-escape')) # [u'name=Pavé+cafe', u'postalCode=5067']
print(str(params_dict).decode('unicode-escape')) # {u'postalCode': u'5067', u'name': u'Pavé+cafe'}
Note: This is only an issue in Python 2. Python 3 prints the characters as you would expect. Also, you may want to look into urlparse for parsing your URL instead of doing it manually.
import urlparse
new_url = u'name=Pavé+cafe&postalCode=5067'
print dict(urlparse.parse_qsl(new_url)) # {u'postalCode': u'5067', u'name': u'Pav\xe9 cafe'}

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?

If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã

With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data

Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"

This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (Ð½Ð°Ð¿Ð¾Ð¼Ð¸Ð½Ð°Ð»ÐºÐ¸)")
'Шепот (напоминалки)'

Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.

Getting error in python with characters from word document

I have this text which is entered in text box
‘f’fdsfs’`124539763~!##$%^’’;’””::’
I am coverting to json and then it comes as
"\\u2018f\\u2019fdsfs\\u2019`124539763~!##$%^\\u2019\\u2019;\\u2019\\u201d\\u201d::\\u2019e"
Now when i am writing the csv file then i get this error
'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)
csv.writer(data)
I tried all data.encode('utf-8') data.decode('unicode-escape') but didn't work

csv module does not support unicode use https://github.com/jdunck/python-unicodecsv instead
although im not sure \u2018 is part of the utf-8 charset
x = "\\u2018f\\u2019fdsfs..."; j = json.loads('"' + x + '"'); print j.encode('cp1252')
‘f’fdsfs...
note that it is being encoded as cp1252
>>> import unicodecsv as csv #https://github.com/jdunck/python-unicodecsv
>>> x = "\\u2018f\\u2019fdsfs..."; j = json.loads('"' + x + '"');
>>> with open("some_file.csv","wb") as f:
... w = csv.writer(f,encoding="cp1252")
... w.writerow([j,"normal"])
...
>>>
here is the csv file : https://www.dropbox.com/s/m4gta1o9vg8tfap/some_file.csv

Replace Specialchars in Python

i need to replace special chars in the filename. Im trying this at the moment with translate, but its not really good working, and i hope you got an idea to do this. Its to make an clear playlist, ive got an bad player of mp3s in my car which cant do umlaute oder specialchars.
My code so far
# -*- coding: utf-8 -*-
import os
import sys
import id3reader
pfad = os.path.dirname(sys.argv[1])+"/"
ordner = ""
table = {
0xe9: u'e',
0xe4: u'ae',
ord(u'ö'): u'oe',
ord(u'ü'): u'ue',
ord(u'ß'): u'ss',
0xe1: u'ss',
0xfc: u'ue',
}
def replace(s):
return ''.join(c for c in s if (c.isalpha() or c == " " or c =="-") )
fobj_in = open(sys.argv[1])
fobj_out = open(sys.argv[1]+".new","w")
for line in fobj_in:
if (line.rstrip()[0:1]=="#" or line.rstrip()[0:1] ==" "):
print line.rstrip()[0:1]
else:
datei= pfad+line.rstrip()
#print datei
id3info = id3reader.Reader(datei)
dateiname= str(id3info.getValue('performer'))+" - "+ str(id3info.getValue('title'))
#print dateiname
arrPfad = line.split('/')
dateiname = replace(dateiname[0:60])
print dateiname
# dateiname = dateiname.translate(table)+".mp3"
ordner = arrPfad[0]+"/"+dateiname
# os.rename(datei,pfad+ordner)
fobj_out.write(ordner+"\r\n")
fobj_in.close()
i get this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 37: ordinal not in range(128)
If i try to use the translate at the id3title i get TypeError: expected a character buffer object

if I need to get rid of non-ascii-characters, I often use:
>>> unicodedata.normalize("NFKD", u"spëcïälchärs").encode('ascii', 'ignore')
'specialchars'
which tries to convert characters to their ascii part of their normalized unicode decomposition.
Bad thing is, it throws away everything it does not know, and is not smart enough to transliterate umlauts (to ue, ae, etc).
But it might help you to at least play those mp3s.
Of course, you are free to do your own str.translate first, and wrap the result in this, to eliminate every non-ascii-character still left. In fact, if your replace is correct, this will solve your problem. I'd suggest you'd take a look on str.translate and str.maketrans, though.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

writing utf-8 encoded text to a file - python

Related

Convert 'bytes' object to string

Python url encode/decode - Convert % escaped hexadecimal digits into string

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

Getting error in python with characters from word document

Replace Specialchars in Python

Categories

Resources