Unicode error in python when printing a list - python

Edit: http://pastebin.com/W4iG3tjS - the file
I have a text file encoded in utf8 with some Cyrillic text it. To load it, I use the following code:
import codecs
fopen = codecs.open('thefile', 'r', encoding='utf8')
fread = fopen.read()
fread dumps the file on the screen all unicodish (escape sequences). print fread displays it in readable form (ASCII I guess).
I then try to split it and write it to an empty file with no encoding:
a = fread.split()
for l in a:
print>>dasFile, l
But I get the following error message: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
Is there a way to dump fread.split() into a file? How can I get rid of this error?

Since you've opened and read the file via codecs.open(), it's been decoded to Unicode. So to output it you need to encode it again, presumably back to UTF-8.
for l in a:
dasFile.write(l.encode('utf-8'))

print is going to use the default encoding, which is normally "ascii". So you see that error with print. But you can open a file and write directly to it.
a = fopen.readlines() # returns a list of lines already, with line endings intact
# do something with a
dasFile.writelines(a) # doesn't add line endings, expects them to be present already.
assuming the lines in a are encoded already.
PS. You should also investigate the io module.

Related

Python search through file and un-escape unicode characters [duplicate]

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).
I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to
So I tried print(pred_str.encode('utf-8')) and my output looks like this:
b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham'
b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5'
b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'
But, I want my output to look like this:
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
aviparīta-pudgala-dharma-nairātmya-pratipādana-artham
triṃśikā-vijñapti-prakaraṇa-ārambhaḥ
pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham
If i save my string in file using:
with codecs.open('out.txt', 'w', 'UTF-8') as f:
f.write(pred_str)
it saves string as expected.
Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.
This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.
You can decode such bytestrings like this:
>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
To read such data from a file:
with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.
try this code:
if pred_str.startswith('\ufeff'):
pred_str = pred_str.split('\ufeff')[1]

Replacing non-UTF-8 from a string

Here is the code:
s = 'Waitematā'
w = open('test.txt','w')
w.write(s)
w.close()
I get the following error.
UnicodeEncodeError: 'charmap' codec can't encode character '\u0101' in position 8: character maps to <undefined>
The string will print with the macron a, ā. However, I am not able to write this to a .txt or .csv file.
Am I able to swap our the macron a, ā for no macron? Thanks for the help in advance.
Note that if you open a file with open('text.txt', 'w') and write a string to it, you are not writing a string to a file, but writing the encoded string into the file. What encoding used depends on your LANG environment variable or other factors.
To force UTF-8, as you suggested in title, you can try this:
w = open('text.txt', 'wb') # note for binary
w.write(s.encode('utf-8')) # convert str into byte explicitly
w.close()
As documented in open:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Not all encodings support all Unicode characters. Since the encoding is platform dependent when not specified, it is better and more portable to be explicit and call out the encoding when reading or writing a text file. UTF-8 supports all Unicode code points:
s = 'Waitematā'
with open('text.txt','w',encoding='utf8') as w:
w.write(s)

what encoding should I use to open a file with a big letter N with tilde character?

I'm trying to open a file with a big letter N tilde (http://graphemica.com/%C3%91) but I can't seem to figure it out. when I open the file in notepad++ it shows the character as xD1, when I open the file in gedit it shows \D1. When I open the file in excel, it shows the character correctly.
Now I'm trying to open the file in python, it halts when it encounters the character. I'm aware that I can put in the encoding so the file can be opened properly but I'm not sure which encoding I should use. My error is
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
this is my code
with codecs.open('tsv.txt', 'r', 'utf8') as my_file:
for line in my_file:
print(line)
if it is not utf8, then what should I use? From the site above, it does not show which encoding 0xd1 is associated with.
You can find in tables how 'Ñ' gets encoded in different encodings.
You can also try it directly with Python:
>>> 'Ñ'.encode('utf8')
b'\xc3\x91'
>>> 'Ñ'.encode('latin1')
b'\xd1'
It seems that your file is encoded in latin-1.

Python 2.7 JSON dump UnicodeEncodeError

I have a file where each line is a json object like so:
{"name": "John", ...}
{...}
I am trying to create a new file with the same objects, but with certain properties removed from all of them.
When I do this, I get a UnicodeEncodeError. Strangely, If I instead loop over range(n) (for some number n) and use infile.next(), it works just as I want it to.
Why so? How do I get this to work by iterating over infile? I tried using dumps() instead of dump(), but that just makes a bunch of empty lines in the outfile.
with open(filename, 'r') as infile:
with open('_{}'.format(filename), 'w') as outfile:
for comment in infile:
decodedComment = json.loads(comment)
for prop in propsToRemove:
# use pop to avoid exception handling
decodedComment.pop(prop, None)
json.dump(decodedComment, outfile, ensure_ascii = False)
outfile.write('\n')
Here is the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001f47d' in position 1: ordinal not in range(128)
Thanks for the help!
The problem you are facing is that the standard file.write() function (called by the json.dump() function) does not support unicode strings. From the error message, it turns out that your string contains the UTF character \U0001f47d (which turns out to code for the character EXTRATERRESTRIAL ALIEN, who knew?), and possibly other UTF characters. To handle these characters, either you can encode them into an ASCII encoding (they'll show up in your output file as \XXXXXX), or you need to use a file writer that can handle unicode.
To do the first option, replace your writing line with this line:
json.dump(unicode(decodedComment), outfile, ensure_ascii = False)
The second option is likely more what you want, and an easy option is to use the codecs module. Import it, and change your second line to:
with codecs.open('_{}'.format(filename), 'w', encoding="utf-8") as outfile:
Then, you'll be able to save the special characters in their original form.

Change string in file in python

I have a problem changing a file in python. I need to change a string.
The file is not a text file but can be edited with a text editor.
Here is my code:
with open(esp,"r") as f:
content=f.readlines()
with open(esp_carsat,"w") as f:
for line in content:
f.write(line.replace("201","202")))
Problem is that content is in byte I think.
'\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n'
So my replace is not working. I tried to play with encoding but the file is not readable afterwards. Furthermore, I have accents in the file (é,è...)
Is there a way to do what I want?
You have UTF-16 encoded data. Decode to Unicode text, replace, and then encode back to UTF-16 again:
>>> data = '\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n\x00'
>>> data.decode('utf16')
u'<InstanceNames>\r\n'
I had to append an extra \x00 to decode that; by reading the file without decoding Python split the line on the \n and left the \x00 for the next line.
Unicode data can handle accents just fine, no further work required there.
This is easiest done with io.open() to open file objects that do the decoding and encoding for you:
import io
with io.open(esp, "r", encoding='utf16') as f:
content=f.readlines()
with open(esp_carsat, "w", encoding='utf16') as f:
for line in content:
f.write(line.replace("201", "202")))
It's UTF-16-LE data:
>>> b
'\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n'
>>> print(b[:-1].decode('utf-16-le'))
<InstanceNames>

Categories

Resources