Python search through file and un-escape unicode characters [duplicate]

Python search through file and un-escape unicode characters [duplicate] - python

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).
I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to
So I tried print(pred_str.encode('utf-8')) and my output looks like this:
b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham'
b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5'
b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'
But, I want my output to look like this:
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
aviparīta-pudgala-dharma-nairātmya-pratipādana-artham
triṃśikā-vijñapti-prakaraṇa-ārambhaḥ
pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham
If i save my string in file using:
with codecs.open('out.txt', 'w', 'UTF-8') as f:
f.write(pred_str)
it saves string as expected.

Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.
This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.
You can decode such bytestrings like this:
>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
To read such data from a file:
with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.

try this code:
if pred_str.startswith('\ufeff'):
pred_str = pred_str.split('\ufeff')[1]

Related

Replacing non-UTF-8 from a string

Here is the code:
s = 'Waitematā'
w = open('test.txt','w')
w.write(s)
w.close()
I get the following error.
UnicodeEncodeError: 'charmap' codec can't encode character '\u0101' in position 8: character maps to <undefined>
The string will print with the macron a, ā. However, I am not able to write this to a .txt or .csv file.
Am I able to swap our the macron a, ā for no macron? Thanks for the help in advance.

Note that if you open a file with open('text.txt', 'w') and write a string to it, you are not writing a string to a file, but writing the encoded string into the file. What encoding used depends on your LANG environment variable or other factors.
To force UTF-8, as you suggested in title, you can try this:
w = open('text.txt', 'wb') # note for binary
w.write(s.encode('utf-8')) # convert str into byte explicitly
w.close()

As documented in open:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Not all encodings support all Unicode characters. Since the encoding is platform dependent when not specified, it is better and more portable to be explicit and call out the encoding when reading or writing a text file. UTF-8 supports all Unicode code points:
s = 'Waitematā'
with open('text.txt','w',encoding='utf8') as w:
w.write(s)

Error when non-English characters present [duplicate]

I'm pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a Wordpress page).
It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?
Currently I'm converting everything to Unicode on the way in, joining it all together in a Python string, then doing:
import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))
There is an encoding error on the last line:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
12286: ordinal not in range(128)
Partial solution:
This Python runs without an error:
row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))
But then if I open the actual text file, I see lots of symbols like:
Qur‚Äôan
Maybe I need to write to something other than a text file?

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.
If your string is actually a unicode object, you'll need to convert it to a unicode-encoded string object before writing it to a file:
foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()
When you read that file again, you'll get a unicode-encoded string that you can decode to a unicode object:
f = file('test', 'r')
print f.read().decode('utf8')

In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:
import io
with io.open(filename, 'w', encoding=character_encoding) as file:
file.write(unicode_text)
It might be more convenient if you need to write the text incrementally (you don't need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.

Unicode string handling is already standardized in Python 3.
char's are already stored in Unicode (32-bit) in memory
You only need to open file in utf-8
(32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)
out1 = "(嘉南大圳 ㄐㄧㄚ　ㄋㄢˊ　ㄉㄚˋ　ㄗㄨㄣˋ )"
fobj = open("t1.txt", "w", encoding="utf-8")
fobj.write(out1)
fobj.close()

Preface: will your viewer work?
Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.
Writing Unicode text to a text file?
In Python 2, use open from the io module (this is the same as the builtin open in Python 3):
import io
Best practice, in general, use UTF-8 for writing to files (we don't even have to worry about byte-order with utf-8).
encoding = 'utf-8'
utf-8 is the most modern and universally usable encoding - it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.
On Windows, you might try utf-16le if you're limited to viewing output in Notepad (or another limited viewer).
encoding = 'utf-16le' # sorry, Windows users... :(
And just open it with the context manager and write your unicode characters out:
with io.open(filename, 'w', encoding=encoding) as f:
f.write(unicode_object)
Example using many Unicode characters
Here's an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):
from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter
try: # use these if Python 2
unicode_chr, range = unichr, xrange
except NameError: # Python 3
unicode_chr = chr
exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
for x in range((2**8)**3):
try:
char = unicode_chr(x)
except ValueError:
continue # can't map to unicode, try next x
cat = category(char)
counts.update((cat,))
if cat in exclude_categories:
continue # get rid of noise & greatly shorten result file
try:
uname = name(char)
except ValueError: # probably control character, don't use actual
uname = control_names.get(x, '')
f.write(u'{0:>6x} {1} {2}\n'.format(x, cat, uname))
else:
f.write(u'{0:>6x} {1} {2} {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
print('{0} chars of category, {1}'.format(count, cat))
This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you'll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.
$ python uni.py
It will display the hexadecimal mapping, category, symbol (unless can't get the name, so probably a control character), and the name of the symbol. e.g.
I recommend less on Unix or Cygwin (don't print/cat the entire file to your output):
$ less unidata
e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):
0 Cc NUL
20 Zs SPACE
21 Po ! EXCLAMATION MARK
b6 So ¶ PILCROW SIGN
d0 Lu Ð LATIN CAPITAL LETTER ETH
e59 Nd ๙ THAI DIGIT NINE
2887 So ⢇ BRAILLE PATTERN DOTS-1238
bc13 Lo 밓 HANGUL SYLLABLE MIH
ffeb Sm ￫ HALFWIDTH RIGHTWARDS ARROW
My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3's would.

The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn't unicode; you take unicode and encode it in iso-8859-1 yourself. That's what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)
You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.

How to print unicode characters into a file:
Save this to file: foo.py:
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')
Run it and pipe output to file:
python foo.py > tmp.txt
Open tmp.txt and look inside, you see this:
el#apollo:~$ cat tmp.txt
e with obfuscation: é
Thus you have saved unicode e with a obfuscation mark on it to a file.

That error arises when you try to encode a non-unicode string: it tries to decode it, assuming it's in plain ASCII. There are two possibilities:
You're encoding it to a bytestring, but because you've used codecs.open, the write method expects a unicode object. So you encode it, and it tries to decode it again. Try: f.write(all_html) instead.
all_html is not, in fact, a unicode object. When you do .encode(...), it first tries to decode it.

In case of writing in python3
>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'
In case of writing in python2:
>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
To avoid this error you would have to encode it to bytes using codecs "utf-8" like this:
>>> f.write(a.encode("utf-8"))
>>> f.close()
and decode the data while reading using the codecs "utf-8":
>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'
And also if you try to execute print on this string it will automatically decode using the "utf-8" codecs like this
>>> print a
batsà

python ascii vs unicode (utf-8) [duplicate]

I'm pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a Wordpress page).
It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?
Currently I'm converting everything to Unicode on the way in, joining it all together in a Python string, then doing:
import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))
There is an encoding error on the last line:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
12286: ordinal not in range(128)
Partial solution:
This Python runs without an error:
row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))
But then if I open the actual text file, I see lots of symbols like:
Qur‚Äôan
Maybe I need to write to something other than a text file?

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.
If your string is actually a unicode object, you'll need to convert it to a unicode-encoded string object before writing it to a file:
foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()
When you read that file again, you'll get a unicode-encoded string that you can decode to a unicode object:
f = file('test', 'r')
print f.read().decode('utf8')

In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:
import io
with io.open(filename, 'w', encoding=character_encoding) as file:
file.write(unicode_text)
It might be more convenient if you need to write the text incrementally (you don't need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.

Unicode string handling is already standardized in Python 3.
char's are already stored in Unicode (32-bit) in memory
You only need to open file in utf-8
(32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)
out1 = "(嘉南大圳 ㄐㄧㄚ　ㄋㄢˊ　ㄉㄚˋ　ㄗㄨㄣˋ )"
fobj = open("t1.txt", "w", encoding="utf-8")
fobj.write(out1)
fobj.close()

Preface: will your viewer work?
Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.
Writing Unicode text to a text file?
In Python 2, use open from the io module (this is the same as the builtin open in Python 3):
import io
Best practice, in general, use UTF-8 for writing to files (we don't even have to worry about byte-order with utf-8).
encoding = 'utf-8'
utf-8 is the most modern and universally usable encoding - it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.
On Windows, you might try utf-16le if you're limited to viewing output in Notepad (or another limited viewer).
encoding = 'utf-16le' # sorry, Windows users... :(
And just open it with the context manager and write your unicode characters out:
with io.open(filename, 'w', encoding=encoding) as f:
f.write(unicode_object)
Example using many Unicode characters
Here's an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):
from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter
try: # use these if Python 2
unicode_chr, range = unichr, xrange
except NameError: # Python 3
unicode_chr = chr
exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
for x in range((2**8)**3):
try:
char = unicode_chr(x)
except ValueError:
continue # can't map to unicode, try next x
cat = category(char)
counts.update((cat,))
if cat in exclude_categories:
continue # get rid of noise & greatly shorten result file
try:
uname = name(char)
except ValueError: # probably control character, don't use actual
uname = control_names.get(x, '')
f.write(u'{0:>6x} {1} {2}\n'.format(x, cat, uname))
else:
f.write(u'{0:>6x} {1} {2} {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
print('{0} chars of category, {1}'.format(count, cat))
This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you'll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.
$ python uni.py
It will display the hexadecimal mapping, category, symbol (unless can't get the name, so probably a control character), and the name of the symbol. e.g.
I recommend less on Unix or Cygwin (don't print/cat the entire file to your output):
$ less unidata
e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):
0 Cc NUL
20 Zs SPACE
21 Po ! EXCLAMATION MARK
b6 So ¶ PILCROW SIGN
d0 Lu Ð LATIN CAPITAL LETTER ETH
e59 Nd ๙ THAI DIGIT NINE
2887 So ⢇ BRAILLE PATTERN DOTS-1238
bc13 Lo 밓 HANGUL SYLLABLE MIH
ffeb Sm ￫ HALFWIDTH RIGHTWARDS ARROW
My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3's would.

The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn't unicode; you take unicode and encode it in iso-8859-1 yourself. That's what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)
You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.

How to print unicode characters into a file:
Save this to file: foo.py:
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')
Run it and pipe output to file:
python foo.py > tmp.txt
Open tmp.txt and look inside, you see this:
el#apollo:~$ cat tmp.txt
e with obfuscation: é
Thus you have saved unicode e with a obfuscation mark on it to a file.

That error arises when you try to encode a non-unicode string: it tries to decode it, assuming it's in plain ASCII. There are two possibilities:
You're encoding it to a bytestring, but because you've used codecs.open, the write method expects a unicode object. So you encode it, and it tries to decode it again. Try: f.write(all_html) instead.
all_html is not, in fact, a unicode object. When you do .encode(...), it first tries to decode it.

In case of writing in python3
>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'
In case of writing in python2:
>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
To avoid this error you would have to encode it to bytes using codecs "utf-8" like this:
>>> f.write(a.encode("utf-8"))
>>> f.close()
and decode the data while reading using the codecs "utf-8":
>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'
And also if you try to execute print on this string it will automatically decode using the "utf-8" codecs like this
>>> print a
batsà

Python can't write to a file despite printing perfectly fine

I'm running up against what I assume is some strange encoding error, but it's really baffling me. Basically I'm trying to write a unicode string to a file as an image, and the string representation is printed fine.
ìԉｃïԁｉԁúлｔ ｃúɭｐâ ρáｎｃéｔｔá, ëɑ ëɭìｔ ｈａϻ ｏｆｆíｃìà ｃòлѕêɋûäｔ. Sｕｎｔ ԁëｓëｒúлｔ
but any way I try to write the string out to any relevant place I get the standard ascii encoding error:
UnicodeEncodeError: 'ascii' codec can't encode characters 0-3: ordinal not in range 128
I've tried setting the encoding of my source files, and ensuring that my system variable isn't set to ascii, and I've tried directly outputting to a file via:
python script.py > output.jpg
and none of it seems to have any effect. I feel a little silly for not being able to solve a simple encoding issue, but I've really got no clue as to where the ascii codec is even coming from at this point.
Relevant code:
def random_image(**kwargs):
image_array = numpy.random.rand(kwargs["dims"][0], kwargs["dims"][1], 3)*255
image = Image.fromarray(image_array.astype('uint8')).convert('RGBA')
format = kwargs.get("format", "JPEG")
output = StringIO.StringIO()
image.save(output, format=format)
content = output.getvalue()
output.close()
content = [str(ord(char)) for char in content]
return content

The first question is why do you store the contents of your image in the form of a Unicode string? Images typically contain arbitrary octets and should be represented with str (bytes in Python 3), not with the unicode type.
When you print a Unicode string to the screen, encoding is chosen based on the environment settings. When you print it to the file, you need to specify an encoding, otherwise ascii is assumed. To have your program default to something more sane for files, start it with:
encoding = sys.stdout.encoding or 'utf-8'
sys.stdout = codecs.getwriter(encoding)(sys.stdout, errors='replace')

Unicode error in python when printing a list

Edit: http://pastebin.com/W4iG3tjS - the file
I have a text file encoded in utf8 with some Cyrillic text it. To load it, I use the following code:
import codecs
fopen = codecs.open('thefile', 'r', encoding='utf8')
fread = fopen.read()
fread dumps the file on the screen all unicodish (escape sequences). print fread displays it in readable form (ASCII I guess).
I then try to split it and write it to an empty file with no encoding:
a = fread.split()
for l in a:
print>>dasFile, l
But I get the following error message: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
Is there a way to dump fread.split() into a file? How can I get rid of this error?

Since you've opened and read the file via codecs.open(), it's been decoded to Unicode. So to output it you need to encode it again, presumably back to UTF-8.
for l in a:
dasFile.write(l.encode('utf-8'))

print is going to use the default encoding, which is normally "ascii". So you see that error with print. But you can open a file and write directly to it.
a = fopen.readlines() # returns a list of lines already, with line endings intact
# do something with a
dasFile.writelines(a) # doesn't add line endings, expects them to be present already.
assuming the lines in a are encoded already.
PS. You should also investigate the io module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python search through file and un-escape unicode characters [duplicate] - python

try this code: if pred_str.startswith('\ufeff'): pred_str = pred_str.split('\ufeff')[1]

Related

Replacing non-UTF-8 from a string

Error when non-English characters present [duplicate]

python ascii vs unicode (utf-8) [duplicate]

Python can't write to a file despite printing perfectly fine

Unicode error in python when printing a list

Categories

Resources