The problem in my code looks something like this:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
deg = u'°'
print deg
print '40%s N, 100%s W' % (deg, deg)
codelim = raw_input('40%s N, 100%s W)? ' % (deg, deg))
I'm trying to generate a raw_input prompt for delimiter characters inside a latitude/longitude string, and the prompt should include an example of such a string. print deg and print '40%s N, 100%s W' % (deg, deg) both work fine -- they return "°" and "40° N, 100° W" respectively -- but the raw_input step fails every time. The error I get is as follows:
Traceback (most recent call last):
File "C:\Users\[rest of the path]\scratch.py", line 5, in <module>
x = raw_input(' %s W")? ' % (deg))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 1:
ordinal not in range(128)
I thought I'd have solved the problem by adding the encoding header, as instructed here (and indeed that did make it possible to print the degree sign at all), but I'm still getting Unicode errors as soon as I add an otherwise-safe string to raw_input. What's going on here?
Try encoding the prompt string to stdouts encoding before passing it to raw input
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys
deg = u'°'
prompt = u'40%s N, 100%s W)? ' % (deg, deg)
codelim = raw_input(prompt.encode(sys.stdout.encoding))
40° N, 100° W)?
Related
I need to read this PDF.
I am using the following code:
from PyPDF2 import PdfFileReader
f = open('myfile.pdf', 'rb')
reader = PdfFileReader(f)
content = reader.getPage(0).extractText()
f.close()
content = ' '.join(content.replace('\xa0', ' ').strip().split())
print(content)
However, the encoding is incorrect, it prints:
Resultado da Prova de Sele“‰o do...
But I expected
Resultado da Prova de Seleção do...
How to solve it?
I'm using Python 3
The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.
# -*- coding: utf-8 -*-
correct = u'Resultado da Prova de Seleção do...'
print(correct.encode(encoding='utf-8'))
You're on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.
# Show installed locales
import locale
from pprint import pprint
pprint(locale.locale_alias)
If that's not the quick fix, since you're getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It's possible that PyPDF wasn't able to determine the correct encoding and gave you the wrong characters.
For example, a quick and dirty comparison of the good and bad strings you posted:
# -*- coding: utf-8 -*-
# Python 3.4
incorrect = 'Resultado da Prova de Sele“‰o do'
correct = 'Resultado da Prova de Seleção do...'
print("Incorrect String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in incorrect:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
print("\n" * 2)
print("Correct String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in correct:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
Relevant Output:
b'\xe2\x80\x9c' 8220
b'\xe2\x80\xb0' 8240
b'\xc3\xa7' 231
b'\xc3\xa3' 227
If you're getting code point 231, (>>>hex(231) # '0xe7) then you're getting back bad data back from PyPDF.
what i have tried is to replace specific " ' " unicode with "’" which solves this issue. Please let me know if u still failed to generate pdf with this approach.
text = text.replace("'", "’")
Below are some examples, when run inside the Python 3.4.3 Shell included in IDLE3 it will output special characters (icons). When I run this same code inside a terminal, the characters will not appear at all.
""" Some print functions with backslashes.
In IDLE3 they will output 'special characters' or icons.
In a terminal, they will not output anything. """
#Somtimes a visual effect.
print ("a, \a") #telephone
print ("\a")
print ("b, \b") #checkmark
print ("c, \c") # just a '\c' output.
# other letters like '\c' kept out the rest of this list.
print ("f, \f") #quarter note (musical)
print ("n, \n") #newline
print ("r, \r") #halve note (musical)
print ("t, \tTabbed in")
#print ("u, \u") #syntaxerror
print ("\u0000") #empty
print ("\u0001") #left arrow
print ("\u0002") #left arrow underline
print ("\u0003") #right arrow (play)
print ("v, \v") #eighth note (musical)
print ("\x01") # == '\u0001' __________(x == 00 ?)
print ("\1") # == '\u0001' == '\x01'
#some more fooling around
print ("\1") #left arrow
print ("\2") #underlined left arrow
print ("\3") #right arrow
print ("\4") #underlined right arrow
print ("\5") #trinity
print ("\6") #Q-parking
print ("\7") #telephone
print ("\8")
print ("\9")
print ("\10") #checkmark
print ("\11 hi") #tab
print ("\12 hi") #newline
print ("\13") #8th note
print ("\14") #4th note
print ("\15") #halve note
print ("\16") #whole note
print ("\17") #double 8th note
print ("\18")
print ("\19")
print ("\20") #left arrow (black)
print ("\21") #right arrow (black)
print ("\22") #harry potter
print ("\23") #X-chrom-carrying cell
print ("\24") #Y-chrom-carrying cell
print ("\25") #diameter for lefties
print ("\26") #pentoid
print ("\27") #gamma?
print ("\28") #I finally realised this will have to do with triple
# binary per character? 111 = 7, stop = 8
print ("\30") #
print ("\31") # female
print ("\32") # male
print ("\33") #
print ("\34") # clock
print ("\35") # alfa / ichtus
print ("\36") # arc
print ("\37") # diameter
print ("\40hi") # spaces? I don't know.
# This does not work by the way:
##import string
###No visual effect.
##alfa = string.ascii_lowercase
##for x in alfa:
## print ("\%s" % x)
Some of my Python Shell 3.4.3. output in IDLE3:
Are these 'special' characters c.q. icons used in any way? Is there some documentation I could have read that would have prevented me from asking this question?
I checked on other questions about this issue on Stack, but all I found was people trying to pass in 'foreign' (like from word symbols or whatever) characters and make them get printed by Python.
If you are in the IDLE and click "Options" and "Configure IDLE..." you see the font you are using. Fonts convert character numbers to what you see. A different font can produce different characters.
Example:
>>> print(u'\u2620')
☠
Which I looked for by searching "unicode skull" and which can be found here.
Not all fonts support all characters.
Unicode characters are organized in blocks of a certain topic. I like the block "Miscellaneous Symbols" where the skull is from.
Encoding
Also an important question is which encoding you use. The encoding determines how characters are mapped to the unicode blocks. A character has to go from print(u'\0001') to the sys.stdout to the console reading it and to the window manager. Each step only understands bytes - 256 possible characters.
So, there are various encodings, such as latin-1, which use the 256 possible characters and map them onto the unicode blocks. latin-1 uses the first two blocks, I think. There are encodings, such as UTF-8 which use 8 bits = 1 byte and more or utf-16 which uses 2 bytes and more or utf-32 which uses 4 bytes and more, which allow more characters to be transfered from the print through the different steps.
If I want to encode the skull and crossbones in latin-1 I would get this error:
>>> u'\u2620'.encode('latin-1')
Traceback (most recent call last):
File "<pyshell#32>", line 1, in <module>
u'\u2620'.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2620' in position 0: ordinal not in range(256)
Another example where I encode the russian letter zhe in the cyrillic code page and the latin one:
>>> print u'\u0436', repr(u'\u0436'.encode('cp1251')) # cyrillic works
ж '\xe6'
>>> print u'\u0436', repr(u'\u0436'.encode('cp1252')) # latin-1 fails
ж
Traceback (most recent call last):
File "<pyshell#41>", line 1, in <module>
print u'\u0436', repr(u'\u0436'.encode('cp1252')) # latin-1
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0436' in position 0: character maps to <undefined>
To escape this encoding jungle, use UTF-8 which can encode everything.
>>> print u'\u0436\u2620', repr(u'\u0436\u2620'.encode('utf-8'))
ж☠ '\xd0\xb6\xe2\x98\xa0'
Encoding and decoding with different encodings changes the character. If you want to use funny characters, use unicode and UTF-8.
I am trying to print chinese text to a file. When i print it on the terminal, it looks correct to me. When i type print >> filename... i get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 24: ordinal not in range(128)
I dont know what else i need to do.. i already encoded all textual data to utf-8 and used string formatting.
This is my code:
# -*- coding: utf-8 -*-
exclude = string.punctuation
def get_documents_cleaned(topics,filename):
c = codecs.open(filename, "w", "utf-8")
for top in topics:
print >> c , "num" , "," , "text" , "," , "id" , "," , top
document_results = proj.get('docs',text=top)['results']
for doc in document_results:
print "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
print >> c , "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
get_documents_cleaned(my_topics,"export_c.csv")
print doc[0]['text'] looks like this:
u' \u3001 \u6010...'
Since your first print statement works, it's clear, that it's not the format function raising the UnicodeDecodeError.
Instead it's a problem with the file writer. c seems expect a unicode object but only gets a UTF-8 encoded str object (let's name it s). So c tries to call s.decode() implicitly which results in the UnicodeDecodeError.
You could fix your problem by simply calling s.decode('utf-8') before printing or by using the Python default open(filename, "w") function instead.
I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?
If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã
With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data
Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"
This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (напоминалки)")
'Шепот (напоминалки)'
Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.
i need to replace special chars in the filename. Im trying this at the moment with translate, but its not really good working, and i hope you got an idea to do this. Its to make an clear playlist, ive got an bad player of mp3s in my car which cant do umlaute oder specialchars.
My code so far
# -*- coding: utf-8 -*-
import os
import sys
import id3reader
pfad = os.path.dirname(sys.argv[1])+"/"
ordner = ""
table = {
0xe9: u'e',
0xe4: u'ae',
ord(u'ö'): u'oe',
ord(u'ü'): u'ue',
ord(u'ß'): u'ss',
0xe1: u'ss',
0xfc: u'ue',
}
def replace(s):
return ''.join(c for c in s if (c.isalpha() or c == " " or c =="-") )
fobj_in = open(sys.argv[1])
fobj_out = open(sys.argv[1]+".new","w")
for line in fobj_in:
if (line.rstrip()[0:1]=="#" or line.rstrip()[0:1] ==" "):
print line.rstrip()[0:1]
else:
datei= pfad+line.rstrip()
#print datei
id3info = id3reader.Reader(datei)
dateiname= str(id3info.getValue('performer'))+" - "+ str(id3info.getValue('title'))
#print dateiname
arrPfad = line.split('/')
dateiname = replace(dateiname[0:60])
print dateiname
# dateiname = dateiname.translate(table)+".mp3"
ordner = arrPfad[0]+"/"+dateiname
# os.rename(datei,pfad+ordner)
fobj_out.write(ordner+"\r\n")
fobj_in.close()
i get this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 37: ordinal not in range(128)
If i try to use the translate at the id3title i get TypeError: expected a character buffer object
if I need to get rid of non-ascii-characters, I often use:
>>> unicodedata.normalize("NFKD", u"spëcïälchärs").encode('ascii', 'ignore')
'specialchars'
which tries to convert characters to their ascii part of their normalized unicode decomposition.
Bad thing is, it throws away everything it does not know, and is not smart enough to transliterate umlauts (to ue, ae, etc).
But it might help you to at least play those mp3s.
Of course, you are free to do your own str.translate first, and wrap the result in this, to eliminate every non-ascii-character still left. In fact, if your replace is correct, this will solve your problem. I'd suggest you'd take a look on str.translate and str.maketrans, though.