Replace Specialchars in Python

Replace Specialchars in Python - python

i need to replace special chars in the filename. Im trying this at the moment with translate, but its not really good working, and i hope you got an idea to do this. Its to make an clear playlist, ive got an bad player of mp3s in my car which cant do umlaute oder specialchars.
My code so far
# -*- coding: utf-8 -*-
import os
import sys
import id3reader
pfad = os.path.dirname(sys.argv[1])+"/"
ordner = ""
table = {
0xe9: u'e',
0xe4: u'ae',
ord(u'ö'): u'oe',
ord(u'ü'): u'ue',
ord(u'ß'): u'ss',
0xe1: u'ss',
0xfc: u'ue',
}
def replace(s):
return ''.join(c for c in s if (c.isalpha() or c == " " or c =="-") )
fobj_in = open(sys.argv[1])
fobj_out = open(sys.argv[1]+".new","w")
for line in fobj_in:
if (line.rstrip()[0:1]=="#" or line.rstrip()[0:1] ==" "):
print line.rstrip()[0:1]
else:
datei= pfad+line.rstrip()
#print datei
id3info = id3reader.Reader(datei)
dateiname= str(id3info.getValue('performer'))+" - "+ str(id3info.getValue('title'))
#print dateiname
arrPfad = line.split('/')
dateiname = replace(dateiname[0:60])
print dateiname
# dateiname = dateiname.translate(table)+".mp3"
ordner = arrPfad[0]+"/"+dateiname
# os.rename(datei,pfad+ordner)
fobj_out.write(ordner+"\r\n")
fobj_in.close()
i get this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 37: ordinal not in range(128)
If i try to use the translate at the id3title i get TypeError: expected a character buffer object

if I need to get rid of non-ascii-characters, I often use:
>>> unicodedata.normalize("NFKD", u"spëcïälchärs").encode('ascii', 'ignore')
'specialchars'
which tries to convert characters to their ascii part of their normalized unicode decomposition.
Bad thing is, it throws away everything it does not know, and is not smart enough to transliterate umlauts (to ue, ae, etc).
But it might help you to at least play those mp3s.
Of course, you are free to do your own str.translate first, and wrap the result in this, to eliminate every non-ascii-character still left. In fact, if your replace is correct, this will solve your problem. I'd suggest you'd take a look on str.translate and str.maketrans, though.

Related

python read file utf-8 decode issue

I am running into an issue with reading a file that has UTF8 and ASCII character. The problem is I am using seek to only read some part of the data, but I have no idea if I am "read" in the "middle" of an UTF8.
osx
python 3.6.6
to simply it, my issue can demoed with following code.
# write some utf-8 to a file
open('/tmp/test.txt', 'w').write(chr(12345)+chr(23456)+chr(34567)+'\n')
data = open('/tmp/test.txt')
data.read() # this works fine. to just demo I can read the file as whole
data.seek(1)
data.read(1) # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# I can read by seek 3 by 3
data.seek(3)
data.read(1) # this works fine.
I know I can open the file in binary then read it without issue by seeking to any position, however, I need to process the string, so I will end up with same issue when decode into string.
data = open('/tmp/test.txt', 'rb')
data.seek(1)
z = data.seek(3)
z.decode() # will hit same error
without using seek, I can read it correctly even just calling read(1).
data = open('/tmp/test.txt')
data.tell() # 0
data.read(1)
data.tell() # shows 3 even calling read(1)
one thing I can think is after seek to a location, try to read, on UnicodeDecodeError, position = position -1, seek(position), until I can read it correctly.
Is there a better (right) way to handle it?

As the documentation explains, when you seek on text files:
offset must either be a number returned by TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour.
In practice, what seek(1) actually does is seek 1 byte into the file—which puts it in the middle of a character. So, what ends up happening is similar to this:
>>> s = chr(12345)+chr(23456)+chr(34567)+'\n'
>>> b = s.encode()
>>> b
b'\xe3\x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:]
b'x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 3: invalid start byte
So, seek(3) happens to work, even though it's not legal, because you happen to be seeking to the start of a character. It's equivalent to this:
>>> b[3:].decode()
'宠蜇\n'
If you want to rely on that undocumented behavior to try to seek randomly into the middle of a UTF-8 text file, you can usually get away with it by doing what you suggested. For example:
def readchar(f, pos):
for i in range(pos:pos+5):
try:
f.seek(i)
return f.read(1)
except UnicodeDecodeError:
pass
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
Or you could use knowledge of the UTF-8 encoding to manually scan for a valid start byte in a binary file:
def readchar(f, pos):
f.seek(pos)
for _ in range(5):
byte = f.read(1)
if byte in range(0, 0x80) or byte in range(0xC0, 0x100):
return byte
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
However, if you're actually just looking for the next complete line before or after some arbitrary point, that's a whole lot easier.
In UTF-8, the newline character is encoded as a single byte, and the same byte as in ASCII—that is, '\n' encodes to b'\n'. (If you have Windows-style endings, the same is true for return, so '\r\n' also encodes to b'\r\n'.) This is by design, to make it easier to handle this kind of problem.
So, if you open the file in binary mode, you can seek forward or backward until you find a newline byte. And then, you can just use the (binary-file) readline method to read from there until the next newline.
The exact details depend on exactly what rule you want to use here. Also, I'm going to show a stupid, completely unoptimized version that reads a character at a time; in real life you probably want to back up, read, and scan (e.g., with rfind), say, 80 characters at a time, but this is hopefully simpler to understand:
def getline(f, pos, maxpos):
for start in range(pos-1, -1, -1):
f.seek(start)
if f.read(1) == b'\n':
break
else:
f.seek(0)
return f.readline().decode()
Here it is in action:
>>> s = ''.join(f'{i}:\u3039\u5ba0\u8707\n' for i in range(5))
>>> b = s.encode()
>>> f = io.BytesIO(b)
>>> maxlen = len(b)
>>> print(getline(f, 0, maxlen))
0:〹宠蜇
>>> print(getline(f, 1, maxlen))
0:〹宠蜇
>>> print(getline(f, 10, maxlen))
0:〹宠蜇
>>> print(getline(f, 11, maxlen))
0:〹宠蜇
>>> print(getline(f, 12, maxlen))
1:〹宠蜇
>>> print(getline(f, 59, maxlen))
4:〹宠蜇

Printing icons in the shell

Below are some examples, when run inside the Python 3.4.3 Shell included in IDLE3 it will output special characters (icons). When I run this same code inside a terminal, the characters will not appear at all.
""" Some print functions with backslashes.
In IDLE3 they will output 'special characters' or icons.
In a terminal, they will not output anything. """
#Somtimes a visual effect.
print ("a, \a") #telephone
print ("\a")
print ("b, \b") #checkmark
print ("c, \c") # just a '\c' output.
# other letters like '\c' kept out the rest of this list.
print ("f, \f") #quarter note (musical)
print ("n, \n") #newline
print ("r, \r") #halve note (musical)
print ("t, \tTabbed in")
#print ("u, \u") #syntaxerror
print ("\u0000") #empty
print ("\u0001") #left arrow
print ("\u0002") #left arrow underline
print ("\u0003") #right arrow (play)
print ("v, \v") #eighth note (musical)
print ("\x01") # == '\u0001' __________(x == 00 ?)
print ("\1") # == '\u0001' == '\x01'
#some more fooling around
print ("\1") #left arrow
print ("\2") #underlined left arrow
print ("\3") #right arrow
print ("\4") #underlined right arrow
print ("\5") #trinity
print ("\6") #Q-parking
print ("\7") #telephone
print ("\8")
print ("\9")
print ("\10") #checkmark
print ("\11 hi") #tab
print ("\12 hi") #newline
print ("\13") #8th note
print ("\14") #4th note
print ("\15") #halve note
print ("\16") #whole note
print ("\17") #double 8th note
print ("\18")
print ("\19")
print ("\20") #left arrow (black)
print ("\21") #right arrow (black)
print ("\22") #harry potter
print ("\23") #X-chrom-carrying cell
print ("\24") #Y-chrom-carrying cell
print ("\25") #diameter for lefties
print ("\26") #pentoid
print ("\27") #gamma?
print ("\28") #I finally realised this will have to do with triple
# binary per character? 111 = 7, stop = 8
print ("\30") #
print ("\31") # female
print ("\32") # male
print ("\33") #
print ("\34") # clock
print ("\35") # alfa / ichtus
print ("\36") # arc
print ("\37") # diameter
print ("\40hi") # spaces? I don't know.
# This does not work by the way:
##import string
###No visual effect.
##alfa = string.ascii_lowercase
##for x in alfa:
## print ("\%s" % x)
Some of my Python Shell 3.4.3. output in IDLE3:
Are these 'special' characters c.q. icons used in any way? Is there some documentation I could have read that would have prevented me from asking this question?
I checked on other questions about this issue on Stack, but all I found was people trying to pass in 'foreign' (like from word symbols or whatever) characters and make them get printed by Python.

If you are in the IDLE and click "Options" and "Configure IDLE..." you see the font you are using. Fonts convert character numbers to what you see. A different font can produce different characters.
Example:
>>> print(u'\u2620')
☠
Which I looked for by searching "unicode skull" and which can be found here.
Not all fonts support all characters.
Unicode characters are organized in blocks of a certain topic. I like the block "Miscellaneous Symbols" where the skull is from.
Encoding
Also an important question is which encoding you use. The encoding determines how characters are mapped to the unicode blocks. A character has to go from print(u'\0001') to the sys.stdout to the console reading it and to the window manager. Each step only understands bytes - 256 possible characters.
So, there are various encodings, such as latin-1, which use the 256 possible characters and map them onto the unicode blocks. latin-1 uses the first two blocks, I think. There are encodings, such as UTF-8 which use 8 bits = 1 byte and more or utf-16 which uses 2 bytes and more or utf-32 which uses 4 bytes and more, which allow more characters to be transfered from the print through the different steps.
If I want to encode the skull and crossbones in latin-1 I would get this error:
>>> u'\u2620'.encode('latin-1')
Traceback (most recent call last):
File "<pyshell#32>", line 1, in <module>
u'\u2620'.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2620' in position 0: ordinal not in range(256)
Another example where I encode the russian letter zhe in the cyrillic code page and the latin one:
>>> print u'\u0436', repr(u'\u0436'.encode('cp1251')) # cyrillic works
ж '\xe6'
>>> print u'\u0436', repr(u'\u0436'.encode('cp1252')) # latin-1 fails
ж
Traceback (most recent call last):
File "<pyshell#41>", line 1, in <module>
print u'\u0436', repr(u'\u0436'.encode('cp1252')) # latin-1
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0436' in position 0: character maps to <undefined>
To escape this encoding jungle, use UTF-8 which can encode everything.
>>> print u'\u0436\u2620', repr(u'\u0436\u2620'.encode('utf-8'))
ж☠ '\xd0\xb6\xe2\x98\xa0'
Encoding and decoding with different encodings changes the character. If you want to use funny characters, use unicode and UTF-8.

writing utf-8 encoded text to a file

I am trying to print chinese text to a file. When i print it on the terminal, it looks correct to me. When i type print >> filename... i get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 24: ordinal not in range(128)
I dont know what else i need to do.. i already encoded all textual data to utf-8 and used string formatting.
This is my code:
# -*- coding: utf-8 -*-
exclude = string.punctuation
def get_documents_cleaned(topics,filename):
c = codecs.open(filename, "w", "utf-8")
for top in topics:
print >> c , "num" , "," , "text" , "," , "id" , "," , top
document_results = proj.get('docs',text=top)['results']
for doc in document_results:
print "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
print >> c , "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
get_documents_cleaned(my_topics,"export_c.csv")
print doc[0]['text'] looks like this:
u' \u3001 \u6010...'

Since your first print statement works, it's clear, that it's not the format function raising the UnicodeDecodeError.
Instead it's a problem with the file writer. c seems expect a unicode object but only gets a UTF-8 encoded str object (let's name it s). So c tries to call s.decode() implicitly which results in the UnicodeDecodeError.
You could fix your problem by simply calling s.decode('utf-8') before printing or by using the Python default open(filename, "w") function instead.

Using the python unicode function

I'm working on a project that compares text.
Here is the relevant piece of code:
def post(self):
A = unicode(flask.request.form['A'])
B = unicode(flask.request.form['B'])
I posted large pieces of text from project gutenberg and I get errors like this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 6: ordinal not in range(128)
Based on this page I have tried errors ignore and errors replace and get the error:
TypeError: decoding Unicode is not supported
If possible I want to be able to take in the most robust set of characters possible. I was hoping there was a python library that would allow this.
Here is more of the code. I think the problem may occur when I try to turn my input into a string.
C = A.split()
D = B.split()
Both = []
for x in C:
if x in D:
Both.append(x)
for x in range(len(Both)):
Both[x]=str(Both[x])
Final = []
for x in set(Both):
Final.append(x)
MissingA = []
for x in C:
if x not in Final and x not in MissingA:
MissingA.append(x)
for x in range(len(MissingA)):
MissingA[x]=str(MissingA[x])
MissingB = []

Here is more of the code. I think the problem may occur when I try to
turn my input into a string.
I think that's right - try eliminating the str() calls.

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?

If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã

With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data

Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"

This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (Ð½Ð°Ð¿Ð¾Ð¼Ð¸Ð½Ð°Ð»ÐºÐ¸)")
'Шепот (напоминалки)'

Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace Specialchars in Python - python

Related

python read file utf-8 decode issue

Printing icons in the shell

writing utf-8 encoded text to a file

Using the python unicode function

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

Categories

Resources