Using the python unicode function - python

I'm working on a project that compares text.
Here is the relevant piece of code:
def post(self):
A = unicode(flask.request.form['A'])
B = unicode(flask.request.form['B'])
I posted large pieces of text from project gutenberg and I get errors like this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 6: ordinal not in range(128)
Based on this page I have tried errors ignore and errors replace and get the error:
TypeError: decoding Unicode is not supported
If possible I want to be able to take in the most robust set of characters possible. I was hoping there was a python library that would allow this.
Here is more of the code. I think the problem may occur when I try to turn my input into a string.
C = A.split()
D = B.split()
Both = []
for x in C:
if x in D:
Both.append(x)
for x in range(len(Both)):
Both[x]=str(Both[x])
Final = []
for x in set(Both):
Final.append(x)
MissingA = []
for x in C:
if x not in Final and x not in MissingA:
MissingA.append(x)
for x in range(len(MissingA)):
MissingA[x]=str(MissingA[x])
MissingB = []

Here is more of the code. I think the problem may occur when I try to
turn my input into a string.
I think that's right - try eliminating the str() calls.

Related

Index strings by letter including diacritics

I'm not sure how to formulate this question, but I'm looking for a magic function that makes this code
for x in magicfunc("H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜"):
print(x)
Behave like this:
H̶
e̕
l̛
l͠
o͟
̨
w̡
o̷
r̀
l҉
ḑ
!͜
Basically, is there a built in unicode function or method that takes a string and outputs an array per glyph with all their respective unicode decorators and diacritical marks and such? The same way that a text editor moves the cursor over to the next letter instead of iterating all of the combining characters.
If not, I'll write the function myself, no help needed. Just wondering if it already exists.
You can use unicodedata.combining to find out if a character is combining:
def combine(s: str) -> Iterable[str]:
buf = None
for x in s:
if unicodedata.combining(x) != 0:
# combining character
buf += x
else:
if buf is not None:
yield buf
buf = x
if buf is not None:
yield buf
Result:
>>> for x in combine("H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜"):
... print(x)
...
H̶
e̕
l̛
l͠
o͟
̨
w̡
o̷
r̀
l
ḑ
!͜
Issue is that COMBINING CYRILLIC MILLIONS SIGN is not recognized as combining, not sure why. You could also test if COMBINING is in the unicodedata.name(x) for the character, that should solve it.
The 3rd party regex module can search by glyph:
>>> import regex
>>> s="H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜"
>>> for x in regex.findall(r'\X',s):
... print(x)
...
H̶
e̕
l̛
l͠
o͟
̨
w̡
o̷
r̀
l҉
ḑ
!͜

python read file utf-8 decode issue

I am running into an issue with reading a file that has UTF8 and ASCII character. The problem is I am using seek to only read some part of the data, but I have no idea if I am "read" in the "middle" of an UTF8.
osx
python 3.6.6
to simply it, my issue can demoed with following code.
# write some utf-8 to a file
open('/tmp/test.txt', 'w').write(chr(12345)+chr(23456)+chr(34567)+'\n')
data = open('/tmp/test.txt')
data.read() # this works fine. to just demo I can read the file as whole
data.seek(1)
data.read(1) # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# I can read by seek 3 by 3
data.seek(3)
data.read(1) # this works fine.
I know I can open the file in binary then read it without issue by seeking to any position, however, I need to process the string, so I will end up with same issue when decode into string.
data = open('/tmp/test.txt', 'rb')
data.seek(1)
z = data.seek(3)
z.decode() # will hit same error
without using seek, I can read it correctly even just calling read(1).
data = open('/tmp/test.txt')
data.tell() # 0
data.read(1)
data.tell() # shows 3 even calling read(1)
one thing I can think is after seek to a location, try to read, on UnicodeDecodeError, position = position -1, seek(position), until I can read it correctly.
Is there a better (right) way to handle it?
As the documentation explains, when you seek on text files:
offset must either be a number returned by TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour.
In practice, what seek(1) actually does is seek 1 byte into the file—which puts it in the middle of a character. So, what ends up happening is similar to this:
>>> s = chr(12345)+chr(23456)+chr(34567)+'\n'
>>> b = s.encode()
>>> b
b'\xe3\x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:]
b'x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 3: invalid start byte
So, seek(3) happens to work, even though it's not legal, because you happen to be seeking to the start of a character. It's equivalent to this:
>>> b[3:].decode()
'宠蜇\n'
If you want to rely on that undocumented behavior to try to seek randomly into the middle of a UTF-8 text file, you can usually get away with it by doing what you suggested. For example:
def readchar(f, pos):
for i in range(pos:pos+5):
try:
f.seek(i)
return f.read(1)
except UnicodeDecodeError:
pass
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
Or you could use knowledge of the UTF-8 encoding to manually scan for a valid start byte in a binary file:
def readchar(f, pos):
f.seek(pos)
for _ in range(5):
byte = f.read(1)
if byte in range(0, 0x80) or byte in range(0xC0, 0x100):
return byte
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
However, if you're actually just looking for the next complete line before or after some arbitrary point, that's a whole lot easier.
In UTF-8, the newline character is encoded as a single byte, and the same byte as in ASCII—that is, '\n' encodes to b'\n'. (If you have Windows-style endings, the same is true for return, so '\r\n' also encodes to b'\r\n'.) This is by design, to make it easier to handle this kind of problem.
So, if you open the file in binary mode, you can seek forward or backward until you find a newline byte. And then, you can just use the (binary-file) readline method to read from there until the next newline.
The exact details depend on exactly what rule you want to use here. Also, I'm going to show a stupid, completely unoptimized version that reads a character at a time; in real life you probably want to back up, read, and scan (e.g., with rfind), say, 80 characters at a time, but this is hopefully simpler to understand:
def getline(f, pos, maxpos):
for start in range(pos-1, -1, -1):
f.seek(start)
if f.read(1) == b'\n':
break
else:
f.seek(0)
return f.readline().decode()
Here it is in action:
>>> s = ''.join(f'{i}:\u3039\u5ba0\u8707\n' for i in range(5))
>>> b = s.encode()
>>> f = io.BytesIO(b)
>>> maxlen = len(b)
>>> print(getline(f, 0, maxlen))
0:〹宠蜇
>>> print(getline(f, 1, maxlen))
0:〹宠蜇
>>> print(getline(f, 10, maxlen))
0:〹宠蜇
>>> print(getline(f, 11, maxlen))
0:〹宠蜇
>>> print(getline(f, 12, maxlen))
1:〹宠蜇
>>> print(getline(f, 59, maxlen))
4:〹宠蜇

Python RSA Decrypt for class using vs2013

For my class assignment we need to decrypt a message that used RSA Encryption. We were given code that should help us with the decryption, but its not helping.
def block_decode(x):
output = ""
i = BLOCK_SIZE+1
while i > 0:
b1 = int(pow(95,i-1))
y = int(x/b1)
i = i - 1
x = x - y*b1
output = output + chr(y+32)
return output
I'm not great with python yet but it looks like it is doing something one character at a time. What really has me stuck is the data we were given. Can't figure out where or how to store it or if it is really decrypted data using RSA. below are just 3 lines of 38 lines some lines have ' or " or even multiple.
FWfk ?0oQ!#|eO Wgny 1>a^ 80*^!(l{4! 3lL qj'b!.9#'!/s2_
!BH+V YFKq _#:X &?A8 j_p< 7\[0 la.[ a%}b E`3# d3N? ;%FW
KyYM!"4Tz yuok J;b^!,V4) \JkT .E[i i-y* O~$? o*1u d3N?
How do I get this into a string list?
You are looking for the function ord which is a built-in function that
Returns the integer ordinal of a one-character string.
So for instance, you can do:
my_file = open("file_containing_encrypted_message")
data = my_file.read()
to read in the encrypted contents.
Then, you can iterate over each character doing
char_val = ord(each_character)
block_decode(char_val)

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?
If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã
With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data
Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"
This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (напоминалки)")
'Шепот (напоминалки)'
Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.

Replace Specialchars in Python

i need to replace special chars in the filename. Im trying this at the moment with translate, but its not really good working, and i hope you got an idea to do this. Its to make an clear playlist, ive got an bad player of mp3s in my car which cant do umlaute oder specialchars.
My code so far
# -*- coding: utf-8 -*-
import os
import sys
import id3reader
pfad = os.path.dirname(sys.argv[1])+"/"
ordner = ""
table = {
0xe9: u'e',
0xe4: u'ae',
ord(u'ö'): u'oe',
ord(u'ü'): u'ue',
ord(u'ß'): u'ss',
0xe1: u'ss',
0xfc: u'ue',
}
def replace(s):
return ''.join(c for c in s if (c.isalpha() or c == " " or c =="-") )
fobj_in = open(sys.argv[1])
fobj_out = open(sys.argv[1]+".new","w")
for line in fobj_in:
if (line.rstrip()[0:1]=="#" or line.rstrip()[0:1] ==" "):
print line.rstrip()[0:1]
else:
datei= pfad+line.rstrip()
#print datei
id3info = id3reader.Reader(datei)
dateiname= str(id3info.getValue('performer'))+" - "+ str(id3info.getValue('title'))
#print dateiname
arrPfad = line.split('/')
dateiname = replace(dateiname[0:60])
print dateiname
# dateiname = dateiname.translate(table)+".mp3"
ordner = arrPfad[0]+"/"+dateiname
# os.rename(datei,pfad+ordner)
fobj_out.write(ordner+"\r\n")
fobj_in.close()
i get this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 37: ordinal not in range(128)
If i try to use the translate at the id3title i get TypeError: expected a character buffer object
if I need to get rid of non-ascii-characters, I often use:
>>> unicodedata.normalize("NFKD", u"spëcïälchärs").encode('ascii', 'ignore')
'specialchars'
which tries to convert characters to their ascii part of their normalized unicode decomposition.
Bad thing is, it throws away everything it does not know, and is not smart enough to transliterate umlauts (to ue, ae, etc).
But it might help you to at least play those mp3s.
Of course, you are free to do your own str.translate first, and wrap the result in this, to eliminate every non-ascii-character still left. In fact, if your replace is correct, this will solve your problem. I'd suggest you'd take a look on str.translate and str.maketrans, though.

Categories

Resources