python Incorrect formatting Cyrillic

python Incorrect formatting Cyrillic - python

def inp(text):
tmp = str()
arr = ['.' for x in range(1, 40 - len(text))]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
Output:
tester.................................
om.....................................
sup....................................
jope...................................
тестер...........................
ом...................................
суп.................................
жопа...............................
Why is Python not properly handling Cyrillic? End of the line is not straight and scrappy. Using the formatting goes the same. How can this be corrected? thanks

Read this:
http://docs.python.org/2/howto/unicode.html
Basically, what you have in text parameter to inp function is a string. In Python 2.7, strings are bytes by default. Cyrilic characters are not mapped 1-1 to bytes when encoded in e.g. utf-8 encoding, but require more than one byte (usually 2 in utf-8), so when you do len(text) you don't get the number of characters, but number of bytes.
In order to get the number of characters, you need to know your encoding. Assuming it's utf-8, you can decode text to that encoding and it will print right:
#!/usr/bin/python
# coding=utf-8
def inp(text):
tmp = str()
utext = text.decode('utf-8')
l = len(utext)
arr = ['.' for x in range(1, 40 - l)]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
The important lines are these two:
utext = text.decode('utf-8')
l = len(utext)
where you first decode the text, which results in an unicode string. After that, you can use the built in len to get the length in characters, which is what you want.
Hope this helps.

Related

Encoding a file with ord function

I'm trying to encode a file and output the encode into a new file, but I got this error:
TypeError: ord() expected string of length 1, but int found
My code:
from sys import argv, exit
def encode(data):
encoded = ''
while data:
current = data[0]
count = 1
for i in data[1:]:
if i == current:
count += 1
else:
break
if count == 255:
break
encoded += '{}{}'.format(chr(ord(current) & 255), chr(count & 255)) #error occurs here.
data = data[count:]
return encoded
if __name__ == '__main__':
if len(argv) < 2:
print('Please specify input file!')
exit(0)
with open(argv[1], 'rb') as (f):
data = f.read()
with open(argv[1] + '.out', 'wb') as (f):
f.write(encode(data))
Additional question: How do I decode the encoded file?

You are reading bytes (open(..., 'rb')), so when you take one element of the byte string, you get a byte, ie. a number. This number already is the character code, so just leave out the ord. Alternatively, you could open the file without the b modifier (open(..., 'r')), which will return a string; I would advise to keep it as a byte string though (or you could run into encoding issues if you are parsing something non-ascii).
You will run into a similar problem saving your file: you cannot write a string into a file opened with the b modifier. Since you have characters outside the ascii range (>128), writing as a string is not a good idea, since python will try to encode your characters (eg. in UTF-8), and you will end up with completely different bytes. Therefore, the best solution probably is not to concat your data to a string in your loop (the part where you do '{}{}'.format(...), but to have a list (encoded = [], concat with encoded.append(current)) and convert that to a byte string using bytes(encoded) after your loop. You can then pass that to write without a problem.
As for how to decode your file, you can just open the file like you do for encoding, read two bytes b1 and b2, and append [b1]*b2 to your output (again, as a list), and convert that to a byte string with bytes().

Ignore newline character in binary file with Python?

I open my file like so :
f = open("filename.ext", "rb") # ensure binary reading with b
My first line of data looks like this (when using f.readline()):
'\x04\x00\x00\x00\x12\x00\x00\x00\x04\x00\x00\x00\xb4\x00\x00\x00\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n'
Thing is, I want to read this data byte by byte (f.read(4)). While debugging, I realized that when it gets to the end of the first line, it still takes in the newline character \n and it is used as the first byte of the following int I read. I don't want to simply use .splitlines()because some data could have an n inside and I don't want to corrupt it. I'm using Python 2.7.10, by the way. I also read that opening a binary file with the b parameter "takes care" of the new line/end of line characters; why is not the case with me?
This is what happens in the console as the file's position is right before the newline character:
>>> d = f.read(4)
>>> d
'\n\x00\x00\x00'
>>> s = struct.unpack("i", d)
>>> s
(10,)

(Followed from discussion with OP in chat)
Seems like the file is in binary format and the newlines are just mis-interpreted values. This can happen when writing 10 to the file for example.
This doesn't mean that newline was intended, and it is probably not. You can just ignore it being printed as \n and just use it as data.

You should just be able to replace the bytes that indicate it is a newline.
>>> d = f.read(4).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
>>> diff = 4 - len(d)
>>> while diff > 0: # You can probably make this more sophisticated
... d += f.read(diff).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
... diff = 4 - len(d)
>>>
>>> s = struct.unpack("i", d)
This should give you an idea of how it will work. This approach could mess with your data's byte alignment.
If you really are seeing "\n" in your print of d then try .replace(b"\n", b"")

Python3 print in hex representation

I can find lot's of threads that tell me how to convert values to and from hex. I do not want to convert anything. Rather I want to print the bytes I already have in hex representation, e.g.
byteval = '\x60'.encode('ASCII')
print(byteval) # b'\x60'
Instead when I do this I get:
byteval = '\x60'.encode('ASCII')
print(byteval) # b'`'
Because ` is the ASCII character that my byte corresponds to.
To clarify: type(byteval) is bytes, not string.

>>> print("b'" + ''.join('\\x{:02x}'.format(x) for x in byteval) + "'")
b'\x60'

See this:
hexify = lambda s: [hex(ord(i)) for i in list(str(s))]
And
print(hexify("abcde"))
# ['0x61', '0x62', '0x63', '0x64', '0x65']
Another example:
byteval='\x60'.encode('ASCII')
hexify = lambda s: [hex(ord(i)) for i in list(str(s))]
print(hexify(byteval))
# ['0x62', '0x27', '0x60', '0x27']
Taken from https://helloacm.com/one-line-python-lambda-function-to-hexify-a-string-data-converting-ascii-code-to-hexadecimal/

Writing to UTF-16-LE text file with BOM

I've read a few postings regarding Python writing to text files but I could not find a solution to my problem. Here it is in a nutshell.
The requirement: to write values delimited by thorn characters (u00FE; and surronding the text values) and the pilcrow character (u00B6; after each column) to a UTF-16LE text file with BOM (FF FE).
The issue: The written-to text file has whitespace between each column that I did not script for. Also, it's not showing up right in UltraEdit. Only the first value ("mom") shows. I welcome any insight or advice.
The script (simplified to ease troubleshooting; the actual script uses a third-party API to obtain the list of values):
import os
import codecs
import shutil
import sys
import codecs
first = u''
textdel = u'\u00FE'.encode('utf_16_le') #thorn
fielddel = u'\u00B6'.encode('utf_16_le') #pilcrow
list1 = ['mom', 'dad', 'son']
num = len(list1) #pretend this is from the metadata profile
f = codecs.open('c:/myFile.txt', 'w', 'utf_16_le')
f.write(u'\uFEFF')
for item in list1:
mytext2 = u''
i = 0
i = i + 1
mytext2 = mytext2 + item + textdel
if i < (num - 1):
mytext2 = mytext2 + fielddel
f.write(mytext2 + u'\n')
f.close()

You're double-encoding your strings. You've already opened your file as UTF-16-LE, so leave your textdel and fielddel strings unencoded. They will get encoded at write time along with every line written to the file.
Or put another way, textdel = u'\u00FE' sets textdel to the "thorn" character, while textdel = u'\u00FE'.encode('utf-16-le') sets textdel to a particular serialized form of that character, a sequence of bytes according to that codec; it is no longer a sequence of characters:
textdel = u'\u00FE'
len(textdel) # -> 1
type(textdel) # -> unicode
len(textdel.encode('utf-16-le')) # -> 2
type(textdel.encode('utf-16-le')) # -> str

Python: How do I compare unicode to ascii text?

I'm trying to convert characters in one list into characters in another list at the same index in Japanese (zenkaku to hangaku moji, for those interested), and I can't get the comparison to work. I am decoding into utf-8 before I compare (decoding into ascii broke the program), but the comparison doesn't ever return true. Does anyone know what I'm doing wrong? Here's the code (indents are a little wacky due to SO's editor):
#!C:\Python27\python.exe
# coding=utf-8
import os
import shutil
import sys
zk = [
'。',
'、',
'「',
'」',
'（',
'）',
'！',
'？',
'・',
'／',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
hk = [
'｡',
'､',
'｢',
'｣',
'(',
')',
'!',
'?',
'･',
'/',
'ｱ','ｲ','ｳ','ｴ','ｵ',
'ｶ','ｷ','ｸ','ｹ','ｺ',
'ｻ','ｼ','ｽ','ｾ','ｿ',
'ｻﾞ','ｼﾞ','ｽﾞ','ｾﾞ','ｿﾞ',
'ﾀ','ﾁ','ﾂ','ﾃ','ﾄ',
'ﾀﾞ','ﾁﾞ','ﾂﾞ','ﾃﾞ','ﾄﾞ',
'ﾗ','ﾘ','ﾙ','ﾚ','ﾛ',
'ﾏ','ﾐ','ﾑ','ﾒ','ﾓ',
'ﾅ','ﾆ','ﾇ','ﾈ','ﾉ',
'ﾊ','ﾋ','ﾌ','ﾍ','ﾎ',
'ﾊﾞ','ﾋﾞ','ﾌﾞ','ﾍﾞ','ﾎﾞ',
'ﾊﾟ','ﾋﾟ','ﾌﾟ','ﾍﾟ','ﾎﾟ',
'ﾔ','ﾕ','ﾖ','ｦ','ﾝ','ｯ'
]
def main():
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print("Please specify a file to check.")
return
try:
f = open(filename, 'r')
except IOError as e:
print("Sorry! The file doesn't exist.")
return
filecontent = f.read()
f.close()
#y = zk[29]
#print y.decode('utf-8')
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
print filename
if __name__ == "__main__":
main()
Am I missing a step?

Several.
zk = [
u'。',
u'、',
u'「',
...
...
f = codecs.open(filename, 'r', encoding='utf-8')
...
I'll let you work out the rest now that the hard work's been done.

Make sure that zk and hk lists contain Unicode strings. Either use Unicode literals e.g., u'a' or decode them at runtime:
fromutf8 = lambda s: s.decode('utf-8') if not isinstance(s, unicode) else s
zk = map(fromutf8, zk)
hk = map(fromutf8, hk)
You could use unicode.translate() to convert characters in one list into characters in another list at the same index:
import codecs
translation_table = dict(zip(map(ord,zk), hk))
with codecs.open(sys.argv[1], encoding='utf-8') as f:
for line in f:
print line.translate(translation_table),

You need to convert everything to the same form, and the form is Unicode strings. Unicode strings have no encoding in the sense .encode() or .decode(). When having a non-unicode string, it is actually a stream of bytes that expresses the value in some encoding. When converting to Unicode, you have to .decode(). When storing Unicode string to a sequence of bytes, you have to .encode() the abstraction to concrete bytes.
This way, when loading Unicode strings from an UTF-8 encoded file, or you have to read it into the old strings (non Unicode, sequences of bytes) and then .decode('utf-8'), or you can use `codecs.open(..., encoding='utf-8') -- then you get Unicode strings automatically.
The form # coding=utf-8 is not the usual, but it is OK... if the editor (I mean the tool that you use to write the text) also thinks this way. Then the old strings are displayed by the editor correctly. In the case they should be .decode('utf-8')d to get Unicode. Old strings with ASCII characters only in the same source can also be converted to Unicode using the .decode('utf-8').
To summarize: you are de coding from bytes to Unicode, and you are en coding the Unicode strings into sequence of bytes. It seems from the question that you are doing the opposite.
The following is completely wrong:
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
because the filecontent is the result of f.read(). This way it is a sequence of bytes. The f in the loop is one byte. The z.decode('utf-8') returns one Unicode character. They cannot be compared. (By the way, the f is a kind of misleading name for a byte value.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python Incorrect formatting Cyrillic - python

Related

Encoding a file with ord function

Ignore newline character in binary file with Python?

Python3 print in hex representation

Writing to UTF-16-LE text file with BOM

Python: How do I compare unicode to ascii text?

Categories

Resources