How do I get this to encode properly? - python

I have a XML file with Russian text:
<p>все чашки имеют стандартный посадочный диаметр - 22,2 мм</p>
I use xml.etree.ElementTree to do manipulate it in various ways (without ever touching the text content). Then, I use ElementTree.tostring:
info["table"] = ET.tostring(table, encoding="utf8") #table is an Element
Then I do some other stuff with this string, and finally write it to a file
f = open(newname, "w")
output = page_template.format(**info)
f.write(output)
f.close()
I wind up with this in my file:
<p>\xd0\xb2\xd1\x81\xd0\xb5 \xd1\x87\xd0\xb0\xd1\x88\xd0\xba\xd0\xb8 \xd0\xb8\xd0\xbc\xd0\xb5\xd1\x8e\xd1\x82 \xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xb4\xd0\xb0\xd1\x80\xd1\x82\xd0\xbd\xd1\x8b\xd0\xb9 \xd0\xbf\xd0\xbe\xd1\x81\xd0\xb0\xd0\xb4\xd0\xbe\xd1\x87\xd0\xbd\xd1\x8b\xd0\xb9 \xd0\xb4\xd0\xb8\xd0\xb0\xd0\xbc\xd0\xb5\xd1\x82\xd1\x80 - 22,2 \xd0\xbc\xd0\xbc</p>
How do I get it encoded properly?

You use
info["table"] = ET.tostring(table, encoding="utf8")
which returns bytes. Then later you apply that to a format string, which is a str (unicode), if you do that you'll end up with a representation of the bytes object.
etree can return an unicode object instead if you use:
info["table"] = ET.tostring(table, encoding="unicode")

The problem is that ElementTree.tostring returns a binary object and not an actual string. The answer to this is:
info["table"] = ET.tostring(table, encoding="utf8").decode("utf8")

Try this - with output parameter being just the Russian string without utf-8 encoding.
import codecs
#output=u'все чашки имеют стандартный посадочный диаметр'
with codecs.open(newname, "w", "utf-16") as stream: #or utf-8
stream.write(output + u"\n")

Related

Hex String to Image File from varbinary(max)

I have a table in a database which stores image files in varbinary(max) type. I would like to extract, convert and save the image file. Then, I used the cast as varcharmax to extract:
cast([IMG_FILE] as varchar(max))
The result of this cast looks like a hex string (I've removed part of string to protect the privacy of the person):
\
I tried to used this hex string in a online tool (https://codepen.io/abdhass/full/jdRNdj), and the image is corrected displayed (remembering that I've cutted part of string to preserve the persons privacy):
Then, I've tried to take this hex string and tried to convert to a image file using python3. I've been trying a lot of things (the majority found here), but until now, I coudn't save the correct file.
Saving directly doesn't generate the image.
with open(photo_path + 'file.jpg', 'wb') as new_jpg:
new_jpg.write(hexString)
Using binascii.unhexlify returns "Non-hexadecimal digit found"
binascii.unhexlify(hexString)
Converting to int/bin returns invalid literal for int() with base 16:
bin(int(hexString, 16))[2:]
I would like to know how to solve this problem? That is, I would like to take this hex string and save a image file in my computer.
If I have string without \x then I can convert every two chars to integer value, create bytearray and save it
text = ''
integers = []
while text:
value = int(text[:2], 16)
integers.append(value)
text = text[2:]
data = bytearray(integers)
with open('output.jpg', 'wb') as fh:
print(fh.write(data))
If I have string with \x then first \xff is treated as char's code so I have to use ord() to convert it integer.
text = '\xffd8ffe000104a46494600010100000100010000fffe003b43524541544f523a2067642d6a7065672076312e3020287573696e6720494a47204a50454720763632292c207175616c697479203d2037350affdb004300080606070605080707070909080a0c140d0c0b0b0c1912130f141d1a1f1e1d1a1c1c20242e2720222c231c1c2837292c30313434341f27393d38323c2e333432ffdb0043010909090c0b0c180d0d1832211c213232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232ffc000110801e0018003012200021101031101ffc4001f0000010501010101010100000000000000000102030405060708090a0bffc400b5100002010303020403050504040000017d01020300041105122131410613516107227114328191a1082342b1c11552d1f02433627282090a161718191a25262728292a3435363738393a434445464748494a535455565758595a636465666768696a737475767778797a838485868788898a92939495969798999aa2a3a4a5a6a7a8a9aab2b3b4b5b6b7b8b9bac2c3c4c5c6c7c8c9cad2d3d4d5d6d7d8d9dae1e2e3e4e5e6e7e8e9eaf1f2f3f4f5f6f7f8f9faffc4001f0100030101010101010101010000000000000102030405060708090a0bffc400b51100020102040403040705040400010277000102031104052131061241510761711322328108144291a1b1c109233352f0156272d10a162434e125f11718191a262728292a35363738393a434445464748494a535455565758595a636465666768696a737475767778797a82838485868788898a92939495969798999aa2a3a4a5a6a7a8a9aab2b3b4b5b6b7b8b9bac2c3c4c5c6c7c8c9cad2d3d4d5d6d7d8d9dae2e3e4e5e6e7e8e9eaf2f3f4f5f6f7f8f9faffda000c03010002110311003f00f7676c606d18c0ea334ddf8ecbff007cd39f185e7f845338f5cd4bb8c5de40fbaa3fe034d2e78e17fef914eebd3a5340cd1d01313cc3d36aff00df229371ecabff007c8a5231d2914366a6ed0d88243fdd5ffbe452ef20636a7fdf228c61b19a1ba8a1b121379c1ca271fec8a6f9a48e113fef914ee71d6902e471473581ec024207089ff7c8a42e7b2aff00df229083d05041f5a24f99d90262190a8fba9ff7c8a6f9849190bff7c8a561d39cd23039cd2d7a0085f031b53fef914d121e8427fdf229db78dd9a611cd2d52d07a5c56705c602ff00df2290ccd9c6d4ff00be4518fce9a179343040643d36affdf229a642171b53fef914bef4d61934936210c8c31f2a7fdf229a643fdd5ffbe45291c719a0a9229df5b8fd06994f4da9ff007c8a634ac38dabff007c8a52bd293193d687a8ae233f1d171fee8a42e71d17fef914e6038c1cd308c8a86f41a1bbdb00617fef914d676c636aff00df229d8f7a61269a62b06ef970557fef9151ef6033b57fef914f24e0d37195c114ef71dc6bb96c7cabff007c8a432118185ffbe4538819f6a6e016a96f501a58e7eeaffdf229a5db7745ff00be4539b03a1a615ee0d1cce3a0c6ef2011b57fef914d0c7d17fef9141fad2018f5a96f415c52e76e36affdf229be6606085ffbe452b76a8ce0e0f38a2d7455f5144879e17fef9151b16c7dd5ff00be4548704719151382070c69abf526fa15dc9cfdd5ff00be4522363b2ffdf2295f205469cd26c49dc9fcc6c636afb7ca283237a2ff00df22902e7192683c526afb957d86b139e8bff7c8a371e0e17fef9a70e4649a61f41569f98842fcf45ffbe453d5b7c4f9000c0edef4c200e94e8c651f24f4feb431791e8cc01c67fba29360fc2a43dbe94801c62ba04479c0c0a403d0d49b71de82075c54bd1e8047b493d38a0549819c83499ed498c888e71da9b8e7bd4d8079a4e33d68695f4023dbc74a4c10b8c91529eb8ce69a49c63934ec85a918000e4d379ed521507bd2600e452d2d6434bb8c2a7bd21008e49a93934d2bef495c08f0318e69a4735285c0e4d371f5a56761dc6601ee69303d4d3c820f34d2bde86b5b8911ed1d3b5260f41526063bd23151f74934f60647d38a6924f7a70cd0400696c088b83de902d48401ce6a3660bd5b152ec03700714de808aad2dec68df7c71eb5564d6add7ab8cfb5351e657173a5a17cae7b9a36918f4ace4d66d9cffac03eb436b100e932fe747237b20724b7341867a66a307b66a9c7abc0cdb44c326a65ba46e323eb4493ec11922561900534a91eb8a72c8a69c5b90012452bad8a202a0fd28381c0cd48cbe9d29a411d6a795010b2827ad0063d85388f9860d29c639a1bf20b11300718a6e0edc53c8cf20d3704f4a4af14090cdb918cd46e3039353e2a365c8a399ee162948d938a8d783562551ce0557546dd424ac277e8591961405e2841814fc6053e55d4688f0aa29ac722a4007f11e69a403d2a7975125d488e3a629ea0847c7a0fe74a149e7a53c708ff004feb54a36d4773d0db181f4a074eb4a7b7d290574589131ef46294d203c53b2189ed40e99a70028153cb740348a6ed1ef4fe28346880600077a4c11df8a7e28a16c0c8f68c505401c53f6f7cf148dd695900cda09e4d211da9fb79e4d21e33536ea806119a691934fcf3487079a76ee172365c91cd211818e952718a630cf7a997604308c0c03c5340515260014d2bc668f31a23200a85e403a9a49ee1501f9b8ae6b58f11dad846c4c9f30a23173d8994d47566bdcde2c6092c063d6b95d5fc5f05aa952ca71ef5c0f883e20e4be1be5ed835e6ba9789aeaf9d88f941f7aeba54a0a3766337397c27a36b9e3d241f29f1f8d7232f8c6e8b191676c7a66b93dcf28def2135195ddc838acf9aced1d098d1b3bc99d4ff00c26b7f21ff005ac07d6abcde30bfdf949dc7a8cd736a769c1a693939c50a524ef73a9f2f2d91d647e33d45006331c7d6ba2d1fe225ce4099d881d6bcc2aca4e513e5f94d5aa925e6651a315767d13a5f8cadaee252b2e4f7aeaec7515b98c152bcfbd7cad61ab5c5a4c1a3948f519eb5dbe87f1027b670928e3d41a9f62aa2e65a14ef17eeea7d04a463ef0a4233d49ae0b44f18c37e5433ed26bb3b6bd8e75051b7715cd38d8719296c4c579143026a45f9f9348eac0641e2a514c8f6fa7029a40031ba9f83c734854679a4eedea043823a1a69e739a94e33c1a615fad16b0fc8a9228a8d5483c1a9a5003707351e08a495d581d895464629714a99f5a083db9a6d7612b11b73d79a4cf1c715260e6908e714d30b0c1d39a545051c7b0fe74ec63922950e56407d3fad55fa01e80c3a7d293da949e9f4a4e86ba0421f4a31ef4b4829083a714734b4668630c51d28a0000d0f4013149c8a773499a5600c66931834bda8a9d980d2bcd210318079a791e949b78e4d0047b78eb49803a53f6e3bd348ed52d8c65260639a7851de98dd7da9598ba8c6031ed54e7ba11a9cb60517b7ab164ee000af35f1978a1adcb2452ede3b1ada9d273d889cf976347c4be26b7b457559b0476af14f13f8a26ba91d6397afa1aced6b5fb8bbb8649a46e7eed614a8001963b8f352deb62e8c1cfdf90af70f71018db9e725aaa84cf19a452c0100d27dd6cf5a6958d5b5bd8937793f2900d46f206ed8a473939a6d34ba99b7d828141a314c91c8477141618c629a0e0d14157d0703ed4e572ad9076fd2a3a2815d9b1a6eb773672ae18919af5ef09f8ba2b88d10cbf37a5785e48e86af699a94da75d2ca8c719e4538422fdd9688c9c1a7cd13eb0b4bd49e31b5b922ade094af29f09f8ba3bb31aac849e339af5082713a2b29e08ac2709537666b1929ec4b8ed9a4239c669e17d4d042f1827359b5a94884a1ce3148460e2a43b9bbe714c6f9693f20b6a539940622a2079c5589706a118f4fc695da0f21e9c7069eb919c522e0e3e6cd4b803a1a7b088f1914c0073cf22a5c0cf34dc00738a1ad6e0b61873b7ad2c60857ebd07f3a5dbd48a55dc11ce7b0fe7556ea80ef5bb7d29bdb14e6edf4a4ae9243b6281c503ad0681876c51db1477a3bd0213be28f6a5a08a000fa5276a5a3bd2b0076a4c714a7eb476a76b0c4f5148697bd045660371410734b9e682734b96c0467e5fa551bbb95894f356e66db19e7a5717e25d51a0b497071c75a50d7414df2ab9cd78d7c5f1d8a3471b027d735e23acf88a4d46566790e074029fe2dd5e4bbbd740f900fad73182a37706bae724a3c9131a51937cecb25bcf21dcfe350ce36b0c396146c326020fcaa55b296460a4115cd751ea77c94a5d0a59a39abeda5caa326a06b4901c053f88a6aa45eccc7d9c8af454be4b02734d319155742e576b8ca4a52a451834c9128a28a005a01a28c7340077a706229761a530b0ed4ae8a519762ee93aa4fa65da4b1390b9f987ad7bcf84bc4d1dfc119593823a1af9e30548ae87c37aecda7dea0f30aa67a55b9732b313f755cfa962963950609a918818c571fe1bd716ea2521f7715d746448809ae59e9a3634efaa0c0e7b5464362a52bef4846062b392b2d0ab3dd95258cf7aae0f241e055bb824f7aa6793823f1a6924ee4dee4a80638a93b535146060d498c1da47e349eaf41b637b734cf518a93be0d34a91c814db698ad713660511802393e83f9d29fad3907cae4fa0fe745f41eccee58703e9494a7b7d290f15d64074a3a504d1de801314bda8a0f4a0618e28a3b518a00314514502000628ed8a5c521a560128a5e31450d0c4c519e0d29e941e10e6a5ad00ccd4a4f2e1246315e39f10b5ef22d6684100b29e6bd33c4774d0c2c41e315f3af8e6fcdddcb82c0807079a984b91f31369549282382690ccecce4e7b1a96d2d1e7942004e7d7a540ff7f6818adfd1d0aa82cd914abd4718b91d14e9ddf2a5b1a5a7e86a8a3705e7a915a83478970579ab369e4ac39cfd6a413a76af0aa55ab2773b13e5d119d269e8010462b367b34f73f856ccf3ef6212a8cae429c8c0aba72982bee667d8e0f2cef8c7b5567b18d81250afd2b48b231e7a52336480bfad752a9240a16462bd826dc8e9555ecdb38515bb28557dbc54261dc8f81c8e95b46b48538dd6a60b5b32f0460d362b7dcc41ad10e76ec75c9f5a6e571854c9f5ae8f68f617b185d3b6840205098da2905ba93f2e39ab009c95dbf8d34a283d714b999a4a104d6831ad9d0659463da9de5ee5ce0f4a91d8ecfbd9a4492458b8a57614d3bbe633a48ca37351862a4107906af5c4259039618aa2c306ba212ba386b53e591e9fe00d564dca8ec79af6eb09fcd814af422be78f025c289b637506bdf7459035aa63d2b9eaa49e88c29c3951b5b7e514c2083d6a4ec29a473cd4a7a17b15a6191554801aaeca7e5354c8c9a6a4d6c0da1e83d2a6078a8d3afb54c38031d296ec345a91f6e39a42c777b548700f069b81914df611195a910028f9f41fce9a7ae29c87f76ff41fcea40ed8f403da93bd2b76fa52576884c76a53d28a3b5020c7140eb8a0d140c38a3bf147d6814083b52e78c5277a281876a4141345020c52f6c514679a0031da9929db19a79351ceb9898fa54caf60381f165c1585f7600c57cd9e2194b6a3721c900b715f4278cc3341213d81af9d3c452017854f27359a4ddac441be666282cec3daba2d2c648504f22b0202049c9c0ae97492b8c1fc2b2c53b44edc3adce8208418301c83de9aca51701b8a9229422e579f5aaf24aace49c827a578cb99b3a9a4d5c8598231c3735565bb6fbbb43525c3fcdc1aafbf6c6db860f635d5082dd9297ba364726e1146429a9d8a6e0031355a396e0e3f761aa54573212ca466b592b0e1252f7424453306627e9eb434721ced6280f6a74aac17383ed512a4b2725cfd285b5ee5db5d0a8ec1414da091dea247081976edcd5d36c77161d7bd40f6eccd8ad949131e5bd9958c9c1403f1a74414fdf19a9becf818c73eb519421f6e0d55d3d8bbab0d7556caa80bef51a02b959188fa55a11c663225e0f6a3ecb1ba125d81edc53524b7314efa14278c795f23122a8739c56c4b6fb21c2f35972aed278eb5b5295cc2bd3b2b9d1f8324db7d8248e6be85f0ecb9b7418ea2be6ff0b4c22d4d3271935f4678625530a1073c56357994fc8c15ba6e74e14e3a527be69fe664f14c391ef46b6021917208aaa54038356e56238aaac79eb4d790ada8a9c0c54aa78a621ce2a618c74a76e50d861a41d6a461f2f0298471cf152ddec3e846d9dd8a7a0da8ff0041fce838cd3971b1fe83f9d0c0ecdba0fa537da95bb7d28aec2043e94bda93bd2f7a061da8068ef4628101e4d1d28a281873d293da97a0a2810518e29334b40c00a4a7629a68105046518514a7ee9a996c079ef8ce3416f2640e86be63f1044dfda92c871b43702bea1f18dbef864e07435f3278ad366aac07af22953dac42b29ea6270f90a315ada4b3f98a8188ac78ced6cf6addd3514ca8c8c49c56588d22ceea366d3674510645209e290c0eec5bb76ab36f1798b935656061f285fcebc3752cce8e5d6e658b3129da78a73d80520655beb566e008589dea3f1aa12dca93c30fceb48b94b54529ae83bcb58895db83ed51b3739e955e4bbf9b99302a27bb4f3026e24915b2a7264a945487cd2bb0c7040a8fcc60463bd33ce0723207d0d264e460d6aa36d0b7364f23055f94f5eb9a58e48c00acb93ea6a84d2beff5aad2ddc8770dd8c7a55aa4da26d6dcb525c33ccea170a0f14d424b162df8566c770fbf96a735f1070056dec9ad110a50b6a5d2a6460ac6ac44932f2d82a2b23edac1f76734f8f5173260b10b4dd295b412ab052d4d6914b9e00c5646a518598053c62b421bc8647f2c125'
integers = []
value = ord(text[0])
integers.append(value)
text = text[1:]
while text:
value = int(text[:2], 16)
integers.append(value)
text = text[2:]
data = bytearray(integers)
with open('output.jpg', 'wb') as fh:
print(fh.write(data))
Because string in your question is incomplete so it creates incomplete image.
But with data from link it create correct JPG file.
EDIT:
It seems you have raw string and \x is treated as normal string, not part of byte \xff - and you have to remove \x at start using using text = text[2:]
text = r'\xffd8ff...'
integers = []
text = text[2:]
while text:
value = int(text[:2], 16)
integers.append(value)
text = text[2:]
data = bytearray(integers)
with open('output.jpg', 'wb') as fh:
print(fh.write(data))
EDIT:
Simpler version with standard module codecs. It still need to remove \x from string.
If you have bytes:
text = b'\\xffd8ff...' # bytes
import codecs
text = text[2:] # remove `\x`
data = codecs.decode(text, 'hex_codec')
with open('output.jpg', 'wb') as fh:
fh.write(data)
If you have string - then you have to first encode() to bytes:
text = '\\xffd8ff...' # string
import codecs
text = text.encode() # bytes
text = text[2:] # remove `\x`
data = codecs.decode(text, 'hex_codec')
with open('output-1.jpg', 'wb') as fh:
fh.write(data)

Python Hex values in ascii encoded string

I have a problem in python reading a string from a .txt file
File contains these data : \xce\x97
Encoded in ascii (Similar to "\xce\x97" using a python string)
I want to convert it to UTF-8 encoding
file.open("file.txt", "r")
a = file.read() #a = "\\xce\\x97"
file.close()
The correct value of this string is : "Η" (Its a greek letter, capital "η")
Ι can use
>>>a = b'\xce\x97'
>>>print(a.decode("utf-8"))
>>>Η
How can I do it using the varriable a?
For decoding problems:
a = "\\xce\\x97"
print(a.encode().decode('unicode-escape').encode("latin-1").decode('utf-8'))
'Η'

Python: Reading from two array of tuples at a time and placing them side-by-side on CSV file

So I have two arrays of tuples that are arranged with Restaurant Name and an Int:
("Restaurant Name", 0)
One is called ArrayForInitialSpots, and the other is called ArrayForChosenSpots. What I want to do is to write the tuples from both rows in side-by-side order in a csv file like this:
"First Restaurant in ArrayForInitialSPots",0,"First Restaurant in ArrayForChosenSpots", 1
"Second Restaurant in ArrayForInitialSpots",0,"Second Restaurant in ArrayForChosenSpots",0
So far i've tried doing this:
with open('data.csv','w') as out:
csv_out=csv.writer(out)
csv_out.writerow(['Restaurant Name','Change'])
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
csv_out.writerow(x + y)
#csv_out.writerow(y)
But I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-6: ordinal not in range(128)
If I remove the zip function, I get too many values to unpack. Any suggestions guys? Thank you very much in advance.
There are two things that you could use to handle extended ascii characters while writing to files:
Set default encoding to utf-8
import sys
reload(sys).setdefaultencoding("utf-8")
Use unicodecsv writer to write data to files
import unicodecsv
unicodecsv
with mhawke help here is my solution
with open('data.csv','w') as out:
csv_out=csv.writer(out)
csv_out.writerow(['Restaurant Name','Change'])
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
list_ = [str(word).decode("utf8") for word in (x+y)]
counter = 0
while counter < len(list_):
s=""
for i in range(counter,counter+4):
s+=list_[i].encode('utf-8')
s+=","
counter = counter + 4
csv_out.writerow(s[:-1])
The problem is not due to your use of zip() - that looks OK, but instead it is an encoding issue. Probably the restaurant names are unicode strings or in some encoding other than ASCII or UTF8? ISO-8859-1 perhaps?
The csv module does not handle unicode; other encodings might work, but it depends. The module does handle 8-bit values OK (except ASCII NUL), so you should be able to encode them as UTF8 like this:
ENCODING = 'iso-8859-1' # assume strings are encoded in this encoding
def to_utf8(item, from_encoding):
if isinstance(item, str):
# byte strings are first decoded to unicode
item = unicode(item, from_encoding)
return unicode(item).encode('utf8')
with open('data.csv', 'w') as out:
csv_out = csv.writer(out)
csv_out.writerow(['Restaurant Name', 'Change'] * 2)
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
csv_out.writerow([to_utf8(item, ENCODING) for item in x+y])
This works by converting each element of the tuple formed by x+y into a UTF-8 string. This includes byte strings in other encodings, as well as other objects such as integers that can be converted to a unicode string via unicode(). If your strings are unicode, just set ENCODING to None.
I'd suggest using numpy:
import numpy as np
IniSpots=[("Restaurant Name0a", 0),("Restaurant Name1a", 1)]
ChoSpots=[("Restaurant Name0b", 0),("Restaurant Name1b", 0)]
c=np.hstack((IniSpots,ChoSpots))
np.savetxt("data.csv", c, fmt='%s',delimiter=",")

Zeroes appearing when reading file (where aren't any)

When reading a file (UTF-8 Unicode text, csv) with Python on Linux, either with:
csv.reader()
file()
values of some columns get a zero as their first characeter (there are no zeroues in input), other get a few zeroes, which are not seen when viewing file with Geany or any other editor. For example:
Input
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
Output
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;0378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
See 378983561 > 0378983561
Reading with:
f = file('/home/foo/data.csv', 'r')
data = f.read()
split_data = data.splitlines()
lines = list(line.split(';') for line in split_data)
print data[51220][8]
>>> '0378983561' #should have been '478983561' (reads like this in Geany etc.)
Same result with csv.reader().
Help me solve the mystery, what could be the cause of this? Could it be related to encoding/decoding?
The data you're getting is a string.
print data[51220][8]
>>> '0478983561'
If you want to use this as an integer, you should parse it.
print int(data[51220][8])
>>> 478983561
If you want this as a string, you should convert it back to a string.
print repr(int(data[51220][8]))
>>> '478983561'
csv.reader treats all columns as strings. Conversion to the appropriate type is up to you as in:
print int(data[51220][8])

Python: How do I compare unicode to ascii text?

I'm trying to convert characters in one list into characters in another list at the same index in Japanese (zenkaku to hangaku moji, for those interested), and I can't get the comparison to work. I am decoding into utf-8 before I compare (decoding into ascii broke the program), but the comparison doesn't ever return true. Does anyone know what I'm doing wrong? Here's the code (indents are a little wacky due to SO's editor):
#!C:\Python27\python.exe
# coding=utf-8
import os
import shutil
import sys
zk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
hk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
def main():
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print("Please specify a file to check.")
return
try:
f = open(filename, 'r')
except IOError as e:
print("Sorry! The file doesn't exist.")
return
filecontent = f.read()
f.close()
#y = zk[29]
#print y.decode('utf-8')
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
print filename
if __name__ == "__main__":
main()
Am I missing a step?
Several.
zk = [
u'。',
u'、',
u'「',
...
...
f = codecs.open(filename, 'r', encoding='utf-8')
...
I'll let you work out the rest now that the hard work's been done.
Make sure that zk and hk lists contain Unicode strings. Either use Unicode literals e.g., u'a' or decode them at runtime:
fromutf8 = lambda s: s.decode('utf-8') if not isinstance(s, unicode) else s
zk = map(fromutf8, zk)
hk = map(fromutf8, hk)
You could use unicode.translate() to convert characters in one list into characters in another list at the same index:
import codecs
translation_table = dict(zip(map(ord,zk), hk))
with codecs.open(sys.argv[1], encoding='utf-8') as f:
for line in f:
print line.translate(translation_table),
You need to convert everything to the same form, and the form is Unicode strings. Unicode strings have no encoding in the sense .encode() or .decode(). When having a non-unicode string, it is actually a stream of bytes that expresses the value in some encoding. When converting to Unicode, you have to .decode(). When storing Unicode string to a sequence of bytes, you have to .encode() the abstraction to concrete bytes.
This way, when loading Unicode strings from an UTF-8 encoded file, or you have to read it into the old strings (non Unicode, sequences of bytes) and then .decode('utf-8'), or you can use `codecs.open(..., encoding='utf-8') -- then you get Unicode strings automatically.
The form # coding=utf-8 is not the usual, but it is OK... if the editor (I mean the tool that you use to write the text) also thinks this way. Then the old strings are displayed by the editor correctly. In the case they should be .decode('utf-8')d to get Unicode. Old strings with ASCII characters only in the same source can also be converted to Unicode using the .decode('utf-8').
To summarize: you are de coding from bytes to Unicode, and you are en coding the Unicode strings into sequence of bytes. It seems from the question that you are doing the opposite.
The following is completely wrong:
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
because the filecontent is the result of f.read(). This way it is a sequence of bytes. The f in the loop is one byte. The z.decode('utf-8') returns one Unicode character. They cannot be compared. (By the way, the f is a kind of misleading name for a byte value.)

Categories

Resources