Problem writing unicode UTF-16 data to file in python - python

I'm working on Windows with Python 2.6.1.
I have a Unicode UTF-16 text file containing the single string Hello, if I look at it in a binary editor I see:
FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0A 00
BOM H e l l o CR LF
What I want to do is read in this file, run it through Google Translate API, and write both it and the result to a new Unicode UTF-16 text file.
I wrote the following Python script (actually I wrote something more complex than this with more error checking, but this is stripped down as a minimal test case):
#!/usr/bin/python
import urllib
import urllib2
import sys
import codecs
def translate(key, line, lang):
ret = ""
print "translating " + line.strip() + " into " + lang
url = "https://www.googleapis.com/language/translate/v2?key=" + key + "&source=en&target=" + lang + "&q=" + urllib.quote(line.strip())
f = urllib2.urlopen(url)
for l in f.readlines():
if l.find("translatedText") > 0 and l.find('""') == -1:
a,b = l.split(":")
ret = unicode(b.strip('"'), encoding='utf-16', errors='ignore')
break
return ret
rd_file_name = sys.argv[1]
rd_file = codecs.open(rd_file_name, encoding='utf-16', mode="r")
rd_file_new = codecs.open(rd_file_name+".new", encoding='utf-16', mode="w")
key_file = open("api.key","r")
key = key_file.readline().strip()
for line in rd_file.readlines():
new_line = translate(key, line, "ja")
rd_file_new.write(unicode(line) + "\n")
rd_file_new.write(new_line)
rd_file_new.write("\n")
This gives me an almost-Unicode file with some extra bytes in it:
FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0A 00 0A 00
20 22 E3 81 93 E3 82 93 E3 81 AB E3 81 A1 E3 81 AF 22 0A 00
I can see that 20 is a space, 22 is a quote, I assume that "E3" is an escape character that urllib2 is using to indicate that the next character is UTF-16 encoded??
If I run the same script but with "cs" (Czech) instead of "ja" (Japanese) as the target language, the response is all ASCII and I get the Unicode file with my "Hello" first as UTF-16 chars and then "Ahoj" as single byte ASCII chars.
I'm sure I'm missing something obvious but I can't see what. I tried urllib.unquote() on the result from the query but that didn't help. I also tried printing the string as it comes back in f.readlines() and it all looks pretty plausible, but it's hard to tell because my terminal window doesn't support Unicode properly.
Any other suggestions for things to try? I've looked at the suggested dupes but none of them seem to quite match my scenario.

I believe the output from Google is UTF-8, not UTF-16. Try this fix:
ret = unicode(b.strip('"'), encoding='utf-8', errors='ignore')

Those E3 bytes are not "escape characters". If one had no access to documentation, and was forced to make a guess, the most likely suspect for the response encoding would be UTF-8. Expectation (based on a one-week holiday in Japan): something like "konnichiwa".
>>> response = "\xE3\x81\x93\xE3\x82\x93\xE3\x81\xAB\xE3\x81\xA1\xE3\x81\xAF"
>>> ucode = response.decode('utf8')
>>> print repr(ucode)
u'\u3053\u3093\u306b\u3061\u306f'
>>> import unicodedata
>>> for c in ucode:
... print unicodedata.name(c)
...
HIRAGANA LETTER KO
HIRAGANA LETTER N
HIRAGANA LETTER NI
HIRAGANA LETTER TI
HIRAGANA LETTER HA
>>>
Looks close enough to me ...

Related

python read binary file byte by byte and read line feed as 0a

This might seem pretty stupid, but I'm a complete newbie in python.
So, I have a binary file that holds data as
47 40 ad e8 66 29 10 87 d7 73 0a 40 10
When I tried to read it with python by
with open(absolutePathInput, "rb") as f:
for line in self.file:
for byte, nextbyte in zip(line[:], line[1:]):
if state == 'wait_for_sync_1':
if ((byte == 0x10) and (nextbyte == 0x87)):
state = 'message_id'
I get
all bytes but 0a is read as line feed(i.e. \n)
It seems that it considers as "line feed" and self.file reads only till 0a.
correct me to read 0a as hex value

Python search for a start hex string, save the hex string after that to a new file

PYTHON
I would like to open a text file, search for a HEX pattern that starts with 45 3F, grab the following hex 6 HEX values, for example 45 4F 5D and put all this in a new file. I also know that it always ends with 00 00.
So the file can look like: bla bla sdsfsdf 45 3F 08 DF 5D 00 00 dsafasdfsadf 45 3F 07 D3 5F 00 00 xztert
And should be put in a new file like this:
08 DF 5D
07 D3 5F
How can I do that?
I have tried:
output = open('file.txt','r')
my_data = output.read()
print re.findall(r"45 3F[0-9a-fA-F]", my_data)
but it only prints:
[]
Any suggestions?
Thank you :-)
Here's a complete answer (Python 3):
with open('file.txt') as reader:
my_data = reader.read()
matches = re.findall(r"45 3F ([0-9A-F]{2} [0-9A-F]{2} [0-9A-F]{2})", my_data)
data_to_write = " ".join(matches)
with open('out.txt', 'w') as writer:
writer.write(data_to_write)
re.findall returns the capturing group, so there's no need to clear out '45 3F'.
When dealing with files, use with open so that the file descriptor will be released when indentation ends.
if you want to accept the first two octets dynamically:
search_string = raw_input()
regex_pattern = search_string.upper() + r" ([0-9A-F]{2} [0-9A-F]{2} [0-9A-F]{2})"
matches = re.findall(regex_pattern, my_data)

How do I translate a hash function from Python to R

I'm trying to translate the following python script to R, but I'm having difficulty that reflects that fact that I am not well versed with Python or R.
Here is what I have for Python:
import hashlib, hmac
print hmac.new('123456', 'hello'.encode('utf-8'),hashlib.sha256).digest()
When I run this in Python I'm getting a message that says standard output is empty.
Question: What am I doing wrong?
Here's what I'm using for R
library('digest')
hmac('123456','hello', algo='sha256', serialize=FALSE)
My questions with the R code are:
How do I encode to utf-8 in R. I couldn't find a package.
What are the correct parameter settings for serialize and raw for R given I want to match the output of the Python function above (once its working).
If you want to get the bytes of the hash in R, set raw=TRUE. Then you can write it out as a binary fine
library('digest')
x <- hmac('123456', enc2utf8('hello'), algo='sha256', serialize=FALSE, raw=TRUE)
writeBin(x, "Rout.txt")
If you're not outputting text, the encoding doesn't matter. These are raw bytes. The only different in the output is that the python print seems to be adding a new line character. If I hexdump on the R file i see
0000000 ac 28 d6 02 c7 67 42 4d 0c 80 9e de bf 73 82 8b
0000010 ed 5c e9 9c e1 55 6f 4d f8 e2 23 fa ee c6 0e dd

PyCrypto AES encryption not working as expected

I am creating a Python function to perform counter mode encryption using the PyCrypto module. I am aware of the builtin, but want to implement it myself.
I'm trying Test Vector #1 from RFC 3686, and have the correct Counter Block and the correct Key in ASCII form. But when I encrypt the Counter Block using the Key, I don't get the expected Key Stream.
The relevant parts of my code:
cipher = AES.new(key)
ctr_block = iv + nonce + ctr
key_stream = base64.b64decode(cipher.encrypt(ctr_block))
I can provide more code if needed, but I'm not sure how because ctr_block and key have many question mark characters when I print them.
Why am I not getting the expected answer? It seems like everything should go right. Perhaps I made some mistake with the encoding of the string.
Edit
Self-contained code:
from Crypto.Cipher import AES
import base64
def hex_to_str(hex_str):
return str(bytearray([int(n, 16) for n in hex_str.split()]))
key = hex_to_str("AE 68 52 F8 12 10 67 CC 4B F7 A5 76 55 77 F3 9E")
iv = hex_to_str("00 00 00 00 00 00 00 00")
nonce = hex_to_str("00 00 00 30")
ctr = hex_to_str("00 00 00 01")
cipher = AES.new(key)
ctr_block = iv + nonce + ctr
key_stream = base64.b64decode(cipher.encrypt(ctr_block))
print "".join([hex(ord(char)) for char in key_stream])
# 0xd90xda0x72
First, the correct CTR block order is nonce + iv + ctr. Second, that base64.b64decode call is wrong: cipher.encrypt produces a decoded string. After these two fixes your code prints 0xb70x600x330x280xdb0xc20x930x1b0x410xe0x160xc80x60x7e0x620xdf which seems to be a correct key stream.
First, use byte strings:
In [14]: keystring = "AE 68 52 F8 12 10 67 CC 4B F7 A5 76 55 77 F3 9E"
In [15]: keystring.replace(' ', '').decode('hex')
Out[15]: '\xaehR\xf8\x12\x10g\xccK\xf7\xa5vUw\xf3\x9e'
Second, you shouldn't use base64.

Convert UTF-16 to UTF-8 and remove BOM?

We have a data entry person who encoded in UTF-16 on Windows and would like to have utf-8 and remove the BOM. The utf-8 conversion works but BOM is still there. How would I remove this? This is what I currently have:
batch_3={'src':'/Users/jt/src','dest':'/Users/jt/dest/'}
batches=[batch_3]
for b in batches:
s_files=os.listdir(b['src'])
for file_name in s_files:
ff_name = os.path.join(b['src'], file_name)
if (os.path.isfile(ff_name) and ff_name.endswith('.json')):
print ff_name
target_file_name=os.path.join(b['dest'], file_name)
BLOCKSIZE = 1048576
with codecs.open(ff_name, "r", "utf-16-le") as source_file:
with codecs.open(target_file_name, "w+", "utf-8") as target_file:
while True:
contents = source_file.read(BLOCKSIZE)
if not contents:
break
target_file.write(contents)
If I hexdump -C I see:
Wed Jan 11$ hexdump -C svy-m-317.json
00000000 ef bb bf 7b 0d 0a 20 20 20 20 22 6e 61 6d 65 22 |...{.. "name"|
00000010 3a 22 53 61 76 6f 72 79 20 4d 61 6c 69 62 75 2d |:"Savory Malibu-|
in the resulting file. How do I remove the BOM?
thx
This is the difference between UTF-16LE and UTF-16
UTF-16LE is little endian without a BOM
UTF-16 is big or little endian with a BOM
So when you use UTF-16LE, the BOM is just part of the text. Use UTF-16 instead, so the BOM is automatically removed. The reason UTF-16LE and UTF-16BE exist is so people can carry around "properly-encoded" text without BOMs, which does not apply to you.
Note what happens when you encode using one encoding and decode using the other. (UTF-16 automatically detects UTF-16LE sometimes, not always.)
>>> u'Hello, world'.encode('UTF-16LE')
'H\x00e\x00l\x00l\x00o\x00,\x00 \x00w\x00o\x00r\x00l\x00d\x00'
>>> u'Hello, world'.encode('UTF-16')
'\xff\xfeH\x00e\x00l\x00l\x00o\x00,\x00 \x00w\x00o\x00r\x00l\x00d\x00'
^^^^^^^^ (BOM)
>>> u'Hello, world'.encode('UTF-16LE').decode('UTF-16')
u'Hello, world'
>>> u'Hello, world'.encode('UTF-16').decode('UTF-16LE')
u'\ufeffHello, world'
^^^^ (BOM)
Or you can do this at the shell:
for x in * ; do iconv -f UTF-16 -t UTF-8 <"$x" | dos2unix >"$x.tmp" && mv "$x.tmp" "$x"; done
Just use str.decode and str.encode:
with open(ff_name, 'rb') as source_file:
with open(target_file_name, 'w+b') as dest_file:
contents = source_file.read()
dest_file.write(contents.decode('utf-16').encode('utf-8'))
str.decode will get rid of the BOM for you (and deduce the endianness).

Categories

Resources