Python Opening UTF-16 file read each byte - python

I am attempting to parse a file that I believe is UTF-16 encoded (the file magic is 0xFEFF), and I can open the file just as I want with:
f = open(file, 'rb')
But when for example I do
print f.read(40)
it prints the actual unicode strings of the file where I would like to access the hexadecimal data and read that byte-by-byte. This may be a stupid question but I haven't been able to find out how to do this.
Also, as a follow up question. Once I get this working, I would like to parse the file looking for a specific set of bytes, in this case:
0x00 00 00 43 00 00 00
And after that pattern is found, begin parsing an entry. What's the best way to accomplish this? I was thinking about using a generator to walk through each byte, and once this pattern shows up, yield the bytes until the next instance of that pattern? Is there a more efficient way to do this?
EDIT: I am using Python 2.7

shouldn't you just be able to do this
string = 'string'
>>> hex(ord(string[1]))
'0x74'
hexString = ''
with open(filename) as f:
while True:
#char = f.read(1)
chars = f.read(40)
hexString += ''.join(hex(ord(char) for char in chars)
if not chars:
break

If you want a string of hexadecimal, you can pass it through binascii.hexlify():
with open(filename, 'rb') as f:
raw = f.read(40)
hexadecimal = binascii.hexlify(raw)
print(hexadecimal)
(This also works without modification on Python 3)
If you need the numerical value of each byte, you can call ord() on each element, or equivalently, map() the function over the string:
with open(filename, 'rb') as f:
raw = f.read(40)
byte_list = map(ord, raw)
print byte_list
(This doesn't work on Python 3, but on 3.x, you can just iterate over raw directly)

Related

Python 3.10 Binary splitting script(inconsistent output)

I need to split a .bin file into chunks. However, I seem to face a problem when it comes to writing the output in the split/new binary file. The output is inconsistent, I can see the data, but there are shifts and gaps everywhere when comparing the split binary with the bigger original one.
def hash_file(filename: str, blocksize: int = 4096) -> str:
blocksCount = 0
with open(filename, "rb") as f:
while True:
#Read a new chunk from the binary file
full_string = f.read(blocksize)
if not full_string:
break
new_string = ' '.join('{:02x}'.format(b) for b in full_string)
split_string = ''.join(chr(int(i, 16)) for i in new_string.split())
#Append the split chunk to the new binary file
newf = open("SplitBin.bin","a", encoding="utf-8")
newf.write(split_string)
newf.close()
#Check if the desired number of mem blocks has been reached
blocksCount = blocksCount + 1
if blocksCount == 1:
break
For characters with ordinals between 0 and 0x7f, their UTF-8 representation will be the same as their byte value. But for characters with ordinals between 0x80 and 0xff, UTF-8 will output two bytes neither of which will be the same as the input. That's why you're seeing inconsistencies.
The easiest way to fix it would be to open the output file in binary mode as well. Then you can eliminate all the formatting and splitting, because you can directly write the data you just read:
with open("SplitBin.bin", "ab") as newf:
newf.write(full_string)
Note that reopening the file each time you write to it will be very slow. Better to leave it open until you're done.

How to read a single UTF-8 character from a file in Python?

f.read(1) will return 1 byte, not one character. The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string. There is no newline character at the end of the string. How do I read such strings?
I have seen this question but none of the answers address the UTF-8 case.
Example code:
file = 'temp.txt'
with open(file, 'wb') as f:
f.write(b'\x41')
f.write(b'\xD0')
f.write(b'\xB1')
f.write(b'\xC0')
with open(file, 'rb') as f:
print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
print(f.buffer.read(1), '+', f.read(1))
This outputs:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 2: invalid start byte
When f.write(b'\xC0') is removed, it works as expected. It seems to read more than it is told: the code doesn't say to read the 0xC0 byte.
The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string.
You have the length of the string, which is likely the byte length as it makes the most sense in a binary file. Read the range of bytes in binary mode and decode it after-the-fact. Here's a contrived example of writing a binary file with a UTF-8 string with the length encoded first. It has a two-byte length followed by the encoded string data, surrounded with 10 bytes of random data on each side.
import os
import struct
string = "我不喜欢你女朋友。你需要一个新的。"
with open('sample.bin','wb') as f:
f.write(os.urandom(10)) # write 10 random bytes
encoded = string.encode()
f.write(len(encoded).to_bytes(2,'big')) # write a two-byte big-endian length
f.write(encoded) # write string
f.write(os.urandom(10)) # 10 more random bytes
with open('sample.bin','rb') as f:
print(f.read()) # show the raw data
# Option 1: Seeking to the known offset, read the length, then the string
with open('sample.bin','rb') as f:
f.seek(10)
length = int.from_bytes(f.read(2),'big')
result = f.read(length).decode()
print(result)
# Option 2: read the fixed portion as a structure.
with open('sample.bin','rb') as f:
# read 10 bytes and a big endian 16-bit value
*other,header = struct.unpack('>10bH',f.read(12))
result = f.read(length).decode()
print(result)
Output:
b'\xa3\x1e\x07S8\xb9LA\xf0_\x003\xe6\x88\x91\xe4\xb8\x8d\xe5\x96\x9c\xe6\xac\xa2\xe4\xbd\xa0\xe5\xa5\xb3\xe6\x9c\x8b\xe5\x8f\x8b\xe3\x80\x82\xe4\xbd\xa0\xe9\x9c\x80\xe8\xa6\x81\xe4\xb8\x80\xe4\xb8\xaa\xe6\x96\xb0\xe7\x9a\x84\xe3\x80\x82ta\xacg\x9c\x82\x85\x95\xf9\x8c'
我不喜欢你女朋友。你需要一个新的。
我不喜欢你女朋友。你需要一个新的。
If you do need to read UTF-8 characters from a particular byte offset in a file, you can wrap the binary stream in a UTF-8 reader after seeking:
with open('sample.bin','rb') as f:
f.seek(12)
c = codecs.getreader('utf8')(f)
print(c.read(1))
Output:
我
Here's a character that takes up more than one byte. Whether you open the file giving the utf-8 encoding or not, reading one byte seems to do the job and you get the whole character.
file = 'temp.txt'
with open(file, 'wb') as f:
f.write('⾀'.encode('utf-8'))
f.write(b'\x01')
with open(file, 'rb') as f:
print(f.read(1))
with open(file, 'r') as f:
print(f.read(1))
Output:
b'\xe2'
⾀
Even though some of the file is non utf-8, you can still open the file in reading mode (non-binary), skip to the byte you want to read and then read a whole character by running read(1).
This works even if your character isn't in the beginning of the file:
file = 'temp.txt'
with open(file, 'wb') as f:
f.write(b'\x01')
f.write('⾀'.encode('utf-8'))
with open(file, 'rb') as f:
print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
print(f.read(1),'+', f.read(1))
If this does not work for you please provide an example.

How to unpack data using byte objects in python

I am trying to unpack a specific piece of data from an encoded file. The data is a type int32.
My approach was to try and read each line of the file and if a section of that lines size matches the size of '
with open(r2sFile2, encoding="latin-1") as datafile:
for line in datafile:
for i in range(0, len(line)):
text = line.encode('latin-1')
fmt = text[0: i+1]
print(sys.getsizeof(fmt))
if (sys.getsizeof(fmt) == 4):
PacketSize = struct.unpack('<I', fmt)
if PacketSize == 'BTH0':
print("Identifier Found")
I am encountering 1 problem so far, the first is that the sys.getsizeof(fmt) [ which reads the size of the line segment ] returns a size always higher than the size required [ 4 ]. Maybe I have to convert every byte object into an int32( if possible )?
You cannot open a file using an encoding and also read raw bytes from the same stream. Specifying an encoding will convert those bytes into unicode points. You want to open the file in binary mode so that it returns byte's instead of unicode characters. You should also be reading fixed size data chunks from the file instead of reading line by line, which doesn't really make sense in binary mode
with open('/path/to/file.dat', 'rb') as f:
val = struct.unpack('<I', f.read(4))
if val == 123:
...
Also, 'BTH0' is not an unsigned integer, so it's never going to compare equal to the value unpacked from your binary file with the format string <I, which will always be an unsigned integer.

Create text file of hexadecimal from binary

I would like to convert a binary to hexadecimal in a certain format and save it as a text file.
The end product should be something like this:
"\x7f\xe8\x89\x00\x00\x00\x60\x89\xe5\x31\xd2\x64\x8b\x52"
Input is from an executable file "a".
This is my current code:
with open('a', 'rb') as f:
byte = f.read(1)
hexbyte = '\\x%02s' % byte
print hexbyte
A few issues with this:
This only prints the first byte.
The result is "\x" and a box like this:
00
7f
In terminal it looks exactly like this:
Why is this so? And finally, how do I save all the hexadecimals to a text file to get the end product shown above?
EDIT: Able to save the file as text with
txt = open('out.txt', 'w')
print >> txt, hexbyte
txt.close()
You can't inject numbers into escape sequences like that. Escape sequences are essentially constants, so, they can't have dynamic parts.
There's already a module for this, anyway:
from binascii import hexlify
with open('test', 'rb') as f:
print(hexlify(f.read()).decode('utf-8'))
Just use the hexlify function on a byte string and it'll give you a hex byte string. You need the decode to convert it back into an ordinary string.
Not quite sure if decode works in Python 2, but you really should be using Python 3, anyway.
Your output looks like a representation of a bytestring in Python returned by repr():
with open('input_file', 'rb') as file:
print repr(file.read())
Note: some bytes are shown as ascii characters e.g. '\x52' == 'R'. If you want all bytes to be shown as the hex escapes:
with open('input_file', 'rb') as file:
print "\\x" + "\\x".join([c.encode('hex') for c in file.read()])
Just add the content to list and print:
with open("default.png",'rb') as file_png:
a = file_png.read()
l = []
l.append(a)
print l

Python3 ASCII Hexadecimal to Binary String Conversion

I'm using Python 3.2.3 on Windows, and am trying to convert binary data within a C-style ASCII file into its binary equivalent for later parsing using the struct module. For example, my input file contains "0x000A 0x000B 0x000C 0x000D", and I'd like to convert it into "\x00\x0a\x00\x0b\x00\x0c\x00\x0d".
The problem I'm running into is that the string datatypes have changed in Python 3, and the built-in functions to convert from hexadecimal to binary, such as binascii.unhexlify(), no longer accept regular unicode strings, but only byte strings. This process of converting from unicode strings to byte strings and back is confusing me, so I'm wondering if there's an easier way to achieve this. Below is what I have so far:
with open(path, "r") as f:
l = []
data = f.read()
values = data.split(" ")
for v in values:
if (v.startswith("0x")):
l.append(binascii.unhexlify(bytes(v[2:], "utf-8").decode("utf-8")
string = ''.join(l)
3>> ''.join(chr(int(x, 16)) for x in "0x000A 0x000B 0x000C 0x000D".split()).encode('utf-16be')
b'\x00\n\x00\x0b\x00\x0c\x00\r'
As agf says, opening the image with mode 'r' will give you string data.
Since the only thing you are doing here is looking at binary data, you probably want to open with 'rb' mode and make your result of type bytes, not str.
Something like:
with open(path, "rb") as f:
l = []
data = f.read()
values = data.split(b" ")
for v in values:
if (v.startswith(b"0x")):
l.append(binascii.unhexlify(v[2:]))
result = b''.join(l)

Categories

Resources