How to read hex values at specific addresses in Python? - python

Say I have a file and I'm interested in reading and storing hex values at certain addresses, like the snippet below:
22660 00 50 50 04 00 56 0F 50 25 98 8A 19 54 EF 76 00
22670 75 38 D8 B9 90 34 17 75 93 19 93 19 49 71 EF 81
I want to read the value at 0x2266D, and be able to replace it with another hex value, but I can't understand how to do it. I've tried using open('filename', 'rb'), however this reads it as the ASCII representation of the values, and I don't see how to pick and choose when addresses I want to change.
Thanks!
Edit: For an example, I have
rom = open("filename", 'rb')
for i in range(5):
test = rom.next().split()
print test
rom.close()
This outputs: ['NES\x1a', 'B\x00\x00\x00\x00\x00\x00\x00\x00\x00!\x0f\x0f\x0f(\x0f!!', '!\x02\x0f\x0f', '!\x0f\x01\x08', '!:\x0f\x0f\x03!', '\x0f', '\x0f\x0f', '!', '\x0f\x0f!\x0f\x03\x0f\x12', '\x0f\x0f\x0f(\x02&%\x0f', '\x0f', '#', '!\x0f\x0f1', '!"#$\x14\x14\x14\x13\x13\x03\x04\x0f\x0f\x03\x13#!!\x00\x00\x00\x00\x00!!', '(', '\x0f"\x0f', '#\x14\x11\x12\x0f\x0f\x0f#', '\x10', "5'4\x0270&\x02\x02\x02\x02\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f126&\x13\x0f\x0f\x0f\x13&6222\x0f", '\x1c,', etc etc.
Much more than 5 bytes, and while some of it is in hex, some has been replaced with ASCII.

There's no indication that some of the bytes were replaced by their ASCII representations. Some bytes happen to be printable.
With a binary file, you can simply seek to the location offset and write the bytes in. Working with the line-iterator in the case of binary file is problematic, as there's no meaningful "lines" in the binary blob.
You can do in-place editing like follows (in fake Python):
with open("filename", "rb+") as f:
f.seek(0x2266D)
the_byte = f.read(1)
if len(the_byte) != 1:
# something's wrong; bolt out ...
else
transformed_byte = your_function(the_byte)
f.seek(-1, 1) # back one byte relative to the current position
f.write(transformed_byte)
But of course, you may want to do the edit on a copy, either in-memory (and commit later, as in the answer of #JosepValls), or on a file copy. The problem with gulping the whole file in memory is, of course, sometimes the system may choke ;) For that purpose you may want to mmap part of the file.

Given that is not a very big file (roms should fit fine in today's computer's memory), just do data = open('filename', 'rb').read(). Now you can do whatever you want to the data (if you print it, it will show ascii, but that is just data!). Unfortunately, string objects don't support item assignment, see this answer for more:
Change one character in a string in Python?
In your case:
data = data[0:0x2266C] + str(0xFF) + data[0x2266D:]

Related

How to read double, float and int values from binary files in python?

I have a binary file that was created in C++. The first value is double and the second is integer. I am reading the values fine using the following code in C++.
double dob_value;
int integer_value;
fread(&dob_value, sizeof(dob_value), 1, fp);
fread(&integer_value, sizeof(integer_value), 1, fp);
I am trying to read the same file in python but I am running into issues. My dob_value is 400000000.00 and my integer_value 400000. I am using following code in python for double.
def interpret_float(x):
return struct.unpack('d',x[4:]+x[:4])
with open(file_name, 'rb') as readfile:
dob = readfile.read(8)
dob_value = interpret_float(dob)[0]
val = readfile.read(4)
test2 = readfile.read(4)
integer_value = int.from_bytes(test2, "little")
My dob_value is 400000000.02384186 . My question is where is this extra decimals coming from? Also, how do I get the correct integer_value? With above code, my integer_value is 1091122467. I also have float values after integer but I haven't looked into that yet.
If the link goes broken and just in case the test.bin contains 00 00 00 00 84 D7 B7 41 80 1A 06 00 70 85 69 C0.
Your binary contains correct 41B7D78400000000 hexadecimal representation of 400000000.0 in the first 8 bytes. Running
import binascii
import struct
fname = r'test.bin'
with open(fname, 'rb') as readfile:
dob = readfile.read(8)
print(struct.unpack('d', dob)[0])
print(binascii.hexlify(dob))
outputs
>> 400000000.0
>> b'0000000084d7b741'
which is also correct little endian representation of the double. When you swap parts, you get
print(binascii.hexlify(dob[4:]+dob[:4]))
>> b'84d7b74100000000'
and if you check the decimal value, it will give you 5.45e-315, not what you expect. Moreover,
struct.unpack('d', dob[4:]+dob[:4])[0]
>>5.44740625e-315
So I'm not sure how you could get 400000000.02384186 from the code above. However, to obtain 400000000.02384186 using your test.bin, just skip the four bytes in the beginning:
with open(fname, 'rb') as readfile:
val = readfile.read(4)
dob = readfile.read(8)
dob = dob[4:]+dob[:4]
print(binascii.hexlify(dob))
print(struct.unpack('d', dob)[0])
>>b'801a060084d7b741'
>>400000000.02384186
Binary value 0x41B7D78400061A80 corresponds to 400000000.02384186. So you first read incorrect bytes, then incorrectly swap parts and get a result close to what you expect. Considering integer value, the 400000 is 0x00061A80, which is also present in the binary, but you definitely read past that bytes, since you used them for double, so you get wrong values.

How to cipher UTF-8 beyond A-Z in python?

Many years ago, I made a program in C# on Windows which "encrypts" text files using (what I thought was) caeser chipher.
Back then I wanted more characters than just A-Z,0-9 and made it possible but never thought about the actual theory behind it.
Looking at some of the files, and comparing it to this website, it seems like the UTF-8 is being shifted.
I started up a Windows VM (because I'm using Linux now) and typed this: abcdefghijklmnopqrstuvwxyz
It generated a text that looks like this in hexadecimals (Shifted 15 times):
70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f c280 c281 c282 c283 c284 c285 c286 c287 c288 c289
How can I shift the hexadecimals to look like this?
61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a
Or are there any easier/better methods of doing this?
UPDATE
I'm using Python 3.5.3, and this is the code I have so far:
import sys
arguments = sys.argv[1:]
file = ""
for arg in arguments:
if arg[0] != "-":
file = arg
lines = []
with open(file) as f:
lines = f.readlines()
for line in lines:
result = 0
for value in list(line):
#value = "0x"+value
temp=value.encode('utf-8').hex()
temp+=15
if(temp>0x7a):
temp-=0x7a
elif(temp<=0):
temp+=0x7a
#result = result + temp
print (result)
Unfortunately, I don't have the C# source code available for the moment. I can try to find it
Assuming your input is ASCII text, the simplest solution is to encode/decode as ASCII and use the built-in methods ord() and chr() to convert from character to byte value and vice versa.
Note that the temp value cannot be less than 0, so the second if-statement can be removed.
NB: This is outside the scope of the question, but I also noticed that you're doing argument parsing yourself. I highly recommend using argparse instead, since it's very easy and gives you a lot extra for free (i.e. it performs error checking and it prints a nice help message if you start your application with '--help' option). See the example code below:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(dest='filenames', metavar='FILE', type=str, nargs='+',
help='file(s) to encrypt')
args = parser.parse_args()
for filename in args.filenames:
with open(filename, 'rt', encoding='ascii') as file:
lines = file.readlines()
for line in lines:
result = ""
for value in line:
temp = ord(value) # character to int value
temp += 15
if temp > 0x7a:
temp -= 0x7a
result += chr(temp) # int value to character
print(result)
You can convert hex back and forth between integers and hex using int() and hex(). However, the hex() method only works on integers. So first you need to convert to an integer using base=16.
hex_int = int(hex_str, 16)
cipher = hex_int - 15
hex_cipher = hex(cipher)
Now apply that in a loop and you can shift your results left or right as desired. And you could of course condense the code as well.
result = hex(int(hex_string, 16) - 15)
#in a loop
hexes = ['70', '71', 'c280']
ciphered = []
for n in hexes:
ciphered.append(hex(int(n, 16) - 15))
You can use int('somestring'.encode('utf-8').hex(),16) to get the exact values on that website. If you want to apply the same rules to each character, you can do it in a character list. You can use
import codecs
def myencode(character,diff):
temp=int(character.encode('utf-8').hex(),16)
temp+=diff
if(temp>0x7a):
temp-=0x7a
elif(temp<=0):
temp+=0x7a
result=codecs.decode(hex(temp)[2:],"hex").decode("utf-8")
return result
diff should be the shift for the cipher (It could be an integer). encode('utf-8') converts string to byte array and .hex() displays bytes as hex. You should feed this function only one character of a string at a time so there would be no issues shifting everything.
After you are done with the encoding you need to decode it in to a new character which you can do by library codecs to convert from integer to byte (char) and then return it back to a string with decode("utf-8")
Edit: Updated, now it works.

Converting broken byte string from unicode back to corresponding bytes

The following code retrieves an iterable object of strings in rows which contains a PDF byte stream. The string row was type of str. The resulting file was a PDF format and could be opened.
with open(fname, "wb") as fd:
for row in rows:
fd.write(row)
Due to a new C-Library and changes in the Python implementation the str changes to unicode. And the corresponding content changed as well so my PDF file is broken.
Starting bytes of first row object:
old row[0]: 25 50 44 46 2D 31 2E 33 0D 0A 25 E2 E3 CF D3 0D 0A ...
new row[0]: 25 50 44 46 2D 31 2E 33 0D 0A 25 C3 A2 C3 A3 C3 8F C3 93 0D 0A ...
I adjust the corresponding byte positions here so it looks like a unicode problem.
I think this is a good start but I still have a unicode string as input...
>>> "\xc3\xa2".decode('utf8') # but as input I have u"\xc3\xa2"
u'\xe2'
I already tried several calls of encode and decode so I need a more analytical way to fix this. I can't see the wood for the trees. Thank you.
When you find u"\xc3\xa2" in a Python unicode string, it often means that you have read an UTF-8 encoded file as is it was Latin1 encoded. So the best thing to do is certainly to fix the initial read.
That being said if you have to depend on broken code, the fix is still easy: you just encode the string as Latin1 and then decode it as UTF-8:
fixed_u_str = broken_u_str.encode('Latin1').decode('UTF-8')
For example:
u"\xc3\xa2\xc3\xa3".encode('Latin1').decode('utf8')
correctly gives u"\xe2\xe3" which displays as âã
This looks like you should be doing
fd.write(row.encode('utf-8'))
assuming the type of row is now unicode (this is my understanding of how you presented things).

numpy.array.tofile() binary file looks "strange" in notepad++

I am just wondering how the function actually stores the data. Because to me, it looks completely strange. Say I have the following code:
import numpy as np
filename = "test.dat"
print(filename)
fileobj = open(filename, mode='wb')
off = np.array([1, 300], dtype=np.int32)
off.tofile(fileobj)
fileobj.close()
fileobj2 = open(filename, mode='rb')
off = np.fromfile(fileobj2, dtype = np.int32)
print(off)
fileobj2.close()
Now I expect 8 bytes inside the file, where each element is represented by 4 bytes (and I could live with any endianness). However when I open up the file in a hex editor (used notepad++ with hex editor plugin) I get the following bytes:
01 00 C4 AC 00
5 bytes, and I have no idea at all what it represents. The first byte looks like it is the number, but then what follows is something weird, certainly not "300".
Yet reloading shows the original array.
Is this something I don't understand in python, or is it a problem in notepad++? - I notice the hex looks different if I select a different "encoding" (huh?). Also: Windows does report it being 8 bytes long.
You can tell very easily that the file actually does have 8 bytes, the same 8 bytes you'd expect (01 00 00 00 2C 01 00 00) just by using anything other than Notepad++ to look at the file, including just replacing your off = np.fromfile(fileobj2, dtype=np.int32) with off = fileobj2.read()thenprinting the bytes (which will give youb'\x01\x00\x00\x00,\x01\x00\x00'`*).
And, from your comments, after I suggested that, you tried it, and saw exactly that.
Which means this is either a bug in Notepad++, or a problem with the way you're using it; Python, NumPy, and your own code are perfectly fine.
* In case it isn't clear: '\x2c' and ',' are the same character, and bytes uses the printable ASCII representation for printable ASCII characters, as well as familiar escapes like '\n', when possible, only using the hex backslash escape for other values.
What are you expecting 300 to look like?
Write the array, and read it back as binary (in ipython):
In [478]: np.array([1,300],np.int32).tofile('test')
In [479]: with open('test','rb') as f: print(f.read())
b'\x01\x00\x00\x00,\x01\x00\x00'
There are 8 bytes, , is just a displayable byte.
Actually, I don't have to go through a file to get this:
In [505]: np.array([1,300]).tostring()
Out[505]: b'\x01\x00\x00\x00,\x01\x00\x00'
Do the same with:
[255]
b'\xff\x00\x00\x00'
[256]
b'\x00\x01\x00\x00'
[300]
b',\x01\x00\x00'
[1,255]
b'\x01\x00\x00\x00\xff\x00\x00\x00'
With powers of 2 (and 1 less) it is easy to identify a pattern in the bytes.
frombuffer converts a byte string back to an array:
In [513]: np.frombuffer(np.array([1,300]).tostring(),int)
Out[513]: array([ 1, 300])
In [514]: np.frombuffer(np.array([1,300]).data,int)
Out[514]: array([ 1, 300])
Judging from this last expression, the tofile is just writing the array buffer to the file as bytes.

Python extraction of date data from a binary file

I have a file that I open in binary format using
with open(filename, 'br') as f2
I then need to extract certain blocks of Hex. There will be lots of these 'dates' in the code that will look like:
F2 96 E6 20 36 1B E4 40
I need to extract every instance of this in order for me to complete my date editing on it. Each 'date' will end with hex char 40 as above.
I have tried regex but these do not seem to work as I want.
For example
re.findall(b'............\\\x40', filename)
Can anyone assist?
I think your are confusing bytes with hex representation. 0x40 is a hexadecimal representation of the integer 64 and it's ascii code of the symbol #.
with open(filename, 'rb') as f:
results = re.findall('.{7}#', f.read())
print results
Please note, that following regexps are equivalent: '.{7}#', '.......#', '.......\x40'
>>> print 0x40, hex(64)
64 0x40
>>> print chr(0x40)
#

Categories

Resources