How to cipher UTF-8 beyond A-Z in python? - python

Many years ago, I made a program in C# on Windows which "encrypts" text files using (what I thought was) caeser chipher.
Back then I wanted more characters than just A-Z,0-9 and made it possible but never thought about the actual theory behind it.
Looking at some of the files, and comparing it to this website, it seems like the UTF-8 is being shifted.
I started up a Windows VM (because I'm using Linux now) and typed this: abcdefghijklmnopqrstuvwxyz
It generated a text that looks like this in hexadecimals (Shifted 15 times):
70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f c280 c281 c282 c283 c284 c285 c286 c287 c288 c289
How can I shift the hexadecimals to look like this?
61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a
Or are there any easier/better methods of doing this?
UPDATE
I'm using Python 3.5.3, and this is the code I have so far:
import sys
arguments = sys.argv[1:]
file = ""
for arg in arguments:
if arg[0] != "-":
file = arg
lines = []
with open(file) as f:
lines = f.readlines()
for line in lines:
result = 0
for value in list(line):
#value = "0x"+value
temp=value.encode('utf-8').hex()
temp+=15
if(temp>0x7a):
temp-=0x7a
elif(temp<=0):
temp+=0x7a
#result = result + temp
print (result)
Unfortunately, I don't have the C# source code available for the moment. I can try to find it

Assuming your input is ASCII text, the simplest solution is to encode/decode as ASCII and use the built-in methods ord() and chr() to convert from character to byte value and vice versa.
Note that the temp value cannot be less than 0, so the second if-statement can be removed.
NB: This is outside the scope of the question, but I also noticed that you're doing argument parsing yourself. I highly recommend using argparse instead, since it's very easy and gives you a lot extra for free (i.e. it performs error checking and it prints a nice help message if you start your application with '--help' option). See the example code below:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(dest='filenames', metavar='FILE', type=str, nargs='+',
help='file(s) to encrypt')
args = parser.parse_args()
for filename in args.filenames:
with open(filename, 'rt', encoding='ascii') as file:
lines = file.readlines()
for line in lines:
result = ""
for value in line:
temp = ord(value) # character to int value
temp += 15
if temp > 0x7a:
temp -= 0x7a
result += chr(temp) # int value to character
print(result)

You can convert hex back and forth between integers and hex using int() and hex(). However, the hex() method only works on integers. So first you need to convert to an integer using base=16.
hex_int = int(hex_str, 16)
cipher = hex_int - 15
hex_cipher = hex(cipher)
Now apply that in a loop and you can shift your results left or right as desired. And you could of course condense the code as well.
result = hex(int(hex_string, 16) - 15)
#in a loop
hexes = ['70', '71', 'c280']
ciphered = []
for n in hexes:
ciphered.append(hex(int(n, 16) - 15))

You can use int('somestring'.encode('utf-8').hex(),16) to get the exact values on that website. If you want to apply the same rules to each character, you can do it in a character list. You can use
import codecs
def myencode(character,diff):
temp=int(character.encode('utf-8').hex(),16)
temp+=diff
if(temp>0x7a):
temp-=0x7a
elif(temp<=0):
temp+=0x7a
result=codecs.decode(hex(temp)[2:],"hex").decode("utf-8")
return result
diff should be the shift for the cipher (It could be an integer). encode('utf-8') converts string to byte array and .hex() displays bytes as hex. You should feed this function only one character of a string at a time so there would be no issues shifting everything.
After you are done with the encoding you need to decode it in to a new character which you can do by library codecs to convert from integer to byte (char) and then return it back to a string with decode("utf-8")
Edit: Updated, now it works.

Related

How to read double, float and int values from binary files in python?

I have a binary file that was created in C++. The first value is double and the second is integer. I am reading the values fine using the following code in C++.
double dob_value;
int integer_value;
fread(&dob_value, sizeof(dob_value), 1, fp);
fread(&integer_value, sizeof(integer_value), 1, fp);
I am trying to read the same file in python but I am running into issues. My dob_value is 400000000.00 and my integer_value 400000. I am using following code in python for double.
def interpret_float(x):
return struct.unpack('d',x[4:]+x[:4])
with open(file_name, 'rb') as readfile:
dob = readfile.read(8)
dob_value = interpret_float(dob)[0]
val = readfile.read(4)
test2 = readfile.read(4)
integer_value = int.from_bytes(test2, "little")
My dob_value is 400000000.02384186 . My question is where is this extra decimals coming from? Also, how do I get the correct integer_value? With above code, my integer_value is 1091122467. I also have float values after integer but I haven't looked into that yet.
If the link goes broken and just in case the test.bin contains 00 00 00 00 84 D7 B7 41 80 1A 06 00 70 85 69 C0.
Your binary contains correct 41B7D78400000000 hexadecimal representation of 400000000.0 in the first 8 bytes. Running
import binascii
import struct
fname = r'test.bin'
with open(fname, 'rb') as readfile:
dob = readfile.read(8)
print(struct.unpack('d', dob)[0])
print(binascii.hexlify(dob))
outputs
>> 400000000.0
>> b'0000000084d7b741'
which is also correct little endian representation of the double. When you swap parts, you get
print(binascii.hexlify(dob[4:]+dob[:4]))
>> b'84d7b74100000000'
and if you check the decimal value, it will give you 5.45e-315, not what you expect. Moreover,
struct.unpack('d', dob[4:]+dob[:4])[0]
>>5.44740625e-315
So I'm not sure how you could get 400000000.02384186 from the code above. However, to obtain 400000000.02384186 using your test.bin, just skip the four bytes in the beginning:
with open(fname, 'rb') as readfile:
val = readfile.read(4)
dob = readfile.read(8)
dob = dob[4:]+dob[:4]
print(binascii.hexlify(dob))
print(struct.unpack('d', dob)[0])
>>b'801a060084d7b741'
>>400000000.02384186
Binary value 0x41B7D78400061A80 corresponds to 400000000.02384186. So you first read incorrect bytes, then incorrectly swap parts and get a result close to what you expect. Considering integer value, the 400000 is 0x00061A80, which is also present in the binary, but you definitely read past that bytes, since you used them for double, so you get wrong values.

Converting broken byte string from unicode back to corresponding bytes

The following code retrieves an iterable object of strings in rows which contains a PDF byte stream. The string row was type of str. The resulting file was a PDF format and could be opened.
with open(fname, "wb") as fd:
for row in rows:
fd.write(row)
Due to a new C-Library and changes in the Python implementation the str changes to unicode. And the corresponding content changed as well so my PDF file is broken.
Starting bytes of first row object:
old row[0]: 25 50 44 46 2D 31 2E 33 0D 0A 25 E2 E3 CF D3 0D 0A ...
new row[0]: 25 50 44 46 2D 31 2E 33 0D 0A 25 C3 A2 C3 A3 C3 8F C3 93 0D 0A ...
I adjust the corresponding byte positions here so it looks like a unicode problem.
I think this is a good start but I still have a unicode string as input...
>>> "\xc3\xa2".decode('utf8') # but as input I have u"\xc3\xa2"
u'\xe2'
I already tried several calls of encode and decode so I need a more analytical way to fix this. I can't see the wood for the trees. Thank you.
When you find u"\xc3\xa2" in a Python unicode string, it often means that you have read an UTF-8 encoded file as is it was Latin1 encoded. So the best thing to do is certainly to fix the initial read.
That being said if you have to depend on broken code, the fix is still easy: you just encode the string as Latin1 and then decode it as UTF-8:
fixed_u_str = broken_u_str.encode('Latin1').decode('UTF-8')
For example:
u"\xc3\xa2\xc3\xa3".encode('Latin1').decode('utf8')
correctly gives u"\xe2\xe3" which displays as âã
This looks like you should be doing
fd.write(row.encode('utf-8'))
assuming the type of row is now unicode (this is my understanding of how you presented things).

How to read hex values at specific addresses in Python?

Say I have a file and I'm interested in reading and storing hex values at certain addresses, like the snippet below:
22660 00 50 50 04 00 56 0F 50 25 98 8A 19 54 EF 76 00
22670 75 38 D8 B9 90 34 17 75 93 19 93 19 49 71 EF 81
I want to read the value at 0x2266D, and be able to replace it with another hex value, but I can't understand how to do it. I've tried using open('filename', 'rb'), however this reads it as the ASCII representation of the values, and I don't see how to pick and choose when addresses I want to change.
Thanks!
Edit: For an example, I have
rom = open("filename", 'rb')
for i in range(5):
test = rom.next().split()
print test
rom.close()
This outputs: ['NES\x1a', 'B\x00\x00\x00\x00\x00\x00\x00\x00\x00!\x0f\x0f\x0f(\x0f!!', '!\x02\x0f\x0f', '!\x0f\x01\x08', '!:\x0f\x0f\x03!', '\x0f', '\x0f\x0f', '!', '\x0f\x0f!\x0f\x03\x0f\x12', '\x0f\x0f\x0f(\x02&%\x0f', '\x0f', '#', '!\x0f\x0f1', '!"#$\x14\x14\x14\x13\x13\x03\x04\x0f\x0f\x03\x13#!!\x00\x00\x00\x00\x00!!', '(', '\x0f"\x0f', '#\x14\x11\x12\x0f\x0f\x0f#', '\x10', "5'4\x0270&\x02\x02\x02\x02\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f126&\x13\x0f\x0f\x0f\x13&6222\x0f", '\x1c,', etc etc.
Much more than 5 bytes, and while some of it is in hex, some has been replaced with ASCII.
There's no indication that some of the bytes were replaced by their ASCII representations. Some bytes happen to be printable.
With a binary file, you can simply seek to the location offset and write the bytes in. Working with the line-iterator in the case of binary file is problematic, as there's no meaningful "lines" in the binary blob.
You can do in-place editing like follows (in fake Python):
with open("filename", "rb+") as f:
f.seek(0x2266D)
the_byte = f.read(1)
if len(the_byte) != 1:
# something's wrong; bolt out ...
else
transformed_byte = your_function(the_byte)
f.seek(-1, 1) # back one byte relative to the current position
f.write(transformed_byte)
But of course, you may want to do the edit on a copy, either in-memory (and commit later, as in the answer of #JosepValls), or on a file copy. The problem with gulping the whole file in memory is, of course, sometimes the system may choke ;) For that purpose you may want to mmap part of the file.
Given that is not a very big file (roms should fit fine in today's computer's memory), just do data = open('filename', 'rb').read(). Now you can do whatever you want to the data (if you print it, it will show ascii, but that is just data!). Unfortunately, string objects don't support item assignment, see this answer for more:
Change one character in a string in Python?
In your case:
data = data[0:0x2266C] + str(0xFF) + data[0x2266D:]

How do I force recv() in Socket to NOT convert my hex values into ASCII if it can (python)

I am using python 3.4 socket interface of python-can. I am having a problem, when I receive the data via recv() or recvfrom() it converts some of the hex data in the message to ASCII if it can for example '63' becomes a 'c'. I do not want this, I want the raw hex data.
Here is a snippet part of the code:
def dissect_can_frame(frame):
can_id, can_dlc, data = struct.unpack(can_frame_fmt, frame)
global dataS
dataS = data[:can_dlc]
return (can_id, can_dlc, data[:can_dlc])
s = socket.socket(socket.AF_CAN,socket.SOCK_RAW,socket.CAN_RAW)
print(s)
s.bind((can_interface,))
#s.bind((sys.argv[1],)) #used for 'can0' as argument at initial execution
print(socket.AF_CAN,",",socket.SOCK_RAW,",",socket.CAN_RAW)
#while True:
cf, addr = s.recvfrom(4096)
print(cf,',',addr)
I get "b'\x18c\xd8\xd6\x1f\x01 \x18'" as the output section of the data instead of "18 63 D8 D6 1F 01 20 18". Do not care about the formatting but notice how '63' has become 'c' and '20' has inserted a space. Can I stop it doing this?
Is it common for socket to convert the data rather than producing the raw data?
Thank you for any help.
That's just how the data looks when it comes out of recv. If you want to convert it into a hex-looking string, then you can use format on each character:
>>> s = b'\x18c\xd8\xd6\x1f\x01 \x18'
>>> " ".join(["{:02X}".format(ord(c)) for c in s])
'18 63 D8 D6 1F 01 20 18'
Of course, this is an inconvenient format for actually doing any kind of analysis on the data. But it looks nice for display purposes.
Alternatively, there's hexlify, but that doesn't space out the values for you:
>>> import binascii
>>> binascii.hexlify(s)
'1863d8d61f012018'

How to write a file of ASCII bytes to a binary file as actual bytes?

Trying to do an MD5 collision homework problem and I'm not sure how to write raw bytes in Python. I gave it a shot but just ended up with a .bin file with ASCII in it. Here's my code:
fileWriteObject1 = open("md5One.bin", 'wb')
fileWriteObject2 = open("md5Two.bin", 'wb')
fileReadObject1 = open('bytes1.txt', 'r')
fileReadObject2 = open('bytes2.txt', 'r')
bytes1Contents = fileReadObject1.readlines()
bytes2Contents = fileReadObject2.readlines()
fileReadObject1.close()
fileReadObject2.close()
for bytes in bytes1Contents:
toWrite = r"\x" + bytes
fileWriteObject1.write(toWrite.strip())
for bytes in bytes2Contents:
toWrite = r"\x" + bytes
fileWriteObject2.write((toWrite.strip())
fileWriteObject1.close()
fileWriteObject2.close()
sample input:
d1
31
dd
02
c5
e6
ee
c4
69
3d
9a
06
98
af
f9
5c
2f
ca
b5
I had a link to my input file but it seems a mod removed it. It's a file with a hex byte written in ASCII on each line.
EDIT: SOLVED! Thanks to Circumflex.
I had two different text files each with 128 bytes of ASCII. I converted them to binary and wrote them using struck.pack and got a MD5 collision.
If you want to write them as raw bytes, you can use the pack() method of the struct type.
You could write the MD5 out as 2 long long ints, but you'd have to write it in 2 8 byte sections
http://docs.python.org/library/struct.html
Edit:
An example:
import struct
bytes = "6F"
byteAsInt = int(bytes, 16)
packedString = struct.pack('B', byteAsInt)
If I've got this right, you're trying to pull in some text with hex strings written, convert them to binary format and output them? If that is the case, that code should do what you want.
It basically converts the raw hex string to an int, then packs it in binary form (as a byte) into a string.
You could loop over something like this for each byte in the input string
>>> import binascii
>>> binary = binascii.unhexlify("d131dd02c5")
>>> binary
'\xd11\xdd\x02\xc5'
binascii.unhexlify() is defined in binascii.c. Here's a "close to C" implementation in Python:
def binascii_unhexlify(ascii_string_with_hex):
arglen = len(ascii_string_with_hex)
if arglen % 2 != 0:
raise TypeError("Odd-length string")
retval = bytearray(arglen//2)
for j, i in enumerate(xrange(0, arglen, 2)):
top = to_int(ascii_string_with_hex[i])
bot = to_int(ascii_string_with_hex[i+1])
if top == -1 or bot == -1:
raise TypeError("Non-hexadecimal digit found")
retval[j] = (top << 4) + bot
return bytes(retval)
def to_int(c):
assert len(c) == 1
return "0123456789abcdef".find(c.lower())
If there were no binascii.unhexlify() or bytearray.fromhex() or str.decode('hex') or similar you could write it as follows:
def unhexlify(s, table={"%02x" % i: chr(i) for i in range(0x100)}):
if len(s) % 2 != 0:
raise TypeError("Odd-length string")
try:
return ''.join(table[top+bot] for top, bot in zip(*[iter(s.lower())]*2))
except KeyError, e:
raise TypeError("Non-hexadecimal digit found: %s" % e)

Categories

Resources