Python binary messes up some files

Python binary messes up some files - python

I have written two python scripts. One of which encodes the file to binary, stores it as a textfile for later decryption. The other script can turn the textfile back into readable information, or at least, that's my aim.
script 1 (encrypt)
(use any .png image file as input, any .txt file as output):
u_input = input("What file to encrypt?")
file_store = input("Where do you want to store the binary?")
character = "" #Blank for now
encrypted = "" #Blank for now, stores the bytes before they are written
with open(u_input, 'rb') as f:
while True:
c = f.read(1)
if not c:
f.close()
break
encrypted = encrypted + str(bin(ord(c))[2:].zfill(8))
print("")
print(encrypted) # This line is not necessary, but I have included it to show that the encryption works
print("")
with open(file_store, 'wb') as f:
f.write(bytes(encrypted, 'UTF-8'))
f.close()
As far as I can tell, this works okay for text files (.txt)
I then have a second script (to decrypt the file)
Use the previously created .txt file as source, any .png file as dest:
u_input =("Sourcefile:")
file_store = input("Decrypted output:")
character = ""
decoded_string = ""
with open(u_input, 'r' as f:
while True:
c = f.read(1)
if not c:
f.close()
break
character = character + c
if len(character) % 8 == 0:
decoded_string = decoded_string + chr(int(character, 2))
character = ""
with open(file_store, 'wb') as f:
f.write(bytes(decoded_string, 'UTF-8'))
f.close()
print("SUCCESS!")
Which works partially. i.e. it writes the file. However, I cannot open it or edit it. When I compare my original file (img.png) with my second file (img2.png), I see characters have been replaced or line breaks not entered correctly. I can't view the file in any image viewing / editing program. I do not understand why.
Please could someone try to explain and provide a solution (albeit, partial)? Thanks in advance.
Note: I am aware that my use of "encryption" and "decryption" are not necessarily used correctly, but this is a personal project, so it doesn't matter to me

It appears you're using Python 3, as you put a UTF-8 parameter on the bytes call. That's your problem - the input should be decoded to a byte string, but you're putting together a Unicode string instead, and the conversion isn't 1:1. It's easy to fix.
decoded_string = b""
# ...
decoded_string = decoded_string + bytes([int(character, 2)])
# ...
f.write(decoded_string)
For a version that works in both Python 2 and Python 3, another small modification. This actually measures faster for me in Python 3.5 so it should be the preferred method.
import struct
# ...
decoded_string = decoded_string + struct.pack('B', int(character, 2))

Related

Reading a file, making some changes and writing the results back

I have an input file (File A) as shown below:
Start of the program
This is my first program ABCDE
End of the program
I receive the program name 'PYTHON' as input, and I need to replace 'ABCDE' with it. So I read the file to find the word 'program' and then replace the string after it as shown below. I have done that in my program. Then, I would like to write the updated string to the original file without changing lines 1 or 3 - just line 2.
Start of the program
This is my first program PYTHON
End of the program
My code:
fileName1 = open(filePath1, "r")
search = "program"
for line in fileName1:
if search in line:
line = line.split(" ")
update = line[5].replace(line[5], input)
temp = " ".join(line[:5]) + " " + update
fileName1 = open(filePath1, "r+")
fileName1.write(temp)
fileName1.close()
else:
fileName1 = open(filePath1, "w+")
fileName1.write(line)
fileName1.close()
I am sure this can be done in an elegant way, but I got a little confused with reading and writing as I experimented with the above code. The output is not as expected. What is wrong with my code?

You can do this with a simple replace:
file_a.txt
Start of the program`
This is my first program ABCDE`
End of the program`
code:
with open('file_a.txt', 'r') as file_handle:
file_content = file_handle.read()
orig_str = 'ABCDE'
rep_str = 'PYTHON'
result = file_content.replace(orig_str, rep_str)
# print(result)
with open('file_a.txt', 'w') as file_handle:
file_handle.write(result)
Also if just replacing ABCDE is not going to work (it may appear in other parts of file as well), then you can use more specific patterns or even a regular expression to replace it more accurately.
For example, here we just replace ABCDE if it comes after program:
with open('file_a.txt', 'r') as file_handle:
file_content = file_handle.read()
orig_str = 'ABCDE'
rep_str = 'PYTHON'
result = file_content.replace('program {}'.format(orig_str),
'program {}'.format(rep_str))
# print(result)
with open('file_a.txt', 'w') as file_handle:
file_handle.write(result)

Python JPEG file carver takes incredibly long on small dumps

I'm writing a JPEG file carver as a part of a forensic lab.
The assignment is to write a script that can extract JPEG files from a 10 MB dd-dump. We are not permitted to assign the file to a memory variable (because if it were to be too big, it would cause an overflow), but instead, the Python script should read directly from the file.
My script seems to work perfectly fine, but it takes extremely long to finish (upwards 30-40 minutes). Is this an expected behavior? Even for such a small 10 MB file? Is there anything I can do to shorten the time?
This is my code:
# Extract JPEGs from a file.
import sys
with open(sys.argv[1], "rb") as binary_file:
binary_file.seek(0, 2) # Seek the end
num_bytes = binary_file.tell() # Get the file size
count = 0 # Our counter of which file we are currently extracting.
for i in range(num_bytes):
binary_file.seek(i)
four_bytes = binary_file.read(4)
whole_file = binary_file.read()
if four_bytes == b"\xff\xd8\xff\xd8" or four_bytes == b"\xff\xd8\xff\xe0" or four_bytes == b"\xff\xd8\xff\xe1": # JPEG signature
whole_file = whole_file.split(four_bytes)
for photo in whole_file:
count += 1
name = "Pic " + str(count) + ".jpg"
file(name, "wb").write(four_bytes+photo)
print name

Aren't you reading your whole file on every for loop?
E: What I mean, is at every byte you read your whole file (for a 10MB file you are reading 10MB 10 million times, aren't you?), even if the four bytes didn't match up to JPEG signature.
E3 : What you need is on every byte to check if there is file to be written (checking for the header/signature). If you match the signature, you have to start writing bytes to file, but first, since you already read 4 bytes, you have to jump back where you are. Then, when reading the byte and writing it to file, you have to check for JPEG ending. If the file ends, you have to write the next byte and close the stream and start searching for header again. This will not extract a JPEG from inside another JPEG.
import sys
with open("C:\\Users\\rauno\\Downloads\\8-jpeg-search\\8-jpeg-search.dd", "rb") as binary_file:
binary_file.seek(0, 2) # Seek the end
num_bytes = binary_file.tell() # Get the file size
write_to_file = False
count = 0 # Our counter of which file we are currently extracting.
for i in range(num_bytes):
binary_file.seek(i)
if write_to_file is False:
four_bytes = binary_file.read(4)
if four_bytes == b"\xff\xd8\xff\xd8" or four_bytes == b"\xff\xd8\xff\xe0" or four_bytes == b"\xff\xd8\xff\xe1": # JPEG signature
write_to_file = True
count += 1
name = "Pic " + str(count) + ".jpg"
f = open(name,"wb")
binary_file.seek(i)
if write_to_file is True: #not 'else' or you miss the first byte
this_byte = binary_file.read(1)
f.write(this_byte)
next_byte = binary_file.read(1) # assuming it does read next byte - i think "read" jumps the seeker (which is why you have .seek(i) at the beginning)
if this_byte == b"\xff" and next_byte==b"\xd9" :
f.write(next_byte)
f.close()
write_to_file = False

Python Split not working properly

I have the following code to read the lines in a file and split them with a delimiter specified. After split I have to write some specific fields into another file.
Sample Data:
Week49_A_60002000;Mar;FY14;Actual;Working;E_1000;PC_000000;4287.63
Code:
import os
import codecs
sfilename = "WEEK_RPT_1108" + os.extsep + "dat"
sfilepath = "Club" + "/" + sfilename
sbackupname = "Club" + "/" + sfilename + os.extsep + "bak"
try:
os.unlink(sbackupname)
except OSError:
pass
os.rename(sfilepath, sbackupname)
try:
inputfile = codecs.open(sbackupname, "r", "utf-16-le")
outputfile = codecs.open(sfilepath, "w", "utf-16-le")
sdelimdatfile = ";"
for line in inputfile:
record = line.split(';')
outputfile.write(record[1])
except IOError, err:
pass
I can see that the 0th array position contains the whole line instead of the first record:
record[0] = Week49_A_60002000;Mar;FY14;Actual;Working;E_1000;PC_000000;4287.63
while on printing record[1], it says array index out of range.
Need help as new to python.
Thanks!

After you comment saying that print line outputs u'\u6557\u6b65\u3934\u415f\u365f\u3030\u3230\u3030\u3b30\u614d\u3b72\u5946\u3431‌\u413b\u7463\u6175\u3b6c\u6f57\u6b72\u6e69\u3b67\u5f45\u3031\u3030\u503b\u5f43\u3‌030\u3030\u3030\u343b\u3832\u2e37\u3336', I can explain what happens and how to fix it.
What happens:
you have a normal 8bits characters file, and the line you show is even in plain ASCII, but you try to decode it as if it were in UTF-16 little endian. So you wrongly combine every two bytes in a single 16 bits unicode character! If your system had been able to correctly display them and if you had directly print line instead of repr(line), you would have got 敗步㤴䅟㙟〰㈰〰㬰慍㭲奆㐱䄻瑣慵㭬潗歲湩㭧彅〱〰倻彃〰〰〰㐻㠲⸷㌶. Of course, none of those unicode characters is the semicolon (; or \x3b of \u003b) so the line cannot be splitted on it.
But as you encode it back before writing record[0] you find the whole line in the new file, what let you believe erroneously that the problem is in the split function.
How to fix:
Just open the file normally, or use the correct encoding if it contains non ascii characters. But as you are using a version 2 of python, I would just do:
try:
inputfile = open(sbackupname, "r")
outputfile = open(sfilepath, "w")
sdelimdatfile = ";"
for line in inputfile:
record = line.split(sdelimdatfile)
outputfile.write(record[1])
except IOError, err:
pass
If you really need to use the codecs module, for example if the file contains UTF8 or latin1 characters, you can replace the open part with:
encoding = "utf8" # or "latin1" or whatever the actual encoding is...
inputfile = codecs.open(sbackupname, "r", encoding)
outputfile = codecs.open(sfilepath, "w", encoding)

Then there is no index [1]:
Either skip the line with "continue" if len(record) < 1 or just not write to the file (like here)
for line in inputfile:
record = line.split(';')
if len(record) >= 1:
outputfile.write(record[1])

python image (.jpeg) to hex code

I operate with a thermal printer, this printer is able to print images, but it needs to get the data in hex format. For this I would need to have a python function to read an image and return a value containing the image data in hex format.
I currently use this format to sent hex format to the printer:
content = b"\x1B\x4E"
Which is the simplest way to do so using Python2.7?
All the best;

I don't really know what you mean by "hex format", but if it needs to get the whole file as a sequence of bytes you can do:
with open("image.jpeg", "rb") as fp:
img = fp.read()
If your printer expects the image in some other format (like 8bit values for every pixel) then try using the pillow library, it has many image manipulation functions and handles a wide range of input and ouput formats.

How about this:
with open('something.jpeg', 'rb') as f:
binValue = f.read(1)
while len(binValue) != 0:
hexVal = hex(ord(binValue))
# Do something with the hex value
binValue = f.read(1)
Or for a function, something like this:
import re
def imgToHex(file):
string = ''
with open(file, 'rb') as f:
binValue = f.read(1)
while len(binValue) != 0:
hexVal = hex(ord(binValue))
string += '\\' + hexVal
binValue = f.read(1)
string = re.sub('0x', 'x', string) # Replace '0x' with 'x' for your needs
return string
Note: You do not necessarily need to do the re.sub portion if you use struct.pack to write the bits, but this will get it into the format that you need

Read in a jpg and make a string of hex values. Then reverse the procedure. Take a string of hex and write it out as a jpg file...
import binascii
with open('my_img.jpg', 'rb') as f:
data = f.read()
print(data[:10])
im_hex = binascii.hexlify(data)
# check out the hex...
print(im_hex[:10])
# reversing the procedure
im_hex = binascii.a2b_hex(im_hex)
print(im_hex[:10])
# write it back out to a jpg file
with open('my_hex.jpg', 'wb') as image_file:
image_file.write(im_hex)

Python - Finding unicode/ascii problems

I am csv.reader to pull in info from a very long sheet. I am doing work on that data set and then I am using the xlwt package to give me a workable excel file.
However, I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 34: ordinal not in range(128)
My question to you all is, how can I find exactly where that error is in my data set? Also, is there some code that I can write which will look through my data set and find out where the issues lie (because some data sets run without the above error and others have problems)?

The answer is quite simple actually : As soon as you read your data from your file, convert it to unicode using the encoding of your file, and handle the UnicodeDecodeError exception :
try:
# decode using utf-8 (use ascii if you want)
unicode_data = str_data.decode("utf-8")
except UnicodeDecodeError, e:
print "The error is there !"
this will save you from many troubles; you won't have to worry about multibyte character encoding, and external libraries (including xlwt) will just do The Right Thing if they need to write it.
Python 3.0 will make it mandatory to specify the encoding of a string, so it's a good idea to do it now.

The csv module doesn't support unicode and null characters. You might be able to replace them by doing something like this though (Replace 'utf-8' with the encoding which your CSV data is encoded in):
import codecs
import csv
class AsciiFile:
def __init__(self, path):
self.f = codecs.open(path, 'rb', 'utf-8')
def close(self):
self.f.close()
def __iter__(self):
for line in self.f:
# 'replace' for unicode characters -> ?, 'ignore' to ignore them
y = line.encode('ascii', 'replace')
y = y.replace('\0', '?') # Can't handle null characters!
yield y
f = AsciiFile(PATH)
r = csv.reader(f)
...
f.close()
If you want to find the positions of the characters which you can't be handled by the CSV module, you could do e.g:
import codecs
lineno = 0
f = codecs.open(PATH, 'rb', 'utf-8')
for line in f:
for x, c in enumerate(line):
if not c.encode('ascii', 'ignore') or c == '\0':
print "Character ordinal %s line %s character %s is unicode or null!" % (ord(c), lineno, x)
lineno += 1
f.close()
Alternatively again, you could use this CSV opener which I wrote which can handle Unicode characters:
import codecs
def OpenCSV(Path, Encoding, Delims, StartAtRow, Qualifier, Errors):
infile = codecs.open(Path, "rb", Encoding, errors=Errors)
for Line in infile:
Line = Line.strip('\r\n')
if (StartAtRow - 1) and StartAtRow > 0: StartAtRow -= 1
elif Qualifier != '(None)':
# Take a note of the chars 'before' just
# in case of excel-style """ quoting.
cB41 = ''; cB42 = ''
L = ['']
qMode = False
for c in Line:
if c==Qualifier and c==cB41==cB42 and qMode:
# Triple qualifiers, so allow it with one
L[-1] = L[-1][:-2]
L[-1] += c
elif c==Qualifier:
# A qualifier, so reverse qual mode
qMode = not qMode
elif c in Delims and not qMode:
# Not in qual mode and delim
L.append('')
else:
# Nothing to see here, move along
L[-1] += c
cB42 = cB41
cB41 = c
yield L
else:
# There aren't any qualifiers.
cB41 = ''; cB42 = ''
L = ['']
for c in Line:
cB42 = cB41; cB41 = c
if c in Delims:
# Delim
L.append('')
else:
# Nothing to see here, move along
L[-1] += c
yield L
for listItem in openCSV(PATH, Encoding='utf-8', Delims=[','], StartAtRow=0, Qualifier='"', Errors='replace')
...

You can refer to code snippets in the question below to get a csv reader with unicode encoding support:
General Unicode/UTF-8 support for csv files in Python 2.6

PLEASE give the full traceback that you got along with the error message. When we know where you are getting the error (reading CSV file, "doing work on that data set", or in writing an XLS file using xlwt), then we can give a focused answer.
It is very possible that your input data is not all plain old ASCII. What produces it, and in what encoding?
To find where the problems (not necessarily errors) are, try a little script like this (untested):
import sys, glob
for pattern in sys.argv[1:]:
for filepath in glob.glob(pattern):
for linex, line in enumerate(open(filepath, 'r')):
if any(c >= '\x80' for c in line):
print "Non-ASCII in line %d of file %r" % (linex+1, filepath)
print repr(line)
It would be useful if you showed some samples of the "bad" lines that you find, so that we can judge what the encoding might be.
I'm curious about using "csv.reader to pull in info from a very long sheet" -- what kind of "sheet"? Do you mean that you are saving an XLS file as CSV, then reading the CSV file? If so, you could use xlrd to read directly from the input XLS file, getting unicode text which you can give straight to xlwt, avoiding any encode/decode problems.
Have you worked through the tutorial from the python-excel.org site?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python binary messes up some files - python

Related

Reading a file, making some changes and writing the results back

Python JPEG file carver takes incredibly long on small dumps

Python Split not working properly

python image (.jpeg) to hex code

Python - Finding unicode/ascii problems

Categories

Resources