I'm writing a JPEG file carver as a part of a forensic lab.
The assignment is to write a script that can extract JPEG files from a 10 MB dd-dump. We are not permitted to assign the file to a memory variable (because if it were to be too big, it would cause an overflow), but instead, the Python script should read directly from the file.
My script seems to work perfectly fine, but it takes extremely long to finish (upwards 30-40 minutes). Is this an expected behavior? Even for such a small 10 MB file? Is there anything I can do to shorten the time?
This is my code:
# Extract JPEGs from a file.
import sys
with open(sys.argv[1], "rb") as binary_file:
binary_file.seek(0, 2) # Seek the end
num_bytes = binary_file.tell() # Get the file size
count = 0 # Our counter of which file we are currently extracting.
for i in range(num_bytes):
binary_file.seek(i)
four_bytes = binary_file.read(4)
whole_file = binary_file.read()
if four_bytes == b"\xff\xd8\xff\xd8" or four_bytes == b"\xff\xd8\xff\xe0" or four_bytes == b"\xff\xd8\xff\xe1": # JPEG signature
whole_file = whole_file.split(four_bytes)
for photo in whole_file:
count += 1
name = "Pic " + str(count) + ".jpg"
file(name, "wb").write(four_bytes+photo)
print name
Aren't you reading your whole file on every for loop?
E: What I mean, is at every byte you read your whole file (for a 10MB file you are reading 10MB 10 million times, aren't you?), even if the four bytes didn't match up to JPEG signature.
E3 : What you need is on every byte to check if there is file to be written (checking for the header/signature). If you match the signature, you have to start writing bytes to file, but first, since you already read 4 bytes, you have to jump back where you are. Then, when reading the byte and writing it to file, you have to check for JPEG ending. If the file ends, you have to write the next byte and close the stream and start searching for header again. This will not extract a JPEG from inside another JPEG.
import sys
with open("C:\\Users\\rauno\\Downloads\\8-jpeg-search\\8-jpeg-search.dd", "rb") as binary_file:
binary_file.seek(0, 2) # Seek the end
num_bytes = binary_file.tell() # Get the file size
write_to_file = False
count = 0 # Our counter of which file we are currently extracting.
for i in range(num_bytes):
binary_file.seek(i)
if write_to_file is False:
four_bytes = binary_file.read(4)
if four_bytes == b"\xff\xd8\xff\xd8" or four_bytes == b"\xff\xd8\xff\xe0" or four_bytes == b"\xff\xd8\xff\xe1": # JPEG signature
write_to_file = True
count += 1
name = "Pic " + str(count) + ".jpg"
f = open(name,"wb")
binary_file.seek(i)
if write_to_file is True: #not 'else' or you miss the first byte
this_byte = binary_file.read(1)
f.write(this_byte)
next_byte = binary_file.read(1) # assuming it does read next byte - i think "read" jumps the seeker (which is why you have .seek(i) at the beginning)
if this_byte == b"\xff" and next_byte==b"\xd9" :
f.write(next_byte)
f.close()
write_to_file = False
Related
I have a .raw file containing a 52 lines html header followed by the data themselves. The file is encoded in little-endian 24bits SIGNED and I want to convert the data to integers in an ASCII file. I use Python 3.
I tried to 'unpack' the entire file with the following code found in this post:
import sys
import chunk
import struct
f1 = open('/Users/anais/Documents/CR_lab/Lab_files/labtest.raw', mode = 'rb')
data = struct.unpack('<i', chunk + ('\0' if chunk[2] < 128 else '\xff'))
But I get this error message:
TypeError: 'module' object is not subscriptable
EDIT
It seems this is better:
data = struct.unpack('<i','\0'+ bytes)[0] >> 8
But I still get an error message:
TypeError: must be str, not type
Easy to fix I presume?
That's not a nice file to process in Python! Python is great for processing text files, because it reads them in big chunks in an internal buffer and then iterates on lines, but you cannot easily access binary data that comes after text read like that. Additionally, the struct module has no support for 24 bits values.
The only way I can imagine is to read the file one byte at a time, first skip 52 time an end of line, then read bytes 3 at a time, concatenate them in a 4 bytes byte string and unpack it.
Possible code could be:
eol = b'\n' # or whatever is the end of line in your file
nlines = 52 # number of lines to skip
with open('/Users/anais/Documents/CR_lab/Lab_files/labtest.raw', mode = 'rb') as f1:
for i in range(nlines): # process nlines lines
t = b'' # to store the content of each line
while True:
x = f1.read(1) # one byte at a time
if x == eol: # ok we have one full line
break
else:
t += x # else concatenate into current line
print(t) # to control the initial 52 lines
while True:
t = bytes((0,)) # struct only knows how to process 4 bytes int
for i in range(3): # so build one starting with a null byte
t += f1.read(1)
# print(t)
if(len(t) == 1): break # reached end of file
if(len(t) < 4): # reached end of file with uncomplete value
print("Remaining bytes at end of file", t)
break
# the trick is that the integer division by 256 skips the initial 0 byte and keeps the sign
i = struct.unpack('<i', t)[0]//256 # // for Python 3, only / for Python 2
print(i, hex(i)) # or any other more useful processing
Remark: above code assumes that your description of 52 lines (terminated by an end of line) is true, but the shown image let think that last line is not. In that case, you should first count 51 lines and then skip the content of the last line.
def skipline(fd, nlines, eol):
for i in range(nlines): # process nlines lines
t = b'' # to store the content of each line
while True:
x = fd.read(1) # one byte at a time
if x == eol: # ok we have one full line
break
else:
t += x # else concatenate into current line
# print(t) # to control the initial 52 lines
with open('/Users/anais/Documents/CR_lab/Lab_files/labtest.raw', mode = 'rb') as f1:
skiplines(f1, 51, b'\n') # skip 51 lines terminated with a \n
skiplines(f1, 1, b'>') # skip last line assuming it ends at the >
...
I have written two python scripts. One of which encodes the file to binary, stores it as a textfile for later decryption. The other script can turn the textfile back into readable information, or at least, that's my aim.
script 1 (encrypt)
(use any .png image file as input, any .txt file as output):
u_input = input("What file to encrypt?")
file_store = input("Where do you want to store the binary?")
character = "" #Blank for now
encrypted = "" #Blank for now, stores the bytes before they are written
with open(u_input, 'rb') as f:
while True:
c = f.read(1)
if not c:
f.close()
break
encrypted = encrypted + str(bin(ord(c))[2:].zfill(8))
print("")
print(encrypted) # This line is not necessary, but I have included it to show that the encryption works
print("")
with open(file_store, 'wb') as f:
f.write(bytes(encrypted, 'UTF-8'))
f.close()
As far as I can tell, this works okay for text files (.txt)
I then have a second script (to decrypt the file)
Use the previously created .txt file as source, any .png file as dest:
u_input =("Sourcefile:")
file_store = input("Decrypted output:")
character = ""
decoded_string = ""
with open(u_input, 'r' as f:
while True:
c = f.read(1)
if not c:
f.close()
break
character = character + c
if len(character) % 8 == 0:
decoded_string = decoded_string + chr(int(character, 2))
character = ""
with open(file_store, 'wb') as f:
f.write(bytes(decoded_string, 'UTF-8'))
f.close()
print("SUCCESS!")
Which works partially. i.e. it writes the file. However, I cannot open it or edit it. When I compare my original file (img.png) with my second file (img2.png), I see characters have been replaced or line breaks not entered correctly. I can't view the file in any image viewing / editing program. I do not understand why.
Please could someone try to explain and provide a solution (albeit, partial)? Thanks in advance.
Note: I am aware that my use of "encryption" and "decryption" are not necessarily used correctly, but this is a personal project, so it doesn't matter to me
It appears you're using Python 3, as you put a UTF-8 parameter on the bytes call. That's your problem - the input should be decoded to a byte string, but you're putting together a Unicode string instead, and the conversion isn't 1:1. It's easy to fix.
decoded_string = b""
# ...
decoded_string = decoded_string + bytes([int(character, 2)])
# ...
f.write(decoded_string)
For a version that works in both Python 2 and Python 3, another small modification. This actually measures faster for me in Python 3.5 so it should be the preferred method.
import struct
# ...
decoded_string = decoded_string + struct.pack('B', int(character, 2))
I have a very large big-endian binary file. I know how many numbers in this file. I found a solution how to read big-endian file using struct and it works perfect if file is small:
data = []
file = open('some_file.dat', 'rb')
for i in range(0, numcount)
data.append(struct.unpack('>f', file.read(4))[0])
But this code works very slow if file size is more than ~100 mb.
My current file has size 1.5gb and contains 399.513.600 float numbers. The above code works with this file an about 8 minutes.
I found another solution, that works faster:
datafile = open('some_file.dat', 'rb').read()
f_len = ">" + "f" * numcount #numcount = 399513600
numbers = struct.unpack(f_len, datafile)
This code runs in about ~1.5 minute, but this is too slow for me. Earlier I wrote the same functional code in Fortran and it run in about 10 seconds.
In Fortran I open the file with flag "big-endian" and I can simply read file in REAL array without any conversion, but in python I have to read file as a string and convert every 4 bites in float using struct. Is it possible to make the program run faster?
You can use numpy.fromfile to read the file, and specify that the type is big-endian specifying > in the dtype parameter:
numpy.fromfile(filename, dtype='>f')
There is an array.fromfile method too, but unfortunately I cannot see any way in which you can control endianness, so depending on your use case this might avoid the dependency on a third party library or be useless.
The following approach gave a good speed up for me:
import struct
import random
import time
block_size = 4096
start = time.time()
with open('some_file.dat', 'rb') as f_input:
data = []
while True:
block = f_input.read(block_size * 4)
data.extend(struct.unpack('>{}f'.format(len(block)/4), block))
if len(block) < block_size * 4:
break
print "Time taken: {:.2f}".format(time.time() - start)
print "Length", len(data)
Rather than using >fffffff you can specify a count e.g. >1000f. It reads the file 4096 chunks at a time. If the amount read is less than this it adjusts the block size and exits.
From the struct - Format Characters documentation:
A format character may be preceded by an integral repeat count. For
example, the format string '4h' means exactly the same as 'hhhh'.
def read_big_endian(filename):
all_text = ""
with open(filename, "rb") as template:
try:
template.read(2) # first 2 bytes are FF FE
while True:
dchar = template.read(2)
all_text += dchar[0]
except:
pass
return all_text
def save_big_endian(filename, text):
with open(filename, "wb") as fic:
fic.write(chr(255) + chr(254)) # first 2 bytes are FF FE
for letter in text:
fic.write(letter + chr(0))
Used to read .rdp files
I'm currently working on a school project and i seem to encouter some problems with an MPEG file. The scope of my project is to:
1) split MPEG file into many fixed size chunk.
2) assemble some of them while omitting certain chunk.
Problem 1:
When i play the file in media player, it will play the video until it reaches the chunk that i omit.
Example:
chunk = ["yui_1", "yui_2", "yui_3", "yui_5", "yui_6"]
Duration of each chunk: 1 second
*If you realise i have omitted "yui_4" chunk.*
If I were to assemble all the chunk except "yui_4", the video will play first 2 seconds before it hangs throughout the duration.
Problem 2:
When i assemble the chunk while omitting the first chunk, it render the entire mpeg file unable to play.
Example:
chunk = ["yui_2", "yui_3", "yui_4", "yui_5", "yui_6"]
Duration of each chunk: 1 second
Below is a portion of my code (hardcode):
def splitFile(inputFile,chunkSize):
splittext = string.split(filename, ".")
name = splittext[0]
extension = splittext[1]
os.chdir("./" + media_dir)
#read the contents of the file
f = open(inputFile, 'rb')
data = f.read() # read the entire content of the file
f.close()
# get the length of data, ie size of the input file in bytes
bytes = len(data)
#calculate the number of chunks to be created
noOfChunks= bytes/chunkSize
if(bytes%chunkSize):
noOfChunks+=1
#create a info.txt file for writing metadata
f = open('info.txt', 'w')
f.write(inputFile+','+'chunk,'+str(noOfChunks)+','+str(chunkSize))
f.close()
chunkNames = []
count = 1
for i in range(0, bytes+1, chunkSize):
fn1 = name + "_%s" % count
chunkNames.append(fn1)
f = open(fn1, 'wb')
f.write(data[i:i+ chunkSize])
count += 1
f.close()
Below is a portion of how i assemble the chunk:
def assemble():
data = ["yui_1", "yui_2", "yui_3", "yui_4", "yui_5", "yui_6", "yui_7"]
output = open("output.mpeg", "wb")
for item in datafile:
data = open(item, "rb"). read()
output.write(data)
output.close()
MPEG video files contain "encoded video data" which means it is compressed. The bottom line is that cutting up the video into chunks that can be appended or played separately is not trivial. Both your problems will not be solved unless you read the MPEG2 transport stream specification carefully and understand where to find points which you can cut and "splice" and still output a compliant MPEG stream. My guess is that this isn't what you want to do for a school project.
Maybe you should try to read how to use FFMPEG (http://www.ffmpeg.org/) to cut and append video files.
Good luck on your project.
I am trying to split up a large xml file into smaller chunks. I write to the output file and then check its size to see if its passed a threshold, but I dont think the getsize() method is working as expected.
What would be a good way to get the filesize of a file that is changing in size.
Ive done something like this...
import string
import os
f1 = open('VSERVICE.xml', 'r')
f2 = open('split.xml', 'w')
for line in f1:
if str(line) == '</Service>\n':
break
else:
f2.write(line)
size = os.path.getsize('split.xml')
print('size = ' + str(size))
running this prints 0 as the filesize for about 80 iterations and then 4176. Does Python store the output in a buffer before actually outputting it?
File size is different from file position. For example,
os.path.getsize('sample.txt')
It exactly returns file size in bytes.
But
f = open('sample.txt')
print f.readline()
f.tell()
Here f.tell() returns the current position of the file handler - i.e. where the next write will put its data. Since it is aware of the buffering, it should be accurate as long as you are simply appending to the output file.
Yes, Python is buffering your output. You'd be better off tracking the size yourself, something like this:
size = 0
for line in f1:
if str(line) == '</Service>\n':
break
else:
f2.write(line)
size += len(line)
print('size = ' + str(size))
(That might not be 100% accurate, eg. on Windows each line will gain a byte because of the \r\n line separator, but it should be good enough for simple chunking.)
Have you tried to replace os.path.getsize with os.tell, like this:
f2.write(line)
size = f2.tell()
Tracking the size yourself will be fine for your case. A different way would be to flush the file buffers just before you check the size:
f2.write(line)
f2.flush() # <-- buffers are written to disk
size = os.path.getsize('split.xml')
Doing that too often will slow down file I/O, of course.
To find the offset to the end of a file:
file.seek(0,2)
print file.tell()
Real world example - read updates to a file and print them as they happen:
file = open('log.txt', 'r')
#find inital End Of File offset
file.seek(0,2)
eof = file.tell()
while True:
#set the file size agian
file.seek(0,2)
neweof = file.tell()
#if the file is larger...
if neweof > eof:
#go back to last position...
file.seek(eof)
# print from last postion to current one
print file.read(neweof-eof),
eof = neweof