How to read from one file and write to several other files - python

I have a file containing several images. The images are chopped up in packets, I called packet chunk in my code example. Every chunk contains a header with: count, uniqueID, start, length. Start contains the start index of the img_data within the chunk and length is the length of the img_data within the chunk. Count runs from 0 to 255 and the img_data of all these 256 chunks combined forms one image. Before reading the chunks I open a 'dummy.bin' file to have something to write to, otherwise I get that f is not defined. At the end I remove the 'dummy.bin' file. The problem is that I need a file reference to start with. Although this code works I wonder if there is another way then creating a dummy-file to get a file reference. The first chunk in 'test_file.bin' has hdr['count'] == 0 so f.close() will be called in the first iteration. That is why I need to have a file reference f before entering the for loop. Apart from that, every iteration I write img_data to a file with f.write(img_data), here I also need a file reference that needs to be defined prior to entering the for loop, in case the first chunk has hdr['count'] != 0. Is this the best solution? how do you generally read from a file and create several other files from it?
# read file, write several other files
import os
def read_chunks(filename, chunksize = 512):
f = open(filename, 'rb')
while True:
chunk = f.read(chunksize)
if chunk:
yield chunk
else:
break
def parse_header(data):
count = data[0]
uniqueID = data[1]
start = data[2]
length = data[3]
return {'count': count, 'uniqueID': uniqueID, 'start': start, 'length': length}
filename = 'test_file.bin'
f = open('dummy.bin', 'wb')
for chunk in read_chunks(filename):
hdr = parse_header(chunk)
if hdr['count'] == 0:
f.close()
img_filename = 'img_' + str(hdr['uniqueID']) + '.raw'
f = open(img_filename, 'wb')
img_data = chunk[hdr['start']: hdr['start'] + hdr['length']]
f.write(img_data)
print(type(f))
f.close()
os.remove('dummy.bin')

Related

Python JPEG file carver takes incredibly long on small dumps

I'm writing a JPEG file carver as a part of a forensic lab.
The assignment is to write a script that can extract JPEG files from a 10 MB dd-dump. We are not permitted to assign the file to a memory variable (because if it were to be too big, it would cause an overflow), but instead, the Python script should read directly from the file.
My script seems to work perfectly fine, but it takes extremely long to finish (upwards 30-40 minutes). Is this an expected behavior? Even for such a small 10 MB file? Is there anything I can do to shorten the time?
This is my code:
# Extract JPEGs from a file.
import sys
with open(sys.argv[1], "rb") as binary_file:
binary_file.seek(0, 2) # Seek the end
num_bytes = binary_file.tell() # Get the file size
count = 0 # Our counter of which file we are currently extracting.
for i in range(num_bytes):
binary_file.seek(i)
four_bytes = binary_file.read(4)
whole_file = binary_file.read()
if four_bytes == b"\xff\xd8\xff\xd8" or four_bytes == b"\xff\xd8\xff\xe0" or four_bytes == b"\xff\xd8\xff\xe1": # JPEG signature
whole_file = whole_file.split(four_bytes)
for photo in whole_file:
count += 1
name = "Pic " + str(count) + ".jpg"
file(name, "wb").write(four_bytes+photo)
print name
Aren't you reading your whole file on every for loop?
E: What I mean, is at every byte you read your whole file (for a 10MB file you are reading 10MB 10 million times, aren't you?), even if the four bytes didn't match up to JPEG signature.
E3 : What you need is on every byte to check if there is file to be written (checking for the header/signature). If you match the signature, you have to start writing bytes to file, but first, since you already read 4 bytes, you have to jump back where you are. Then, when reading the byte and writing it to file, you have to check for JPEG ending. If the file ends, you have to write the next byte and close the stream and start searching for header again. This will not extract a JPEG from inside another JPEG.
import sys
with open("C:\\Users\\rauno\\Downloads\\8-jpeg-search\\8-jpeg-search.dd", "rb") as binary_file:
binary_file.seek(0, 2) # Seek the end
num_bytes = binary_file.tell() # Get the file size
write_to_file = False
count = 0 # Our counter of which file we are currently extracting.
for i in range(num_bytes):
binary_file.seek(i)
if write_to_file is False:
four_bytes = binary_file.read(4)
if four_bytes == b"\xff\xd8\xff\xd8" or four_bytes == b"\xff\xd8\xff\xe0" or four_bytes == b"\xff\xd8\xff\xe1": # JPEG signature
write_to_file = True
count += 1
name = "Pic " + str(count) + ".jpg"
f = open(name,"wb")
binary_file.seek(i)
if write_to_file is True: #not 'else' or you miss the first byte
this_byte = binary_file.read(1)
f.write(this_byte)
next_byte = binary_file.read(1) # assuming it does read next byte - i think "read" jumps the seeker (which is why you have .seek(i) at the beginning)
if this_byte == b"\xff" and next_byte==b"\xd9" :
f.write(next_byte)
f.close()
write_to_file = False

xor-ing a large file in python

I am trying to apply a xOr operation to a number of files, some of which are very large.
Basically i am getting a file and xor-ing it byte by byte (or at least this is what i think i'm doing). When it hits a larger file (around 70MB) i get an out of memory error and my script crashes.
My computer has 16GB of Ram with more than 50% of it available so i would not relate this to my hardware.
def xor3(source_file, target_file):
b = bytearray(open(source_file, 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
open(target_file, 'wb').write(b)
I tried to read the file in chunks, but it seems i'm too unexperimented for this as the output is not the desired one. The first function returns what i want, of course :)
def xor(data):
b = bytearray(data)
for i in range(len(b)):
b[i] ^= 0x41
return data
def xor4(source_file, target_file):
with open(source_file,'rb') as ifile:
with open(target_file, 'w+b') as ofile:
data = ifile.read(1024*1024)
while data:
ofile.write(xor(data))
data = ifile.read(1024*1024)
What is the appropiate solution for this kind of operation ? What is it that i am doing wrong ?
use seek function to get the file in chunks and append it every time to output file
CHUNK_SIZE = 1000 #for example
with open(source_file, 'rb') as source:
with open(target_file, 'a') as target:
bytes = bytearray(source.read(CHUNK_SIZE))
source.seek(CHUNK_SIZE)
for i in range(len(bytes)):
bytes[i] ^= 0x71
target.write(bytes)
Unless I am mistaken, in your second example, you create a copy of data by calling bytearray and assigning it to b. Then you modify b, but return data.
The modification in b has no effect on data itself.
Iterate lazily over the large file.
from operator import xor
from functools import partial
def chunked(file, chunk_size):
return iter(lambda: file.read(chunk_size), b'')
myoperation = partial(xor, 0x71)
with open(source_file, 'rb') as source, open(target_file, 'ab') as target:
processed = (map(myoperation, bytearray(data)) for data in chunked(source, 65536))
for data in processed:
target.write(bytearray(data))
This probably only works in python 2, which shows again that it's much nicer to use for bytestreams:
def xor(infile, outfile, val=0x71, chunk=1024):
with open(infile, 'r') as inf:
with open(outfile, 'w') as outf:
c = inf.read(chunk)
while c != '':
s = "".join([chr(ord(cc) ^val) for cc in c])
outf.write(s)
c = inf.read(chunk)

How to read the last few lines within a file using Python?

I am reading a folder with a specific file name. I am reading the content within a file, but how do I read specific lines or the last 6 lines within a file?
************************************
Test Scenario No. 1
TestcaseID = FB_71125_1
dpSettingScript = FB_71125_1_DP.txt
************************************
Setting Pre-Conditions (DP values, Sqlite DB):
cp /fs/images/nfs/FileRecogTest/MNT/test/Databases/FB_71125_1_device.sqlite $NUANCE_DB_DIR/device.sqlite
"sync" twice.
Starting the test:
0#00041511#0000000000# FILERECOGNITIONTEST: = testScenarioNo (int)1 =
0#00041514#0000000000# FILERECOGNITIONTEST: = TestcaseID (char*)FB_71125_1 =
0#00041518#0000000000# FILERECOGNITIONTEST: = dpSettingScript (char*)FB_71125_1_DP.txt =
0#00041520#0000000000# FILERECOGNITIONTEST: = UtteranceNo (char*)1 =
0#00041524#0000000000# FILERECOGNITIONTEST: = expectedEventData (char*)0||none|0||none =
0#00041528#0000000000# FILERECOGNITIONTEST: = expectedFollowUpDialog (char*) =
0#00041536#0000000000# FILERECOGNITIONTEST: /fs/images/nfs/FileRecogTest/MNT/test/main_menu.wav#MEDIA_COND:PAS_MEDIA&MEDIA_NOT_BT#>main_menu.global<#<FS0000_Pos_Rec_Tone><FS1000_MainMenu_ini1>
0#00041789#0000000000# FILERECOGNITIONTEST: Preparing test data done
0#00043768#0000000000# FILERECOGNITIONTEST: /fs/images/nfs/FileRecogTest/MNT/test/Framework.wav##>{any_device_name}<#<FS0000_Pos_Rec_Tone><FS1400_DeviceDisambig_<slot>_ini1>
0#00044008#0000000000# FILERECOGNITIONTEST: Preparing test data done
0#00045426#0000000000# FILERECOGNITIONTESTWARNING: expected >{any_device_name}<, got >lowconfidence1#FS1000_MainMenu<
1900#00046452#0000000000# FILERECOGNITIONTESTERROR: expected <FS0000_Pos_Rec_Tone><FS1400_DeviceDisambig_<slot>_ini1>, got <FS0000_Misrec_Tone><FS1000_MainMenu_nm1_004><pause300><FS1000_MainMenu_nm_001>
0#00046480#0000000000# FILERECOGNITIONTEST: Preparing test data done
0#00047026#0000000000# FILERECOGNITIONTEST: Stopping dialog immediately
[VCALogParser] Scenario 1 FAILED.
Can someone suggest me how to read specific lines, or the last 6 lines within a file ?
I can think of two methods. If your files are not too big, you can just read all lines, and keep only the last six ones:
f = open(some_path)
last_lines = f.readlines()[-6:]
But that's really brute-force. Something cleverer is to make a guess, using the seek() method of your file object:
file_size = os.stat(some_path).st_size # in _bytes_, so take care depending on encoding
f = open(some_path)
f.seek(file_size - 1000) # here's the guess. Adjust with expected line length
last_lines = f.readline()[-6:]
To read the last 6 lines of a single file, you could use Python's file.seek to move near to the end of the file and then read the remaining lines. You need to decide what the maximum line length could possibly be, e.g. 1024 characters.
The seek command is first used to move to the end of the file (without reading it in), tell is used to determine with position in the file (as we are at the end, this will be the length). It then goes backwards in the file and reads the lines in. If the file is very short, the whole file is read in.
import os
filename = r"C:\Users\hemanth_venkatappa\Desktop\TEST\Language\test.txt"
back_up = 6 * 1024 # Go back from the end more than 6 lines worth.
with open(filename, "r") as f_input:
f_input.seek(0, os.SEEK_END)
backup = min(back_up, f_input.tell())
f_input.seek(-backup, os.SEEK_END)
print f_input.readlines()[-6:]
Using with will ensure your file is automatically closed afterwards. Prefixing your file path with r avoids you needing to double backslash your file path.
So to then apply this to your directory walk and write your results to a separate output file, you could do the following:
import os
import re
back_up = 6 * 256 # Go back from the end more than 6 lines worth
directory = r"C:\Users\hemanth_venkatappa\Desktop\TEST\Language"
output_filename = r"C:\Users\hemanth_venkatappa\Desktop\TEST\output.txt"
with open(output_filename, 'w') as f_output:
for dirpath, dirnames, filenames in os.walk(directory):
for filename in filenames:
if filename.startswith('VCALogParser_output'):
cur_file = os.path.join(dirpath, filename)
with open(cur_file, "r") as f_input:
f_input.seek(0, os.SEEK_END)
backup = min(back_up , f_input.tell())
f_input.seek(-backup, os.SEEK_END)
last_lines = ''.join(f_input.readlines()[-6:])
try:
summary = ', '.join(re.search(r'(\d+ warning\(s\)).*?(\d+ error\(s\)).*?(\d+ scenarios\(s\))', last_lines, re.S).groups())
except AttributeError:
summary = "No summary"
f_output.write('{}: {}\n'.format(filename, summary))
Or, essentially, use a for loop to append lines to an array and then remove the nth number of items from the array like:
array=[]
f=open("file.txt","r")
for lines in f:
array.append(f.readlines())
f.close()
while len(array) > 5:
del array[0]

Splitting a CSV file into equal parts?

I have a large CSV file that I would like to split into a number that is equal to the number of CPU cores in the system. I want to then use multiprocess to have all the cores work on the file together. However, I am having trouble even splitting the file into parts. I've looked all over google and I found some sample code that appears to do what I want. Here is what I have so far:
def split(infilename, num_cpus=multiprocessing.cpu_count()):
READ_BUFFER = 2**13
total_file_size = os.path.getsize(infilename)
print total_file_size
files = list()
with open(infilename, 'rb') as infile:
for i in xrange(num_cpus):
files.append(tempfile.TemporaryFile())
this_file_size = 0
while this_file_size < 1.0 * total_file_size / num_cpus:
files[-1].write(infile.read(READ_BUFFER))
this_file_size += READ_BUFFER
files[-1].write(infile.readline()) # get the possible remainder
files[-1].seek(0, 0)
return files
files = split("sample_simple.csv")
print len(files)
for ifile in files:
reader = csv.reader(ifile)
for row in reader:
print row
The two prints show the correct file size and that it was split into 4 pieces (my system has 4 CPU cores).
However, the last section of the code that prints all the rows in each of the pieces gives the error:
for row in reader:
_csv.Error: line contains NULL byte
I tried printing the rows without running the split function and it prints all the values correctly. I suspect the split function has added some NULL bytes to the resulting 4 file pieces but I'm not sure why.
Does anyone know if this a correct and fast method to split the file? I just want resulting pieces that can be read successfully by csv.reader.
As I said in a comment, csv files would need to be split on row (or line) boundaries. Your code doesn't do this and potentially breaks them up somewhere in the middle of one — which I suspect is the cause of your _csv.Error.
The following avoids doing that by processing the input file as a series of lines. I've tested it and it seems to work standalone in the sense that it divided the sample file up into approximately equally size chunks because it's unlikely that an whole number of rows will fit exactly into a chunk.
Update
This it is a substantially faster version of the code than I originally posted. The improvement is because it now uses the temp file's own tell() method to determine the constantly changing length of the file as it's being written instead of calling os.path.getsize(), which eliminated the need to flush() the file and call os.fsync() on it after each row is written.
import csv
import multiprocessing
import os
import tempfile
def split(infilename, num_chunks=multiprocessing.cpu_count()):
READ_BUFFER = 2**13
in_file_size = os.path.getsize(infilename)
print 'in_file_size:', in_file_size
chunk_size = in_file_size // num_chunks
print 'target chunk_size:', chunk_size
files = []
with open(infilename, 'rb', READ_BUFFER) as infile:
for _ in xrange(num_chunks):
temp_file = tempfile.TemporaryFile()
while temp_file.tell() < chunk_size:
try:
temp_file.write(infile.next())
except StopIteration: # end of infile
break
temp_file.seek(0) # rewind
files.append(temp_file)
return files
files = split("sample_simple.csv", num_chunks=4)
print 'number of files created: {}'.format(len(files))
for i, ifile in enumerate(files, start=1):
print 'size of temp file {}: {}'.format(i, os.path.getsize(ifile.name))
print 'contents of file {}:'.format(i)
reader = csv.reader(ifile)
for row in reader:
print row
print ''

Assemble mpeg file unable to play in mediaplayer

I'm currently working on a school project and i seem to encouter some problems with an MPEG file. The scope of my project is to:
1) split MPEG file into many fixed size chunk.
2) assemble some of them while omitting certain chunk.
Problem 1:
When i play the file in media player, it will play the video until it reaches the chunk that i omit.
Example:
chunk = ["yui_1", "yui_2", "yui_3", "yui_5", "yui_6"]
Duration of each chunk: 1 second
*If you realise i have omitted "yui_4" chunk.*
If I were to assemble all the chunk except "yui_4", the video will play first 2 seconds before it hangs throughout the duration.
Problem 2:
When i assemble the chunk while omitting the first chunk, it render the entire mpeg file unable to play.
Example:
chunk = ["yui_2", "yui_3", "yui_4", "yui_5", "yui_6"]
Duration of each chunk: 1 second
Below is a portion of my code (hardcode):
def splitFile(inputFile,chunkSize):
splittext = string.split(filename, ".")
name = splittext[0]
extension = splittext[1]
os.chdir("./" + media_dir)
#read the contents of the file
f = open(inputFile, 'rb')
data = f.read() # read the entire content of the file
f.close()
# get the length of data, ie size of the input file in bytes
bytes = len(data)
#calculate the number of chunks to be created
noOfChunks= bytes/chunkSize
if(bytes%chunkSize):
noOfChunks+=1
#create a info.txt file for writing metadata
f = open('info.txt', 'w')
f.write(inputFile+','+'chunk,'+str(noOfChunks)+','+str(chunkSize))
f.close()
chunkNames = []
count = 1
for i in range(0, bytes+1, chunkSize):
fn1 = name + "_%s" % count
chunkNames.append(fn1)
f = open(fn1, 'wb')
f.write(data[i:i+ chunkSize])
count += 1
f.close()
Below is a portion of how i assemble the chunk:
def assemble():
data = ["yui_1", "yui_2", "yui_3", "yui_4", "yui_5", "yui_6", "yui_7"]
output = open("output.mpeg", "wb")
for item in datafile:
data = open(item, "rb"). read()
output.write(data)
output.close()
MPEG video files contain "encoded video data" which means it is compressed. The bottom line is that cutting up the video into chunks that can be appended or played separately is not trivial. Both your problems will not be solved unless you read the MPEG2 transport stream specification carefully and understand where to find points which you can cut and "splice" and still output a compliant MPEG stream. My guess is that this isn't what you want to do for a school project.
Maybe you should try to read how to use FFMPEG (http://www.ffmpeg.org/) to cut and append video files.
Good luck on your project.

Categories

Resources