xor-ing a large file in python - python

I am trying to apply a xOr operation to a number of files, some of which are very large.
Basically i am getting a file and xor-ing it byte by byte (or at least this is what i think i'm doing). When it hits a larger file (around 70MB) i get an out of memory error and my script crashes.
My computer has 16GB of Ram with more than 50% of it available so i would not relate this to my hardware.
def xor3(source_file, target_file):
b = bytearray(open(source_file, 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
open(target_file, 'wb').write(b)
I tried to read the file in chunks, but it seems i'm too unexperimented for this as the output is not the desired one. The first function returns what i want, of course :)
def xor(data):
b = bytearray(data)
for i in range(len(b)):
b[i] ^= 0x41
return data
def xor4(source_file, target_file):
with open(source_file,'rb') as ifile:
with open(target_file, 'w+b') as ofile:
data = ifile.read(1024*1024)
while data:
ofile.write(xor(data))
data = ifile.read(1024*1024)
What is the appropiate solution for this kind of operation ? What is it that i am doing wrong ?

use seek function to get the file in chunks and append it every time to output file
CHUNK_SIZE = 1000 #for example
with open(source_file, 'rb') as source:
with open(target_file, 'a') as target:
bytes = bytearray(source.read(CHUNK_SIZE))
source.seek(CHUNK_SIZE)
for i in range(len(bytes)):
bytes[i] ^= 0x71
target.write(bytes)

Unless I am mistaken, in your second example, you create a copy of data by calling bytearray and assigning it to b. Then you modify b, but return data.
The modification in b has no effect on data itself.

Iterate lazily over the large file.
from operator import xor
from functools import partial
def chunked(file, chunk_size):
return iter(lambda: file.read(chunk_size), b'')
myoperation = partial(xor, 0x71)
with open(source_file, 'rb') as source, open(target_file, 'ab') as target:
processed = (map(myoperation, bytearray(data)) for data in chunked(source, 65536))
for data in processed:
target.write(bytearray(data))

This probably only works in python 2, which shows again that it's much nicer to use for bytestreams:
def xor(infile, outfile, val=0x71, chunk=1024):
with open(infile, 'r') as inf:
with open(outfile, 'w') as outf:
c = inf.read(chunk)
while c != '':
s = "".join([chr(ord(cc) ^val) for cc in c])
outf.write(s)
c = inf.read(chunk)

Related

How can I read a .txt file letter by letter in Python? [duplicate]

In Python, given the name of a file, how can I write a loop that reads one character each time through the loop?
with open(filename) as f:
while True:
c = f.read(1)
if not c:
print("End of file")
break
print("Read a character:", c)
First, open a file:
with open("filename") as fileobj:
for line in fileobj:
for ch in line:
print(ch)
This goes through every line in the file and then every character in that line.
I like the accepted answer: it is straightforward and will get the job done. I would also like to offer an alternative implementation:
def chunks(filename, buffer_size=4096):
"""Reads `filename` in chunks of `buffer_size` bytes and yields each chunk
until no more characters can be read; the last chunk will most likely have
less than `buffer_size` bytes.
:param str filename: Path to the file
:param int buffer_size: Buffer size, in bytes (default is 4096)
:return: Yields chunks of `buffer_size` size until exhausting the file
:rtype: str
"""
with open(filename, "rb") as fp:
chunk = fp.read(buffer_size)
while chunk:
yield chunk
chunk = fp.read(buffer_size)
def chars(filename, buffersize=4096):
"""Yields the contents of file `filename` character-by-character. Warning:
will only work for encodings where one character is encoded as one byte.
:param str filename: Path to the file
:param int buffer_size: Buffer size for the underlying chunks,
in bytes (default is 4096)
:return: Yields the contents of `filename` character-by-character.
:rtype: char
"""
for chunk in chunks(filename, buffersize):
for char in chunk:
yield char
def main(buffersize, filenames):
"""Reads several files character by character and redirects their contents
to `/dev/null`.
"""
for filename in filenames:
with open("/dev/null", "wb") as fp:
for char in chars(filename, buffersize):
fp.write(char)
if __name__ == "__main__":
# Try reading several files varying the buffer size
import sys
buffersize = int(sys.argv[1])
filenames = sys.argv[2:]
sys.exit(main(buffersize, filenames))
The code I suggest is essentially the same idea as your accepted answer: read a given number of bytes from the file. The difference is that it first reads a good chunk of data (4006 is a good default for X86, but you may want to try 1024, or 8192; any multiple of your page size), and then it yields the characters in that chunk one by one.
The code I present may be faster for larger files. Take, for example, the entire text of War and Peace, by Tolstoy. These are my timing results (Mac Book Pro using OS X 10.7.4; so.py is the name I gave to the code I pasted):
$ time python so.py 1 2600.txt.utf-8
python so.py 1 2600.txt.utf-8 3.79s user 0.01s system 99% cpu 3.808 total
$ time python so.py 4096 2600.txt.utf-8
python so.py 4096 2600.txt.utf-8 1.31s user 0.01s system 99% cpu 1.318 total
Now: do not take the buffer size at 4096 as a universal truth; look at the results I get for different sizes (buffer size (bytes) vs wall time (sec)):
2 2.726
4 1.948
8 1.693
16 1.534
32 1.525
64 1.398
128 1.432
256 1.377
512 1.347
1024 1.442
2048 1.316
4096 1.318
As you can see, you can start seeing gains earlier on (and my timings are likely very inaccurate); the buffer size is a trade-off between performance and memory. The default of 4096 is just a reasonable choice but, as always, measure first.
Just:
myfile = open(filename)
onecharacter = myfile.read(1)
Python itself can help you with this, in interactive mode:
>>> help(file.read)
Help on method_descriptor:
read(...)
read([size]) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
I learned a new idiom for this today while watching Raymond Hettinger's Transforming Code into Beautiful, Idiomatic Python:
import functools
with open(filename) as f:
f_read_ch = functools.partial(f.read, 1)
for ch in iter(f_read_ch, ''):
print 'Read a character:', repr(ch)
Just read a single character
f.read(1)
This will also work:
with open("filename") as fileObj:
for line in fileObj:
for ch in line:
print(ch)
It goes through every line in the the file and every character in every line.
(Note that this post now looks extremely similar to a highly upvoted answer, but this was not the case at the time of writing.)
Best answer for Python 3.8+:
with open(path, encoding="utf-8") as f:
while c := f.read(1):
do_my_thing(c)
You may want to specify utf-8 and avoid the platform encoding. I've chosen to do that here.
Function – Python 3.8+:
def stream_file_chars(path: str):
with open(path) as f:
while c := f.read(1):
yield c
Function – Python<=3.7:
def stream_file_chars(path: str):
with open(path, encoding="utf-8") as f:
while True:
c = f.read(1)
if c == "":
break
yield c
Function – pathlib + documentation:
from pathlib import Path
from typing import Union, Generator
def stream_file_chars(path: Union[str, Path]) -> Generator[str, None, None]:
"""Streams characters from a file."""
with Path(path).open(encoding="utf-8") as f:
while (c := f.read(1)) != "":
yield c
You should try f.read(1), which is definitely correct and the right thing to do.
f = open('hi.txt', 'w')
f.write('0123456789abcdef')
f.close()
f = open('hej.txt', 'r')
f.seek(12)
print f.read(1) # This will read just "c"
To make a supplement,
if you are reading file that contains a line that is vvvvery huge, which might break your memory, you might consider read them into a buffer then yield the each char
def read_char(inputfile, buffersize=10240):
with open(inputfile, 'r') as f:
while True:
buf = f.read(buffersize)
if not buf:
break
for char in buf:
yield char
yield '' #handle the scene that the file is empty
if __name__ == "__main__":
for word in read_char('./very_large_file.txt'):
process(char)
os.system("stty -icanon -echo")
while True:
raw_c = sys.stdin.buffer.peek()
c = sys.stdin.read(1)
print(f"Char: {c}")
Combining qualities of some other answers, here is something that is invulnerable to long files / lines, while being more succinct and faster:
import functools as ft, itertools as it
with open(path) as f:
for c in it.chain.from_iterable(
iter(ft.partial(f.read, 4096), '')
):
print(c)
#reading out the file at once in a list and then printing one-by-one
f=open('file.txt')
for i in list(f.read()):
print(i)

How to read from one file and write to several other files

I have a file containing several images. The images are chopped up in packets, I called packet chunk in my code example. Every chunk contains a header with: count, uniqueID, start, length. Start contains the start index of the img_data within the chunk and length is the length of the img_data within the chunk. Count runs from 0 to 255 and the img_data of all these 256 chunks combined forms one image. Before reading the chunks I open a 'dummy.bin' file to have something to write to, otherwise I get that f is not defined. At the end I remove the 'dummy.bin' file. The problem is that I need a file reference to start with. Although this code works I wonder if there is another way then creating a dummy-file to get a file reference. The first chunk in 'test_file.bin' has hdr['count'] == 0 so f.close() will be called in the first iteration. That is why I need to have a file reference f before entering the for loop. Apart from that, every iteration I write img_data to a file with f.write(img_data), here I also need a file reference that needs to be defined prior to entering the for loop, in case the first chunk has hdr['count'] != 0. Is this the best solution? how do you generally read from a file and create several other files from it?
# read file, write several other files
import os
def read_chunks(filename, chunksize = 512):
f = open(filename, 'rb')
while True:
chunk = f.read(chunksize)
if chunk:
yield chunk
else:
break
def parse_header(data):
count = data[0]
uniqueID = data[1]
start = data[2]
length = data[3]
return {'count': count, 'uniqueID': uniqueID, 'start': start, 'length': length}
filename = 'test_file.bin'
f = open('dummy.bin', 'wb')
for chunk in read_chunks(filename):
hdr = parse_header(chunk)
if hdr['count'] == 0:
f.close()
img_filename = 'img_' + str(hdr['uniqueID']) + '.raw'
f = open(img_filename, 'wb')
img_data = chunk[hdr['start']: hdr['start'] + hdr['length']]
f.write(img_data)
print(type(f))
f.close()
os.remove('dummy.bin')

Read a large big-endian binary file

I have a very large big-endian binary file. I know how many numbers in this file. I found a solution how to read big-endian file using struct and it works perfect if file is small:
data = []
file = open('some_file.dat', 'rb')
for i in range(0, numcount)
data.append(struct.unpack('>f', file.read(4))[0])
But this code works very slow if file size is more than ~100 mb.
My current file has size 1.5gb and contains 399.513.600 float numbers. The above code works with this file an about 8 minutes.
I found another solution, that works faster:
datafile = open('some_file.dat', 'rb').read()
f_len = ">" + "f" * numcount #numcount = 399513600
numbers = struct.unpack(f_len, datafile)
This code runs in about ~1.5 minute, but this is too slow for me. Earlier I wrote the same functional code in Fortran and it run in about 10 seconds.
In Fortran I open the file with flag "big-endian" and I can simply read file in REAL array without any conversion, but in python I have to read file as a string and convert every 4 bites in float using struct. Is it possible to make the program run faster?
You can use numpy.fromfile to read the file, and specify that the type is big-endian specifying > in the dtype parameter:
numpy.fromfile(filename, dtype='>f')
There is an array.fromfile method too, but unfortunately I cannot see any way in which you can control endianness, so depending on your use case this might avoid the dependency on a third party library or be useless.
The following approach gave a good speed up for me:
import struct
import random
import time
block_size = 4096
start = time.time()
with open('some_file.dat', 'rb') as f_input:
data = []
while True:
block = f_input.read(block_size * 4)
data.extend(struct.unpack('>{}f'.format(len(block)/4), block))
if len(block) < block_size * 4:
break
print "Time taken: {:.2f}".format(time.time() - start)
print "Length", len(data)
Rather than using >fffffff you can specify a count e.g. >1000f. It reads the file 4096 chunks at a time. If the amount read is less than this it adjusts the block size and exits.
From the struct - Format Characters documentation:
A format character may be preceded by an integral repeat count. For
example, the format string '4h' means exactly the same as 'hhhh'.
def read_big_endian(filename):
all_text = ""
with open(filename, "rb") as template:
try:
template.read(2) # first 2 bytes are FF FE
while True:
dchar = template.read(2)
all_text += dchar[0]
except:
pass
return all_text
def save_big_endian(filename, text):
with open(filename, "wb") as fic:
fic.write(chr(255) + chr(254)) # first 2 bytes are FF FE
for letter in text:
fic.write(letter + chr(0))
Used to read .rdp files

logical "OR" between bin files

I'm trying to write a script that take a list of files and perform a "logical or" between them. As you can see in the script, at the first stage i'm creating an empty append_buffer. Then I want to do a logical OR with all the files in the list.
My problem is that when I read the files I get a str and not a bytearray. So when I tried to perform the or it failed. I have tried to convert it without any success.
import struct
#import sys, ast
buffera = bytearray()
append_buffer = bytearray()
output_buffer = bytearray()
files_list=['E:\out.jpg','E:\loala2.jpg','E:\Koala.jpg','E:\loala2.jpg']
print(files_list[1])
#######################################################################################################################
# create_dummy_bin_file_for_first_iteration , base on first file size
temp_file = open(files_list[1], "rb")
print ( temp_file )
buffera = temp_file.read(temp_file.__sizeof__())
temp_file.close()
for x in range(0, len(buffera)):
append_buffer.append(0x00)
#######################################################################################################################
for i in range(1, len(files_list)):
print( files_list[i] )
file = open(files_list[i], "rb")
file_buffer = file.read(file.__sizeof__())
file.close()
if ( len(file_buffer) != len(append_buffer) ):
print("Can't merge different size bin files ")
exit(1)
else:
for x in range(0, len(buffera)):
or_data=(file_buffer[x] | append_buffer[x])
print("---")
print(type(file_buffer[x]))
print(file_buffer[x])
print("---")
print(type(append_buffer[x]))
print(append_buffer[x])
outputfile = open(files_list[0], "wb")
outputfile.write(output_buffer)
outputfile.close()
You can use the ord and chr operators to do convert each character to an integer and back.
Using this, your code would be:
or_data=chr(ord(file_buffer[x]) | ord(append_buffer[x]))
This code example makes the complete work in memory.
# Read data from the first file
with open("file1.txt", "rt") as f:
d1 = f.read()
# Read data from the second file
with open("file2.txt", "rt") as f:
d2 = f.read()
# Make sure that both sizes are equal
assert len(d1) == len(d2)
# Calculate OR-ed data
d3 = "".join(chr(ord(d1[i]) | ord(d2[i])) for i in range(len(d1)))
# Write the output data
with open("file3.txt", "wt") as f:
f.write(d3)
It's possible to process these data byte-by-byte too, in order to reduce memory consumption.

What is the best data structure to store chunks of file in python

I want to store chunks of file in a list so that later on a can perform some operations with the map function on each chunks. Intuitively I am tempted to to something like below (but it doesn't work):
fi = open(fileName, "rb")
data = fi.read()
fi.close()
max = len(data)
block = 1024
tmp = []
for i in range(0, max, block):
tmp.append(data[i:i+block])
I'd suggest to read the file by chunks in the first place:
block = 1024
with open(fileName, 'rb') as f:
tmp = [chunk for chunk in iter(lambda: f.read(block), b'')]
See the documentation for iter().

Categories

Resources