Delete / Insert Data in mmap'ed File - python

I am working on a script in Python that maps a file for processing using mmap().
The tasks requires me to change the file's contents by
Replacing data
Adding data into the file at an offset
Removing data from within the file (not whiting it out)
Replacing data works great as long as the old data and the new data have the same number of bytes:
VDATA = mmap.mmap(f.fileno(),0)
start = 10
end = 20
VDATA[start:end] = "0123456789"
However, when I try to remove data (replacing the range with "") or inserting data (replacing the range with contents longer than the range), I receive the error message:
IndexError: mmap slice assignment is
wrong size
This makes sense.
The question now is, how can I insert and delete data from the mmap'ed file?
From reading the documentation, it seems I can move the file's entire contents back and forth using a chain of low-level actions but I'd rather avoid this if there is an easier solution.

In lack of an alternative, I went ahead and wrote two helper functions - deleteFromMmap() and insertIntoMmap() - to handle the low level file actions and ease the development.
The closing and reopening of the mmap instead of using resize() is do to a bug in python on unix derivates leading resize() to fail. (http://mail.python.org/pipermail/python-bugs-list/2003-May/017446.html)
The functions are included in a complete example.
The use of a global is due to the format of the main project but you can easily adapt it to match your coding standards.
import mmap
# f contains "0000111122223333444455556666777788889999"
f = open("data","r+")
VDATA = mmap.mmap(f.fileno(),0)
def deleteFromMmap(start,end):
global VDATA
length = end - start
size = len(VDATA)
newsize = size - length
VDATA.move(start,end,size-end)
VDATA.flush()
VDATA.close()
f.truncate(newsize)
VDATA = mmap.mmap(f.fileno(),0)
def insertIntoMmap(offset,data):
global VDATA
length = len(data)
size = len(VDATA)
newsize = size + length
VDATA.flush()
VDATA.close()
f.seek(size)
f.write("A"*length)
f.flush()
VDATA = mmap.mmap(f.fileno(),0)
VDATA.move(offset+length,offset,size-offset)
VDATA.seek(offset)
VDATA.write(data)
VDATA.flush()
deleteFromMmap(4,8)
# -> 000022223333444455556666777788889999
insertIntoMmap(4,"AAAA")
# -> 0000AAAA22223333444455556666777788889999

There is no way to shift contents of a file (be it mmap'ed or plain) without doing it explicitly. In the case of a mmap'ed file, you'll have to use the mmap.move method.

Related

Reading binary file. Translate matlab to python

I'm going to translate the working matlab code for reading the binary file to python code. Is there an equivalent for
% open the file for reading
fid=fopen (filename,'rb','ieee-le');
% first read the signature
tmp=fread(fid,2,'char');
% read sizes
rows=fread(fid,1,'ushort');
cols=fread(fid,1,'ushort');
there's the struct module to do that, specifically the unpack function which accepts a buffer, but you'll have to read the required size from the input using struct.calcsize
import struct
endian = "<" # little endian
with open(filename,'rb') as f:
tmp = struct.unpack(f"{endian}cc",f.read(struct.calcsize("cc")))
tmp_int = [int.from_bytes(x,byteorder="little") for x in tmp]
rows = struct.unpack(f"{endian}H",f.read(struct.calcsize("H")))[0]
cols = struct.unpack(f"{endian}H",f.read(struct.calcsize("H")))[0]
you might want to use the struct.Struct class for reading the rest of the data in chunks, as it is going to be faster than decoding numbers one at a time.
ie:
data = []
reader = struct.Struct(endian + "i"*cols) # i for integer
row_size = reader.size
for row_number in range(rows):
row = reader.unpack(f.read(row_size))
data.append(row)
Edit: corrected the answer, and added an example for larger chuncks.
Edit2: okay, more improvement, assuming we are reading 1 GB file of shorts, storing it as python int makes no sense and will most likely give an out of memory error (or system will freeze), the proper way to do it is using numpy
import numpy as np
data = np.fromfile(f,dtype=endian+'H').reshape(cols,rows) # ushorts
this way it'll have the same space in memory as it did on disk.

What's the fastest way to save/load a large collection (list/set) of strings in Python 3.6?

The file is 5gb long.
I did find a similar question on stackoverflow where people suggest the use of a numpy array but I suppose this solution would be applicable to a collection of numbers and not strings.
would there be anything that beats eval(list.txt) or importing a python file with a variable set to the list?
what is the most efficient way to load/save a python list of strings?
For the read-only case:
import numpy as np
class IndexedBlob:
def __init__(self, filename):
index_filename = filename + '.index'
blob = np.memmap(filename, mode='r')
try:
# if there is an existing index
indices = np.memmap(index_filename, dtype='>i8', mode='r')
except FileNotFoundError:
# else, create it
indices, = np.where(blob == ord('\n'))
# force dtype to predictable file
indices = np.array(indices, dtype='>i8')
with open(index_filename, 'wb') as f:
# add a virtual newline
np.array(-1, dtype='>i8').tofile(f)
indices.tofile(f)
# then reopen it as a file to reduce memory
# (and also pick up that -1 we added)
indices = np.memmap(index_filename, dtype='>i8', mode='r')
self.blob = blob
self.indices = indices
def __getitem__(self, line):
assert line >= 0
lo = self.indices[line] + 1
hi = self.indices[line + 1]
return self.blob[lo:hi].tobytes().decode()
Some additional notes:
Adding new strings at the end (by simply opening the file in append mode and writing a line - but beware of previous broken writes) is easy - but remember to manually update the index file too. But note that you'll need to re-mmap if you want to see it for existing IndexedBlob objects. You could avoid that and simply keep a list of "loose" objects.
By design, if the last line is missing a newline, it is ignored (to detect truncation or concurrent writing)
You could significantly shrink the size of the index by only storying every nth newline, then doing a linear search at lookup time. I found this not worth it, however.
If you use separate indices for the start and end, you are no longer constrained to storing the strings in order, which opens up several possibilities for mutation. But if mutation is rare, rewriting the whole file and regenerating the index isn't too expensive.
Consider using '\0' as your separator instead of '\n.
And of course:
General concurrent mutation is hard no matter what you do. If you need to do anything complicated, use a real database: it is the simplest solution at that point.

How to read only part of a list of strings in python

I need to find a way to be able to read x bytes of data from a list containing strings. Each item in the list is ~36MB. I need to be able to run through each item in the list, but only grabbing about ~1KB of that item at a time.
Essentially it looks like this:
for item in list:
#grab part of item
#do something with that part
#Move onto next part, until you've gone through the whole item
My current code (which kind of works, but seems to be rather slow and inefficient) is such:
for character in bucket:
print character
packet = "".join(character)
if(len(packet.encode("utf8")) >= packetSizeBytes):
print "Bytes: " + str(len(packet.encode("utf8")))
return packet
I'm wondering if there exists anything like f.read(bufSize), but for strings.
Not sure if it's relevant, but for more context this is what I'm doing:
I'm reading data from a very large file (several GB) into much smaller (and more manageable chunks). I chunk the file using f.read(chunkSize), and store those as buckets However, even those buckets are still too large for what I ultimately need to do with the data, so I want to grab only parts of the bucket at a time.
Originally, I bypassed the whole bucket thing, and just chunked the file into chunks that were small enough for my purposes. However, this led to me having to chunk the file hundreds of thousands of times, which got kind of slow. My hope now is to be able to have buckets queued up so that while I'm doing something with one bucket, I can begin reading from others. If any of this sounds confusing, let me know and I'll try to clarify.
Thanks
If you're using str's (or byte's in python 3), each character is a byte, so f.read(5) is the same as f[:5]. If you want just the first 5 bytes from every string in a list, you could do
[s[:5] for s in buckets]
But be aware that this is making a copy of all those strings. It would be more memory efficient to take just the data that you want as you're reading it, rather than create a bunch of intermediary lists, then send that data to another thread to process it and continue reading the file.
import threading
def worker(chunk):
# do stuff with chunk
...
def main():
with open('file', 'r') as f:
bucket = f.read(500)
while bucket:
chunk = bucket[:5]
thread = threading.Thread(target=worker, args=(chunk,))
thread.start()
bucket = f.read(500)
Please check speed of this, if you are wanting to affect the input list.
l = [] # Your list
x = 0
processed = 0
while processed!=len(l):
bts = l[x][:1024]
l[x] = l[x][1024:]
# Do something with bts
if not l[x]: processed += 1
x += 1
if x==len(l): x = 0
This method some servers use for buffering, but string operations after certain string size become slow. So the best will be to have list of lists already truncated to one K at the point of creation.

Flask: Get the size of request.files object

I want to get the size of uploading image to control if it is greater than max file upload limit. I tried this one:
#app.route("/new/photo",methods=["POST"])
def newPhoto():
form_photo = request.files['post-photo']
print form_photo.content_length
It printed 0. What am I doing wrong? Should I find the size of this image from the temp path of it? Is there anything like PHP's $_FILES['foo']['size'] in Python?
There are a few things to be aware of here - the content_length property will be the content length of the file upload as reported by the browser, but unfortunately many browsers dont send this, as noted in the docs and source.
As for your TypeError, the next thing to be aware of is that file uploads under 500KB are stored in memory as a StringIO object, rather than spooled to disk (see those docs again), so your stat call will fail.
MAX_CONTENT_LENGTH is the correct way to reject file uploads larger than you want, and if you need it, the only reliable way to determine the length of the data is to figure it out after you've handled the upload - either stat the file after you've .save()d it:
request.files['file'].save('/tmp/foo')
size = os.stat('/tmp/foo').st_size
Or if you're not using the disk (for example storing it in a database), count the bytes you've read:
blob = request.files['file'].read()
size = len(blob)
Though obviously be careful you're not reading too much data into memory if your MAX_CONTENT_LENGTH is very large
If you don't want save the file to disk first, use the following code, this work on in-memory stream
import os
file = request.files['file']
# os.SEEK_END == 2
# seek() return the new absolute position
file_length = file.seek(0, os.SEEK_END)
# also can use tell() to get current position
# file_length = file.tell()
# seek back to start position of stream,
# otherwise save() will write a 0 byte file
# os.SEEK_END == 0
file.seek(0, os.SEEK_SET)
otherwise, this will better
request.files['file'].save('/tmp/file')
file_length = os.stat('/tmp/file').st_size
The proper way to set a max file upload limit is via the MAX_CONTENT_LENGTH app configuration. For example, if you wanted to set an upload limit of 16 megabytes, you would do the following to your app configuration:
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024
If the uploaded file is too large, Flask will automatically return status code 413 Request Entity Too Large - this should be handled on the client side.
The following section of the code should meet your purpose..
form_photo.seek(0,2)
size = form_photo.tell()
As someone else already suggested, you should use the
app.config['MAX_CONTENT_LENGTH']
to restrict file sizes. But
Since you specifically want to find out the image size, you can do:
import os
photo_size = os.stat(request.files['post-photo']).st_size
print photo_size
You can go by popen from os import
save it first
photo=request.files['post-photo']
photo.save('tmp')
now, just get the size
os.popen('ls -l tmp | cut -d " " -f5').read()
this in bytes
for Megabytes or Gigabytes, use the flag --b M or --b G

get specific content from file python

I have a file test.txt which has an array:
array = [3,5,6,7,9,6,4,3,2,1,3,4,5,6,7,8,5,3,3,44,5,6,6,7]
Now what I want to do is get the content of array and perform some calculations with the array. But the problem is when I do open("test.txt") it outputs the content as the string. Actually the array is very big, and if I do a loop it might not be efficient. Is there any way to get the content without splitting , ? Any new ideas?
I recommend that you save the file as json instead, and read it in with the json module. Either that, or make it a .py file, and import it as python. A .txt file that looks like a python assignment is kind of odd.
Does your text file need to look like python syntax? A list of comma separated values would be the usual way to provide data:
1,2,3,4,5
Then you could read/write with the csv module or the numpy functions mentioned above. There's a lot of documentation about how to read csv data in efficiently. Once you had your csv reader data object set up, data could be stored with something like:
data = [ map( float, row) for row in csvreader]
If you want to store a python-like expression in a file, store only the expression (i.e. without array =) and parse it using ast.literal_eval().
However, consider using a different format such as JSON. Depending on the calculations you might also want to consider using a format where you do not need to load all data into memory at once.
Must the array be saved as a string? Could you use a pickle file and save it as a Python list?
If not, could you try lazy evaluation? Maybe only process sections of the array as needed.
Possibly, if there are calculations on the entire array that you must always do, it might be a good idea to pre-compute those results and store them in the txt file either in addition to the list or instead of the list.
You could also use numpy to load the data from the file using numpy.genfromtxt or numpy.loadtxt. Both are pretty fast and both have the ability to do the recasting on load. If the array is already loaded though, you can use numpy to convert it to an array of floats, and that is really fast.
import numpy as np
a = np.array(["1", "2", "3", "4"])
a = a.astype(np.float)
You could write a parser. They are very straightforward. And much much faster than regular expressions, please don't do that. Not that anyone suggested it.
# open up the file (r = read-only, b = binary)
stream = open("file_full_of_numbers.txt", "rb")
prefix = '' # end of the last chunk
full_number_list = []
# get a chunk of the file at a time
while True:
# just a small 1k chunk
buffer = stream.read(1024)
# no more data is left in the file
if '' == buffer:
break
# delemit this chunk of data by a comma
split_result = buffer.split(",")
# append the end of the last chunk to the first number
split_result[0] = prefix + split_result[0]
# save the end of the buffer (a partial number perhaps) for the next loop
prefix = split_result[-1]
# only work with full results, so skip the last one
numbers = split_result[0:-1]
# do something with the numbers we got (like save it into a full list)
full_number_list += numbers
# now full_number_list contains all the numbers in text format
You'll also have to add some logic to use the prefix when the buffer is blank. But I'll leave that code up to you.
OK, so the following methods ARE dangerous. Since they are used to attack systems by injecting code into them, used them at your own risk.
array = eval(open("test.txt", 'r').read().strip('array = '))
execfile('test.txt') # this is the fastest but most dangerous.
Safer methods.
import ast
array = ast.literal_eval(open("test.txt", 'r').read().strip('array = ')).
...
array = [float(value) for value in open('test.txt', 'r').read().strip('array = [').strip('\n]').split(',')]
The eassiest way to serialize python objects so you can load them later is to use pickle. Assuming you dont want a human readable format since this adds major head, either-wise, csv is fast and json is flexible.
import pickle
import random
array = random.sample(range(10**3), 20)
pickle.dump(array, open('test.obj', 'wb'))
loaded_array = pickle.load(open('test.obj', 'rb'))
assert array == loaded_array
pickle does have some overhead and if you need to serialize large objects you can specify the compression ratio, the default is 0 no compression, you can set it to pickle.HIGHEST_PROTOCOL pickle.dump(array, open('test.obj', 'wb'), pickle.HIGHEST_PROTOCOL)
If you are working with large numerical or scientific data sets then use numpy.tofile/numpy.fromfile or scipy.io.savemat/scipy.io.loadmat they have little overhead, but again only if you are already using numpy/scipy.
good luck.

Categories

Resources