I'm sort of a python nub, so bear with me. I have a set of custom classes, each of which basically wraps and adds some functionality to an image file that's been converted to a numpy.ndarray. Since it takes about 2 minutes to create all these objects each time the script is run, I was hoping to create a list of them and pickle that list. The pickling seems to go well; the unpickling fails.
This is all I'm doing:
Pickling
frame_jar_file = open(os.path.join(asset_path, "frame_jar.pkl"), "w+")
for x in range(1, 500):
path = os.path.join(img_path, "{0}.jpg".format(str(x).zfill(8)))
surface = NumpySurface(path)
self.scene_surfaces.append(surface)
frame_jar = cPickle.Pickler(frame_jar_file, -1) # have tried this with no protocol arg as well
frame_jar.dump(self.scene_surfaces)
frame_jar_file.close()
exit()
Produces a file about 2gb in size, which seems about right to me given the data.
Unpickling
self.scene_surfaces = cPickle.Unpickler(os.path.join(asset_path, "frame_jar.pkl"))
Provokes this error:
TypeError: argument must have 'read' and 'readline' attributes
You need to pass in an open file object, not the filename:
with open(os.path.join(asset_path, "frame_jar.pkl"), 'rb') as infh:
unpickler = cPickle.Unpickler(infh)
self.scene_surfaces = unpickler.load()
I also assumed you wanted to load the data, not just create an unpickler.
Related
I would like to load a LightGBM model from a string or buffer rather than a file on disk.
It seems that there is a method called model_from_string documentation link but ... it produces an error, which seemingly defeats the purpose of the method as I understand it.
import boto3
import lightgbm as lgb
import io
model_path = 'some/path/here'
s3_bucket = boto3.resource('s3').Bucket('some-bucket')
obj = s3_bucket.Object(model_path)
buf = io.BytesIO()
try:
obj.download_fileobj(buf)
except Exception as e:
raise e
else:
model = lgb.Booster().model_from_string(buf.read().decode("UTF-8"))
which produces the following error....
TypeError: Need at least one training dataset or model file to create booster instance
Alternatively, I thought that I might be able to use the regular loading method
lgb.Booster(model_file=buf.read().decode("UTF-8"))
... but this also doesn't work.
FileNotFoundError: [Errno 2] No such file or directory: ''
Now, I realize that I can create a workaround by writing the buffer to disk, and then reading it. However, this feels very redundant and inefficient.
Thus, my question is, how can instantiate a model to use for prediction without pointing to a an actual file on disk?
It seems that there is an undocumented parameter model_str which can be used to initialize the lgb.Booster object.
model = lgb.Booster({'model_str': buf.read().decode("UTF-8")})
Source: https://github.com/Microsoft/LightGBM/issues/2097#issuecomment-482332232
Credit goes to Nikita Titov aka StrikerRUS on GitHub.
The problem
My application is extracting a list of zip files in memory and writing the data to a temporary file. I then memory map the data in the temp file for use in another function. When I do this in a single process, it works fine, reading the data doesn't affect memory, max RAM is around 40MB. However when I do this using concurrent.futures the RAM goes up to 500MB.
I have looked at this example and I understand I could be submitting the jobs in a nicer way to save memory during processing. But I don't think my issue is related, as I am not running out of memory during processing. The issue I don't understand is why it is holding onto the memory even after the memory maps are returned. Nor do I understand what is in the memory, since doing this in a single process does not load the data in memory.
Can anyone explain what is actually in the memory and why this is different between single and parallel processing?
PS I used memory_profiler for measuring the memory usage
Code
Main code:
def main():
datadir = './testdata'
files = os.listdir('./testdata')
files = [os.path.join(datadir, f) for f in files]
datalist = download_files(files, multiprocess=False)
print(len(datalist))
time.sleep(15)
del datalist # See here that memory is freed up
time.sleep(15)
Other functions:
def download_files(filelist, multiprocess=False):
datalist = []
if multiprocess:
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
returned_future = [executor.submit(extract_file, f) for f in filelist]
for future in returned_future:
datalist.append(future.result())
else:
for f in filelist:
datalist.append(extract_file(f))
return datalist
def extract_file(input_zip):
buffer = next(iter(extract_zip(input_zip).values()))
with tempfile.NamedTemporaryFile() as temp_logfile:
temp_logfile.write(buffer)
del buffer
data = memmap(temp_logfile, dtype='float32', shape=(2000000, 4), mode='r')
return data
def extract_zip(input_zip):
with ZipFile(input_zip, 'r') as input_zip:
return {name: input_zip.read(name) for name in input_zip.namelist()}
Helper code for data
I can't share my actual data, but here's some simple code to create files that demonstrate the issue:
for i in range(1, 16):
outdir = './testdata'
outfile = 'file_{}.dat'.format(i)
fp = np.memmap(os.path.join(outdir, outfile), dtype='float32', mode='w+', shape=(2000000, 4))
fp[:] = np.random.rand(*fp.shape)
del fp
with ZipFile(outdir + '/' + outfile[:-4] + '.zip', mode='w', compression=ZIP_DEFLATED) as z:
z.write(outdir + '/' + outfile, outfile)
The problem is that you're trying to pass an np.memmap between processes, and that doesn't work.
The simplest solution is to instead pass the filename, and have the child process memmap the same file.
When you pass an argument to a child process or pool method via multiprocessing, or return a value from one (including doing so indirectly via a ProcessPoolExecutor), it works by calling pickle.dumps on the value, passing the pickle across processes (the details vary, but it doesn't matter whether it's a Pipe or a Queue or something else), and then unpickling the result on the other side.
A memmap is basically just an mmap object with an ndarray allocated in the mmapped memory.
And Python doesn't know how to pickle an mmap object. (If you try, you will either get a PicklingError or a BrokenProcessPool error, depending on your Python version.)
A np.memmap can be pickled, because it's just a subclass of np.ndarray—but pickling and unpickling it actually copies the data and gives you a plain in-memory array. (If you look at data._mmap, it's None.) It would probably be nicer if it gave you an error instead of silently copying all of your data (the pickle-replacement library dill does exactly that: TypeError: can't pickle mmap.mmap objects), but it doesn't.
It's not impossible to pass the underlying file descriptor between processes—the details are different on every platform, but all of the major platforms have a way to do that. And you could then use the passed fd to build an mmap on the receiving side, then build a memmap out of that. And you could probably even wrap this up in a subclass of np.memmap. But I suspect if that weren't somewhat difficult, someone would have already done it, and in fact it would probably already be part of dill, if not numpy itself.
Another alternative is to explicitly use the shared memory features of multiprocessing, and allocate the array in shared memory instead of a mmap.
But the simplest solution is, as I said at the top, to just pass the filename instead of the object, and let each side memmap the same file. This does, unfortunately, mean you can't just use a delete-on-close NamedTemporaryFile (although the way you were using it was already non-portable and wouldn't have worked on Windows the same way it does on Unix), but changing that is still probably less work than the other alternatives.
I have an .obj file in which, previously, I transformed an image to base64 and saved with pickle.
The problem is when I try to load the .obj file with pickle, convert the code into image from base64, and load it with pygame.
The function that loads the image:
def mainDisplay_load(self):
main_folder = path.dirname(__file__)
img_main_folder = path.join(main_folder, "sd_graphics")
# loadImg
self.mainTerminal = pg.image.load(path.join(img_main_folder, self.main_uncode("tr.obj"))).convert_alpha()
The function that decodes the file:
def main_uncode(self, object):
openFile = open(object, "rb")
str = pickle.load(openFile)
openFile.close()
fileData = base64.b64decode(str)
return fileData
The error I get when the code is run:
str = pickle.load(openFile)
EOFError: Ran out of input
How can I fix it?
Python version: 3.6.2
Pygame version: 1.9.3
Update 1
This is the code I used to create the .obj file:
import base64, pickle
with open("terminal.png", "rb") as imageFile:
str = base64.b64encode(imageFile.read())
print(str)
file_pi = open("tr.obj","wb")
pickle.dump(str,file_pi)
file_pi.close()
file_pi2 = open("tr.obj","rb")
str2 = pickle.load(file_pi2)
file_pi2.close()
imgdata = base64.b64decode(str2)
filename = 'some_image.jpg' # I assume you have a way of picking unique filenames
with open(filename, 'wb') as f:
f.write(imgdata)
Once the file is created, it is loaded and a second image is created. This is to check if the image is the same or there are errors in the conversion.
As you can see, I used part of the code to load the image, but instead of saving it, it is loaded into pygame. And that's where the mistake occurs.
Update 2
I finally managed to solve it.
In the main code:
def mainDisplay_load(self):
self.all_graphics = pg.sprite.Group()
self.graphics_menu = pg.sprite.Group()
# loadImg
self.img_mainTerminal = mainGraphics(self, 0, 0, "sd_graphics/tr.obj")
In the library containing graphics classes:
import pygame as pg
import base64 as bs
import pickle as pk
from io import BytesIO as by
from lib.setting import *
class mainGraphics(pg.sprite.Sprite):
def __init__(self, game, x, y, object):
self.groups = game.all_graphics, game.graphics_menu
pg.sprite.Sprite.__init__(self, self.groups)
self.game = game
self.object = object
self.outputGraphics = by()
self.x = x
self.y = y
self.eventType()
self.rect = self.image.get_rect()
self.rect.x = self.x * tilesizeDefault
self.rect.y = self.y * tilesizeDefault
def eventType(self):
openFile = open(self.object, "rb")
str = pk.load(openFile)
openFile.close()
self.outputGraphics.write(bs.b64decode(str))
self.outputGraphics.seek(0)
self.image = pg.image.load(self.outputGraphics).convert_alpha()
For the question of why I should do such a thing, it is simple:
any attacker with sufficient motivation can still get to it easily
Python is free and open.
On the one hand, we have a person who intentionally goes and modify and recover the hidden data. But if Python is an open language, as with even more complicated and protected languages, the most motivated are able to crack the game or program and retrieve the same data.
On the other hand, we have a person who knows only the basics, or not even that. A person who cannot access the files without knowing more about the language, or decoding the files.
So you can understand that decoding files, from my point of view, does not need to be protected from a motivated person. Because even with a more complex and protected language, that motivated person will be able to get what he wants. The protection is used against people who have no knowledge of the language.
So, if the error you get is indeed "pickle: run out of input", that propably means you messed your directories in the code above, and are trying to read an empty file with the same name as your obj file is.
Actually, as it is, this line in your code:
self.mainTerminal=pg.image.load(path.join(img_main_folder,self.main_uncode
("tr.obj"))).convert_alpha()
Is completly messed up. Just read it and you can see the problem: you are passing to the main_uncode method just the file name, without directory information. And then, if it would by chance have worked, as I have poitned in the comments a while ago, you would try to use the unserialized image data as a filename from where to read your image. (You or someone else had probably thought that main_uncode should write a temporary image file and writ the image data to that, so that Pygame could read it, but as it is, it is just returning the raw image data in a string).
Threfore, by fixing the above call and passing an actual path to main_uncode, and further modifying it to write the temporary data to a file and return its path would fix the snippets of code above.
Second thing is I can't figure out why do you need this ".obj" file at all. If it is just for "security through obscurity" hopping people get your bundled file can't open the images, that is a thing far from a recommended practice. To sum up just one thing: it will delay legitimate uses of your file (like, you yourself does not seem to be able to use it ), while any attacker with sufficient motivation can still get to it easily. By opening an image, base-64 encoding and pickling it, and doing the reverse process you are doing essentially a no-operation. Even more, a pickle file can serialize and write to disk complex Python objects - but a base64 serialization of an image could be written directly to a file, with no need for pickle.
Third thing: just use with to open all the files, not just the ones you read with the imaging library, Take your time to learn a little bit more about Python.
Is this a bug? It demonstrates what happens when you use libtiff to extract an image from an open tiff file handle. It works in python 2.x and does not work in python 3.2.3
import os
# any file will work here, since it's not actually loading the tiff
# assuming it's big enough for the seek
filename = "/home/kostrom/git/wiredfool-pillow/Tests/images/multipage.tiff"
def test():
fp1 = open(filename, "rb")
buf1 = fp1.read(8)
fp1.seek(28)
fp1.read(2)
for x in range(16):
fp1.read(12)
fp1.read(4)
fd = os.dup(fp1.fileno())
os.lseek(fd, 28, os.SEEK_SET)
os.close(fd)
# this magically fixes it: fp1.tell()
fp1.seek(284)
expect_284 = fp1.tell()
print ("expected 284, actual %d" % expect_284)
test()
The output which I feel is in error is:
expected 284, actual -504
Uncommenting the fp1.tell() does some ... side effect ... which stabilizes the py3 handle, and I don't know why. I'd also appreciate if someone can test other versions of python3.
No, this is not a bug. The Python 3 io library, which provides you with the file object from an open() call, gives you a buffered file object. For binary files, you are given a (subclass of) io.BufferedIOBase.
The Python 2 file object is far more primitive, although you can use the io library there too.
By seeking at the OS level you are bypassing the buffer and are mucking up the internal state. Generally speaking, as the doctor said to the patient complaining that pinching his skin hurts: don't do that.
If you have a pressing need to do this anyway, at the very least use the underlying raw file object (a subclass of the io.RawIOBase class) via the io.BufferedIO.raw attribute:
fp1 = open(filename, "rb").raw
os.dup creates a duplicate file descriptor that refers to the same open file description. Therefore, os.lseek(fd, 28, SEEK_SET) changes the seek position of the file underlying fp1.
Python's file objects cache the file position to avoid repeated system calls. The side effect of this is that changing the file position without using the file object methods will desynchronize the cached position and the real position, leading to nonsense like you've observed.
Worse yet, because the files are internally buffered by Python, seeking outside the file methods could actually cause the returned file data to be incorrect, leading to corruption or other nasty stuff.
The documentation in bufferedio.c notes that tell can be used to reinitialize the cached value:
* The absolute position of the raw stream is cached, if possible, in the
`abs_pos` member. It must be updated every time an operation is done
on the raw stream. If not sure, it can be reinitialized by calling
_buffered_raw_tell(), which queries the raw stream (_buffered_raw_seek()
also does it). To read it, use RAW_TELL().
I've wrote a small cryptographic module in python whose task is to cipher a file and put the result in a tarfile. The original file to encrypt can be quit large, but that's not a problem because my program only need to work with a small block of data at a time, that can be encrypted on the fly and stored.
I'm looking for a way to avoid doing it in two passes, first writing all the data in a temporary file then inserting result in a tarfile.
Basically I do the following (where generator_encryptor is a simple generator that yield chunks of data read from sourcefile).
:
t = tarfile.open("target.tar", "w")
tmp = file('content', 'wb')
for chunk in generator_encryptor("sourcefile"):
tmp.write(chunks)
tmp.close()
t.add(content)
t.close()
I'm a bit annoyed having to use a temporary file as I file it should be easy to write blocs directly in the tar file, but collecting every chunks in a single string and using something like t.addfile('content', StringIO(bigcipheredstring) seems excluded because I can't guarantee that I have memory enough to old bigcipheredstring.
Any hint of how to do that ?
You can create an own file-like object and pass to TarFile.addfile. Your file-like object will generate the encrypted contents on the fly in the fileobj.read() method.
Huh? Can't you just use the subprocess module to run a pipe through to tar? That way, no temporary file should be needed. Of course, this won't work if you can't generate your data in small enough chunks to fit in RAM, but if you have that problem, then tar isn't the issue.
Basically using a file-like object and passing it to TarFile.addfile do the trick, but there is still some issues open.
I need to known the full encrypted file size at the beginning
the way tarfile access to read method is such that the custom file-like object must always return full read buffers, or tarfile suppose it's end of file. It leads to some really inefficient buffer copying in the code of read method, but it's either that or change tarfile module.
The resulting code is below, basically I had to write a wrapper class that transform my existing generator into a file-like object. I also added the GeneratorEncrypto class in my example to make code compleat. You can notice it has a len method that returns the length of the written file (but understand it's just a dummy placeholder that does nothing usefull).
import tarfile
class GeneratorEncryptor(object):
"""Dummy class for testing purpose
The real one perform on the fly encryption of source file
"""
def __init__(self, source):
self.source = source
self.BLOCKSIZE = 1024
self.NBBLOCKS = 1000
def __call__(self):
for c in range(0, self.NBBLOCKS):
yield self.BLOCKSIZE * str(c%10)
def __len__(self):
return self.BLOCKSIZE * self.NBBLOCKS
class GeneratorToFile(object):
"""Transform a data generator into a conventional file handle
"""
def __init__(self, generator):
self.buf = ''
self.generator = generator()
def read(self, size):
chunk = self.buf
while len(chunk) < size:
try:
chunk = chunk + self.generator.next()
except StopIteration:
self.buf = ''
return chunk
self.buf = chunk[size:]
return chunk[:size]
t = tarfile.open("target.tar", "w")
tmp = file('content', 'wb')
generator = GeneratorEncryptor("source")
ti = t.gettarinfo(name = "content")
ti.size = len(generator)
t.addfile(ti, fileobj = GeneratorToFile(generator))
t.close()
I guess you need to understand how the tar format works, and handle the tar writing yourself. Maybe this can be helpful?
http://mail.python.org/pipermail/python-list/2001-August/100796.html