Persisting a Large scipy.sparse.csr_matrix - python

I have a very large sparse scipy matrix. Attempting to use save_npz resulted in the following error:
>>> sp.save_npz('/projects/BIGmatrix.npz',W)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/numpy/lib/npyio.py", line 716, in _savez
pickle_kwargs=pickle_kwargs)
File "/usr/local/lib/python3.5/dist-packages/numpy/lib/format.py", line 597, in write_array
array.tofile(fp)
OSError: 6257005295 requested and 3283815408 written
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/_matrix_io.py", line 78, in save_npz
np.savez_compressed(file, **arrays_dict)
File "/usr/local/lib/python3.5/dist-packages/numpy/lib/npyio.py", line 659, in savez_compressed
_savez(file, args, kwds, True)
File "/usr/local/lib/python3.5/dist-packages/numpy/lib/npyio.py", line 721, in _savez
raise IOError("Failed to write to %s: %s" % (tmpfile, exc))
OSError: Failed to write to /projects/BIGmatrix.npzg6ub_z3y-numpy.npy: 6257005295 requested and 3283815408 written
As such I wanted to try persisting it to postgres via psycopg2 but I haven't found a method of iterating over all nonzeros so I can persist them as rows in a table.
What is the best way to handle this task?

Save all the attributes in __dict__ of the matrix object, and recreate the csr_matrix when load:
from scipy import sparse
import numpy as np
a = np.zeros((1000, 2000))
a[np.random.randint(0, 1000, 100), np.random.randint(0, 2000, 100)] = np.random.randn(100)
b = sparse.csr_matrix(a)
np.savez("tmp", data=b.data, indices=b.indices, indptr=b.indptr, shape=np.array(b.shape))
f = np.load("tmp.npz")
b2 = sparse.csr_matrix((f["data"], f["indices"], f["indptr"]), shape=f["shape"])
(b != b2).sum()

It seems that the way things go is:
When you invoke scipy.sparse.save_npz(), by default it saves as a compressed file; however, in order to do so it first creates a temporary uncompressed version of the target file that it then compresses down to the final result. This means that whatever drive you save to needs to be large enough to accommodate the uncompressed temp file which in my case was 47G.
I re-tried the save in a larger drive and the process completed without incident.
Note: The compression can take quite a long time.

Related

Python Pillow Image to PDF and then merging memory issues

Goal:
Convert finite number of files in .jpg format and merge them into one PDF file.
Expected result:
Files from folder are successfully converted and merged into one pdf file at specified location.
Problem:
When size of files exceed certain number, in my tests it was around 400 mb the program crashes with following message:
Traceback (most recent call last):
File "C:\Users\kaczk\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PIL\ImageFile.py", line 498, in _save
fh = fp.fileno()
io.UnsupportedOperation: fileno
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "MakePDF.py", line 10, in <module>
im1.save(pdf1_filename, "PDF" ,resolution=1000.0, save_all=True, append_images=imageList)
File "C:\Users\kaczk\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PIL\Image.py", line 2084, in save
save_handler(self, fp, filename)
File "C:\Users\kaczk\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PIL\PdfImagePlugin.py", line 46, in _save_all
_save(im, fp, filename, save_all=True)
File "C:\Users\kaczk\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PIL\PdfImagePlugin.py", line 175, in _save
Image.SAVE["JPEG"](im, op, filename)
File "C:\Users\kaczk\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PIL\JpegImagePlugin.py", line 770, in _save
ImageFile._save(im, fp, [("jpeg", (0, 0) + im.size, 0, rawmode)], bufsize)
File "C:\Users\kaczk\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PIL\ImageFile.py", line 513, in _save
fp.write(d)
MemoryError
After running the program with task manager i noticed that indeed the computer runs out of ram memory when executing this program. Below is the code used.
import os
from PIL import Image
fileList = os.listdir(r'C:\location\of\photos\folder')
imageList = []
im1 = Image.open(os.path.join(r'C:\location\of\photos\folder',fileList[0]))
for file in fileList[1:]:
imageList.append(Image.open(os.path.join(r'C:\location\of\photos\folder',file)))
pdf1_filename = r'C:\location\of\pdf\destination.pdf'
im1.save(pdf1_filename, "PDF" ,resolution=500.0, save_all=True, append_images=imageList)
Is there an easy mistake I am making here regarding memory usage? Is there different module that would make the task easier while working with more and larger files? I will be very grateful for all help.
This question is quite old but since I got there struggling with the same issue, here is an answer.
You simply have to close your images after using them:
im1.close()
for i in imageList:
i.close()
This solved it for me.
PS: take a look at glob, it eases working with paths a lot.

Read very long array from mat with scipy

I have a result file from Dymola (.mat v4) which stores all variables in a huge 1D array (More or less 2GB of data in one array...). I can't do anything about the file format as we are bound to use Dymola. When trying to read the file using scipy (with Python 2.7.13 64bit), I get the following error:
C:\Users\...\scipy\io\matlab\mio4.py:352: RuntimeWarning: overflow encountered
in long_scalars
remaining_bytes = hdr.dtype.itemsize * n
C:\...\scipy\io\matlab\mio4.py:172: RuntimeWarning: overflow
encountered in long_scalars
num_bytes *= d
Traceback (most recent call last):
File
...
self.mat = scipy.io.loadmat(fileName, chars_as_strings=False)
File "C:\...\scipy\io\matlab\mio.py", line 136, in loadmat
matfile_dict = MR.get_variables(variable_names)
File "C:\...\scipy\io\matlab\mio4.py", line 399, in get_variables
mdict[name] = self.read_var_array(hdr)
File "C:\...\scipy\io\matlab\mio4.py", line 374, in read_var_array
return self._matrix_reader.array_from_header(header, process)
File "C:\...\scipy\io\matlab\mio4.py", line 137, in array_from_header
arr = self.read_full_array(hdr)
File "C:\...\scipy\io\matlab\mio4.py", line 207, in read_full_array
return self.read_sub_array(hdr)
File "C:\...\scipy\io\matlab\mio4.py", line 178, in read_sub_array
"`variable_names` kwarg to `loadmat`" % hdr.name)
ValueError: Not enough bytes to read matrix 'data_2'; is this a badly-formed
file? Consider listing matrices with `whosmat` and loading named matrices with `variable_names` kwarg to `loadmat`
The error/problem is pretty clear to me. My question: Are there any workarounds? Can I still read the file and get the data? Is it possible to split the array while reading it?
I suggest you turn on conversion to SDF file format which is based on HDF5. This format can better handle large files. See Simulation/Setup.
Alternatively you can reduce the number of variables stored in the file using Variable Selections in Dymola.

IOError: encoder zip not available ubuntu

I getting following error when I am trying to run my python program which uses PIL.
Generate_Dot.py:14: RuntimeWarning: the frombuffer defaults may change in a future release; for portability, change the call to read:
frombuffer(mode, size, data, 'raw', mode, 0, 1)
img = Image.frombuffer('L', size, data)
Traceback (most recent call last):
File "Generate_Dot.py", line 15, in <module>
img.save('image.png')
File "/home/kapil/python/lib/python2.7/site-packages/PIL-1.1.7-py2.7-linux-i686.egg/PIL/Image.py", line 1439, in save
save_handler(self, fp, filename)
File "/home/kapil/python/lib/python2.7/site-packages/PIL-1.1.7-py2.7-linux-i686.egg/PIL/PngImagePlugin.py", line 572, in _save
ImageFile._save(im, _idat(fp, chunk), [("zip", (0,0)+im.size, 0, rawmode)])
File "/home/kapil/python/lib/python2.7/site-packages/PIL-1.1.7-py2.7-linux-i686.egg/PIL/ImageFile.py", line 481, in _save
e = Image._getencoder(im.mode, e, a, im.encoderconfig)
File "/home/kapil/python/lib/python2.7/site-packages/PIL-1.1.7-py2.7-linux-i686.egg/PIL/Image.py", line 401, in _getencoder
raise IOError("encoder %s not available" % encoder_name)
IOError: encoder zip not available
cpython depends on various third party libraries being installed in your system. In this case 'zip' is required to perform some compression. I'm guessing in this case it would be one of gzip, compress, or zlib. You should be able to install these quite easily using yum, or apt-get.

numpy loading file error

I tried to load .npy file created by numpy:
import numpy as np
F = np.load('file.npy')
And numpy raises this error:
C:\Miniconda3\lib\site-packages\numpy\lib\npyio.py in load(file,
mmap_mode)
379 N = len(format.MAGIC_PREFIX)
380 magic = fid.read(N)
--> 381 fid.seek(-N, 1) # back-up
382 if magic.startswith(_ZIP_PREFIX):
383 # zip-file (assume .npz)
OSError: [Errno 22] Invalid argument
Could anyone explain me what its mean? How can I recover my file?
You are using a file object that does not support the seek method. Note that the file parameter of numpy.load must support the seek method. My guess is that you are perhaps operating on a file object that corresponds to another file object that has already been opened elsewhere and remains open:
>>> f = open('test.npy', 'wb') # file remains open after this line
>>> np.load('test.npy') # numpy now wants to use the same file
# but cannot apply `seek` to the file opened elsewhere
Traceback (most recent call last):
File "<pyshell#114>", line 1, in <module>
np.load('test.npy')
File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 370, in load
fid.seek(-N, 1) # back-up
IOError: [Errno 22] Invalid argument
Note that I receive the same error as you did. If you have an open file object, you will want to close it before using np.load and before you use np.save to save your file object.

Pickle: Reading a dictionary, EOFError

I recently found out about pickle, which is amazing. But it errors on me when used for my actual script, testing it with a one item dictionary it worked fine. My real script is thousands of lines of code storing various objects within maya into it. I do not know if it has anything to do with the size, I have read around a lot of threads here but none are specific to my error.
I have tried writing with all priorities. No luck.
This is my output code:
output = open('locatorsDump.pkl', 'wb')
pickle.dump(l.locators, output, -1)
output.close()
This is my read code:
jntdump = open('locatorsDump.pkl', 'rb')
test = pickle.load(jntdump)
jntdump.close()
This is the error:
# Error: Error in maya.utils._guiExceptHook:
# File "C:\Program Files\Autodesk\Maya2011\Python\lib\site-packages\pymel-1.0.0-py2.6.egg\maya\utils.py", line 277, in formatGuiException
# exceptionMsg = excLines[-1].split(':',1)[1].strip()
# IndexError: list index out of range
#
# Original exception was:
# Traceback (most recent call last):
# File "<maya console>", line 3, in <module>
# File "C:\Program Files\Autodesk\Maya2011\bin\python26.zip\pickle.py", line 1370, in load
# return Unpickler(file).load()
# File "C:\Program Files\Autodesk\Maya2011\bin\python26.zip\pickle.py", line 858, in load
# dispatch[key](self)
# File "C:\Program Files\Autodesk\Maya2011\bin\python26.zip\pickle.py", line 880, in load_eof
# raise EOFError
# EOFError #
Try using pickle.dumps() and pickle.loads() as a test.
If you don't recieve the same error, you know it is related to the file write.

Categories

Resources