MemoryError while working with large files in Python

MemoryError while working with large files in Python - python

So I'm trying to convert a few hundred 120MB-2GB sized .svs files (high-resolution microscopy photos at 40 times magnification) to jpeg tiles, and I keep getting MemoryError at the step where my script allocates large amounts of data to RAM. Do I simply need larger RAM or are there other tricks I can try?
My laptop has a 64-bit i3 processor and 4GB RAM (3.8 GB available, usually). I run Python 3.8.2. Even when I input only 1 of the smaller slides at once, this error occurs (see below), so I guess smaller batches are not possible. Downsampling will influence the scanning resolution, so that is not an option.
output_jpeg_tiles('D:/TCGA slides/TCGA-O1-A52J-01A.svs', 'D:/JPEG tiles')
converting D:/TCGA slides/TCGA-O1-A52J-01A.svs with width 66640 and height 25155
Traceback (most recent call last):
File "<pyshell#29>", line 1, in <module>
output_jpeg_tiles('D:/TCGA slides/TCGA-O1-A52J-01A.svs', 'D:/JPEG tiles')
File "<pyshell#23>", line 22, in output_jpeg_tiles
patch = img.read_region((begin_x, begin_y), 0, (patch_width, patch_height))
File "C:\Users\...\site-packages\openslide\__init__.py", line 222, in read_region
return lowlevel.read_region(self._osr, location[0], location[1],
File "C:\Users\...\site-packages\openslide\lowlevel.py", line 258, in read_region
buf = (w * h * c_uint32)()
MemoryError
Any suggestions? :)

Related

Python sklearn ValueError: array is too big

I made simple script on Python (ver.3.7) that classifies satellite image, but It can classify only clip of the satellite image. When I'm trying to classify the whole satellite image, it returns this:
Traceback (most recent call last):
File "v0-3.py", line 219, in classification_tool
File "sklearn\cluster\k_means_.py", line 972, in fit
File "sklearn\cluster\k_means_.py", line 312, in k_means
File "sklearn\utils\validation.py", line 496, in check_array
File "numpy\core\_asarray.py", line 85, in asarray
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
I tried using MiniBatchKMeans instead of KMeans (from Sklearn.KMeans : how to avoid Memory or Value Error?), but It still doesn't work. How I can avoid or solve this error? Maybe there are some mistakes in my code?

Oh I'm idiot because I used x32 version of Python instead of x64.
Maybe reinstalling Python to x64 version will solve your problem, user

Numpy memory error while using audio analysis using python

I get the following error, while I test a more than 100 mb audio file.
Traceback (most recent call last):
File "C:\Users\opensource\Desktop\pyAudioAnalysis-master\audioFeatureExtractio
n.py", line 542, in stFeatureExtraction signal = numpy.double(signal)MemoryError

Assuming your data was int16 before, by upcasting to float64, you quadrupled the size of your array. This is likely more than the memory you had left, and it threw a MemoryError

Memory Error while using the Binary Opening operation in Sci-Kit Image package for granulometry

I get a memory error when I am using the opening operation in the scikit-image package (it saturates my RAM). This memory error occurs for a 3-D structuring element which is a sphere/ball of radius 16 or larger. I am trying to use granulometry to measure the size distribution of objects in the image (3D array), so I need structuring elements of increasing radii. The memory requirements also increase exponentially and I cannot find a way around it. Is there a simple solution to this problem so that I can use structuring elements of even greater radii? The image size is 200X200X200. TIA
Traceback (most recent call last):
File "R3.py", line 124, in <module>
output_image = skimage.morphology.binary_opening(image, ball)
File "/usr/lib/python2.7/dist-packages/skimage/morphology/binary.py", line 117, in binary_opening
eroded = binary_erosion(image, selem)
File "/usr/lib/python2.7/dist-packages/skimage/morphology/binary.py", line 41, in binary_erosion
ndimage.convolve(binary, selem, mode='constant', cval=1, output=conv)
File "/usr/lib/python2.7/dist-packages/scipy/ndimage/filters.py", line 696, in convolve
origin, True)
File "/usr/lib/python2.7/dist-packages/scipy/ndimage/filters.py", line 544, in _correlate_or_convolve
_nd_image.correlate(input, weights, output, mode, cval, origins)
MemoryError

A volume of dimensions 200x200x200 is pretty small. A granulometry is made of sequential openings, so you just need 2 more volumes for the computation: one temporary between the erosion and the dilation, and one more for the final results. Which means three volumes total. And the structuring element should be a list of coordinates, so nothing too big.
Consequently, there is absolutely not reason you cannot perform a granulometry on your computer for a volume of such dimensions. The only explanation for the exponential memory use would be that the intermediate results are not erased.

Memory Error: Numpy.random.normal

In theano the following code snippet is throwing Memory error:
self.w = theano.shared(
np.asarray(
np.random.normal(
loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
dtype=theano.config.floatX),
name='w', borrow=True)
Just to mention the size n_in=64*56*56 and n_out=4096. The snippet is taken from the init method of a Fully Connected Layer. See the traceback:
Traceback (most recent call last):
File "<stdin>", line 8, in <module>
File "final.py", line 510, in __init__
loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
File "mtrand.pyx", line 1636, in mtrand.RandomState.normal (numpy/random/mtrand/mtrand.c:20676)
File "mtrand.pyx", line 242, in mtrand.cont2_array_sc (numpy/random/mtrand/mtrand.c:7401)
MemoryError
Is there any way we can get around the problem?

A MemoryError is Pythons way of saying: "I tried getting enough memory for that operation but your OS says it doesn't have enough".
So there's no workaround. You have to do it another way (or buy more RAM!). I don't know what your floatX is, but your array contains 64*56*56*4096 elements that translates to:
6.125 GB if you use float64
3.063 GB if you use float32
1.531 GB if you use float16 (not sure if float16 is supported for your operations though)
But the problem with MemoryErrors is that just avoiding them once generally isn't enough. If you don't change your approach you'll get problems again as soon as you do an operation that requires an intermediate or new array (then you have two huge arrays) or that coerces to a higher dtype (then you have two huge arrays and the new one is of higher dtype so requires more space).
So the only viable workaround is to change the approach, maybe you can start by calculating subsets (map-reduce approach)?

CountVectorizer() in scikit-learn Python gives Memory error when feeding big Dataset. Same code with Smaller dataset works fine, what am I missing?

I am Working on Two Class Machine Learning Problem. Training Set contains 2-Millions Rows of URL(Strings) and Label 0 and 1. Classifier LogisticRegression() should predict any of two labels when testing datasets are passed. I am getting 95% accuracy results when i use smaller dataset i.e 78,000 URL and 0 and 1 as labels.
The Problem I am having is When I feed in big dataset (2 million row of URL strings) I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "C:/Users/Slim/.xy/startups/start/chi2-94.85 - Copy.py", line 48, in <module>
bi_counts = bi.fit_transform(url_list)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 717, in _count_vocab
j_indices.append(vocabulary[feature])
MemoryError
My code which is working for small datasets with fair enough accuracy is
bi = CountVectorizer(ngram_range=(3, 3),binary = True, max_features=9000, analyzer='char_wb')
bi_counts = bi.fit_transform(url_list)
tf = TfidfTransformer(norm='l2')
X_train_tf =tf.fit_transform(use_idf=True, bi_counts)
clf = LogisticRegression(penalty='l1',intercept_scaling=0.5,random_state=True)
clf.fit(train_x2,y)
I tried to keep 'max_features' as minimum as possible say max_features=100, but still same result.
Please Note:
I am Using core i5 with 4GB ram
I tried the same code on 8GB ram but
no luck
I am using Pyhon 2.7.6 with sklearn, NumPy 1.8.1, SciPy 0.14.0, Matplotlib 1.3.1
UPDATE:
#Andreas Mueller suggested to used HashingVectorizer(), i used it with small and large datasets, 78,000 dataset compiled successfully but 2-million dataset gave me same memory error as shown above. I tried it on 8GB ram and in-use memory space = 30% when compiling big dataset.

IIRC the max_features is only applied after the whole dictionary is computed.
The easiest way out is to use the HashingVectorizer that does not compute a dictionary.
You will lose the ability to get the corresponding token for a feature, but you shouldn't run into memory issues any more.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

MemoryError while working with large files in Python - python

Related

Python sklearn ValueError: array is too big

Numpy memory error while using audio analysis using python

Memory Error while using the Binary Opening operation in Sci-Kit Image package for granulometry

Memory Error: Numpy.random.normal

CountVectorizer() in scikit-learn Python gives Memory error when feeding big Dataset. Same code with Smaller dataset works fine, what am I missing?

Categories

Resources