Python: numpy.corrcoef Memory Error - python

I was trying to calculate the correlation between a large set of data read from a text. For extremely large data set the program give a memory error. Can anyone please tell me how to correct this problem. Thanks
The following is my code:
enter code here
import numpy
from numpy import *
from array import *
from decimal import *
import sys
Threshold = 0.8;
TopMostData = 10;
FileName = sys.argv[1]
File = open(FileName,'r')
SignalData = numpy.empty((1, 128));
SignalData[:][:] = 0;
for line in File:
TempLine = line.split();
TempInt = [float(i) for i in TempLine]
SignalData = vstack((SignalData,TempInt))
del TempLine;
del TempInt;
File.close();
TempData = SignalData;
SignalData = SignalData[1:,:]
SignalData = SignalData[:,65:128]
print "File Read | Data Stored" + " | Total Lines: " + str(len(SignalData))
CorrelationData = numpy.corrcoef(SignalData)
The following is the error:
Traceback (most recent call last):
File "Corelation.py", line 36, in <module>
CorrelationData = numpy.corrcoef(SignalData)
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1824, in corrcoef
return c/sqrt(multiply.outer(d, d))
MemoryError

You run out of memory as the comments show. If that happens because you are using 32-bit Python, even the method below will fail. But for the 64-bit Python and not-so-much-RAM situation we can do a lot as calculating the correlations is easily done piecewise, as you only need two lines in the memory simultaneously.
So, you may split your input into, say, 1000 row chunks, and then the resulting 1000 x 1000 matrices are easy to keep in memory. Then you can assemble your result into the big output matrix which is not necessarily in the RAM. I recommend this approach even if you have a lot of RAM, because this is much more memory-friendly. Correlation coefficient calculation is not an operation where fast random accesses would help a lot if the input can be kept in RAM.
Unfortunately, the numpy.corrcoef does not do this automatically, and we'll have to roll our own correlation coefficient calculation. Fortunately, that is not as hard as it sounds.
Something along these lines:
import numpy as np
# number of rows in one chunk
SPLITROWS = 1000
# the big table, which is usually bigger
bigdata = numpy.random.random((27000, 128))
numrows = bigdata.shape[0]
# subtract means form the input data
bigdata -= np.mean(bigdata, axis=1)[:,None]
# normalize the data
bigdata /= np.sqrt(np.sum(bigdata*bigdata, axis=1))[:,None]
# reserve the resulting table onto HDD
res = np.memmap("/tmp/mydata.dat", 'float64', mode='w+', shape=(numrows, numrows))
for r in range(0, numrows, SPLITROWS):
for c in range(0, numrows, SPLITROWS):
r1 = r + SPLITROWS
c1 = c + SPLITROWS
chunk1 = bigdata[r:r1]
chunk2 = bigdata[c:c1]
res[r:r1, c:c1] = np.dot(chunk1, chunk2.T)
Some notes:
the code above is tested above np.corrcoef(bigdata)
if you have complex values, you'll need to create a complex output array res and take the complex conjugate of chunk2.T
the code garbles bigdata to maintain performance and minimize memory use; if you need to preserve it, make a copy
The above code takes about 85 s to run on my machine, but the data will mostly fit in RAM, and I have a SSD disk. The algorithm is coded in such order to avoid too random access into the HDD, i.e. the access is reasonably sequential. In comparison, the non-memmapped standard version is not significantly faster even if you have a lot of memory. (Actually, it took a lot more time in my case, but I suspect I ran out of my 16 GiB and then there was a lot of swapping going on.)
You can make the actual calculations faster by omitting half of the matrix, because res.T == res. In practice, you can omit all blocks where c > r and then mirror them later on. On the other hand, the performance is most likely limited by the HDD preformance, so other optimizations do not necessarily bring much more speed.
Of course, this approach is easy to make parallel, as the chunk calculations are completely independent. Also the memmapped array can be shared between threads rather easily.

Related

How do I improve the speed of this parser using python?

I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.
However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.
How can I optimize / improve this python parser?
Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?
Thank you so much in advance. 💗
from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np
directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()
# Loop through all files in directory
for filename in os.listdir(directory)[::3]:
try:
# Uncompress and parse protobuff-file using gtfs_realtime_pb2
with gzip.open(directory + filename, 'rb') as file:
response = file.read()
feed = gtfs_realtime_pb2.FeedMessage()
feed.ParseFromString(response)
print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))
for trip in feed.entity:
if trip.trip_update.trip.trip_id not in read_trips:
try:
if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)
for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)
read_trips.add(trip.trip_update.trip.trip_id)
except KeyError:
continue
else:
continue
except OSError:
continue
I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).
As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.
That is (only the modified lines):
datapoints = []
# ...
datapoints.append((key,ts,delay))
# ...
npdata = np.array(datapoints, dtype=int)
I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.
for trip in feed.entity:
this_trip_update = trip.trip_update
this_trip_id = this_trip_update.trip.trip_id
if this_trip_id not in read_trips:
try:
if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
this_trip_id)
# create two iterators to walk through the list of updates
cur_updates = iter(this_trip_update.stop_time_update)
nxt_updates = iter(this_trip_update.stop_time_update)
# advance the nxt_updates iter so it is one ahead of cur_updates
next(nxt_updates)
for cur_update, next_update in zip(cur_updates, nxt_updates):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(next_update.arrival.time)
key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)
read_trips.add(this_trip_id)
except KeyError:
continue
This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.
(This probably is more appropriate for CodeReview, but I hardly ever go there.)

python sparse matrix creation paralellize to speed up

I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).

Scipy Sparse Eigensolver: MemoryError after multiple passes through loop without anything new being written during loop

I'm using Python + Scipy to diagonalize sparse matrices with random entries on the diagonal; in particular, I need eigenvalues in the middle of the spectrum. The code I've written has worked fine for months, but now I'm looking at bigger matrices and am running into "MemoryError"s. What's confusing/driving me insane is that the error only shows up after a few iterations (namely 9) of constructing a random matrix and diagonalizing it, but I don't see any way in which my code stores anything extra in memory from one iteration to the next, and so can't see how my code could fail during the 9th iteration but not the 1st.
Here are the details (and I apologize in advance if I've left anything out, I'm new to posting on this site):
Each matrix I construct is 16000x16000, with 15x16000 non-zero entries. Everything ran fine when I was looking at 4000x4000-size matrices. The bulk of my code is
#Initialization
#...
for i in range(dim):
for n in range(N):
digit = (i % 2**(n+1)) / 2**n
index = (i % 2**n) + ((digit + 1) % 2)*(2**n) + (i / 2**(n+1))*(2**(n+1))
row[dim + N*i + n] = index
col[dim + N*i + n] = i
dat[dim + N*i + n] = -G
e_list = open(e_list_name + "_%03dk_%010ds" % (num_states, int(start_time)), "w")
e_log = open(e_log_name + "_%03dk_%010ds" % (num_states, int(start_time)), "w")
for t in range(num_itr): #Begin iterations
dat[0:dim] = math.sqrt(N/2.0)*np.random.randn(dim) #Get new diagonal elements
H = sparse.csr_matrix((dat, (row, col))) #Construct new matrix
vals = sparse.linalg.eigsh(H, k = num_states + 2, sigma = target_energy, which = 'LM', return_eigenvectors = False) #Get new eigenvalues
vals = np.sort(vals)
vals.tofile(e_list)
e_log.write("Iter %d complete\n" % (t+1))
e_list.flush()
e_log.flush()
e_list.close()
e_log.close()
I've been setting num_itr to 100. During the 9th pass through the num_itr loop (as indicated by 8 lines having been written to e_log), the program crashes with the error message
Can't expand MemType 0: jcol 7438
Traceback (most recent call last):
File "/usr/lusers/clb37/QREM_Energy_Gatherer.py", line 55, in <module>
vals = sparse.linalg.eigsh(H, k = num_states + 2, sigma = target_energy, which = 'LM', return_eigenvectors = False)
File "/usr/lusers/clb37/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1524, in eigsh
symmetric=True, tol=tol)
File "/usr/lusers/clb37/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1030, in get_OPinv_matvec
return SpLuInv(A.tocsc()).matvec
File "/usr/lusers/clb37/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 898, in __init__
self.M_lu = splu(M)
File "/usr/lusers/clb37/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py", line 242, in splu
ilu=False, options=_options)
MemoryError
Sure enough, the program will fail during the 9th pass through that loop every time I run it on my machine, and when I try running this code on machines with more memory the program makes it through more iterations before crashing, so it looks like the computer really is running out of memory. If that's all there is to it then fine, but what I can't understand is why the program doesn't crash during the 1st iteration. I don't see any point in the 8 lines of the num_itr loop at which something gets written to memory without just being overwritten during the following iteration. I've used Heapy's heap() function to look at my memory usage, and it just prints out "Total size = 11715240 bytes" during every pass.
I feel like there's something fundamental that I just don't know about going on here, either some bug in my writing that I don't know to look for or some detail about how memory is handled. Can anyone explain to me why this code fails during the 9th pass through the num_itr loop but not the 1st?
Ok, this seems to be reproducible on Scipy 0.14.0.
It can apparently be worked around the issue by adding
import gc; gc.collect()
inside the loop to force Pythons cyclic garbage collector to run.
The issue appears that somewhere inside scipy.sparse.eigh there is a cyclic reference loop, in the vein of:
class Foo(object):
pass
a = Foo()
b = Foo()
a.spam = b
b.spam = a
del a, b # <- but a, b still refer to each other and are not dead
This is still perfectly OK in principle: although Python's reference counting doesn't detect such cyclic garbage, a collection is run periodically to gather such objects. However, if each object is very large in memory (eg. big Numpy arrays) the periodic runs are too infrequent, and you run out of memory before the next cyclic garbage collection run is done.
So a workaround is to force the GC to run when you know there's big garbage to collect.
A better workaround would be to change scipy.sparse.eigh so that such cyclic garbage is not generated in the first place.

Python pickle file strangely large

I made a pickle file, storing a grayscale value of each pixel in 100,000 80x80 sized images.
(Plus an array of 100,000 integers whose values are one-digit).
My approximation for the total size of the pickle is,
4 byte x 80 x 80 x 100000 = 2.88 GB
plus the array of integers, which shouldn't be that large.
The generated pickle file however is over 16GB, so it's taking hours just to unpickle it and load it, and it eventually freezes, after it takes full memory resources.
Is there something wrong with my calculation or is it the way I pickled it?
I pickled the file in the following way.
from PIL import Image
import pickle
import os
import numpy
import time
trainpixels = numpy.empty([80000,6400])
trainlabels = numpy.empty(80000)
validpixels = numpy.empty([10000,6400])
validlabels = numpy.empty(10000)
testpixels = numpy.empty([10408,6400])
testlabels = numpy.empty(10408)
i=0
tr=0
va=0
te=0
for (root, dirs, filenames) in os.walk(indir1):
print 'hello'
for f in filenames:
try:
im = Image.open(os.path.join(root,f))
Imv=im.load()
x,y=im.size
pixelv = numpy.empty(6400)
ind=0
for ii in range(x):
for j in range(y):
temp=float(Imv[j,ii])
temp=float(temp/255.0)
pixelv[ind]=temp
ind+=1
if i<40000:
trainpixels[tr]=pixelv
tr+=1
elif i<45000:
validpixels[va]=pixelv
va+=1
else:
testpixels[te]=pixelv
te+=1
print str(i)+'\t'+str(f)
i+=1
except IOError:
continue
trainimage=(trainpixels,trainlabels)
validimage=(validpixels,validlabels)
testimage=(testpixels,testlabels)
output=open('data.pkl','wb')
pickle.dump(trainimage,output)
pickle.dump(validimage,output)
pickle.dump(testimage,output)
Please let me know if you see something wrong with either my calculation or my code!
Python Pickles are not a thrifty mechanism for storing data as you're storing objects instead of "just the data."
The following test case takes 24kb on my system and this is for a small, sparsely populated numpy array stored in a pickle:
import os
import sys
import numpy
import pickle
testlabels = numpy.empty(1000)
testlabels[0] = 1
testlabels[99] = 0
test_labels_size = sys.getsizeof(testlabels) #80
output = open('/tmp/pickle', 'wb')
test_labels_pickle = pickle.dump(testlabels, output)
print os.path.getsize('/tmp/pickle')
Further, I'm not sure why you believe 4kb to be the size of a number in Python -- non-numpy ints are 24 bytes (sys.getsizeof(1)) and numpy arrays are a minimum of 80 bytes (sys.getsizeof(numpy.array([0], float))).
As you stated as a response to my comment, you have reasons for staying with Pickle, so I won't try to convince you further to not store objects, but be aware of the overhead of storing objects.
As an option: reduce the size of your training data/Pickle fewer objects.

Why is loading this file taking so much memory?

Trying to load a file into python. It's a very big file (1.5Gb), but I have the available memory and I just want to do this once (hence the use of python, I just need to sort the file one time so python was an easy choice).
My issue is that loading this file is resulting in way to much memory usage. When I've loaded about 10% of the lines into memory, Python is already using 700Mb, which is clearly too much. At around 50% the script hangs, using 3.03 Gb of real memory (and slowly rising).
I know this isn't the most efficient method of sorting a file (memory-wise) but I just want it to work so I can move on to more important problems :D So, what is wrong with the following python code that's causing the massive memory usage:
print 'Loading file into memory'
input_file = open(input_file_name, 'r')
input_file.readline() # Toss out the header
lines = []
totalLines = 31164015.0
currentLine = 0.0
printEvery100000 = 0
for line in input_file:
currentLine += 1.0
lined = line.split('\t')
printEvery100000 += 1
if printEvery100000 == 100000:
print str(currentLine / totalLines)
printEvery100000 = 0;
lines.append( (lined[timestamp_pos].strip(), lined[personID_pos].strip(), lined[x_pos].strip(), lined[y_pos].strip()) )
input_file.close()
print 'Done loading file into memory'
EDIT: In case anyone is unsure, the general consensus seems to be that each variable allocated eats up more and more memory. I "fixed" it in this case by 1) calling readLines(), which still loads all the data, but only has one 'string' variable overhead for each line. This loads the entire file using about 1.7Gb. Then, when I call lines.sort(), I pass a function to key that splits on tabs and returns the right column value, converted to an int. This is slow computationally, and memory-intensive overall, but it works. Learned a ton about variable allocation overhad today :D
Here is a rough estimate of the memory needed, based on the constants derived from your example. At a minimum you have to figure the Python internal object overhead for each split line, plus the overhead for each string.
It estimates 9.1 GB to store the file in memory, assuming the following constants, which are off by a bit, since you're only using part of each line:
1.5 GB file size
31,164,015 total lines
each line split into a list with 4 pieces
Code:
import sys
def sizeof(lst):
return sys.getsizeof(lst) + sum(sys.getsizeof(v) for v in lst)
GIG = 1024**3
file_size = 1.5 * GIG
lines = 31164015
num_cols = 4
avg_line_len = int(file_size / float(lines))
val = 'a' * (avg_line_len / num_cols)
lst = [val] * num_cols
line_size = sizeof(lst)
print 'avg line size: %d bytes' % line_size
print 'approx. memory needed: %.1f GB' % ((line_size * lines) / float(GIG))
Returns:
avg line size: 312 bytes
approx. memory needed: 9.1 GB
I don't know about the analysis of the memory usage, but you might try this to get it to work without running out of memory. You'll sort into a new file which is accessed using a memory mapping (I've been led to believe this will work efficiently [in terms of memory]). Mmap has some OS specific workings, I tested this on Linux (very small scale).
This is the basic code, to make it run with a decent time efficiency you'd probably want to do a binary search on the sorted file to find where to insert the line otherwise it will probably take a long time.
You can find a file-seeking binary search algorithm in this question.
Hopefully a memory efficient way of sorting a massive file by line:
import os
from mmap import mmap
input_file = open('unsorted.txt', 'r')
output_file = open('sorted.txt', 'w+')
# need to provide something in order to be able to mmap the file
# so we'll just copy the first line over
output_file.write(input_file.readline())
output_file.flush()
mm = mmap(output_file.fileno(), os.stat(output_file.name).st_size)
cur_size = mm.size()
for line in input_file:
mm.seek(0)
tup = line.split("\t")
while True:
cur_loc = mm.tell()
o_line = mm.readline()
o_tup = o_line.split("\t")
if o_line == '' or tup[0] < o_tup[0]: # EOF or we found our spot
mm.resize(cur_size + len(line))
mm[cur_loc+len(line):] = mm[cur_loc:cur_size]
mm[cur_loc:cur_loc+len(line)] = line
cur_size += len(line)
break

Categories

Resources