This question already has answers here:
Is there support for sparse matrices in Python?
(2 answers)
Closed 10 years ago.
I am looking for a solution to store about 10 million floating point (double precision) numbers of a sparse matrix. The matrix is actually a two-dimensional triangular matrix consisting of 1 million by 1 million elements. The element (i,j) is the actual score measure score(i,j) between the element i and element j. The storage method must allow very fast access to this information maybe by memory mapping the file containing the matrix. I certainly don't want to load all the file in memory.
class Score(IsDescription):
grid_i = UInt32Col()
grid_j = UInt32Col()
score = FloatCol()
I've tried pytables by using the Score class as exposed, but I cannot access directly to the element i,j without scanning all the rows. Any suggestion?
10 million double precision floats take up 80 MB of memory. If you store them in a 1 million x 1 million sparse matrix, in CSR or CSC formats, you will need an additional 11 million int32s, for a total of around 125 MB. That's probably less than 7% of the physical memory in your system. And in my experience, on a system with 4GB running a 32-bit version of python, you rarely start having trouble allocating arrays until you try to get a hold of ten times that.
Run the following code on your computer:
for j in itertools.count(100) :
try :
a = np.empty((j * 10**6,), dtype='uint8`)
print 'Allocated {0} MB of memory!'.format(j)
del a
except MemoryError:
print 'Failed to allocate {0} MB of memory!'.format(j)
break
And unless it fails to get you at least 4 times the amount calculated above, don't even hesitate about sticking the whole thing in memory using a scipy.sparse format.
I have no experience with pytables, nor much with numpy's memmap arrays. But it seems to me that either one of those will involve you coding the logic to handle the sparsity, something I would try to avoid unless impossible to.
You should use scipy.sparse. Here's some more info about the formats and usage.
Related
I have a vector v with size n and I need to increment by 1 each entry using this code:
for output_diff in results:
for i in n:
if (output_diff & (1 << i)):
v[i] += 1
Size of results is approximately 10 000 000 and size of n = 4096. How can I do that using parallelism or maybe multiprocessing in python? I tried using the idea in How to implement a reduce operation in python multiprocessing? , but it takes longer than serial way.
If your operation is taking a long time (say maybe 30 seconds or longer), then you could perhaps benefit from dividing results into as many pieces as you want to run python processes, and using python's multiprocessing module. If the operation isn't taking that long, the overhead of starting these new processes will outweigh the benefit of using them.
Since the operation being carried out does not depend on the values stored in v, each process can write to an independent vector and you can aggregate the results at the end. Pass each process a vector v_prime of 0's of the same length as v. Perform the above operation, each process handling a portion of the output_diffs in results and incrementing the corresponding values in v_prime instead of v. Then at the end, each process returns its vector v_prime. Sum all of the returned v_primes and the original v (this is where having the items expressed as numpy arrays is helpful, as it is easy to add numpy vectors of the same length) to get the correct result.
I am trying to understand why this python code results in a process that requires 236 MB of memory, considering that the list is only 76 MB long.
import sys
import psutil
initial = psutil.virtual_memory().available / 1024 / 1024
available_memory = psutil.virtual_memory().available
vector_memory = sys.getsizeof([])
vector_position_memory = sys.getsizeof([1]) - vector_memory
positions = 10000000
print "vector with %d positions should use %d MB of memory " % (positions, (vector_memory + positions * vector_position_memory) / 1024 / 1024)
print "it used %d MB of memory " % (sys.getsizeof(range(0, positions)) / 1024 / 1024)
final = psutil.virtual_memory().available / 1024 / 1024
print "however, this process used in total %d MB" % (initial - final)
The output is:
vector with 10000000 positions should use 76 MB of memory
it used 76 MB of memory
however, this process used in total 236 MB
Adding x10 more positions (i.e. positions = 100000000) results in x10 more memory.
vector with 100000000 positions should use 762 MB of memory
it used 762 MB of memory
however, this process used in total 2330 MB
My ultimate goal is to suck as much memory as I can to create a very long list. To do this, I created this code to understand/predict how big my list could be based on available memory. To my surprise, python needs a ton of memory to manage my list, I guess.
Why does python use so much memory?! What is it doing with it? Any idea on how I can predict python's memory requirements to effectively create a list to use pretty much all the available memory while preventing the OS from doing swap?
The getsizeof function only includes the space used by the list itself.
But the list is effectively just an array of pointers to int objects, and you created 10000000 of those, and each one of those takes memory as well—typically 24 bytes.
The first few numbers (usually up to 255) are pre-created and cached by the interpreter, so they're effectively free, but the rest are not. So, you want to add something like this:
int_memory = sys.getsizeof(10000)
print "%d int objects should use another %d MB of memory " % (positions - 256, (positions - 256) * int_memory / 1024 / 1024)
And then the results will make more sense.
But notice that if you aren't creating a range with 10M unique ints, but instead, say, 10M random ints from 0-10000, or 10M copies of 0, that calculation will no longer be correct. So if want to handle those cases, you need to do something like stash the id of every object you've seen so far and skip any additional references to the same id.
The Python 2.x docs used to have a link to an old recursive getsizeof function that does that, and more… but that link went dead, so it was removed.
The 3.x docs have a link to a newer one, which may or may not work in Python 2.7. (I notice from a quick glance that it uses a __future__ statement for print, and falls back from reprlib.repr to repr, so it probably does.)
If you're wondering why every int is 24 bytes long (in 64-bit CPython; it's different for different platforms and implementations, of course):
CPython represents every builtin type as a C struct that contains, at least, space for a refcount and a pointer to the type. Any actual value the object needs to represent is in addition to that.1 So, the smallest non-singleton type is going to take 24 bytes per instance.
If you're wondering how you can avoid using up 24 bytes per integer, the answer is to use NumPy's ndarray—or, if for some reason you can't, the stdlib's array.array.
Either one lets you specify a "native type", like np.int32 for NumPy or i for array.array, and create an array that holds 100M of those native-type values directly. That will take exactly 4 bytes per value, plus a few dozen constant bytes of header overhead, which is a lot smaller than a list's 8 bytes of pointer, plus a bit of slack at the end that scales with the length, plus an int object wrapping up each value.
Using array.array, you're sacrificing speed for space,2 because every time you want to access one of those values, Python has to pull it out and "box" it as an int object.
Using NumPy, you're gaining both speed and space, because NumPy will let you perform vectorized operations over the whole array in a tightly-optimized C loop.
1. What about non-builtin types, that you create in Python with class? They have a pointer to a dict—which you can see from Python-land as __dict__—that holds all the attributes you add. So they're 24 bytes according to getsizeof, but of course you have to also add the size of that dict.
2. Unless you aren't. Preventing your system from going into swap hell is likely to speed things up a lot more than the boxing and unboxing slows things down. And, even if you aren't avoiding that massive cliff, you may still be avoiding smaller cliffs involving VM paging or cache locality.
I'm currently processing some ocean model outputs. At each time step, it has 42*1800*3600 grid points.
I found that the bottelneck in my program is the slicing, and calling xarray_built in method to extract the values. And what's more interesting, same syntax sometimes require a vastly differnt amount of time.
ds = xarray.open_dataset(filename, decode_times=False)
vvel0=ds.VVEL.sel(lat=slice(-60,-20),lon=slice(0,40))/100 #in CCSM output, unit is cm/s convert to m/s
uvel0=ds.UVEL.sel(lat=slice(-60,-20),lon=slice(0,40))/100 ## why the speed is that different? now it's regional!!
temp0=ds.TEMP.sel(lat=slice(-60,-20),lon=slice(0,40)) #de
Take this for example, reading a VVEL and UVEL took ~4sec, while reading in TEMP only needed ~6ms. Without slicing, VVEL and UVEL took ~1sec, and TEMP needed 120 nanosecond.
I always thought that, when I only input part of the full array, I need less memory, and therefore less time. It turned out, that XARRAY loads in the full array and any extra slicing takes more time. But, could somebody please explain why is reading different variables from the same netcdf file takes that different of time?
The program is designed to extract a stepwise section, and calculate the cross-sectional heat transport, so I need to pick out either UVEL or VVEL, times that by TEMP along the section. So, it may seems that, loading in TEMP that fast is good, isn't it?
Unfortunately, that's not the case. When I loop through about ~250 grid points along the prescribed section...
# Calculate VT flux orthogonal to the chosen grid cells, which is the heat transport across GOODHOPE line
vtflux=[]
utflux=[]
vap = vtflux.append
uap = utflux.append
#for i in range(idx_north,idx_south+1):
for i in range(10):
yidx=gh_yidx[i]
xidx=gh_xidx[i]
lon_next=ds_lon[i+1].values
lon_current=ds_lon[i].values
lat_next=ds_lat[i+1].values
lat_current=ds_lat[i].values
tt=np.squeeze(temp[:,yidx,xidx].values) #<< calling values is slow
if (lon_next<lon_current) and (lat_next==lat_current): # The condition is incorrect
dxlon=Re*np.cos(lat_current*np.pi/180.)*0.1*np.pi/180.
vv=np.squeeze(vvel[:,yidx,xidx].values)
vt=vv*tt
vtdxdz=np.dot(vt[~np.isnan(vt)],layerdp[0:len(vt[~np.isnan(vt)])])*dxlon
vap(vtdxdz)
#del vtdxdz
elif (lon_next==lon_current) and (lat_next<lat_current):
#ut=np.array(uvel[:,gh_yidx[i],gh_xidx[i]].squeeze().values*temp[:,gh_yidx[i],gh_xidx[i]].squeeze().values) # slow
uu=np.squeeze(uvel[:,yidx,xidx]).values # slow
ut=uu*tt
utdxdz=np.dot(ut[~np.isnan(ut)],layerdp[0:len(ut[~np.isnan(ut)])])*dxlat
uap(utdxdz) #m/s*degC*m*m ## looks fine, something wrong with the sign
#del utdxdz
total_trans=(np.nansum(vtflux)-np.nansum(utflux))*3996*1026/1e15
Especially this line:
tt=np.squeeze(temp[:,yidx,xidx].values)
It takes ~3.65 Sec, but now it has to be repeated for ~250 times. If I remove .values, then this time reduces to ~4ms. But I need to time the tt to vt, so I have to extract the values. What's weird, is that the similar expression, vv=np.squeeze(vvel[:,yidx,xidx].values) requires much less time, only about ~1.3ms.
To summarize my questions:
Why loading in different variables from the same netcdf file takes different amount of time?
Is there a more efficient way to pick out a single column in a multidimensional array? (not necessary the xarray structure, also numpy.ndarray)
Why does extracting values from Xarray structures need different amount of time, for the exact same syntax?
Thank you!
When you index a variable loaded from a netCDF file, xarray doesn't load it into memory immediately. Instead, we create a lazy array that supports any number of further differed indexing operations. This is true even if you aren't using dask.array (triggered by setting chunks= in open_dataset or using open_mfdataset).
This explains the surprising performance you observe. Calculating temp0 is fast, because it doesn't load any data from disk. vvel0 is slow, because dividing by 100 requires loading the data into memory as a numpy array.
Later, it's slower to index temp0 because each operation loads data from disk, instead of indexing a numpy array already in memory.
The work-around is to explicitly load the portion of your dataset that you need into memory first, e.g., by writing temp0.load(). The netCDF section of the xarray docs also gives this tip.
I have a list which I .append() to in a for-loop, finally the length of the list is around 180,000. Each item in the list is a numpy array of 7680 float32 values.
Then I convert the list to a numpy array, i.e. I expect an array of shape ( 180000, 7680 ):
d = numpy.asarray( dlist, dtype = 'float32' )
That caused the script to crash with the message Killed.
Is memory the problem? Assuming float32 takes 4 bytes, 180000x7680x4bytes = 5.5 GB.
I am using 64 bit Ubuntu, 12 GB RAM.
Yes, memory is the problem
Your estimate also needs to take into account a memory-allocation already done for the list-representation of the 180000 x 7680 x float32, so without other details on dynamic memory-releases / garbage-collections, the numpy.asarray() method needs a bit more than just another space of 1800000 x 7680 x numpy.float32 bytes.
If you try to test with less than a third length of the list, you may inspect the resulting effective-overhead of the numpy.array data-representation, so as to have exact data for your memory-feasible design
Memory-profiling may help to point out the bottleneck and understand the code requirements, that may sometimes help to save half of the allocation space needed for data, compared to an original mode of data-flow and operations:
(Fig.: Courtesy scikit-learn testing numpy-based or BLAS-direct calling method impact on memory-allocation envelopes )
You should take into account, that you need twice the size of memory in the process of conversion.
Also, other software may take some of your RAM and when you have no additional paging space defined, using 11GB of your 12GB memory will probably bring your system into trouble.
Right, I'm iterating through a large binary file
I need to minimise the time of this loop:
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
blockReturn = np.zeros((num_samples,num_receivers,num_channels))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
for smpl in range(0,num_samples):
r2_iq=np.fromfile(ReadFile.fid,np.int16,2)
blockReturn[smpl,rec,chl] = np.sqrt(math.fabs(r2_iq[0])*math.fabs(r2_iq[0]) + math.fabs(r2_iq[1])*math.fabs(r2_iq[1]))
return blockReturn
So, what's going on is as follows:
r1 is the header of the file, dTypes.NB_HDR is a type I made:
NB_HDR= np.dtype([('f3',np.uint32),('f4',np.uint32),('f5',np.uint32),('f6',np.int32),('f7',np.int32),('f8',np.uint32)])
That gets all the information about the forthcoming data block, and nicely puts us in the right position within the file (the start of the data block!).
In this data block there is:
4096 samples per channel,
4 channels per receiver,
9 receivers.
So num_receivers, num_channels, num_samples will always be the same (at the moment anyway), but as you can see this is a fairly large amount of data. Each 'sample' is a pair of int16 values that I want to find the magnitude of (hence Pythagoras).
This NB2 code is executed for each 'Block' in the file, for a 12GB file (which is how big they are) there are about 20,900 Blocks, and I've got to iterate through 1000 of these files (so, 12TB overall). Any speed advantage even it's it's milliseconds would be massively appreciated.
EDIT: Actually it might be of help to know how I'm moving around inside the file. I have a function as follows:
def navigateTo(self, blockNum, indexNum):
ReadFile.fid.seek(ReadFile.fileIndex[blockNum][indexNum],0)
ReadFile.currentBlock = blockNum
ReadFile.index = indexNum
Before I run all this code I scan the file and make a list of index locations at ReadFile.fileIndex that I browse using this function and then 'seek' to the absolute location - is this efficient?
Cheers
Because you know the length of a block after you read the header, read the whole block at once. Then reshape the array (very fast, only affects metadata) and take use the np.hypot ufunc:
blockData = np.fromfile(ReadFile.fid, np.int16, num_receivers*num_channels*num_samples*2)
blockData = blockData.reshape((num_receivers, num_channes, num_samples, 2))
return np.hypot(blockData[:,:,:,0], blockData[:,:,:,1])
On my machine it runs in 11ms per block.
import numpy as np
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
# first, match your array bounds to the way you are walking the file
blockReturn = np.zeros((num_receivers,num_channels,num_samples))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
# second, read in all the samples at once if you have enough memory
r2_iq=np.fromfile(ReadFile.fid,np.int16,2*num_samples)
r2_iq.shape = (-1,2) # tell numpy that it is an array of two values
# create dot product vector by squaring data elementwise, and then
# adding those elements together. Results is of length num_samples
r2_iq = r2_iq * r2_iq
r2_iq = r2_iq[:,0] + r2_iq[:,1]
# get the distance by performing the square root "into" blockReturn
np.sqrt(r2_iq, out=blockReturn[rec,chl,:])
return blockReturn
This should help your performance. Two main ideas in numpy work. First, your result arrays dimensions should match how your loop dimensions are crafted, for memory locality.
Second, Numpy is FAST. I've beaten hand coded C with numpy, simply because it uses LAPack and vector acceleration. However to get that power, you have to let it manipulate more data at a time. That is why your sample loop has been collapsed to read in the full sample for the receiver and channel in one large read. Then use the supreme vector powers of numpy to calculate your magnitude by dot product.
There is a little more optimization to be had in the magnitude calculation, but numpy recycles buffers for you, making it less important than you might think. I hope this helps!
I'd try to use as few loops and as much constants as possible.
Everything that can be done in a linear fashion should be done so.
If values don't change, use constants to reduce lookups and such,
because that eats up cpu cycles.
This is from a theoretical point of view ;-)
If possible use highly optimised libraries. I don't exaclty know what you are trying to achieve but i'd rather use an existing FFT-Lib than writing it myself :>
One more thing: http://en.wikipedia.org/wiki/Big_O_notation (can be an eye-opener)
Most importantly, you shouldn't do file access at the lowest level of a triple nested loop, whether you do this in C or Python. You've got to read in large chunks of data at a time.
So to speed this up, read in large chunks of data at a time, and process that data using numpy indexing (that is, vectorize your code). This is particularly easy in your case since all your data is int32. Just read in big chunks of data, and reshape the data into an array that reflects the (receiver, channel, sample) structure, and then use the appropriate indexing to multiply and add things for Pythagoras, and the 'sum' command to add up the terms in the resulting array.
This is more of an observation than a solution, but porting that function to C++ and loading it in with the Python API would get you a lot of speed gain to begin with before loop optimization.