Shared memory parallelization

Shared memory parallelization - python

I have a vector v with size n and I need to increment by 1 each entry using this code:
for output_diff in results:
for i in n:
if (output_diff & (1 << i)):
v[i] += 1
Size of results is approximately 10 000 000 and size of n = 4096. How can I do that using parallelism or maybe multiprocessing in python? I tried using the idea in How to implement a reduce operation in python multiprocessing? , but it takes longer than serial way.

If your operation is taking a long time (say maybe 30 seconds or longer), then you could perhaps benefit from dividing results into as many pieces as you want to run python processes, and using python's multiprocessing module. If the operation isn't taking that long, the overhead of starting these new processes will outweigh the benefit of using them.
Since the operation being carried out does not depend on the values stored in v, each process can write to an independent vector and you can aggregate the results at the end. Pass each process a vector v_prime of 0's of the same length as v. Perform the above operation, each process handling a portion of the output_diffs in results and incrementing the corresponding values in v_prime instead of v. Then at the end, each process returns its vector v_prime. Sum all of the returned v_primes and the original v (this is where having the items expressed as numpy arrays is helpful, as it is easy to add numpy vectors of the same length) to get the correct result.

Related

python xarray indexing/slicing very slow

I'm currently processing some ocean model outputs. At each time step, it has 42*1800*3600 grid points.
I found that the bottelneck in my program is the slicing, and calling xarray_built in method to extract the values. And what's more interesting, same syntax sometimes require a vastly differnt amount of time.
ds = xarray.open_dataset(filename, decode_times=False)
vvel0=ds.VVEL.sel(lat=slice(-60,-20),lon=slice(0,40))/100 #in CCSM output, unit is cm/s convert to m/s
uvel0=ds.UVEL.sel(lat=slice(-60,-20),lon=slice(0,40))/100 ## why the speed is that different? now it's regional!!
temp0=ds.TEMP.sel(lat=slice(-60,-20),lon=slice(0,40)) #de
Take this for example, reading a VVEL and UVEL took ~4sec, while reading in TEMP only needed ~6ms. Without slicing, VVEL and UVEL took ~1sec, and TEMP needed 120 nanosecond.
I always thought that, when I only input part of the full array, I need less memory, and therefore less time. It turned out, that XARRAY loads in the full array and any extra slicing takes more time. But, could somebody please explain why is reading different variables from the same netcdf file takes that different of time?
The program is designed to extract a stepwise section, and calculate the cross-sectional heat transport, so I need to pick out either UVEL or VVEL, times that by TEMP along the section. So, it may seems that, loading in TEMP that fast is good, isn't it?
Unfortunately, that's not the case. When I loop through about ~250 grid points along the prescribed section...
# Calculate VT flux orthogonal to the chosen grid cells, which is the heat transport across GOODHOPE line
vtflux=[]
utflux=[]
vap = vtflux.append
uap = utflux.append
#for i in range(idx_north,idx_south+1):
for i in range(10):
yidx=gh_yidx[i]
xidx=gh_xidx[i]
lon_next=ds_lon[i+1].values
lon_current=ds_lon[i].values
lat_next=ds_lat[i+1].values
lat_current=ds_lat[i].values
tt=np.squeeze(temp[:,yidx,xidx].values) #<< calling values is slow
if (lon_next<lon_current) and (lat_next==lat_current): # The condition is incorrect
dxlon=Re*np.cos(lat_current*np.pi/180.)*0.1*np.pi/180.
vv=np.squeeze(vvel[:,yidx,xidx].values)
vt=vv*tt
vtdxdz=np.dot(vt[~np.isnan(vt)],layerdp[0:len(vt[~np.isnan(vt)])])*dxlon
vap(vtdxdz)
#del vtdxdz
elif (lon_next==lon_current) and (lat_next<lat_current):
#ut=np.array(uvel[:,gh_yidx[i],gh_xidx[i]].squeeze().values*temp[:,gh_yidx[i],gh_xidx[i]].squeeze().values) # slow
uu=np.squeeze(uvel[:,yidx,xidx]).values # slow
ut=uu*tt
utdxdz=np.dot(ut[~np.isnan(ut)],layerdp[0:len(ut[~np.isnan(ut)])])*dxlat
uap(utdxdz) #m/s*degC*m*m ## looks fine, something wrong with the sign
#del utdxdz
total_trans=(np.nansum(vtflux)-np.nansum(utflux))*3996*1026/1e15
Especially this line:
tt=np.squeeze(temp[:,yidx,xidx].values)
It takes ~3.65 Sec, but now it has to be repeated for ~250 times. If I remove .values, then this time reduces to ~4ms. But I need to time the tt to vt, so I have to extract the values. What's weird, is that the similar expression, vv=np.squeeze(vvel[:,yidx,xidx].values) requires much less time, only about ~1.3ms.
To summarize my questions:
Why loading in different variables from the same netcdf file takes different amount of time?
Is there a more efficient way to pick out a single column in a multidimensional array? (not necessary the xarray structure, also numpy.ndarray)
Why does extracting values from Xarray structures need different amount of time, for the exact same syntax?
Thank you!

When you index a variable loaded from a netCDF file, xarray doesn't load it into memory immediately. Instead, we create a lazy array that supports any number of further differed indexing operations. This is true even if you aren't using dask.array (triggered by setting chunks= in open_dataset or using open_mfdataset).
This explains the surprising performance you observe. Calculating temp0 is fast, because it doesn't load any data from disk. vvel0 is slow, because dividing by 100 requires loading the data into memory as a numpy array.
Later, it's slower to index temp0 because each operation loads data from disk, instead of indexing a numpy array already in memory.
The work-around is to explicitly load the portion of your dataset that you need into memory first, e.g., by writing temp0.load(). The netCDF section of the xarray docs also gives this tip.

Efficient way of computing statistics for large/imprecise amount of data

I have over 65 million numeric values stored in a text file. I need to compute the maximum, minimum, average, standard deviation, as well as the 25, 50, and 75 percentiles.
Normally I would use the attached code, but I need a more efficient way to compute these metrics because i cannot store all value p in a list. How can I more effectively calculate these values in Python?
import numpy as np
np.average(obj)
np.min(mylist)
np.max(mylist)
np.std(mylist)
np.percentile(obj, 25)
np.percentile(obj, 50)
np.percentile(obj, 75)
maxx = float('-inf')
minx = float('+inf')
sumz = 0
for index, p in enumerate(open("foo.txt", "r")):
maxx = max(maxx, float(p))
minx = min(minx, float(p))
sumz += float(p)
index += 1
my_max = maxx
my_min = minx
my_avg = sumz/index

Use binary file. Then you can use numpy.memmap to map it to memory and can perform all sorts of algorithms, even if the dataset was larger than RAM.
You can even use the numpy.memmap to create a memory mapped array, and read your data in from the text file... you can work on it and when you are done, you also have the data in binary format.

I think you are on the right track, by iterating over the file and keeping track of max and min values. To calculate the std, you should keep a sum of squares inside the loop: sum_of_squares += z**2. You then can calculate std = sqrt(sum_of_squares / n - (sumz / n)**2) after the loop, see formula here (but this formula might suffer from numerical problems). For performance, you might want to iterate over the file in some decent size chunks of data.
To calculate the median and percentiles in a 'continuous' way, you could build up a histogram inside your loop. After the loop, you can get approximate percentiles and median by converting the histogram to the CDF, the error will depend on the number of bins.

As Antti Haapala says, the easiest and most efficient way to do this will be to stick with numpy, and just use a memmapped binary file instead of a text file. Yes, converting from one format to the other will take a bit of time—but it'll almost certainly save more time than it costs (because you can use numpy vectorized operations instead of loops), and it will also make your code a lot simpler.
If you can't do that, Python 3.4 will come with a statistics module. A backport to 2.6+ will hopefully be available at some point after the PEP is finalized; at present I believe you can only get stats, the earlier module it's based on, which requires 3.1+. Unfortunately, while stats does do single-pass algorithms on iterators, it doesn't have any convenient way to run multiple algorithms in parallel on the same iterator, so you have be clever with itertools.tee and zip to force it to interleave the work instead of pulling the whole thing into memory.
And of course there are plenty of other modules out there if you search PyPI for "stats" and/or "statistics" and/or "statistical".
Either way, using a pre-built module will mean someone's already debugged all the problems you're going to run into, and they may have also optimized the code (maybe even ported it to C) to boot.

To get the percentiles, sort the text file using a command line program. Use the line count (index in your program) to find the line numbers of the percentiles (index // 4, etc.) Then retrieve those lines from the file.

Most of these operations can be expressed easily in terms of simple arithmetic. In that case, it can actually (surprisingly) be quite efficient to process simple statistics directly from the Linux command line using awk and sed, e.g. as in this post: < http://www.unixcl.com/2008/09/sum-of-and-group-by-using-awk.html >.
If you need to generalize to more advanced operations, like weighted percentiles, then I'd recommend using Python Pandas (notably the HDFStore capabilities for later retrieval). I've used Pandas with a DataFrame of over 25 million records before (10 columns by 25 million distinct rows). If you're more memory constrained, you could read the data in in chunks, calculate partial contributions from each chunk, and store out intermediate results, then finish off the calculation by just loading the intermediate results, in a serialized sort of map-reduce kind of framework.

Efficiently store a large sparse matrix (float) [duplicate]

This question already has answers here:
Is there support for sparse matrices in Python?
(2 answers)
Closed 10 years ago.
I am looking for a solution to store about 10 million floating point (double precision) numbers of a sparse matrix. The matrix is actually a two-dimensional triangular matrix consisting of 1 million by 1 million elements. The element (i,j) is the actual score measure score(i,j) between the element i and element j. The storage method must allow very fast access to this information maybe by memory mapping the file containing the matrix. I certainly don't want to load all the file in memory.
class Score(IsDescription):
grid_i = UInt32Col()
grid_j = UInt32Col()
score = FloatCol()
I've tried pytables by using the Score class as exposed, but I cannot access directly to the element i,j without scanning all the rows. Any suggestion?

10 million double precision floats take up 80 MB of memory. If you store them in a 1 million x 1 million sparse matrix, in CSR or CSC formats, you will need an additional 11 million int32s, for a total of around 125 MB. That's probably less than 7% of the physical memory in your system. And in my experience, on a system with 4GB running a 32-bit version of python, you rarely start having trouble allocating arrays until you try to get a hold of ten times that.
Run the following code on your computer:
for j in itertools.count(100) :
try :
a = np.empty((j * 10**6,), dtype='uint8`)
print 'Allocated {0} MB of memory!'.format(j)
del a
except MemoryError:
print 'Failed to allocate {0} MB of memory!'.format(j)
break
And unless it fails to get you at least 4 times the amount calculated above, don't even hesitate about sticking the whole thing in memory using a scipy.sparse format.
I have no experience with pytables, nor much with numpy's memmap arrays. But it seems to me that either one of those will involve you coding the logic to handle the sparsity, something I would try to avoid unless impossible to.

You should use scipy.sparse. Here's some more info about the formats and usage.

Python: Fastest way to iterate this through a large file

Right, I'm iterating through a large binary file
I need to minimise the time of this loop:
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
blockReturn = np.zeros((num_samples,num_receivers,num_channels))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
for smpl in range(0,num_samples):
r2_iq=np.fromfile(ReadFile.fid,np.int16,2)
blockReturn[smpl,rec,chl] = np.sqrt(math.fabs(r2_iq[0])*math.fabs(r2_iq[0]) + math.fabs(r2_iq[1])*math.fabs(r2_iq[1]))
return blockReturn
So, what's going on is as follows:
r1 is the header of the file, dTypes.NB_HDR is a type I made:
NB_HDR= np.dtype([('f3',np.uint32),('f4',np.uint32),('f5',np.uint32),('f6',np.int32),('f7',np.int32),('f8',np.uint32)])
That gets all the information about the forthcoming data block, and nicely puts us in the right position within the file (the start of the data block!).
In this data block there is:
4096 samples per channel,
4 channels per receiver,
9 receivers.
So num_receivers, num_channels, num_samples will always be the same (at the moment anyway), but as you can see this is a fairly large amount of data. Each 'sample' is a pair of int16 values that I want to find the magnitude of (hence Pythagoras).
This NB2 code is executed for each 'Block' in the file, for a 12GB file (which is how big they are) there are about 20,900 Blocks, and I've got to iterate through 1000 of these files (so, 12TB overall). Any speed advantage even it's it's milliseconds would be massively appreciated.
EDIT: Actually it might be of help to know how I'm moving around inside the file. I have a function as follows:
def navigateTo(self, blockNum, indexNum):
ReadFile.fid.seek(ReadFile.fileIndex[blockNum][indexNum],0)
ReadFile.currentBlock = blockNum
ReadFile.index = indexNum
Before I run all this code I scan the file and make a list of index locations at ReadFile.fileIndex that I browse using this function and then 'seek' to the absolute location - is this efficient?
Cheers

Because you know the length of a block after you read the header, read the whole block at once. Then reshape the array (very fast, only affects metadata) and take use the np.hypot ufunc:
blockData = np.fromfile(ReadFile.fid, np.int16, num_receivers*num_channels*num_samples*2)
blockData = blockData.reshape((num_receivers, num_channes, num_samples, 2))
return np.hypot(blockData[:,:,:,0], blockData[:,:,:,1])
On my machine it runs in 11ms per block.

import numpy as np
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
# first, match your array bounds to the way you are walking the file
blockReturn = np.zeros((num_receivers,num_channels,num_samples))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
# second, read in all the samples at once if you have enough memory
r2_iq=np.fromfile(ReadFile.fid,np.int16,2*num_samples)
r2_iq.shape = (-1,2) # tell numpy that it is an array of two values
# create dot product vector by squaring data elementwise, and then
# adding those elements together. Results is of length num_samples
r2_iq = r2_iq * r2_iq
r2_iq = r2_iq[:,0] + r2_iq[:,1]
# get the distance by performing the square root "into" blockReturn
np.sqrt(r2_iq, out=blockReturn[rec,chl,:])
return blockReturn
This should help your performance. Two main ideas in numpy work. First, your result arrays dimensions should match how your loop dimensions are crafted, for memory locality.
Second, Numpy is FAST. I've beaten hand coded C with numpy, simply because it uses LAPack and vector acceleration. However to get that power, you have to let it manipulate more data at a time. That is why your sample loop has been collapsed to read in the full sample for the receiver and channel in one large read. Then use the supreme vector powers of numpy to calculate your magnitude by dot product.
There is a little more optimization to be had in the magnitude calculation, but numpy recycles buffers for you, making it less important than you might think. I hope this helps!

I'd try to use as few loops and as much constants as possible.
Everything that can be done in a linear fashion should be done so.
If values don't change, use constants to reduce lookups and such,
because that eats up cpu cycles.
This is from a theoretical point of view ;-)
If possible use highly optimised libraries. I don't exaclty know what you are trying to achieve but i'd rather use an existing FFT-Lib than writing it myself :>
One more thing: http://en.wikipedia.org/wiki/Big_O_notation (can be an eye-opener)

Most importantly, you shouldn't do file access at the lowest level of a triple nested loop, whether you do this in C or Python. You've got to read in large chunks of data at a time.
So to speed this up, read in large chunks of data at a time, and process that data using numpy indexing (that is, vectorize your code). This is particularly easy in your case since all your data is int32. Just read in big chunks of data, and reshape the data into an array that reflects the (receiver, channel, sample) structure, and then use the appropriate indexing to multiply and add things for Pythagoras, and the 'sum' command to add up the terms in the resulting array.

This is more of an observation than a solution, but porting that function to C++ and loading it in with the Python API would get you a lot of speed gain to begin with before loop optimization.

Efficient Python array with 100 million zeros?

What is an efficient way to initialize and access elements of a large array in Python?
I want to create an array in Python with 100 million entries, unsigned 4-byte integers, initialized to zero. I want fast array access, preferably with contiguous memory.
Strangely, NumPy arrays seem to be performing very slow. Are there alternatives I can try?
There is the array.array module, but I don't see a method to efficiently allocate a block of 100 million entries.
Responses to comments:
I cannot use a sparse array. It will be too slow for this algorithm because the array becomes dense very quickly.
I know Python is interpreted, but surely there is a way to do fast array operations?
I did some profiling, and I get about 160K array accesses (looking up or updating an element by index) per second with NumPy. This seems very slow.

I have done some profiling, and the results are completely counterintuitive.
For simple array access operations, numpy and array.array are 10x slower than native Python arrays.
Note that for array access, I am doing operations of the form:
a[i] += 1
Profiles:
[0] * 20000000
Access: 2.3M / sec
Initialization: 0.8s
numpy.zeros(shape=(20000000,), dtype=numpy.int32)
Access: 160K/sec
Initialization: 0.2s
array.array('L', [0] * 20000000)
Access: 175K/sec
Initialization: 2.0s
array.array('L', (0 for i in range(20000000)))
Access: 175K/sec, presumably, based upon the profile for the other array.array
Initialization: 6.7s

Just a reminder how Python's integers work: if you allocate a list by saying
a = [0] * K
you need the memory for the list (sizeof(PyListObject) + K * sizeof(PyObject*)) and the memory for the single integer object 0. As long as the numbers in the list stay below the magic number V that Python uses for caching, you are fine because those are shared, i.e. any name that points to a number n < V points to the exact same object. You can find this value by using the following snippet:
>>> i = 0
>>> j = 0
>>> while i is j:
... i += 1
... j += 1
>>> i # on my system!
257
This means that as soon as the counts go above this number, the memory you need is sizeof(PyListObject) + K * sizeof(PyObject*) + d * sizeof(PyIntObject), where d < K is the number of integers above V (== 256). On a 64 bit system, sizeof(PyIntObject) == 24 and sizeof(PyObject*) == 8, i.e. the worst case memory consumption is 3,200,000,000 bytes.
With numpy.ndarray or array.array, memory consumption is constant after initialization, but you pay for the wrapper objects that are created transparently, as Thomas Wouters said. Probably, you should think about converting the update code (which accesses and increases the positions in the array) to C code, either by using Cython or scipy.weave.

Try this:
x = [0] * 100000000
It takes just a few seconds to execute on my machine, and access is close to instant.

If you are are not able to vectorize your calculuations, Python/Numpy will be slow. Numpy is fast because vectorized calculations occur at a lower level than Python. The core numpy functions are all written in C or Fortran. Hence sum(a) is not a python loop with many accesses, it's a single low level C call.
Numpy's Performance Python demo page has a good example with different options. You can easily get 100x increase by using a lower level compiled language, Cython, or using vectorized functions if feasible. This blog post that shows a 43 fold increase using Cython for a numpy usecase.

It's unlikely you'll find anything faster than numpy's array. The implementation of the array itself is as efficient as it would be in, say, C (and basically the same as array.array, just with more usefulness.)
If you want to speed up your code, you'll have to do it by doing just that. Even though the array is implemented efficiently, accessing it from Python code has certain overhead; for example, indexing the array produces integer objects, which have to be created on the fly. numpy offers a number of operations implemented efficiently in C, but without seeing the actual code that isn't performing as well as you want it's hard to make any specific suggestions.

For fast creation, use the array module.
Using the array module is ~5 times faster for creation, but about twice as slow for accessing elements compared to a normal list:
# Create array
python -m timeit -s "from array import array" "a = array('I', '\x00'
* 100000000)"
10 loops, best of 3: 204 msec per loop
# Access array
python -m timeit -s "from array import array; a = array('I', '\x00'
* 100000000)" "a[4975563]"
10000000 loops, best of 3: 0.0902 usec per loop
# Create list
python -m timeit "a = [0] * 100000000"
10 loops, best of 3: 949 msec per loop
# Access list
python -m timeit -s "a = [0] * 100000000" "a[4975563]"
10000000 loops, best of 3: 0.0417 usec per loop

In addition to the other excellent solutions, another way is to use a dict instead of an array (elements which exist are non-zero, otherwise they're zero). Lookup time is O(1).
You might also check if your application is resident in RAM, rather than swapping out. It's only 381 MB, but the system may not be giving you it all for whatever reason.
However there are also some really fast sparse matrices (SciPy and ndsparse). They are done in low-level C, and might also be good.

If
access speed of array.array is acceptable for your application
compact storage is most important
you want to use standard modules (no NumPy dependency)
you are on platforms that have /dev/zero
the following may be of interest to you. It initialises array.array about 27 times faster than array.array('L', [0]*size):
myarray = array.array('L')
f = open('/dev/zero', 'rb')
myarray.fromfile(f, size)
f.close()
On How to initialise an integer array.array object with zeros in Python I'm looking for an even better way.

I would simply create your own data type that doesn't initialize ANY values.
If you want to read an index position that has NOT been initialized, you return zeroes. Still, do not initialize any storage.
If you want to read an index position that HAS been initialized, simply return the value.
If you want to write to an index position that has NOT been initialized, initialize it, and store the input.

NumPy is the appropriate tool for a large, fixed-size, homogeneous array. Accessing individual elements of anything in Python isn't going to be all that fast, though whole-array operations can often be conducted at speeds similar to C or Fortran. If you need to do operations on millions and millions of elements individually quickly, there is only so much you can get out of Python.
What sort of algorithm are you implementing? How do you know that using sparse arrays is too slow if you haven't tried it? What do you mean by "efficient"? You want quick initialization? That is the bottleneck of your code?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Shared memory parallelization - python

Related

python xarray indexing/slicing very slow

Efficient way of computing statistics for large/imprecise amount of data

Efficiently store a large sparse matrix (float) [duplicate]

Python: Fastest way to iterate this through a large file

Efficient Python array with 100 million zeros?

Categories

Resources