Is it possible to translate this Python code to Cython? - python

I'm actually looking to speed up #2 of this code by as much as possible, so I thought that it might be useful to try Cython. However, I'm not sure how to implement sparse matrix in Cython. Can somebody show how to / if it's possible to wrap it in Cython or perhaps Julia to make it faster?
#1) This part computes u_dict dictionary filled with unique strings and then enumerates them.
import scipy.sparse as sp
import numpy as np
from scipy.sparse import csr_matrix
full_dict = set(train1.values.ravel().tolist() + test1.values.ravel().tolist() + train2.values.ravel().tolist() + test2.values.ravel().tolist())
print len(full_dict)
u_dict= dict()
for i, q in enumerate(full_dict):
u_dict[q] = i
shape = (len(full_dict), len(full_dict))
H = sp.lil_matrix(shape, dtype=np.int8)
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
#2) I need to speed up this part
# train_full is pandas dataframe with two columns w1 and w2 filled with strings
H = load_sparse_csr('matrix.npz')
correlation_train = []
for idx, row in train_full.iterrows():
if idx%1000 == 0: print idx
id_1 = u_dict[row['w1']]
id_2 = u_dict[row['w2']]
a_vec = H[id_1].toarray() # these vectors are of length of < 3 mil.
b_vec = H[id_2].toarray()
correlation_train.append(np.corrcoef(a_vec, b_vec)[0][1])

While I contributed to How to properly pass a scipy.sparse CSR matrix to a cython function? quite some time ago, I doubt if cython is the way to go. Especially if you don't already have experience with numpy and cython. cython gives the biggest speedup when you replace iterative calculations with code that it can translate to C without calling numpy or other python code. Throw pandas into the mix and you have an even bigger learning curve.
And important parts of sparse code are already written with cython.
Without touching the cython issue I see a couple of problems.
H is defined twice:
H = sp.lil_matrix(shape, dtype=np.int8)
H = load_sparse_csr('matrix.npz')
That's either an oversight, or a failure to understand how Python variables are created and assigned. The 2nd assignment replaces the first; thus the first does nothing. In addition the first just makes an empty lil matrix. Such a matrix could be filled iteratively; while not fast it is the intended use of the lil format.
The 2nd expression creates a new matrix from data saved in an npz file. That involves the numpy npz file loaded as well as the basic csr matrix creation code. And since the attributes are already in csr format, there's nothing for cython touch.
You do have an iteration here - but over a Pandas dataframe:
for idx, row in train_full.iterrows():
id_1 = u_dict[row['w1']]
a_vec = H[id_1].toarray()
Looks like you are picking a particular row of H based on a dictionary/array look up. Sparse matrix indexing is slow compared to dense matrix indexing. That is, if Ha = H.toarray() fits your memory then,
a_vec = Ha[id_1,:]
will be a lot faster.
Faster selection of rows (or columns) from a sparse matrix has been asked before. If you could work directly with the sparse data of a row I could recommend something more direct. But you want a dense array that you can pass to np.corrcoef, so we'd have to implement the toarray step as well.
How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Related

scipy sparse matrix -- accessing multiple elements of a path

I have a scipy sparse matrix A and a (long) list of coordinates
myrows=[i1,i2,...] mycols=[j1,j2,...]. I need a list of their values [A[i1,j2],A[i2,j2],...]. How can I do this quickly. A loop is too slow.
I've thought about cython.inline() (which I use in other places in my code) or weave, but I don't see how to use the sparse type efficiently in cython or C++. Am I missing something simple?
Currently I'm using a hack that seems inefficient and possibly wrong sometimes -- which I flag with an error message. Here is my badly written code. Note that it relies on the ordering of elements to be preserved under addition and assumes that the elements in myrows,mycols are in A.
import scipy.sparse as sps
def getmatvals(A,myrows,mycols) #A is a coo_matrix
B = sps.coo_matrix((range(1,1+A.nnz),(A.row,A.col)),shape=A.shape)
T = sps.coo_matrix(([A.nnz+1]*len(myrows),(myrows,mycols)),shape=A.shape)
G = B-T #signify myelements in G by negatives and others by 0's
H = np.minimum([0]*A.nnz,G.data) #remove extra elements
H = H[np.nonzero(H)]
H = H + A.nnz
return A.data[H]

Python/Numpy: Build 2D array without adding duplicate rows (for triangular mesh)

I'm working on some code that manipulates 3D triangular meshes. Once I have imported mesh data, I need to "unify" vertices that are at the same point in space.
I've been assuming that numpy arrays would be the fastest way of storing & manipulating the data, but I can't seem to find a fast way of building a list of vertices while avoiding adding duplicate entries.
So, to test out methods, creating a 3x30000 array with 10000 unique rows:
import numpy as np
points = np.random.random((10000,3))
raw_data = np.concatenate((points,points,points))
np.random.shuffle(raw_data)
This serves as a good approximation of mesh data, with each point appearing as a facet vertex 3 times. While unifying, I need to build a list of unique vertices; if a point already is in the list a reference to it must be stored.
The best I've been able to come up with using numpy so far has been the following:
def unify(raw_data):
# first point must be new
unified_verts = np.zeros((1,3),dtype=np.float64)
unified_verts[0] = raw_data[0]
ref_list = [0]
for i in range(1,len(raw_data)):
point = raw_data[i]
index_array = np.where(np.all(point==unified_verts,axis=1))[0]
# point not in array yet
if len(index_array) == 0:
point = np.expand_dims(point,0)
unified_verts = np.concatenate((unified_verts,point))
ref_list.append(len(unified_verts)-1)
# point already exists
else:
ref_list.append(index_array[0])
return unified_verts, ref_list
Testing using cProfile:
import cProfile
cProfile.run("unify(raw_data)")
On my machine this runs in 5.275 seconds. I've though about using Cython to speed it up, but from what I've read Cython doesn't typically run much faster than numpy methods. Any advice on ways to do this more efficiently?
Jaime has shown a neat trick which can be used to view a 2D array as a 1D array with items that correspond to rows of the 2D array. This trick can allow you to apply numpy functions which take 1D arrays as input (such as np.unique) to higher dimensional arrays.
If the order of the rows in unified_verts does not matter (as long as the ref_list is correct with respect to unifed_verts), then you could use np.unique along with Jaime's trick like this:
def unify2(raw_data):
dtype = np.dtype((np.void, (raw_data.shape[1] * raw_data.dtype.itemsize)))
uniq, inv = np.unique(raw_data.view(dtype), return_inverse=True)
uniq = uniq.view(raw_data.dtype).reshape(-1, raw_data.shape[1])
return uniq, inv
The result is the same in the sense that the raw_data can be reconstructed from the return values of unify (or unify2):
unified, ref = unify(raw_data)
uniq, inv = unify2(raw_data)
assert np.allclose(uniq[inv], unified[ref]) # raw_data
On my machine, unified, ref = unify(raw_data) requires about 51.390s, while uniq, inv = unify2(raw_data) requires about 0.133s (~ 386x speedup).

Python - Difficulty Calculating Partition Function using Large NumPy arrays

I had a pretty compact way of computing the partition function of an Ising-like model using itertools, lambda functions, and large NumPy arrays. Given a network consisting of N nodes and Q "states"/node, I have two arrays, h-fields and J-couplings, of sizes (N,Q) and (N,N,Q,Q) respectively. J is upper-triangular, however. Using these arrays, I have been computing the partition function Z using the following method:
# Set up lambda functions and iteration tuples of the form (A_1, A_2, ..., A_n)
iters = itertools.product(range(Q),repeat=N)
hf = lambda s: h[range(N),s]
jf = lambda s: np.array([J[fi,fj,s[fi],s[fj]] \
for fi,fj in itertools.combinations(range(N),2)]).flatten()
# Initialize and populate partition function array
pf = np.zeros(tuple([Q for i in range(N)]))
for it in iters:
hterms = np.exp(hf(it)).prod()
jterms = np.exp(-jf(it)).prod()
pf[it] = jterms * hterms
# Calculates partition function
Z = pf.sum()
This method works quickly for small N and Q, say (N,Q) = (5,2). However, for larger systems (N,Q) = (18,3), this method cannot even create the pf array due to memory issues because it has Q^N nontrivial elements. Any ideas on how to either overcome this memory issue or how to alter the code to work on subarrays?
Edit: Made a small mistake in the definition of jf. It has been corrected.
You can avoid the large array just by initializing Z to 0, and incrementing it by jterms * iterms in each iteration. This still won't get you out of calculating and summing Q^N numbers, however. To do that, you probably need to figure out a way to simplify the partition function algebraically.
Not sure what you are trying to compute but I tested your code with ChrisB suggestion and jf will not work for Q=3.
Perhaps you shouldn't use a dense numpy array to encode your function? You could try sparse arrays or just straight Python with Numba compilation. This blogpost shows using Numba on the simple Ising model with good performance.

numpy: efficient execution of a complex reshape of an array

I am reading a vendor-provided large binary array into a 2D numpy array tempfid(M, N)
# load data
data=numpy.fromfile(file=dirname+'/fid', dtype=numpy.dtype('i4'))
# convert to complex data
fid=data[::2]+1j*data[1::2]
tempfid=fid.reshape(I*J*K, N)
and then I need to reshape it into a 4D array useful4d(N,I,J,K) using non-trivial mappings for the indices. I do this with a for loop along the following lines:
for idx in range(M):
i=f1(idx) # f1, f2, and f3 are functions involving / and % as well as some lookups
j=f2(idx)
k=f3(idx)
newfid[:,i,j,k] = tempfid[idx,:] #SLOW! CAN WE IMPROVE THIS?
Converting to complex takes 33% of the time while the copying of these slices M slices takes the remaining 66%. Calculating the indices is fast irrespective of whether I do this one by one in a loop as shown or by numpy.vectorizing the operation and applying it to an arange(M).
Is there a way to speed this up? Any help on more efficient slicing, copying (or not) etc appreciated.
EDIT:
As learned in the answer to question "What's the fastest way to convert an interleaved NumPy integer array to complex64?" the conversion to complex can be sped up by a factor of 6 if a view is used instead:
fid = data.astype(numpy.float32).view(numpy.complex64)
idx = numpy.arange(M)
i = numpy.vectorize(f1)(idx)
j = numpy.vectorize(f2)(idx)
k = numpy.vectorize(f3)(idx)
# you can index arrays with other arrays
# that lets you specify this operation in one line.
newfid[:, i,j,k] = tempfid.T
I've never used numpy's vectorize. Vectorize just means that numpy will call your python function multiple times. In order to get speed, you need use array operations like the one I showed here and you used to get complex numbers.
EDIT
The problem is that the dimension of size 128 was first in newfid, but last in tempfid. This is easily by using .T which takes the transpose.
How about this. Set us your indicies using the vectorized versions of f1,f2,f3 (not necessarily using np.vectorize, but perhaps just writing a function that takes an array and returns an array), then use np.ix_:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ix_.html
to get the index arrays. Then reshape tempfid to the same shape as newfid and then use the results of np.ix_ to set the values. For example:
tempfid = np.arange(10)
i = f1(idx) # i = [4,3,2,1,0]
j = f2(idx) # j = [1,0]
ii = np.ix_(i,j)
newfid = tempfid.reshape((5,2))[ii]
This maps the elements of tempfid onto a new shape with a different ordering.

Best way to create a NumPy array from a dictionary?

I'm just starting with NumPy so I may be missing some core concepts...
What's the best way to create a NumPy array from a dictionary whose values are lists?
Something like this:
d = { 1: [10,20,30] , 2: [50,60], 3: [100,200,300,400,500] }
Should turn into something like:
data = [
[10,20,30,?,?],
[50,60,?,?,?],
[100,200,300,400,500]
]
I'm going to do some basic statistics on each row, eg:
deviations = numpy.std(data, axis=1)
Questions:
What's the best / most efficient way to create the numpy.array from the dictionary? The dictionary is large; a couple of million keys, each with ~20 items.
The number of values for each 'row' are different. If I understand correctly numpy wants uniform size, so what do I fill in for the missing items to make std() happy?
Update: One thing I forgot to mention - while the python techniques are reasonable (eg. looping over a few million items is fast), it's constrained to a single CPU. Numpy operations scale nicely to the hardware and hit all the CPUs, so they're attractive.
You don't need to create numpy arrays to call numpy.std().
You can call numpy.std() in a loop over all the values of your dictionary. The list will be converted to a numpy array on the fly to compute the standard variation.
The downside of this method is that the main loop will be in python and not in C. But I guess this should be fast enough: you will still compute std at C speed, and you will save a lot of memory as you won't have to store 0 values where you have variable size arrays.
If you want to further optimize this, you can store your values into a list of numpy arrays, so that you do the python list -> numpy array conversion only once.
if you find that this is still too slow, try to use psycho to optimize the python loop.
if this is still too slow, try using Cython together with the numpy module. This Tutorial claims impressive speed improvements for image processing. Or simply program the whole std function in Cython (see this for benchmarks and examples with sum function )
An alternative to Cython would be to use SWIG with numpy.i.
if you want to use only numpy and have everything computed at C level, try grouping all the records of same size together in different arrays and call numpy.std() on each of them. It should look like the following example.
example with O(N) complexity:
import numpy
list_size_1 = []
list_size_2 = []
for row in data.itervalues():
if len(row) == 1:
list_size_1.append(row)
elif len(row) == 2:
list_size_2.append(row)
list_size_1 = numpy.array(list_size_1)
list_size_2 = numpy.array(list_size_2)
std_1 = numpy.std(list_size_1, axis = 1)
std_2 = numpy.std(list_size_2, axis = 1)
While there are already some pretty reasonable ideas present here, I believe following is worth mentioning.
Filling missing data with any default value would spoil the statistical characteristics (std, etc). Evidently that's why Mapad proposed the nice trick with grouping same sized records.
The problem with it (assuming there isn't any a priori data on records lengths is at hand) is that it involves even more computations than the straightforward solution:
at least O(N*logN) 'len' calls and comparisons for sorting with an effective algorithm
O(N) checks on the second way through the list to obtain groups(their beginning and end indexes on the 'vertical' axis)
Using Psyco is a good idea (it's strikingly easy to use, so be sure to give it a try).
It seems that the optimal way is to take the strategy described by Mapad in bullet #1, but with a modification - not to generate the whole list, but iterate through the dictionary converting each row into numpy.array and performing required computations. Like this:
for row in data.itervalues():
np_row = numpy.array(row)
this_row_std = numpy.std(np_row)
# compute any other statistic descriptors needed and then save to some list
In any case a few million loops in python won't take as long as one might expect. Besides this doesn't look like a routine computation, so who cares if it takes extra second/minute if it is run once in a while or even just once.
A generalized variant of what was suggested by Mapad:
from numpy import array, mean, std
def get_statistical_descriptors(a):
if ax = len(shape(a))-1
functions = [mean, std]
return f(a, axis = ax) for f in functions
def process_long_list_stats(data):
import numpy
groups = {}
for key, row in data.iteritems():
size = len(row)
try:
groups[size].append(key)
except KeyError:
groups[size] = ([key])
results = []
for gr_keys in groups.itervalues():
gr_rows = numpy.array([data[k] for k in gr_keys])
stats = get_statistical_descriptors(gr_rows)
results.extend( zip(gr_keys, zip(*stats)) )
return dict(results)
numpy dictionary
You can use a structured array to preserve the ability to address a numpy object by a key, like a dictionary.
import numpy as np
dd = {'a':1,'b':2,'c':3}
dtype = eval('[' + ','.join(["('%s', float)" % key for key in dd.keys()]) + ']')
values = [tuple(dd.values())]
numpy_dict = np.array(values, dtype=dtype)
numpy_dict['c']
will now output
array([ 3.])

Categories

Resources