Overhead in networkx reverse function? - python

I have the following code:
import networkx
def reverse_graph(g):
reversed = networkx.DiGraph()
for e in g.edges():
reversed.add_edge(e[1], e[0])
return reversed
g = networkx.DiGraph()
for i in range(500000):
g.add_edge(i, i+1)
g2 = g.reverse()
g3 = reverse_graph(g)
And according to my line profiler I am spending WAY more time reversing the graph using networkx (their reverse took about 21 sec, mine took about 7). The overhead seems high in this simple case, and it's even worse in other code I have with more complex objects. Is there something happening under the hood of networkx I'm not aware of? This seems like it should be a relatively cheap function.
For reference, here is the doc for the reverse function
EDIT: I also tried running the implementations the other way around (i.e. mine first) to make sure there was no cacheing happening when they created theirs. Mine was still significantly faster

The source code for the reverse method looks like this:
def reverse(self, copy=True):
"""Return the reverse of the graph.
The reverse is a graph with the same nodes and edges
but with the directions of the edges reversed.
Parameters
----------
copy : bool optional (default=True)
If True, return a new DiGraph holding the reversed edges.
If False, reverse the reverse graph is created using
the original graph (this changes the original graph).
"""
if copy:
H = self.__class__(name="Reverse of (%s)"%self.name)
H.add_nodes_from(self)
H.add_edges_from( (v,u,deepcopy(d)) for u,v,d
in self.edges(data=True) )
H.graph=deepcopy(self.graph)
H.node=deepcopy(self.node)
else:
self.pred,self.succ=self.succ,self.pred
self.adj=self.succ
H=self
return H
So by default, when copy=True, not only are the edge nodes reversed,
but also a deepcopy of any edge data is made. Then the graph attributes (held in
self.graph) are deepcopied, and then the nodes themselves are deepcopied.
That's a lot of copying that reverse_graph does not do.
If you don't deepcopy everything, modifying g3 could affect g.
If you don't need to deepcopy everything, (and if mutating g is acceptable) then
g.reverse(copy=False)
is even faster than
g3 = reverse_graph(g)
In [108]: %timeit g.reverse(copy=False)
1000000 loops, best of 3: 359 ns per loop
In [95]: %timeit reverse_graph(g)
1 loops, best of 3: 1.32 s per loop
In [96]: %timeit g.reverse()
1 loops, best of 3: 4.98 s per loop

Related

Efficient tensor contraction with Python

I have a piece of code with a bottleneck calculation involving tensor contractions. Lets say I want to calculate a tensor A_{i,j,k,l}( X ) whose non-zero entries for a single x\in X are N ~ 10^5, and X represents a grid with M total points, with M~1000 approximately. For a single element of the tensor A, the rhs of the equation looks something like:
A_{ijkl}(M) = Sum_{m,n,p,q} S_{i,j, m,n }(M) B_{m,n,p,q}(M) T_{ p,q, k,l }(M)
In addition, the middle tensor B_{m,n,p,q}(M) is obtained by numerical convolution of arrays so that:
B_{m,n,p,q}(M) = ( L_{m,n} * F_{p,q} )(M)
where "*" is the convolution operator, and all tensors have appoximately the same number of elements as A. My problem has to do with efficiency of the sums; to compute a single rhs of A, it takes very long times given the complexity of the problem. I have a "keys" system, where each tensor element is accessed by its unique key combination ( ( p,q,k,l ) for T for example ) taken from a dictionary. Then the dictionary for that specific key gives the Numpy array associated to that key to perform an operation, and all operations (convolutions, multiplications...) are done using Numpy. I have seen that the most time consuming part is actually due to the nested loop (I loop over all keys (i,j,k,l) of the A tensor, and for each key, a rhs like the one above needs to be computed). Is there any efficient way to do this? Consider that:
1) Using simple numpy arrays of 4 +1 D results in high memory usage, since all tensors are of type complex
2 ) I have tried several approaches: Numba is quite limited when working with dictionaries, and some important Numpy features that I need are not currently supported. For instance, the numpy.convolve() only takes the first 2 arguments, but does not take the "mode" argument which reduces considerably the needed convolution interval in this case, I dont need the "full" output of the convolution
3) My most recent approach is trying to implement everything using Cython for this part... But this is quite time consuming as well as more error prone given the logic of the code.
Any ideas on how to deal with such complexity using Python?
Thanks!
You have to make your question a bit more precise, which also includes a working code example which you have already tried. It is for example unclear, why you use dictionarys in this tensor contractions. Dictionary lookups looks to be a weard thing for this calculation, but maybe I didn't get the point what you really want to do.
Tensor contraction actually is very easy to implement in Python (Numpy), there are methods to find the best way to contract the tensors and they are really easy to use (np.einsum).
Creating some data (this should be part of the question)
import numpy as np
import time
i=20
j=20
k=20
l=20
m=20
n=20
p=20
q=20
#I don't know what complex 2 means, I assume it is complex128 (real and imaginary part are in float64)
#size of all arrays is 1.6e5
Sum_=np.random.rand(m,n,p,q).astype(np.complex128)
S_=np.random.rand(i,j,m,n).astype(np.complex128)
B_=np.random.rand(m,n,p,q).astype(np.complex128)
T_=np.random.rand(p,q,k,l).astype(np.complex128)
The naive way
This code is basically the same as writing it in loops using Cython or Numba without calling BLAS routines (ZGEMM) or optimizing the contraction order -> 8 nested loops to do the job.
t1=time.time()
A=np.einsum("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_)
print(time.time()-t1)
This results in a very slow runtime of about 330 seconds.
How to increase the speed by a factor of 7700
%timeit A=np.einsum("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_,optimize="optimal")
#42.9 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why is this so much faster?
Lets have a look at the contraction path and the internals.
path=np.einsum_path("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_,optimize="optimal")
print(path[1])
# Complete contraction: mnpq,ijmn,mnpq,pqkl->ijkl
# Naive scaling: 8
# Optimized scaling: 6
# Naive FLOP count: 1.024e+11
# Optimized FLOP count: 2.562e+08
# Theoretical speedup: 399.750
# Largest intermediate: 1.600e+05 elements
#--------------------------------------------------------------------------
#scaling current remaining
#--------------------------------------------------------------------------
# 4 mnpq,mnpq->mnpq ijmn,pqkl,mnpq->ijkl
# 6 mnpq,ijmn->ijpq pqkl,ijpq->ijkl
# 6 ijpq,pqkl->ijkl ijkl->ijkl
and
path=np.einsum_path("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_,optimize="optimal",einsum_call=True)
print(path[1])
#[((2, 0), set(), 'mnpq,mnpq->mnpq', ['ijmn', 'pqkl', 'mnpq'], False), ((2, 0), {'n', 'm'}, 'mnpq,ijmn->ijpq', ['pqkl', 'ijpq'], True), ((1, 0), {'p', 'q'}, 'ijpq,pqkl->ijkl', ['ijkl'], True)]
Doing the contraction in multiple well choosen steps reduces the required flops by a factor of 400. But thats not the only thing what einsum does here. Just have a look at 'mnpq,ijmn->ijpq', ['pqkl', 'ijpq'], True), ((1, 0) the True stands for a BLAS contraction -> tensordot call -> (matrix matix multiplication).
Internally this looks basically as follows:
#consider X as a 4th order tensor {mnpq}
#consider Y as a 4th order tensor {ijmn}
X_=X.reshape(m*n,p*q) #-> just another view on the data (2D), costs almost nothing (no copy, just a view)
Y_=Y.reshape(i*j,m*n) #-> just another view on the data (2D), costs almost nothing (no copy, just a view)
res=np.dot(Y_,X_) #-> dot is just a wrapper for highly optimized BLAS functions, in case of complex128 ZGEMM
output=res.reshape(i,j,p,q) #-> just another view on the data (4D), costs almost nothing (no copy, just a view)

Fast way to select from numpy array without intermediate index array

Given the following 2-column array, I want to select items from the second column that correspond to "edges" in the first column. This is just an example, as in reality my a has potentially millions of rows. So, ideally I'd like to do this as fast as possible, and without creating intermediate results.
import numpy as np
a = np.array([[1,4],[1,2],[1,3],[2,6],[2,1],[2,8],[2,3],[2,1],
[3,6],[3,7],[5,4],[5,9],[5,1],[5,3],[5,2],[8,2],
[8,6],[8,8]])
i.e. I want to find the result,
desired = np.array([4,6,6,4,2])
which is entries in a[:,1] corresponding to where a[:,0] changes.
One solution is,
b = a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1, 1]
which gives np.array([6,6,4,2]), I could simply prepend the first item, no problem. However, this creates an intermediate array of the indexes of the first items. I could avoid the intermediate by using a list comprehension:
c = [a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y]
This also gives [6,6,4,2]. Assuming a generator-based zip (true in Python 3), this doesn't need to create an intermediate representation and should be very memory efficient. However, the inner loop is not numpy, and it necessitates generating a list which must be subsequently turned back into a numpy array.
Can you come up with a numpy-only version with the memory efficiency of c but the speed efficiency of b? Ideally only one pass over a is needed.
(Note that measuring the speed won't help much here, unless a is very big, so I wouldn't bother with benchmarking this, I just want something that is theoretically fast and memory efficient. For example, you can assume rows in a are streamed from a file and are slow to access -- another reason to avoid the b solution, as it requires a second random-access pass over a.)
Edit: a way to generate a large a matrix for testing:
from itertools import repeat
N, M = 100000, 100
a = np.array(zip([x for y in zip(*repeat(np.arange(N),M)) for x in y ], np.random.random(N*M)))
I am afraid if you are looking to do this in a vectorized way, you can't avoid an intermediate array, as there's no built-in for it.
Now, let's look for vectorized approaches other than nonzero(), which might be more performant. Going by the same idea of performing differentiation as with the original code of (a[1:,0]-a[:-1,0]), we can use boolean indexing after looking for non-zero differentiations that correspond to "edges" or shifts.
Thus, we would have a vectorized approach like so -
a[np.append(True,np.diff(a[:,0])!=0),1]
Runtime test
The original solution a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1,1] would skip the first row. But, let's just say for the sake of timing purposes, it's a valid result. Here's the runtimes with it against the proposed solution in this post -
In [118]: from itertools import repeat
...: N, M = 100000, 2
...: a = np.array(zip([x for y in zip(*repeat(np.arange(N),M))\
for x in y ], np.random.random(N*M)))
...:
In [119]: %timeit a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1,1]
100 loops, best of 3: 6.31 ms per loop
In [120]: %timeit a[1:][np.diff(a[:,0])!=0,1]
100 loops, best of 3: 4.51 ms per loop
Now, let's say you want to include the first row too. The updated runtimes would look something like this -
In [123]: from itertools import repeat
...: N, M = 100000, 2
...: a = np.array(zip([x for y in zip(*repeat(np.arange(N),M))\
for x in y ], np.random.random(N*M)))
...:
In [124]: %timeit a[np.append(0,(a[1:,0]-a[:-1,0]).nonzero()[0]+1),1]
100 loops, best of 3: 6.8 ms per loop
In [125]: %timeit a[np.append(True,np.diff(a[:,0])!=0),1]
100 loops, best of 3: 5 ms per loop
Ok actually I found a solution, just learned about np.fromiter, which can build a numpy array based on a generator:
d = np.fromiter((a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y), int)
I think this does it, generates a numpy array without any intermediate arrays. However, the caveat is that it does not seem to be all that efficient! Forgetting what I said in the question about testing:
t = [lambda a: a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1, 1],
lambda a: np.array([a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y]),
lambda a: np.fromiter((a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y), int)]
from timeit import Timer
[Timer(x(a)).timeit(number=10) for x in t]
[0.16596235800034265, 1.811289312000099, 2.1662971739997374]
It seems the first solution is drastically faster! I assume this is because even though it generates intermediate data, it is able to perform the inner loop completely in numpy, while in the other it runs Python code for each item in the array.
Like I said, this is why I'm not sure this kind of benchmarking makes sense here -- if accesses to a were much slower, the benchmark wouldn't be CPU-loaded. Thoughts?
Not "accepting" this answer since I am hoping someone can come up with something faster.
If memory efficiency is your concern, that can be solved as such: The only intermediate of the same size-order as the input data can be made of type bool (a[1:,0] != a[:-1, 0]); and if your input data is int32, that is 8 times smaller than 'a' itself. You can count the nonzeros of that binary array to preallocate the output array as well; though that should not be very either significant if the output of the != is as sparse as your example suggests.

exceptions for numpy arrays

I'm looking to remove certain values within a constant range around values held within a second array. i.e. I have one large np array and I want to remove values +-3 in that array using another array of specific values, say [20,50,90,210]. So if my large array was [14,21,48,54,92,215] I would want [14,54,215] returned. The values are double precision so I'm trying to avoid creating a large mask array to remove specific values and use a range instead.
You mentioned that you wanted to avoid a large mask array. Unless both your "large array" and your "specific values" array are very large, I wouldn't try to avoid this. Often, with numpy it's best to allow relatively large temporary arrays to be created.
However, if you do need to control memory usage more tightly, you have several options. A typical trick is to only vectorize one part of the operation and iterate over the shorter input (this is shown in the second example below). It saves having nested loops in Python, and can significantly decrease the memory usage involved.
I'll show three different approaches. There are several others (including dropping down to C or Cython if you really need tight control and performance), but hopefully this gives you some ideas.
On a side note, for these small inputs, the overhead of array creation will overwhelm the differences. The speed and memory usage I'm referring to is only for large (>~1e6 elements) arrays.
Fully vectorized, but most memory usage
The easiest way is to calculate all distances at once and then reduce the mask back to the same shape as the initial array. For example:
import numpy as np
vals = np.array([14,21,48,54,92,215])
other = np.array([20,50,90,210])
dist = np.abs(vals[:,None] - other[None,:])
mask = np.all(dist > 3, axis=1)
result = vals[mask]
Partially vectorized, intermediate memory usage
Another option is to build up the mask iteratively for each element in the "specific values" array. This iterates over all elements of the shorter "specific values" array (a.k.a. other in this case):
import numpy as np
vals = np.array([14,21,48,54,92,215])
other = np.array([20,50,90,210])
mask = np.ones(len(vals), dtype=bool)
for num in other:
dist = np.abs(vals - num)
mask &= dist > 3
result = vals[mask]
Slowest, but lowest memory usage
Finally, if you really want to reduce memory usage, you could iterate over every item in your large array:
import numpy as np
vals = np.array([14,21,48,54,92,215])
other = np.array([20,50,90,210])
result = []
for num in vals:
if np.all(np.abs(num - other) > 3):
result.append(num)
The temporary list in that case is likely to take up more memory than the mask in the previous version. However, you could avoid the temporary list by using np.fromiter if you wanted. The timing comparison below shows an example of this.
Timing Comparisons
Let's compare the speed of these functions. We'll use 10,000,000 elements in the "large array" and 4 values in the "specific values" array. The relative speed and memory usage of these functions depend strongly on the sizes of the two arrays, so you should only consider this as a vague guideline.
import numpy as np
vals = np.random.random(1e7)
other = np.array([0.1, 0.5, 0.8, 0.95])
tolerance = 0.05
def basic(vals, other, tolerance):
dist = np.abs(vals[:,None] - other[None,:])
mask = np.all(dist > tolerance, axis=1)
return vals[mask]
def intermediate(vals, other, tolerance):
mask = np.ones(len(vals), dtype=bool)
for num in other:
dist = np.abs(vals - num)
mask &= dist > tolerance
return vals[mask]
def slow(vals, other, tolerance):
def func(vals, other, tolerance):
for num in vals:
if np.all(np.abs(num - other) > tolerance):
yield num
return np.fromiter(func(vals, other, tolerance), dtype=vals.dtype)
And in this case, the partially vectorized version wins out. That's to be expected in most cases where vals is significantly longer than other. However, the first example (basic) is almost as fast, and is arguably simpler.
In [7]: %timeit basic(vals, other, tolerance)
1 loops, best of 3: 1.45 s per loop
In [8]: %timeit intermediate(vals, other, tolerance)
1 loops, best of 3: 917 ms per loop
In [9]: %timeit slow(vals, other, tolerance)
1 loops, best of 3: 2min 30s per loop
Either way you choose to implement things, these are common vectorization "tricks" that show up in many problems. In high-level languages like Python, Matlab, R, etc It's often useful to try fully vectorizing, then mix vectorization and explicit loops if memory usage is an issue. Which one is best usually depends on the relative sizes of the inputs, but this is a common pattern to try when optimizing speed vs memory usage in high-level scientific programming.
You can try:
def closestmatch(x, y):
val = np.abs(x-y)
return(val.min()>=3)
Then:
b[np.array([closestmatch(a, x) for x in b])]

Product of a sequence in NumPy

I need to implement this following function with NumPy -
where F_l(x) are N number of arrays that I need to calculate, which are dependent on an array G(x), that I am given, and A_j are N coefficients that are also given. I would like to implement it in NumPy as I would have to calculate F_l(x) for every iteration of my program. The dummy way to do this is by for loops and ifs:
import numpy as np
A = np.arange(1.,5.,1)
G = np.array([[1.,2.],[3.,4.]])
def calcF(G,A):
N = A.size
print A
print N
F = []
for l in range(N):
F.append(G/A[l])
print F[l]
for j in range(N):
if j != l:
F[l]*=((G - A[l])/(G + A[j]))*((A[l] - A[j])/(A[l] + A[j]))
return F
F= calcF(G,A)
print F
As for loops and if statements are relatively slow, I am looking for a NumPy witty way to do the same thing. Anyone has an idea?
Listed in this post is a vectorized solution making heavy usage of NumPy's powerful broadcasting feature after extending dimensions of input arrays to 3D and 4D cases with np.newaxis/None at various places according to the computation involved. Here's the implementation -
# Get size of A
N = A.size
# Perform "(G - A[l])/(G + A[j]))" in a vectorized manner
p1 = (G - A[:,None,None,None])/(G + A[:,None,None])
# Perform "((A[l] - A[j])/(A[l] + A[j]))" in a vectorized manner
p2 = ((A[:,None] - A)/(A[:,None] + A))
# Elementwise multiplications between the previously calculated parts
p3 = p1*p2[...,None,None]
# Set the escaped portion "j != l" output as "G/A[l]"
p3[np.eye(N,dtype=bool)] = G/A[:,None,None]
Fout = p3.prod(1)
# If you need separate arrays just like in the question, split it
Fout_split = np.array_split(Fout,N)
Sample run -
In [284]: # Original inputs
...: A = np.arange(1.,5.,1)
...: G = np.array([[1.,2.],[3.,4.]])
...:
In [285]: calcF(G,A)
Out[285]:
[array([[-0. , -0.00166667],
[-0.01142857, -0.03214286]]), array([[-0.00027778, 0. ],
[ 0.00019841, 0.00126984]]), array([[ 1.26984127e-03, 1.32275132e-04],
[ -0.00000000e+00, -7.93650794e-05]]), array([[-0.00803571, -0.00190476],
[-0.00017857, 0. ]])]
In [286]: vectorized_calcF(G,A) # Posted solution
Out[286]:
[array([[[-0. , -0.00166667],
[-0.01142857, -0.03214286]]]), array([[[-0.00027778, 0. ],
[ 0.00019841, 0.00126984]]]), array([[[ 1.26984127e-03, 1.32275132e-04],
[ -0.00000000e+00, -7.93650794e-05]]]), array([[[-0.00803571, -0.00190476],
[-0.00017857, 0. ]]])]
Runtime test -
In [289]: # Larger inputs
...: A = np.random.randint(1,500,(400))
...: G = np.random.randint(1,400,(20,20))
...:
In [290]: %timeit calcF(G,A)
1 loops, best of 3: 4.46 s per loop
In [291]: %timeit vectorized_calcF(G,A) # Posted solution
1 loops, best of 3: 1.87 s per loop
Vectorization with NumPy/MATLAB : General approach
Felt like I could throw in my two cents on my general approach and I would think others follow similar strategies when trying to vectorize codes, especially in a high level platform like NumPy or MATLAB. So, here's a quick check-list of things that could be considered for Vectorization -
Idea about extending the dimensions : Dimensions are to be extended for the input arrays such that the new dimensions hold results that would have otherwise gotten generated iteratively within the nested loops.
Where to start vectorizing from? Start from the deepest (that loop where the code is iterating the most) stage of computation and see how inputs could be extended and the relevant computation could be brought in. Take good care of tracing the iterators involved and extend dimensions accordingly. Move outwards onto outer loops, until you are satisfied with the vectorization done.
How to take care of conditional statements? For simple cases, brute force compute everything and see how the IF/ELSE parts could be taken care of later on. This would be highly context specific.
Are there dependencies? If so, see if the dependencies could be traced and implemented accordingly. This could form another topic for discussion, but here are few examples I got myself involved with.

Fastest way of finding the index of the closest element in a non-sorted Python list of floats

Given as input a list of floats that is not sorted, what would be the most efficient way of finding the index of the closest element to a certain value? Some potential solutions come to mind:
For:
x = random.sample([float(i) for i in range(1000000)], 1000000)
1) Own function:
def min_val(lst, val):
min_i = None
min_dist = 1000000.0
for i, v in enumerate(lst):
d = abs(v - val)
if d < min_dist:
min_dist = d
min_i = i
return min_i
Result:
%timeit min_val(x, 5000.56)
100 loops, best of 3: 11.5 ms per loop
2) Min
%timeit min(range(len(x)), key=lambda i: abs(x[i]-5000.56))
100 loops, best of 3: 16.8 ms per loop
3) Numpy (including conversion)
%timeit np.abs(np.array(x)-5000.56).argmin()
100 loops, best of 3: 3.88 ms per loop
From that test, it seems that converting the list to numpy array is the best solution. However two questions come to mind:
Was that indeed a realistic comparison?
Is the numpy solution the fastest way to achieve this in Python?
Consider the partition algorithm from QuickSort. The partition algorithm rearranges a list such that the pivot element is in its final location after invocation. Based on the value of the pivot, you could then partition the portion of the array that is likely to contain the element closest to your target. Once you've either found the element you're after or a have a partition of length 1 that contains your element you're done.
The general problem you're addressing is a selection problem.
In your question you were wondering about what sort of array/list implementation to use, and that will have an impact on performance. A bigger factor will be the search algorithm as opposed to the list/array representation.
Edit in light of comment from #Andrzej
Ah, then I misunderstood your question. Strictly speaking, linear search is always O(n), so efficiency within the bounds of Big-Oh analysis is the same regardless of underlying data structure. The gotcha here is that for linear search you want a nice simple data structure to make the run-time performance as good as possible.
A Python list is an array of references to objects while (to my understanding) a Numpy array is a contiguous array of objects. The Numpy array will perform better since it doesn't have to dereference the objects to get to the values.
Your comparison technique seems reasonable for Python list vs. Numpy array. I'd be reluctant to say that a Numpy array is the fastest way to solve the problem, but it should perform better than a Python list.

Categories

Resources