Memory efficient storage of many large scipy sparse matrices - python

I need to store around 50.000 scipy sparse csr matrices where each matrix is a vector of length 3.7Million:
x = scipy.sparse.csr_matrix((3.7Mill,1))
I currently store them into a simple dictionary, because I also need to know the corresponding key for each vector (in this case the key is just a simple integer).
The problem now is the huge amount of memory needed. Are there some more efficient ways?

Try to use Lazy data structures.
For example:
def lazy(func):
def lazyfunc(*args, **kwargs):
temp = lambda x : func(*args, **kwargs)
temp.__name__ = "lazy-" + func.__name__
return temp
return lazyfunc
"""
Add some simple functions
"""
def add(x, y):
print "Not lazy"
return x + y
#lazy
def add_lazy(x, y):
print "lazy!"
return x + y
Usage:
>>> add(1, 2)
Not lazy
3
$ add_lazy(1, 2)
<function lazy-add_lazy at 0x021E9470>
>>> myval = add_lazy(1, 2)
>>> myval()
lazy!
3
Look at:
http://finderweb.com/blog/lazy_dict/
http://www.pages.drexel.edu/~kmk592/rants/lazy-python/index.html

Related

Tuple vs Generator expression. Why does performance flip, the longer the sequence gets?

I was looking at the following code and thought I noticed a potential performance issue.
def multiply(x):
return x * 10
def test():
data = tuple(x for x in range(5))
first_map = tuple(map(
multiply,
tuple(x for x in data)
))
second_map = tuple(map(
multiply,
tuple(x for x in data)
))
ziped = tuple(zip(data, second_map, first_map))
return ziped
import timeit
print(timeit.timeit('test()', globals=globals()))
Running this prints 5.317162958992412 for me.
I thought all these intermediate tuple would cause multiple iterations over data and so I turned it into the following.
def multiply(x):
return x * 10
def test():
data = tuple(x for x in range(5))
first_map = map(
multiply,
(x for x in data)
)
second_map = map(
multiply,
(x for x in data)
)
ziped = tuple(zip(data, second_map, first_map))
return ziped
import timeit
print(timeit.timeit('test()', globals=globals()))
And in fact, running this prints 4.783027238998329 for me, so we've optimized the performance. Yay!
But now comes the suprising (to me) part!. As we scale up the data, the results flip!
In other words, if we change data to be data = tuple(x for x in range(50)), then, the version with the intermediate tuples becomes the faster one.
What am I missing?

How to vectorize a class instantiation to allow NumPy arrays as input?

I programmed class which looks something like this:
import numpy as np
class blank():
def __init__(self,a,b,c):
self.a=a
self.b=b
self.c=c
n=5
c=a/b*8
if (a>b):
y=c+a*b
else:
y=c-a*b
p = np.empty([1,1])
k = np.empty([1,1])
l = np.empty([1,1])
p[0]=b
k[0]=b*(c-1)
l[0]=p+k
for i in range(1, n, 1):
p=np.append(p,l[i-1])
k=np.append(k,(p[i]*(c+1)))
l=np.append(l,p[i]+k[i])
komp = np.zeros(shape=(n, 1))
for i in range(0, n):
pl_avg = (p[i] + l[i]) / 2
h=pl_avg*3
komp[i]=pl_avg*h/4
self.tot=komp+l
And when I call it like this:
from ex1 import blank
import numpy as np
res=blank(1,2,3)
print(res.tot)
everything works well.
BUT I want to call it like this:
res = blank(np.array([1,2,3]), np.array([3,4,5]), 3)
Is there an easy way to call it for each i element of this two arrays without editing class code?
You won't be able to instantiate a class with NumPy arrays as inputs without changing the class code. #PabloAlvarez and #NagaKiran already provided alternative: iterate with zip over arrays and instantiate class for each pair of elements. While this is pretty simple solution, it defeats the purpose of using NumPy with its efficient vectorized operations.
Here is how I suggest you to rewrite the code:
from typing import Union
import numpy as np
def total(a: Union[float, np.ndarray],
b: Union[float, np.ndarray],
n: int = 5) -> np.array:
"""Calculates what your self.tot was"""
bc = 8 * a
c = bc / b
vectorized_geometric_progression = np.vectorize(geometric_progression,
otypes=[np.ndarray])
l = np.stack(vectorized_geometric_progression(bc, c, n))
l = np.atleast_2d(l)
p = np.insert(l[:, :-1], 0, b, axis=1)
l = np.squeeze(l)
p = np.squeeze(p)
pl_avg = (p + l) / 2
komp = np.array([0.75 * pl_avg ** 2]).T
return komp + l
def geometric_progression(bc, c, n):
"""Calculates array l"""
return bc * np.logspace(start=0,
stop=n - 1,
num=n,
base=c + 2)
And you can call it both for sole numbers and NumPy arrays like that:
>>> print(total(1, 2))
[[2.6750000e+01 6.6750000e+01 3.0675000e+02 1.7467500e+03 1.0386750e+04]
[5.9600000e+02 6.3600000e+02 8.7600000e+02 2.3160000e+03 1.0956000e+04]
[2.1176000e+04 2.1216000e+04 2.1456000e+04 2.2896000e+04 3.1536000e+04]
[7.6205600e+05 7.6209600e+05 7.6233600e+05 7.6377600e+05 7.7241600e+05]
[2.7433736e+07 2.7433776e+07 2.7434016e+07 2.7435456e+07 2.7444096e+07]]
>>> print(total(3, 4))
[[1.71000000e+02 3.39000000e+02 1.68300000e+03 1.24350000e+04 9.84510000e+04]
[8.77200000e+03 8.94000000e+03 1.02840000e+04 2.10360000e+04 1.07052000e+05]
[5.59896000e+05 5.60064000e+05 5.61408000e+05 5.72160000e+05 6.58176000e+05]
[3.58318320e+07 3.58320000e+07 3.58333440e+07 3.58440960e+07 3.59301120e+07]
[2.29323574e+09 2.29323590e+09 2.29323725e+09 2.29324800e+09 2.29333402e+09]]
>>> print(total(np.array([1, 3]), np.array([2, 4])))
[[[2.67500000e+01 6.67500000e+01 3.06750000e+02 1.74675000e+03 1.03867500e+04]
[1.71000000e+02 3.39000000e+02 1.68300000e+03 1.24350000e+04 9.84510000e+04]]
[[5.96000000e+02 6.36000000e+02 8.76000000e+02 2.31600000e+03 1.09560000e+04]
[8.77200000e+03 8.94000000e+03 1.02840000e+04 2.10360000e+04 1.07052000e+05]]
[[2.11760000e+04 2.12160000e+04 2.14560000e+04 2.28960000e+04 3.15360000e+04]
[5.59896000e+05 5.60064000e+05 5.61408000e+05 5.72160000e+05 6.58176000e+05]]
[[7.62056000e+05 7.62096000e+05 7.62336000e+05 7.63776000e+05 7.72416000e+05]
[3.58318320e+07 3.58320000e+07 3.58333440e+07 3.58440960e+07 3.59301120e+07]]
[[2.74337360e+07 2.74337760e+07 2.74340160e+07 2.74354560e+07 2.74440960e+07]
[2.29323574e+09 2.29323590e+09 2.29323725e+09 2.29324800e+09 2.29333402e+09]]]
You can see that results are in compliance.
Explanation:
First of all I'd like to note that your calculation of p, k, and l doesn't have to be in the loop. Moreover, calculating k is unnecessary. If you see carefully, how elements of p and l are calculated, they are just geometric progressions (except the 1st element of p):
p = [b, b*c, b*c*(c+2), b*c*(c+2)**2, b*c*(c+2)**3, b*c*(c+2)**4, ...]
l = [b*c, b*c*(c+2), b*c*(c+2)**2, b*c*(c+2)**3, b*c*(c+2)**4, b*c*(c+2)**5, ...]
So, instead of that loop, you can use np.logspace. Unfortunately, np.logspace doesn't support base parameter as an array, so we have no other choice but to use np.vectorize which is just a loop under the hood...
Calculating of komp though is easily vectorized. You can see it in my example. No need for loops there.
Also, as I already noted in a comment, your class doesn't have to be a class, so I took a liberty of changing it to a function.
Next, note that input parameter c is overwritten, so I got rid of it. Variable y is never used. (Also, you could calculate it just as y = c + a * b * np.sign(a - b))
And finally, I'd like to remark that creating NumPy arrays with np.append is very inefficient (as it was pointed out by #kabanus), so you should always try to create them at once - no loops, no appending.
P.S.: I used np.atleast_2d and np.squeeze in my code and it could be unclear why I did it. They are necessary to avoid if-else clauses where we would check dimensions of array l. You can print intermediate results to see what is really going on there. Nothing difficult.
if it is just calling class with two different list elements, loop can satisfies well
res = [blank(i,j,3) for i,j in zip(np.array([1,2,3]),np.array([3,4,5]))]
You can see list of values for res variable
The only way I can think of iterating lists of arrays is by using a function on the main program for iteration and then do the operations you need to do inside the loop.
This solution works for each element of both arrays (note to use zip function for making the iteration in both lists if they have a small size as listed in this answer here):
for n,x in zip(np.array([1,2,3]),np.array([3,4,5])):
res=blank(n,x,3)
print(res.tot)
Hope it is what you need!

How to differentiate within a class? python beginner

from sympy.mpmath import *
I'm constructing a beam model, but I've encountered some trouble with the last part - getSlope. Otherwise though, the rest should be fine.
class beam(object):
"""Model of a beam.
"""
def __init__(self, E, I, L):
"""The class costructor.
"""
self.E = E # Young's modulus of the beam in N/m^2
self.I = I # Second moment of area of the beam in m^4
self.L = L # Length of the beam in m
self.Loads = [(0.0, 0.0)] # the list of loads applied to the beam
def setLoads(self, Loads):
'''This function allows multiple point loads to be applied to the beam
using a list of tuples of the form (load, position)
'''
self.Loads = Loads
The above doesn't need any adjustment since it was given.
def beamDeflection(self, Load, x):
"""A measure of how much the beam bends.
"""
a = 2.5
b = a + (x - a)
(P1, A) = Load
if 0 <= x <= a:
v = ((P1*b*x)/(6*self.L*self.E*self.I))*((self.L**2)-(x**2)-(b**2))
else:
if a < x <= 5:
v = ((P1*b)/(6*self.L*self.E*self.I)) * (((self.L/b)*((x-a)**3)) - (x**3) + (x*((self.L**2) - (b**2))))
return v
The above function 'beamDeflection' is some simple hardcoding that I've done, where if a single load is placed on the left hand side, then a certain formula is used and if the load is on the other side, then a different formula is used.
def getTotalDeflection(self, x):
"""A superposition of the deflection.
"""
return sum(self.beamDeflection(loadall, x) for loadall in self.Loads)
'getTotalDeflection' calculates the total deflection at a point when multiple loads are placed on it.
def getSlope(self, x):
"""Differentiate 'v' then input a value for x to obtain a result.
"""
mp.dps = 15
mp.pretty = True
theta = sympy.diff(lambda x: self.beamDeflection(self.Loads, x), x)
return theta
b = beam(8.0E9, 1.333E-4, 5.0)
b.setLoads([(900, 3.1), (700, 3.8), (1000, 4.2)])
print b.getSlope(1.0)
For this function, I'm supposed to differentiate 'beamDeflection' or 'v' as I defined it while it's under more than one load then input a value for x to find the gradient/slope.
I'm following this: "http://docs.sympy.org/dev/modules/mpmath/calculus/differentiation.html" to differentiate, but it needs a second argument (an integer it seems) for it to work, so I don't think this is the correct method of differentiating it. Could anyone shed some light on this please?
Imports
First things first, get in the good habit of not importing things with a star.
Here's a tangible example from the sympy package.
from sympy import * # imports the symbol pi
>>> type(pi) # can be used to construct analytical expressions
<class 'sympy.core.numbers.Pi'>
>>> 2*pi
2*pi
>>> from sympy.mpmath import * # imports a different pi and shadows the previous one
>>> type(pi) # floating point precision of the constant pi
<class 'sympy.mpmath.ctx_mp_python.constant'>
>>> 2*pi
mpf('6.2831853071795862')
Overall, I'd advise you to use from sympy import mpmath as mp and then you can use anything from that package like so: mp.diff(), mp.pi, etc.
Differentiation
sympy.mpmath.diff() (or mp.diff() from now on) computes the derivative of a function at some point x. You need to provide at least the two mandatory arguments; a function of x, and x, the point of interest.
If your function was one like getTotalDeflection(), with only an x input, you could pass it on as is. For example,
def getSlope(self, x):
return mp.diff(self.getTotalDeflection, x)
However, if you want to use a function like beamDeflection(), you'll have to encapsulate it in a function of only x, while you somehow pass on the other argument. For example,
def getSlope(self, x, load):
f_of_x = lambda x: self.beamDeflection(load, x)
return mp.diff(f_of_x, x)
According to the way you've set up the method beamDeflection(), the argument Load is a tuple of two values, i.e. load and position. An example use would be
b.getSlope(1.0, (900, 3.1))
If you want to get the derivative for a list of loads, you'll have to give it a list of lists (or tuples).
def getSlope(self, x, loads):
f_of_x = lambda x: sum(self.beamDeflection(load, x) for load in loads)
return mp.diff(f_of_x, x)
b.getSlope(1.0, [(900, 3.1), (700, 3.8)])
Of course, if the loads you want to use are the ones stored in self.Loads, then you can simply use
def getSlope(self, x):
return mp.diff(self.getTotalDeflection, x)

How can I manipulate cartesian coordinates in Python?

I have a collection of basic cartesian coordinates and I'd like to manipulate them with Python. For example, I have the following box (with coordinates show as the corners):
0,4---4,4
0,0---4,0
I'd like to be able to find a row that starts with (0,2) and goes to (4,2). Do I need to break up each coordinate into separate X and Y values and then use basic math, or is there a way to process coordinates as an (x,y) pair? For example, I'd like to say:
New_Row_Start_Coordinate = (0,2) + (0,0)
New_Row_End_Coordinate = New_Row_Start_Coordinate + (0,4)
Sounds like you're looking for a Point class. Here's a simple one:
class Point:
def __init__(self, x, y):
self.x, self.y = x, y
def __str__(self):
return "{}, {}".format(self.x, self.y)
def __neg__(self):
return Point(-self.x, -self.y)
def __add__(self, point):
return Point(self.x+point.x, self.y+point.y)
def __sub__(self, point):
return self + -point
You can then do things like this:
>>> p1 = Point(1,1)
>>> p2 = Point(3,4)
>>> print p1 + p2
4, 5
You can add as many other operations as you need. For a list of all of the methods you can implement, see the Python docs.
depending on what you want to do with the coordinates, you can also misuse the complex numbers:
import cmath
New_Row_Start_Coordinate = (0+2j) + (0+0j)
New_Row_End_Coordinate = New_Row_Start_Coordinate + (4+0j)
print New_Row_End_Coordinate.real
print New_Row_End_Coordinate.imag
Python doesn't natively support elementwise operations on lists; you could do it via list comprehensions or map but that's a little clunky for this use case. If you're doing a lot of this kind of thing, I'd suggest looking at NumPy.
For a = (0,2) and b = (0,0) a + b will yield (0, 2, 0, 0), which is probably not what you want. I suggest to use numpy's add function: http://docs.scipy.org/doc/numpy/reference/generated/numpy.add.html
Parameters : x1, x2 : array_like
Returns: The sum of x1 and x2, element-wise. (...)

Best way to find roots of a multidimensional, scalar function with SciPy

Suppose I have a function whose range is a scalar but whose domain is a vector. For example:
def func(x):
return x[0] + 1 + x[1]**2
What's a good way to find the a root of this function? scipy.optimize.fsolve and scipy.optimize.root expect func to return a vector (rather than a scalar), and scipy.optimize.newton only takes scalar arguments. I can redefine func as
def func(x):
return [x[0] + 1 + x[1]**2, 0]
Then root and fsolve can find a root, but the zeros in the Jacobian means it won't always do a good job. For example:
fsolve(func, array([0,2]))
=> array([-5, 2])
It'll only vary the first parameter but not the second, meaning that it often finds a zero that's far away.
EDIT: it looks like the following redefinition of func works better:
def func(x):
fx = x[0] + 1 + x[1]**2
return [fx, fx]
fsolve(func, array([0,5]))
=>array([-16.27342781, 3.90812331])
So it's now willing to change both parameters. The code is still kind of ugly though.
Have you tried the minimization of the absolute value of your function using fmin?
For example:
>>> import scipy.optimize as op
>>> import numpy as np
>>> def func(x):
>>> return x[0] + 1 + x[1]**2
>>> func1 = lambda x: np.abs(func(x))
>>> tmp = op.fmin(func1, [10000., 10000.])
>>> func(tmp)
0.0
>>> print tmp
[-8346.12025122 91.35162971]
Since -- for my problem -- I have a good initial guess and a non-crazy function, Newton's method works well. For a scalar, multidimensional function, Newton's method becomes:
Here's a rough code example:
def func(x): #the function to find a root of
return x[0] + 1 + x[1]**2
def dfunc(x): #the gradient of that function
return array([1, 2*x[1]])
def newtRoot(x0, func, dfunc):
x = array(x0)
for n in xrange(100): # do at most 100 iterations
f = func(x)
df = dfunc(x)
if abs(f) < 1e-6: # exit function if we're close enough
break
x = x - df*f/norm(df)**2 # update guess
return x
In use:
nsolve([0,2],func,dfunc)
=> array([-1.0052546 , 0.07248865])
func([-1.0052546 , 0.07248865])
=> 4.3788225025098715e-09
Not bad! Of course, this function is very rough, but you get the idea. It also won't work well for "tricky" functions or where you don't have a good starting guess. I think I'll use something like this but then fall back to fsolve or root if Newton's method doesn't converge.

Categories

Resources