I've been trying to optimize my computations; and for most operations that I've tried, tensorflow is much faster. I'm trying to do a fairly simple operation...Transform a matrix (multiply each value by 1/2 and then add 1/2 to that value).
With the help of #mrry , I was able to do these operations in tensorflow. However to my surprise, the numpy method was significantly faster?!
tensorflow seems like an extremely useful tool for data scientists and I think this could help clarify it's use and advantages.
Am I not using tensorflow data structures and operations in the most efficient way? I'm not sure how non-tensorflow methods would be faster. I'm using a Mid-2012 Macbook Air 4GB RAM
trans1 is the tensorflow version while trans2 is numpy. DF_var is a pandas dataframe object
import pandas as pd
import tensorflow as tf
import numpy as np
def trans1(DF_var):
#Total user time is 31.8532807827 seconds
#Create placeholder
T_feed = tf.placeholder(tf.float32,DF_var.shape)
#Matrix transformation
T_signed = tf.add(
tf.constant(0.5,dtype=tf.float32),
tf.mul(T_feed,tf.constant(0.5,dtype=tf.float32))
)
#Get rid of of top triangle
T_ones = tf.constant(np.tril(np.ones(DF_var.shape)),dtype=tf.float32)
T_tril = tf.mul(T_signed,T_ones)
#Start Graph Session
sess = tf.Session()
DF_signed = pd.DataFrame(
sess.run(T_tril,feed_dict={T_feed: DF_var.as_matrix()}),
columns = DF_var.columns, index = DF_var.index
)
#Close Graph Session
sess.close()
return(DF_signed)
def trans2(DF_var):
#Total user time is 1.71233415604 seconds
M_computed = np.tril(np.ones(DF_var.shape))*(0.5 + 0.5*DF_var.as_matrix())
DF_signed = pd.DataFrame(M_computed,columns=DF_var.columns, index=DF_var.index)
return(DF_signed)
My timing method was:
import time
start_time = time.time()
#operation
print str(time.time() - start_time)
Your results are compatible with the benchmarks from another guy.
In his benchmark he compared NumPy, Theano and Tensorflow on
an Intel core i5-4460 CPU with 16GiB RAM and a Nvidia GTX 970 with 4
GiB RAM using Theano 0.8.2, Tensorflow 0.11.0, CUDA 8.0 on Linux Mint
18
His results for addition shows that:
He also tested a few other functions such as matrix multiplication:
The results are:
It is clear that the main strengths of Theano and TensorFlow are very
fast dot products and matrix exponents. The dot product is
approximately 8 and 7 times faster respectively with Theano/Tensorflow
compared to NumPy for the largest matrices. Strangely, matrix addition
is slow with the GPU libraries and NumPy is the fastest in these
tests.
The minimum and mean of matrices are slow in Theano and quick in
Tensorflow. It is not clear why Theano is as slow (worse than NumPy)
for these operations.
Related
I have a a loop in which I'm calculating several pseudoinverses of rather large, non-sparse matrices (eg. 20000x800).
As my code spends most time on the pinv, I was trying to find a way to speed up the computation. I'm already using multiprocessing (joblib/loky) to run with several processes, but that of course increases also overhead. Using jit did not help much.
Is there a faster way / better implementation to compute pseudoinverse using any function? Precision isn't key.
My current benchmark
import time
import numba
import numpy as np
from numpy.linalg import pinv as np_pinv
from scipy.linalg import pinv as scipy_pinv
from scipy.linalg import pinv2 as scipy_pinv2
#numba.njit
def np_jit_pinv(A):
return np_pinv(A)
matrix = np.random.rand(20000, 800)
for pinv in [np_pinv, scipy_pinv, scipy_pinv2, np_jit_pinv]:
start = time.time()
pinv(matrix)
print(f'{pinv.__module__ +"."+pinv.__name__} took {time.time()-start:.3f}')
numpy.linalg.pinv took 2.774
scipy.linalg.basic.pinv took 1.906
scipy.linalg.basic.pinv2 took 1.682
__main__.np_jit_pinv took 2.446
EDIT:
JAX seems to be 30% faster! impressive! Thanks for letting me know #yuri-brigance . For Windows it works well under WSL.
numpy.linalg.pinv took 2.774
scipy.linalg.basic.pinv took 1.906
scipy.linalg.basic.pinv2 took 1.682
__main__.np_jit_pinv took 2.446
jax._src.numpy.linalg.pinv took 0.995
Try with JAX:
import jax.numpy as jnp
jnp.linalg.pinv(A)
Seems to be slightly faster than regular numpy.linalg.pinv. On my machine your benchmark looks like this:
jax._src.numpy.linalg.pinv took 3.127
numpy.linalg.pinv took 4.284
CPU: i7-9750 #2.6GHz (with 16G DDR4 Ram); GPU: Nvidia Geforce GTX 1600 TI (6G); OS: Windows 10-64bit
I tried to see how fast the GPU is in doing basic matrix operations compared with CPU, and I basically followed this https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56. The following is my super simple code
import numpy as np
import cupy as cp
import time
### Numpy and CPU
s = time.time()
A = np.random.random([10000,10000]); B = np.random.random([10000,10000])
CPU = np.matmul(A,B); CPU *= 5
e = time.time()
print(f'CPU time: {e - s: .2f}')
### CuPy and GPU
s = time.time()
C= cp.random.random([10000,10000]); D = cp.random.random([10000,10000])
GPU = cp.matmul(C,D); GPU *= 5
cp.cuda.Stream.null.synchronize()
# to let the code finish executing on the GPU before calculating the time
e = time.time()
print(f'GPU time: {e - s: .2f}')
Ironically, it shows
CPU time: 11.74
GPU time: 12.56
This really confuse me. How could the GPU be even slower than CPU on large matrix operations? Note that I even have not applied parallel computing (I am a beginner and I am not sure whether the system will open it for me or not.) I did have checked similar questions such as Why is my CPU doing matrix operations faster than GPU instead?. But here I am using cupy rather than mxnet (cupy is newer and designed for GPU computing).
Can someone help? I woud really appreciate!
numpy random is generating floats (32bit) as default. Cupy random generates 64bit (double) by default. To make an apples to apples comparison, change the GPU random number generation like this:
C= cp.random.random([10000,10000], dtype=cp.float32)
D = cp.random.random([10000,10000], dtype=cp.float32)
I have different hardware (both CPU and GPU) than you, but once this change is made the GPU version is about 12x faster than cpu version. Generating both ndarray of random numbers, matrix multiplication and scalar multiplication using cupy takes less than one second in total
I have been calculating the Fiedler eigenvector for matrices up to 10K x 10K in size using numpy/scipy eigendecomposition operators in Python. But I want to scale to larger matrices (e.g., 100K and more), and perform the calculation as fast as possible. For 10K, it takes me a few minutes to do the eigendecomposition. Here is the code I am currently using:
from scipy.linalg import eigh
w,v = eigh(lapMat)
sortWInds = argsort(w)
fVec = v[:,sortWInds[1]]
From the little I understand about Spark, the eigendecomposition operator still requires a lot of cross talk between cores in a distributed system. I ran some tests thru a contractor and didn't see the speed-up I was hoping for using Spark on a multi-core AWS AMI. Here is the main code used to perform SVDs on an AWS Linux Ubuntu AMI:
#Benchmarking Setup - Configure here for the benchmark size required
sizes=[10000]
cores=[64]
max_cores_for_a_single_executor=8
from pyspark import SparkContext, SparkConf
from pyspark.mllib.linalg.distributed import RowMatrix
from datetime import datetime
from pyspark.mllib.random import RandomRDDs
from pyspark.mllib.linalg import Vectors
# Iterate over matrix of size in sizes
for size in sizes:
# Iterate over number of cores used
for core in cores:
# Calculating spark configuration for a distributed setup
executor_cores= max_cores_for_a_single_executor if core>max_cores_for_a_single_executor else core
executors=1 if core/max_cores_for_a_single_executor==0 else core/max_cores_for_a_single_executor
#Initializing Spark
conf = SparkConf().setAppName("SVDBenchmarking")\
.set("spark.executor.cores",executor_cores)\
.set("spark.executor.instances",executors) \
.set("spark.dynamicAllocation.enabled","false")\
.set("spark.driver.maxResultSize","25g")\
.set("spark.executor.memory", "60g")\
sc = SparkContext.getOrCreate(conf=conf)
start = datetime.now()
# Input matrix of specific size generated and saved earlier
inputRdd=sc.textFile("hdfs://ip-172-31-34-253.us-west-2.compute.internal:8020/data/input"+str(size))
inputRdd=sc.textFile("/Users/data/input"+str(size))
intermid2=inputRdd\
.map(lambda x: textToVector(x))\
.sortByKey()\
.map(lambda x: extract(x))
mat=RowMatrix(intermid2)
# Step-2
# running SVD
svd = mat.computeSVD(size, computeU=True)
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.
# Stoping clock for benchmark
end = datetime.now()
Given the eigenstructure of a matrix is key for recommendation algorithms there must be "community" efforts available for calculating SVDs far faster than a single-core approach in numpy/scipy.
There have also been recent efforts at multigrid algorithms for explicitly calculating the Fielder eigenvector Urschel 2014. I think at one time he made Matlab code available.
Does anyone have any pointers to 1) understanding the state-of-the-art for rapidly calculating non-dominant eigenvectors/SVs (such as the Fiedler eigenvector) in large matrices, 2) public codebases for performing these operations, preferably in Python, or at least callable from Python, and 3) recommendations for architectures to perform calculations for matrices of this size or bigger (> 10K) that don't swamp the RAM?
Thanks (and humbly),
Nirmal
I am trying to improve the performance of NumPy in Python 3.6 using Intel's MKL. With a fresh anaconda installation i created a MKL environment using:
conda create -n idp intelpython3_core python=3
As written in this article,
it seems that the MKL has internal thresholds to decide whether to use threading or not. It seems one of these thresholds is given by the vector size used in the calculations (kind of obvious). This threshold is set to a vector size of 8192 (at least for my machine). When vectors exceed this size, i can observe my python scripts using 4 threads (i have 2 cores with hyper threading) for calculations like:
import numpy as np
x = np.random.rand(8193)
y = np.sin(x)
So far everything is working as intended.
Beside the threading part, MKL "Features highly optimized, threaded, and vectorized math functions that maximize performance on each processor family" (read here). Since the problems i'm usually working on do not exceed the vector size threshold, i'm not interested in the performance increase which is obtained by threading, but more in the optimized math functions of MKL. Unfortunately it seems like those are only used, when the vector size is above the threshold.
I've written a sample code to measure the performance of the sine operation on vectors with different sizes:
from timeit import default_timer as timer
import mkl
import numpy as np
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
np.random.seed(0)
Nop = int(1e4)
def func(x):
return np.sin(x)
def measure(x):
t1 = timer()
for i in range(0, Nop):
func(x)
t2 = timer()
diff = (t2 - t1)*1000.0
print("vec size: %i:" % len(x), end="")
print("\t time needed: %f ms" % diff)
x0 = np.random.rand(20000)
measure(np.array(x0[:8192]))
measure(np.array(x0[:8193]))
measure(np.array(x0[:8192]))
These lines:
import mkl
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
are just there to make sure, that the increase in performance is not due to threading (i also checked the CPU usage, it is indeed only using one thread)
I get these results:
vec size: 8192: time needed: 8185.900477 ms
vec size: 8193: time needed: 436.843237 ms
vec size: 8192: time needed: 1777.306942 ms
As you can see, the 8193-vector runs roughly 20x faster than the 8192-vector. What is even more confusing is the fact, that the second run of the 8193-vector is 4x faster then before, after doing the calculation on the bigger vector.
Now my questions:
Am i doing anything obviously wrong, which i am not aware of, which
leads to these results?
Can anyone reproduce these results or is it just my installation/my
machine behaving like this
Is the increase in performance really due to the optimized
implementation of sine?
Is it possible to enforce always using the optimized version of sine
independent of the vector size?
PS:
I actually tried the following in the simulation i'm running for my master thesis, which involve a lot of sine and cosine function calls:
Just added this line before anything else is calculated:
np.sin(np.zeros(8193))
And now everything runs 50% faster.
I am working on a project in python. Due to some reason, I have to call matlab for calculation
ubuntu 14.04 64bit
python 2.7.6
numpy 1.11.1
matlab 2016a linux-64bit
import matlab
import matlab.engine
import numpy as np
import time
data = np.random.rand(1000, 100, 100)
print ('pass begin')
st = time.time()
data_matlab = matlab.double(data.tolist())
print ('pass numpy to matlab finished in {:.2f} sec'.format(time.time() - st))
passing a float64 type numpy array with shape of 1000,100,100 to matlab array takes 63.49 seconds. This is unacceptable. Is there any efficient way to passing big data array from numpy to matlab array in python ?
pass begin
pass numpy to matlab finished in 63.49 sec
Starting with MATLAB R2022a, this operation is at least an order of magnitude faster than in previous releases. When I run the code sample above on a Windows 10 machine, the operation takes consistently less than 2 seconds now as opposed to the more than 63 seconds reported in the original question. See release notes under "Performance"/"MATLAB Engine API for Python: Improved performance with large multidimensional arrays in Python".