MPI python-Open-MPI

MPI python-Open-MPI - python

I have 20,000*515 numpy matrix,representing biological datas. I need to find the correlation coefficient of the biological data,meaning as a result I would have a 20,000*20,000 matrix with correlation value. Then I populate a numpy array with 1's and 0's if each correlation coefficient is greater than a threshold value.
I used numpy.corrcoef to find the correlation coefficient and it works well in my machine.
Then I wanted to put in the cluster(having 10 computers and node varying from 2 to 8). when I tried putting it in the cluster, each node generating (40)random numbers and getting those 40 random columns from the biological data resulting in 20,000*40 matrix, I ran into memory issue saying.
mpirun noticed that process rank # with PID # on node name exited on signal 9 (Killed).
Then I tried rewriting the program like getting each rows finding the correlation coefficient and if the value is greater than the threshold then store 1 in the matrix or else 0 rather than creating a correlation matrix.
But it takes 1.30 hrs to run this program.I need to run it 100 times.
Can anyone please suggest a better way of doing this, like how to solve the memory issue by allocating jobs once each core has finished it's job. I am new to MPI. Below is my code.
If you need any more information please let me know.
Thank you
import numpy
from mpi4py import MPI
import time
Size=MPI.COMM_WORLD.Get_size();
Rank=MPI.COMM_WORLD.Get_rank();
Name=MPI.Get_processor_name();
RandomNumbers={};
rndm_indx=numpy.random.choice(range(515),40,replace=False)
rndm_indx=numpy.sort(rndm_indx)
Data=numpy.genfromtxt('MyData.csv',usecols=rndm_indx);
RandomNumbers[Rank]=rndm_indx;
CORR_CR=numpy.zeros((Data.shape[0],Data.shape[0]));
start=time.time();
for i in range(0,Data.shape[0]):
Data[i]=Data[i]-np.mean(Data[i]);
alpha1=1./np.linalg.norm(Data[i]);
for j in range(i,Data.shape[0]):
if(i==j):
CORR_CR[i][j]=1;
else:
Data[j]=Data[j]-np.mean(Data[j]);
alpha2=1./np.linalg.norm(Data[j]);
corr=np.inner(Data[i],Data[j])*(alpha1*alpha2);
corr=int(np.absolute(corrcoef)>=0.9)
CORR_CR[i][j]=CORR_CR[j][i]=corr
end=time.time();
CORR_CR=CORR_CR-numpy.eye(CORR_CR.shape[0]);
elapsed=(end-start)
print('Total Time',elapsed)

The execution time of the program you posted is about 96s on my computer. Let's optimize a couple of things before exploring parallel computations.
Let's store the norms of the vectors to avoid computing it each time the norm is needed. Getting alpha1=1./numpy.linalg.norm(Data[i]); out of the second loop is a good starting point. Since the vectors do not change during computations, their norms can be computed in advance:
alpha=numpy.zeros(Data.shape[0])
for i in range(0,Data.shape[0]):
Data[i]=Data[i]-numpy.mean(Data[i])
alpha[i]=1./numpy.linalg.norm(Data[i])
for i in range(0,Data.shape[0]):
for j in range(i,Data.shape[0]):
if(i==j):
CORR_CR[i][j]=1;
else:
corr=numpy.inner(Data[i],Data[j])*(alpha[i]*alpha[j]);
corr=int(numpy.absolute(corr)>=0.9)
CORR_CR[i][j]=CORR_CR[j][i]=corr
The computation time is already down to 17s.
Assuming that the vectors are not highly correlated, most of the correlation coefficients will be rounded to zero. Hence, the matrix is likely to be sparce (full of zeros). Let's use the scipy.sparse.coo_matrix sparce matrix format, which is very easy to populate: the non-null items and their coordinates i,j are to be stored in lists.
data=[]
ii=[]
jj=[]
...
if(corr!=0):
data.append(corr)
ii.append(i)
jj.append(j)
data.append(corr)
ii.append(j)
jj.append(i)
...
CORR_CR=scipy.sparse.coo_matrix((data,(ii,jj)), shape=(Data.shape[0],Data.shape[0]))
The computation time is down to 13s (negligeable improvement ?) and the memory footprint is greatly reduced. It would be a major improvement if larger datasets were to be considered.
for loops in python are pretty unefficient. See for loop in python is 10x slower than matlab for instance. But there are plenty of ways around, such as using vectorized operations or optimized iterators such as those provided by numpy.nditer. One of the reasons for for loops being unefficient is that python is a interpreted language: not compilation occurs in the process. Hence, to overcome this problem, the most tricky part of the code can be compiled by using a tool like Cython.
The critical part of the code are written in Cython, in a dedicated file correlator.pyx.
This file is turned into a correlator.c file by Cython
This file is compiled by your favorite c compiler gcc to build a shared library correlator.so
The optimized function can be used in your program after import correlator.
The content of correlator.pyx, starting from Numpy vs Cython speed , looks like:
import numpy
cimport numpy
cimport scipy.linalg.cython_blas
ctypedef numpy.float64_t DTYPE_t
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def process(numpy.ndarray[DTYPE_t, ndim=2] array,numpy.ndarray[DTYPE_t, ndim=1] alpha,int imin,int imax):
cdef unsigned int rows = array.shape[0]
cdef int cols = array.shape[1]
cdef unsigned int row, row2
cdef int one=1
ii=[]
jj=[]
data=[]
for row in range(imin,imax):
for row2 in range(row,rows):
if row==row2:
data.append(0)
ii.append(row)
jj.append(row2)
else:
corr=scipy.linalg.cython_blas.ddot(&cols,&array[row,0],&one,&array[row2,0],&one)*alpha[row]*alpha[row2]
corr=int(numpy.absolute(corr)>=0.9)
if(corr!=0):
data.append(corr)
ii.append(row)
jj.append(row2)
data.append(corr)
ii.append(row2)
jj.append(row)
return ii,jj,data
where scipy.linalg.cython_blas.ddot() will perform the inner product.
To cythonize and compile the .pyx file, the following makefile can be used (I hope you are using Linux...)
all: correlator correlatorb
correlator: correlator.pyx
cython -a correlator.pyx
correlatorb: correlator.c
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o correlator.so correlator.c
The new correlation function is called in the main python file by:
import correlator
ii,jj,data=correlator.process(Data,alpha,0,Data.shape[0])
By using a compiled loop, the time is down to 5.4s! It's already ten times faster. Moreover, this computations are performed on a single process!
Let's introduce parallel computations.
Please notice that two arguments are added to the function process : imin and imax. These arguments signals the range of rows of CORR_CR that are computed. It is performed so as to anticipate the use of parallel computation. Indeed, a straightforward way to parallelize the program is to split the outer for loop (index i) to the different processes.
Each processes will perform the outer for loop for a particular range of the index i which is computed so as to balance the workload between the processes.
The program has to complete the following operations:
Process 0 ("root process") reads the vectors Data in the file.
The Data is broadcast to all processes by using the MPI function bcast().
The range of indexes i to be considered by each process is computed.
The correlation coefficient are computed by each process for the corresponding indexes. The non-null terms of the matrix are stored in lists data,ii,jj on each process.
These lists are gathered on the root process by calling the MPI function gather(). It produces three lists of Size lists which are concatenated to get 3 lists required to create the sparce adjacency matrix.
Here goes the code:
import numpy
from mpi4py import MPI
import time
import scipy.sparse
import warnings
warnings.simplefilter('ignore',scipy.sparse.SparseEfficiencyWarning)
Size=MPI.COMM_WORLD.Get_size();
Rank=MPI.COMM_WORLD.Get_rank();
Name=MPI.Get_processor_name();
#a dummy set of data is created here.
#Samples such that (i-j)%10==0 are highly correlated.
RandomNumbers={};
rndm_indx=numpy.random.choice(range(515),40,replace=False)
rndm_indx=numpy.sort(rndm_indx)
Data=numpy.ascontiguousarray(numpy.zeros((2000,515),dtype=numpy.float64))
if Rank==0:
#Data=numpy.genfromtxt('MyData.csv',usecols=rndm_indx);
Data=numpy.ascontiguousarray(numpy.random.rand(2000,515))
lin=numpy.linspace(0.,1.,515)
for i in range(Data.shape[0]):
Data[i]+=numpy.sin((1+i%10)*10*lin)*100
start=time.time();
#braodcasting the matrix
Data=MPI.COMM_WORLD.bcast(Data, root=0)
RandomNumbers[Rank]=rndm_indx;
print Data.shape[0]
#an array to store the inverse of the norm of each sample
alpha=numpy.zeros(Data.shape[0],dtype=numpy.float64)
for i in range(0,Data.shape[0]):
Data[i]=Data[i]-numpy.mean(Data[i])
if numpy.linalg.norm(Data[i])==0:
print "process "+str(Rank)+" is facing a big problem"
else:
alpha[i]=1./numpy.linalg.norm(Data[i])
#balancing the computational load between the processes.
#Each process must perform about Data.shape[0]*Data.shape[0]/(2*Size) correlation.
#each process cares for a set of rows.
#Of course, the last rank must care about more rows than the first one.
ilimits=numpy.zeros(Size+1,numpy.int32)
if Rank==0:
nbtaskperprocess=Data.shape[0]*Data.shape[0]/(2*Size)
icurr=0
for i in range(Size):
nbjob=0
while(nbjob<nbtaskperprocess and icurr<=Data.shape[0]):
nbjob+=(Data.shape[0]-icurr)
icurr+=1
ilimits[i+1]=icurr
ilimits[Size]=Data.shape[0]
ilimits=MPI.COMM_WORLD.bcast(ilimits, root=0)
#the "local" job has been cythonized in file main2.pyx
import correlator
ii,jj,data=correlator.process(Data,alpha,ilimits[Rank],ilimits[Rank+1])
#gathering the matrix inputs from every processes on the root process.
data = MPI.COMM_WORLD.gather(data, root=0)
ii = MPI.COMM_WORLD.gather(ii, root=0)
jj = MPI.COMM_WORLD.gather(jj, root=0)
if Rank==0:
#concatenate the lists
data2=sum(data,[])
ii2=sum(ii,[])
jj2=sum(jj,[])
#create the adjency matrix
CORR_CR=scipy.sparse.coo_matrix((data2,(ii2,jj2)), shape=(Data.shape[0],Data.shape[0]))
print CORR_CR
end=time.time();
elapsed=(end-start)
print('Total Time',elapsed)
By running mpirun -np 4 main.py, the computation time is 3.4s. It's not 4 time faster... This is likely due to the fact that the bottleneck of the computation is the computations of scalar products, which requires a large memory bandwidth. And my personnal computer is really limited regarding the memory bandwidth...
Last comment: there are plenty of possibilities for improvements.
- The vectors in Data are copied to every processes... This affects the memory required by the program. Dispatching the computation in a different way and trading memory against communications could overcome this problem...
- Each process still computes the norms of all the vectors...It could be improved by having each process computing the norms of some vector and using the MPI function allreduce() on alpha...
What to do with this adjacency matrix ?
I think you already know the answer to this question, but you can provide this adjacency matrix to sparse.csgraph routines such as connected_components() or laplacian(). Indeed, you are not very far from spectral clustering!
For instance, if the clusters are obvious, using connected_components() is sufficient:
if Rank==0:
# coo to csr format
S=CORR_CR.tocsr()
print S
n_components, labels=scipy.sparse.csgraph.connected_components(S, directed=False, connection='weak', return_labels=True)
print "number of different famillies "+str(n_components)
numpy.set_printoptions(threshold=numpy.nan)
print labels

Related

For loop vs Numpy vectorization computation time

I was randomly comparing the computation times of an explicit for-loop with vectorized implementation in numpy. I ran exactly 1 million iterations and found some astounding differences. For-loop took about 646ms while the np.exp() function computed the same result in less than 20ms.
import time
import math
import numpy as np
iter = 1000000
x = np.zeros((iter,1))
v = np.random.randn(iter,1)
before = time.time()
for i in range(iter):
x[i] = math.exp(v[i])
after = time.time()
print(x)
print("Non vectorized= " + str((after-before)*1000) + "ms")
before = time.time()
x = np.exp(v)
after = time.time()
print(x)
print("Vectorized= " + str((after-before)*1000) + "ms")
The result I got:
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Non vectorized= 646.1577415466309ms
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Vectorized= 19.547224044799805ms
My questions are:
What exactly is happening in the second case? The first one is using
an explicit for-loop and thus the computation time is justified.
What is happening "behind the scenes" in the second case?
How can one implement such computations (second case) without using numpy (in plain Python)?

What is happening is that NumPy is calling high quality numerical libraries (BLAS for instance) which are very good at vector arithmetic.
I imagine you could specifically call the exact libraries used by NumPy, however, NumPy would likely know best which to use.

NumPy is a Python wrapper over libraries and code written in C. This is a large part of the efficiency of NumPy. C code compiles directly to instructions which are executed by your processor or GPU. On the other hand, Python code must be interpreted as it executes. Despite the ever increasing speed we can get from interpreted languages with advances like Just In Time Compilers, for some tasks they will never be able to approach the speed of compiled languages.

It comes down to the fact that Python does not have direct access to the hardware level.
Python can't use the SIMD (Single instruction, multiple data) assembly instructions that most modern CPU's and GPU's have. These SIMD instruction allow a single operation to execute on a vector of data all at once (within a single clock cycle) at the hardware level.
NumPy on the other hand has functions built in C, and C is a language capable of running SIMD instructions. Therefore NumPy can take advantage of the vectorization hardware in your processor.

Fast Calculation of Fiedler Eigenvector for Large Matrices in Python

I have been calculating the Fiedler eigenvector for matrices up to 10K x 10K in size using numpy/scipy eigendecomposition operators in Python. But I want to scale to larger matrices (e.g., 100K and more), and perform the calculation as fast as possible. For 10K, it takes me a few minutes to do the eigendecomposition. Here is the code I am currently using:
from scipy.linalg import eigh
w,v = eigh(lapMat)
sortWInds = argsort(w)
fVec = v[:,sortWInds[1]]
From the little I understand about Spark, the eigendecomposition operator still requires a lot of cross talk between cores in a distributed system. I ran some tests thru a contractor and didn't see the speed-up I was hoping for using Spark on a multi-core AWS AMI. Here is the main code used to perform SVDs on an AWS Linux Ubuntu AMI:
#Benchmarking Setup - Configure here for the benchmark size required
sizes=[10000]
cores=[64]
max_cores_for_a_single_executor=8
from pyspark import SparkContext, SparkConf
from pyspark.mllib.linalg.distributed import RowMatrix
from datetime import datetime
from pyspark.mllib.random import RandomRDDs
from pyspark.mllib.linalg import Vectors
# Iterate over matrix of size in sizes
for size in sizes:
# Iterate over number of cores used
for core in cores:
# Calculating spark configuration for a distributed setup
executor_cores= max_cores_for_a_single_executor if core>max_cores_for_a_single_executor else core
executors=1 if core/max_cores_for_a_single_executor==0 else core/max_cores_for_a_single_executor
#Initializing Spark
conf = SparkConf().setAppName("SVDBenchmarking")\
.set("spark.executor.cores",executor_cores)\
.set("spark.executor.instances",executors) \
.set("spark.dynamicAllocation.enabled","false")\
.set("spark.driver.maxResultSize","25g")\
.set("spark.executor.memory", "60g")\
sc = SparkContext.getOrCreate(conf=conf)
start = datetime.now()
# Input matrix of specific size generated and saved earlier
inputRdd=sc.textFile("hdfs://ip-172-31-34-253.us-west-2.compute.internal:8020/data/input"+str(size))
inputRdd=sc.textFile("/Users/data/input"+str(size))
intermid2=inputRdd\
.map(lambda x: textToVector(x))\
.sortByKey()\
.map(lambda x: extract(x))
mat=RowMatrix(intermid2)
# Step-2
# running SVD
svd = mat.computeSVD(size, computeU=True)
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.
# Stoping clock for benchmark
end = datetime.now()
Given the eigenstructure of a matrix is key for recommendation algorithms there must be "community" efforts available for calculating SVDs far faster than a single-core approach in numpy/scipy.
There have also been recent efforts at multigrid algorithms for explicitly calculating the Fielder eigenvector Urschel 2014. I think at one time he made Matlab code available.
Does anyone have any pointers to 1) understanding the state-of-the-art for rapidly calculating non-dominant eigenvectors/SVs (such as the Fiedler eigenvector) in large matrices, 2) public codebases for performing these operations, preferably in Python, or at least callable from Python, and 3) recommendations for architectures to perform calculations for matrices of this size or bigger (> 10K) that don't swamp the RAM?
Thanks (and humbly),
Nirmal

Why is numpy faster than my c/c++ code for summing an array of float?

I was testing the efficiency of my simple shared C library and comparing it with the numpy implmentation.
Library creation: The following function is defined in sum_function.c:
float sum_vector(float* data, int num_row){
float value = 0.0;
for (int i = 0; i < num_row; i++){
value += data[i];
}
return value;
}
Library compilation: the shared library sum.so is created by
clang -c sum_function.c
clang -shared -o sum.so sum_function.o
Measurement: a simple numpy array is created and the sum of its elements is calculated using the above function.
from ctypes import *
import numpy as np
N = int(1e7)
data = np.arange(N, dtype=np.float32)
libc = cdll.LoadLibrary("sum.so")
libc.sum_vector.restype = c_float
libc.sum_vector(data.ctypes.data_as(POINTER(c_float)),
c_int(N))
The above function takes 30 ms. However, if I use numpy.sum, the execution time is only 4 ms.
So my question is: what makes numpy a lot faster than my C implementation? I cannot think about any improvement in terms of algorithm for calculating the sum of a vector.

There are many reasons that could be involved depending even on the compiler you are using. Your numpy backend is in many cases C/C++. In other words, you have to appreciate that languages like C++ allow for a lot more efficiency and contact to hardware but also demand a lot of knowledge. C++ less that C, as as long as you use the STL like in #PaulMcKenzie's comment. Those are routines that are optimized for runtime performance.
The next thing is memory allocation. Now, your vector seems large enough that the allocator inside <std::vector> will align the memory on the heap. Memory on the stack can end up unaligned keeping std::accumulate even to be slow. Here's an idea how such allocator could be written to avoid that: https://github.com/kvahed/codeare/blob/master/src/matrix/Allocator.hpp. This is part of an MRI image reconstruction library I wrote as a PhD student.
A word on SIMD: Same library other aspect. https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp How to do state of the art arithmetic is anything but trivial.
Both above concepts culminate into https://github.com/kvahed/codeare/blob/master/src/matrix/Matrix.hpp, where you easily outperform any standardized code on a specific machine.
And last but not least: The compiler and the compiler flags. Your runtime code should once debugged probably be compiled -O2 -g or even -O3. If you have good test coverage you might even be able to get away with -Ofast which ditches ieee math precision. Apart of numerical integration I have never witnessed issues.

You need to enable optimizations
In addition to that you have to check if the compiler is able to use autovectorization. If you want distribute a compiled binary, you may want to add multiple codepaths (AVX2,SS2) to get a runable and performant version on all platforms.
A small overview of different implementations and their performance. If you can't beat the numpy sum implementation (binary version installed via pip) on an recent processor you have done something wrong, but also keep the varying implementation and compiler (fastmath) dependent precision in mind. I was too lazy to install clang but used Numba, which has also a LLVM backend (same as clang has).
import numba as nb
import numpy as np
import time
#prints information about SIMD vectorization
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
#nb.njit(fastmath=True) #eq. O3, march-native,fastmath
def sum_nb(ar):
s1=0. #double
for i in range(ar.shape[0]):
s1+=ar[i+0]
return s1
N = int(1e7)
ar = np.random.rand(N).astype(np.float32)
#Numba solution float32 with float64 accumulator
#don't measure compilation time
sum_1=sum_nb(ar)
t1=time.time()
for i in range(1000):
sum_1=sum_nb(ar)
print(time.time()-t1)
#Numba solution float64 with float64 accumulator
#don't measure compilation time
arr_64=ar.astype(np.float64)
sum_2=sum_nb(arr_64)
t1=time.time()
for i in range(1000):
sum_2=sum_nb(arr_64)
print(time.time()-t1)
#Numpy solution (float32)
t1=time.time()
for i in range(1000):
sum_3=np.sum(ar)
print(time.time()-t1)
#Numpy solution (float32, with float64 accumulator)
t1=time.time()
for i in range(1000):
sum_4=np.sum(ar,dtype=np.float64)
print(time.time()-t1)
#Numpy solution (float64)
t1=time.time()
for i in range(1000):
sum_5=np.sum(arr_64)
print(time.time()-t1)
print(sum_1)
print(sum_2)
print(sum_3)
print(sum_4)
print(sum_5)
Performance
#Numba solution float32 with float64 accumulator: 2.29ms
#Numba solution float64 with float64 accumulator: 4.76ms
#Numpy solution (float32): 5.72ms
#Numpy solution (float32) with float64 accumulator:: 7.97ms
#Numpy solution (float64):: 10.61ms

Parallel Scipy COO Matrix Computations

I am trying to calculate sparse matrix calculations using scipy for an algorithm that require intensive dependent computations(PageRank) on very large RDF datasets. I want to use multiple cores for the scipy calculation within the following code
F = sparse.coo_matrix((y['data'],(y['row'],y['col'])),shape=y['shape'])
W = sparse.coo_matrix((y['data'],(y['row'],y['col'])),shape=y['shape'])
P = sparse.bmat([[None, W], [F, None]])
previous = np.ones(n)/n
ones = np.ones(n)/n
while error > epsilon:
tmp = np.array(previous)
previous = damping*P.T.dot(previous) + (1-damping)*ones
error = np.linalg.norm(tmp - previous)
if(printerror):
print(error)
I have searched every possible answer I could find and I tried integrating the mkl(anaconda build) within the code but the performance on multiple cores does not seem to scale up. I have come to an understanding that the scipy call csr.h does not make use of BLAS call, I am wondering whether I need to make changes and replace the call to csr_matvec in from scipy/sparsetools with an appropriate Sparse BLAS call since MKL has those and then link scipy to mkl. Am I understanding something wrong or missing something. I would really appreciate some help in the matter. One similar question is here Thanks!!

Why is numpy much slower than matlab on a digitize example?

I am comparing performance of numpy vs matlab, in several cases I observed that numpy is significantly slower (indexing, simple operations on arrays such as absolute value, multiplication, sum, etc.). Let's look at the following example, which is somehow striking, involving the function digitize (which I plan to use for synchronizing timestamps):
import numpy as np
import time
scale=np.arange(1,1e+6+1)
y=np.arange(1,1e+6+1,10)
t1=time.time()
ind=np.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
The result is:
Time passed is 55.91 seconds
Let's now try the same example Matlab using the equivalent function histc
scale=[1:1e+6];
y=[1:10:1e+6];
tic
[N,bin]=histc(scale,y);
t=toc;
display(['Time passed is ',num2str(t), ' seconds'])
The result is:
Time passed is 0.10237 seconds
That's 560 times faster!
As I'm learning to extend Python with C++, I implemented my own version of digitize (using boost libraries for the extension):
import analysis # my C++ module implementing digitize
t1=time.time()
ind2=analysis.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind2) #ok
The result is:
Time passed is 0.02 seconds
There is a bit of cheating as my version of digitize assumes inputs are all monotonic, this might explain why it is even faster than Matlab. However, sorting an array of size 1e+6 takes 0.16 seconds (with numpy.sort), making therefore the performance of my function worse (by a factor of approx 1.6) compared to the Matlab function histc.
So the questions are:
Why is numpy.digitize so slow? Is this function not supposed to be written in compiled and optimized code?
Why is my own version of digitize much faster than numpy.digitize, but still slower than Matlab (I am quite confident I use the fastest algorithm possible, given that I assume inputs are already sorted)?
I am using Fedora 16 and I recently installed ATLAS and LAPACK libraries (but there has been so change in performance). Should I perhaps rebuild numpy? I am not sure if my installation of numpy uses the appropriate libraries to gain maximum speed, perhaps Matlab is using better libraries.
Update
Based on the answers so far, I would like to stress that the Matlab function histc is not equivalent to numpy.histogram if someone (like me in this case) does not care about the histogram. I need the second output of hisc, which is a mapping from input values to the index of the provided input bins. Such an output is provided by the numpy functions digitize and searchsorted. As one of the answers says, searchsorted is much faster than digitize. However, searchsorted is still slower than Matlab by a factor 2:
t1=time.time()
ind3=np.searchsorted(y,scale,"right")
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind3) #ok
The result is
Time passed is 0.21 seconds
So the questions are now:
What is the sense of having numpy.digitize if there is an equivalent function numpy.searchsorted which is 280 times faster?
Why is the Matlab function histc (which also provides the output of numpy.searchsorted) 2 times faster than numpy.searchsorted?

First, let's look at why numpy.digitize is slow. If your bins are found to be monotonic, then one of these functions is called depending on whether the bins are nondecreasing or nonincreasing (the code for this is found in numpy/lib/src/_compiled_base.c in the numpy git repo):
static npy_intp
incr_slot_(double x, double *bins, npy_intp lbins)
{
npy_intp i;
for ( i = 0; i < lbins; i ++ ) {
if ( x < bins [i] ) {
return i;
}
}
return lbins;
}
static npy_intp
decr_slot_(double x, double * bins, npy_intp lbins)
{
npy_intp i;
for ( i = lbins - 1; i >= 0; i -- ) {
if (x < bins [i]) {
return i + 1;
}
}
return 0;
}
As you can see, it is doing a linear search. Linear search is much, much slower than binary search so there is your answer as to why it is slow. I will open a ticket for this on the numpy tracker.
Second, I think that Matlab is actually slower than your C++ code because Matlab also assumes that the bins are monotonically nondecreasing.

I can't answer why numpy.digitize() is so slow -- I could confirm your timings on my machine.
The function numpy.searchsorted() does basically the same thing as numpy.digitize(), but efficiently.
ind = np.searchsorted(y, scale, "right")
takes about 0.15 seconds on my machine and gives exactly the same result as your code.
Note that your Matlab code does something different from both of those functions -- it is the equivalent of numpy.histogram().

Before the question can get answered, several subquestions need to be addressed:
In order to get more reliable results, you should run several
iterations of the tests and average their results. This would
somehow eliminate startup effects, which do not have anything to do
with the algorithm. Also, try to use larger data for the same
purpose.
Use the same algortihms across the frameworks. This has already
been addressed in other answers here.
Make sure, the algorithms are really similar enough. How do they
utilize system ressources? How is iterated over memory ? If (just an
example) a Matlab algorithm uses repmat and the numpy would not, the
comparison is not fair.
How does the corresponding framework parallelize? This possibly
is connected to your individual machine / processor configuration.
Matlab does parallelize some (but by far not all) builtin functions.
I dont know about numpy/CPython.
Use a memory profiler in order to find out, how both implementations
behave from that performance point of view.
Afterwards (this is only a guess) we probably will find out, numpy does often behave slower than Matlab. Many questions here at SO come to the same conclusion. One explanation could be, that Matlab has an easier job to optimize array access, because it does not need to take into account a whole collection of general purpose objects (like CPython). The requirements on mathematical arrays are much lower than those on general arrays. numpy on the other hand does utilize CPython, which must serve the full python library - not only numpy. However, according to this comparison test (among many others) Matlab is still pretty slow ...

I don't think you are comparing the same functions in numpy and matlab. The equivalent to histc is np.histogram as far as I can tell from looking at the documentation. I don't have matlab to do a comparison, but when I do the following on my machine:
In [7]: import numpy as np
In [8]: scale=np.arange(1,1e+6+1)
In [9]: y=np.arange(1,1e+6+1,10)
In [10]: %timeit np.histogram(scale,y)
10 loops, best of 3: 135 ms per loop
I get a number that is approximately equivalent to what you get for histc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.