Related
To be clear, below is what I am trying to do. And the question is, how can I change the function oper_AB() so that instead of the nested for loop, I utilize the vectorization/broadcasting in numpy and get to the ret_list much faster?
def oper(a_1D, b_1D):
return np.dot(a_1D, b_1D) / np.dot(b_1D, b_1D)
def oper_AB(A_2D, B_2D):
ret_list = []
for a_1D in A_2D:
for b_1D in B_2D:
ret_list.append(oper(a_1D, b_1D))
return ret_list
Strictly addressing the question (with the reservation that I suspect the OP wants the norm, not the norm squared, as divisor below):
r = a # b.T / np.linalg.norm(b, axis=1)**2
Example:
np.random.seed(0)
a = np.random.randint(0, 10, size=(2,2))
b = np.random.randint(0, 10, size=(2,2))
Then:
>>> a
array([[5, 0],
[3, 3]])
>>> b
array([[7, 9],
[3, 5]])
>>> oper_AB(a, b)
[0.2692307692307692,
0.4411764705882353,
0.36923076923076925,
0.7058823529411765]
>>> a # b.T / np.linalg.norm(b, axis=1)**2
array([[0.26923077, 0.44117647],
[0.36923077, 0.70588235]])
>>> np.ravel(a # b.T / np.linalg.norm(b, axis=1)**2)
array([0.26923077, 0.44117647, 0.36923077, 0.70588235])
Speed:
n, m = 1000, 100
a = np.random.uniform(size=(n, m))
b = np.random.uniform(size=(n, m))
orig = %timeit -o oper_AB(a, b)
# 2.73 s ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
new = %timeit -o np.ravel(a # b.T / np.linalg.norm(b, axis=1)**2)
# 2.22 ms ± 33.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
orig.average / new.average
# 1228.78 (speedup)
Our solution is 1200x faster than the original.
Correctness:
>>> np.allclose(np.ravel(a # b.T / np.linalg.norm(b, axis=1)**2), oper_AB(a, b))
True
Speed on large array, comparison to #Ahmed AEK's solution:
n, m = 2000, 2000
a = np.random.uniform(size=(n, m))
b = np.random.uniform(size=(n, m))
new = %timeit -o np.ravel(a # b.T / np.linalg.norm(b, axis=1)**2)
# 86.5 ms ± 484 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
other = %timeit -o AEK(a, b) # Ahmed AEK's answer
# 102 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Our solution is 15% faster :-)
this should work.
result = (np.matmul(A_2D, B_2D.transpose())/np.sum(B_2D*B_2D,axis=1)).flatten()
but this second implementation will be faster because of cache utilization.
def oper_AB(A_2D, B_2D):
b_squared = np.sum(B_2D*B_2D,axis=1).reshape([-1,1])
b_normalized = B_2D/b_squared
del b_squared
returned_val = np.matmul(A_2D,b_normalized.transpose())
return returned_val.flatten()
the del is there just if the memory allocated by B_2D is too big, (or it's just me being used to working with multiple GB arrays)
Edit: as requested for A_1D - B_1D
def oper2_AB(A_2D, B_2D):
output = np.zeros([A_2D.shape[0]*B_2D.shape[0],A_2D.shape[1]],dtype=A_2D.dtype)
for i in range(len(A_2D)):
output[i*len(B_2D):(i+1)*len(B_2D)] = A_2D[i]-B_2D
return output
I'm new to tensorflow 2.0, and haven't done much except designing and training some artificial neural networks from boilerplate code. I'm trying to solve an exercise for newcomers into the new tensorflow. I created some code, but it doesn't work. Below is the problem definition:
Assuming we have tensor M of rational numbers in shape of (a, b, c) and scalar p ∈ (0, 1) (memory factor), let’s create a function that will return tensor N in shape of (a, b, c). Each element of N tensors moving along axis c should be increased by the value of predecessor multiplied by p.
Assuming we have tensor:
T = [x1, x2, x3, x4]
in shape of (1, 1, 4), we would like to get vector:
[x1, x2+x1·p, x3+(x2+x1·p)·p, x4+(x3+(x2+x1·p)·p)*p]
Solution should be created in Tensorflow 2.0 and should be focused on delivering the shortest execution time on CPU. Created graph should allow to efficiently calculate derivative both on tensor M and value p.
This is the code I created till now:
import tensorflow as tf
#tf.function
def vectorize_predec(t, p):
last_elem = 0
result = []
for el in t:
result.append(el + (p * last_elem))
last_elem = el + (p * last_elem)
return result
p = tf.Variable(0.5, dtype='double')
m = tf.constant([[0, 1, 2, 3, 4],
[1, 3, 5, 7, 10],
[1, 1, 1, -1, 0]])
vectorize_predec(m, p)
But it throws a TypeError.
I looked around documentation, I've seen functions like cumsum and polyeval, but I'm not sure they fit my needs. To my understanding, I need to write my own customer function annotated with #tf.function. I'm also not sure how to handle 3-dimension tensors properly according to the problem definition (adding the predecessor should happen on the last ("c") axis).
I've seen in documentation (here: https://www.tensorflow.org/tutorials/customization/performance) that there are ways to measure size of the produced graph. Although, I'm not sure how "graph" allows to efficiently calculate derivative both on tensor M and value p. ELI5 answers appreciated, or at least some materials I can read to educate myself better.
Thanks a lot!
I'll give you a couple of different methods to implement that. I think the most obvious solution is to use tf.scan:
import tensorflow as tf
def apply_momentum_scan(m, p, axis=0):
# Put axis first
axis = tf.convert_to_tensor(axis, dtype=tf.int32)
perm = tf.concat([[axis], tf.range(axis), tf.range(axis + 1, tf.rank(m))], axis=0)
m_t = tf.transpose(m, perm)
# Do computation
res_t = tf.scan(lambda a, x: a * p + x, m_t)
# Undo transpose
perm_t = tf.concat([tf.range(1, axis + 1), [0], tf.range(axis + 1, tf.rank(m))], axis=0)
return tf.transpose(res_t, perm_t)
However, you can also implement this as a particular matrix product, if you build a matrix of exponential factors:
import tensorflow as tf
def apply_momentum_matmul(m, p, axis=0):
# Put axis first and reshape
m = tf.convert_to_tensor(m)
p = tf.convert_to_tensor(p)
axis = tf.convert_to_tensor(axis, dtype=tf.int32)
perm = tf.concat([[axis], tf.range(axis), tf.range(axis + 1, tf.rank(m))], axis=0)
m_t = tf.transpose(m, perm)
shape_t = tf.shape(m_t)
m_tr = tf.reshape(m_t, [shape_t[0], -1])
# Build factors matrix
r = tf.range(tf.shape(m_tr)[0])
p_tr = tf.linalg.band_part(p ** tf.dtypes.cast(tf.expand_dims(r, 1) - r, p.dtype), -1, 0)
# Do computation
res_tr = p_tr # m_tr
# Reshape back and undo transpose
res_t = tf.reshape(res_tr, shape_t)
perm_t = tf.concat([tf.range(1, axis + 1), [0], tf.range(axis + 1, tf.rank(m))], axis=0)
return tf.transpose(res_t, perm_t)
This can also be rewritten to avoid the first transposing (which in TensorFlow is expensive) with tf.tensordot:
import tensorflow as tf
def apply_momentum_tensordot(m, p, axis=0):
# Put axis first and reshape
m = tf.convert_to_tensor(m)
# Build factors matrix
r = tf.range(tf.shape(m)[axis])
p_mat = tf.linalg.band_part(p ** tf.dtypes.cast(tf.expand_dims(r, 1) - r, p.dtype), -1, 0)
# Do computation
res_t = tf.linalg.tensordot(m, p_mat, axes=[[axis], [1]])
# Transpose
last_dim = tf.rank(res_t) - 1
perm_t = tf.concat([tf.range(axis), [last_dim], tf.range(axis, last_dim)], axis=0)
return tf.transpose(res_t, perm_t)
The three functions would be used in a similar way:
import tensorflow as tf
p = tf.Variable(0.5, dtype=tf.float32)
m = tf.constant([[0, 1, 2, 3, 4],
[1, 3, 5, 7, 10],
[1, 1, 1, -1, 0]], tf.float32)
# apply_momentum is one of the functions above
print(apply_momentum(m, p, axis=0).numpy())
# [[ 0. 1. 2. 3. 4. ]
# [ 1. 3.5 6. 8.5 12. ]
# [ 1.5 2.75 4. 3.25 6. ]]
print(apply_momentum(m, p, axis=1).numpy())
# [[ 0. 1. 2.5 4.25 6.125 ]
# [ 1. 3.5 6.75 10.375 15.1875]
# [ 1. 1.5 1.75 -0.125 -0.0625]]
Using a matrix product is more asymptotically complex, but it can be faster than scanning. Here is a small benchmark:
import tensorflow as tf
import numpy as np
# Make test data
tf.random.set_seed(0)
p = tf.constant(0.5, dtype=tf.float32)
m = tf.random.uniform([100, 30, 50], dtype=tf.float32)
# Axis 0
print(np.allclose(apply_momentum_scan(m, p, 0).numpy(), apply_momentum_matmul(m, p, 0).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 0).numpy(), apply_momentum_tensordot(m, p, 0).numpy()))
# True
%timeit apply_momentum_scan(m, p, 0)
# 11.5 ms ± 610 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit apply_momentum_matmul(m, p, 0)
# 1.36 ms ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit apply_momentum_tensordot(m, p, 0)
# 1.62 ms ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Axis 1
print(np.allclose(apply_momentum_scan(m, p, 1).numpy(), apply_momentum_matmul(m, p, 1).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 1).numpy(), apply_momentum_tensordot(m, p, 1).numpy()))
# True
%timeit apply_momentum_scan(m, p, 1)
# 4.27 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit apply_momentum_matmul(m, p, 1)
# 1.27 ms ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit apply_momentum_tensordot(m, p, 1)
# 1.2 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Axis 2
print(np.allclose(apply_momentum_scan(m, p, 2).numpy(), apply_momentum_matmul(m, p, 2).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 2).numpy(), apply_momentum_tensordot(m, p, 2).numpy()))
# True
%timeit apply_momentum_scan(m, p, 2)
# 6.29 ms ± 64.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit apply_momentum_matmul(m, p, 2)
# 1.41 ms ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit apply_momentum_tensordot(m, p, 2)
# 1.05 ms ± 26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, matrix product seems to win. Let's see if this scales:
import tensorflow as tf
import numpy as np
# Make test data
tf.random.set_seed(0)
p = tf.constant(0.5, dtype=tf.float32)
m = tf.random.uniform([1000, 300, 500], dtype=tf.float32)
# Axis 0
print(np.allclose(apply_momentum_scan(m, p, 0).numpy(), apply_momentum_matmul(m, p, 0).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 0).numpy(), apply_momentum_tensordot(m, p, 0).numpy()))
# True
%timeit apply_momentum_scan(m, p, 0)
# 784 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_matmul(m, p, 0)
# 1.13 s ± 76.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_tensordot(m, p, 0)
# 1.3 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Axis 1
print(np.allclose(apply_momentum_scan(m, p, 1).numpy(), apply_momentum_matmul(m, p, 1).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 1).numpy(), apply_momentum_tensordot(m, p, 1).numpy()))
# True
%timeit apply_momentum_scan(m, p, 1)
# 852 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_matmul(m, p, 1)
# 659 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_tensordot(m, p, 1)
# 741 ms ± 19.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Axis 2
print(np.allclose(apply_momentum_scan(m, p, 2).numpy(), apply_momentum_matmul(m, p, 2).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 2).numpy(), apply_momentum_tensordot(m, p, 2).numpy()))
# True
%timeit apply_momentum_scan(m, p, 2)
# 1.06 s ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_matmul(m, p, 2)
# 924 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_tensordot(m, p, 2)
# 483 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Well, now it's not so clear anymore. Scanning is still not super fast, but matrix products are sometimes slower. As you can imagine if you go to even bigger tensors the complexity of matrix products will dominate the timings.
So, if you want the fastest solution and know your tensors are not going to get huge, use one of the matrix product implementations. If you're fine with okay speed but want to make sure you don't run out of memory (matrix solution also takes much more) and timing is predictable, you can use the scanning solution.
Note: Benchmarks above were carried out on CPU, results may vary significantly on GPU.
Here is an answer that just provides some information, and a naive solution to fix the code---not the actual problem (please refer below for the why).
First of, the TypeError is a problem of incompatible types in the tensors of your early attempt. Some tensors contain floating point numbers (double), some contain integers. It would have helped to show the full error message:
TypeError: Input 'y' of 'Mul' Op has type int32 that does not match type float64 of argument 'x'.
Which happens to put on the right track (despite the gory details of the stack trace).
Here is a naive fix to get the code to work (with caveats against the target problem):
import tensorflow as tf
#tf.function
def vectorize_predec(t, p):
_p = tf.transpose(
tf.convert_to_tensor(
[p * t[...,idx] for idx in range(t.shape[-1] - 1)],
dtype=tf.float64))
_p = tf.concat([
tf.zeroes((_p.shape[0], 1), dtype=tf.float64),
_p
], axis=1)
return t + _p
p = tf.Variable(0.5, dtype='double')
m = tf.constant([[0, 1, 2, 3, 4],
[1, 3, 5, 7, 10],
[1, 1, 1, -1, 0]], dtype=tf.float64)
n = tf.constant([[0.0, 1.0, 2.5, 4.0, 5.5],
[1.0, 3.5, 6.5, 9.5, 13.5],
[1.0, 1.5, 1.5, -0.5, -0.5]], dtype=tf.float64)
print(f'Expected: {n}')
result = vectorize_predec(m, p)
print(f'Result: {result}')
tf.test.TestCase().assertAllEqual(n, result)
The main changes:
The m tensor gets a dtype=tf.float64 to match the orignal double, so the type error vanishes.
The function is basically a complete rewrite. The naive idea is to exploit the problem definition, which does not state whether the values in N are calculated before or after updates. Here is a version before update, way easier. Solving what seems to be the "real" problem requires working a bit more on the function (see other answers, and I may work more here).
How the function works:
It calculates the expected increments p * x1, p * x2, etc into a standard Python array. Note it stops before the last element of the last dimension, as we will shift the array.
It converts the array to a tensor with tf.convert_to_tensor, so adding the array to the computation graph. The transpose is necessary to match the original tensor shape (we could avoid it).
It appends zeroes at the beginning of each dimension along the last axis.
The result is the sum of the original tensor and the constructed one.
The values become x1 + 0.0 * p, then x2 + x1 * p, etc. This illustrates a few functions and issues to look at (types, shapes), but I admit it cheats and does not solve the actual problem.
Also, this code is not efficient on any hardware. It is just illustrative, and would need to (1) eliminate the Python array, (2) eliminate the transpose, (3) eliminate the concatenate operation. Hopefully great training :-)
Extra notes:
The problem asks for a solution on tensors of shape (a, b, c). The code you share works on tensors of shape (a, b), so fixing the code will still not solve the problem.
The problem requires rational numbers. Not sure what the intent
is, and this answer leaves this requirement aside.
The shape of T = [x1, x2, x3, x4] is actually (4,), assuming xi are scalars.
Why tf.float64? By default, we get tf.float32, and removing the double would make the code works. But the example would lose the point that types matter, so the choice for an explicit non-default type (and uglier code).
I am creating this array for my shader and this step is very slow as it constitutes a nested for loop. Currently this methos takes approx 1 sec to create this. Can anyone suggest any faster method for creating this array.
import numpy as np
elems = []
b = 23503
a = 24
for i in range(0, a - 1):
for j in range(0, b - 1):
elems += [j + b * i, j + b * i + 1, j + b * (i + 1)]
elems += [j + b * (i + 1), j + b * (i + 1) + 1, j + b * i + 1]
elems = np.array(elems, dtype=np.int32)
First I would recognise that there is a lot of repeated computation. The base term involving the iterator variables here is i*b+j, so let's have NumPy create an array that contains those values in the order they should appear:
ib_j = (np.arange(a-1)[:, None]*b + np.arange(b-1)).flatten()
Next we compute the six different columns from this base, stack them horizontally, and flatten:
def create_shader_array(a, b):
ib_j = (np.arange(a-1)[:, None]*b + np.arange(b-1)).flatten()
return np.column_stack((ib_j, ib_j+1, ib_j+b, ib_j+b, ib_j+b+1, ib_j+1)).flatten()
Validation:
>>> all(create_shader_array(a, b) == AKS(a, b)) # AKS is your original implementation
True
Timing:
>>> %timeit AKS(24, 23503)
1.02 s ± 8.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit create_shader_array(24, 23503)
28.8 ms ± 364 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can use meshgrid to cover the i and j iterations and then to an outer add to get the inner shading. Using ravel in the end to get a 1D array.
inner = np.array([0, 1, b, b, b+1, 1], dtype="int32")
j, i = np.meshgrid(np.arange(b-1), np.arange(a-1))
elems = np.add.outer((j+b*i), inner).ravel()
or with a one-liner:
elems = ([0, 1, b, b, b+1, 1]+np.arange(b-1)[:, None]+b*np.arange(a-1)[:,None, None]).ravel()
Finishes in <6ms on my computer
In [9]: %timeit ([0, 1, b, b, b+1, 1]+np.arange(b-1)[:,None]+b*np.arange(a-1)[:
...: ,None, None]).ravel()
5.23 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %timeit create_shader_array(a, b)
29.8 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Say I have a matrix A of dimension N by M.
I wish to return an N dimensional vector V where the nth element is the double sum of all pairwise product of the entries in the nth row of A.
In loops, I guess I could do:
V = np.zeros(A.shape[0])
for n in range(A.shape[0]):
for i in range(A.shape[1]):
for j in range(A.shape[1]):
V[n] += A[n,i] * A[n,j]
I want to vectorise this and I guess I could do:
V_temp = np.einsum('ij,ik->ijk', A, A)
V = np.einsum('ijk->i', A)
But I don't think this is very memory efficient way as the intermediate step V_temp is unnecessarily storing the whole outer products when all I need are sums. Is there a better way to do this?
Thanks
You can use
V=np.einsum("ni,nj->n",A,A)
You are actually calculating
A.sum(-1)**2
In other words, the sum over an outer product is just the product of the sums of the factors.
Demo:
A = np.random.random((1000,1000))
np.allclose(np.einsum('ij,ik->i', A, A), A.sum(-1)**2)
# True
t = timeit.timeit('np.einsum("ij,ik->i",A,A)', globals=dict(A=A,np=np), number=10)*100; f"{t:8.4f} ms"
# '948.4210 ms'
t = timeit.timeit('A.sum(-1)**2', globals=dict(A=A,np=np), number=10)*100; f"{t:8.4f} ms"
# ' 0.7396 ms'
Perhaps you can use
np.einsum('ij,ik->i', A, A)
or the equivalent
np.einsum(A, [0,1], A, [0,2], [0])
On a 2015 Macbook, I get
In [35]: A = np.random.rand(100,100)
In [37]: %timeit for_loops(A)
640 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [38]: %timeit np.einsum('ij,ik->i', A, A)
658 µs ± 7.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit np.einsum(A, [0,1], A, [0,2], [0])
672 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Say I have two arrays,
import numpy as np
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
What's the fastest, most Pythonic, etc., etc. way to get a new array, z, with a number of elements equal to x.size * y.size, in which the elements are the products of every pair of elements (x_i, y_j) from the two input arrays.
To rephrase, I'm looking for an array z in which z[k] is x[i] * y[j].
A simple but inefficient way to get this is as follows:
z = np.empty(x.size * y.size)
counter = 0
for i in x:
for j in y:
z[counter] = i * j
counter += 1
Running the above code shows that z in this example is
In [3]: z
Out[3]:
array([ 5., 6., 7., 8., 10., 12., 14., 16., 15., 18., 21.,
24., 20., 24., 28., 32.])
Here's one way to do it:
import itertools
z = np.empty(x.size * y.size)
counter = 0
for i, j in itertools.product(x, y):
z[counter] = i * j
counter += 1
It'd be nice to get rid of that counter, though, as well as the for loop (but at least I got rid of one of the loops).
UPDATE
Being one-liners, the other provided answers are better than this one (according to my standards, which value brevity). The timing results below show that #BilalAkil's answer is faster than #TimLeathart's:
In [10]: %timeit np.array([x * j for j in y]).flatten()
The slowest run took 4.37 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 24.2 µs per loop
In [11]: %timeit np.multiply.outer(x, y).flatten()
The slowest run took 5.59 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 10.5 µs per loop
Well I haven't much experience with numpy, but a quick search gave me this:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.outer.html
>>> np.multiply.outer([1, 2, 3], [4, 5, 6])
array([[ 4, 5, 6],
[ 8, 10, 12],
[12, 15, 18]])
You can then flatten that array to get the same output as you requested:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html
EDIT: #Divakar's answer showed us that ravel will do the same thing as flatten, except faster o.O So use that instead.
So in your case, it'd look like this:
>>> np.multiply.outer(x, y).ravel()
BONUS: You can go multi-dimensional with this!
Two more approaches could be suggested here.
Using matrix-multiplication with np.dot:
np.dot(x[:,None],y[None]).ravel()
With np.einsum:
np.einsum('i,j->ij',x,y).ravel()
Runtime tests
In [31]: N = 10000
...: x = np.random.rand(N)
...: y = np.random.rand(N)
...:
In [32]: %timeit np.dot(x[:,None],y[None]).ravel()
1 loops, best of 3: 302 ms per loop
In [33]: %timeit np.einsum('i,j->ij',x,y).ravel()
1 loops, best of 3: 274 ms per loop
Same as #BilalAkil's answer but with ravel() instead of flatten() as a faster alternative -
In [34]: %timeit np.multiply.outer(x, y).ravel()
1 loops, best of 3: 211 ms per loop
#BilalAkil's answer:
In [35]: %timeit np.multiply.outer(x, y).flatten()
1 loops, best of 3: 451 ms per loop
#Tim Leathart's answer:
In [36]: %timeit np.array([y * a for a in x]).flatten()
1 loops, best of 3: 766 ms per loop
Here's a way to do it:
import numpy as np
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
z = np.array([y * a for a in x]).flatten()
I know I'm super late to the party here, but I thought I'd throw my hat into the ring for anyone reading this question in the future. Using the same metric as #Divakar, I added what I consider to be a much more intuitive solution to the list (the first code snippet measured):
import numpy as np
N = 10000
x = np.random.rand(N)
y = np.random.rand(N)
%timeit np.ravel(x[:,None] * y[None])
635 ms ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.outer(x, y).ravel()
640 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.dot(x[:,None],y[None]).ravel()
853 ms ± 57.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.einsum('i,j->ij',x,y).ravel()
754 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Based on the similarity in execution time, it seems likely that numpy.outer functions exactly the same way as my solution internally, although you should take observations like this with a hefty grain of salt.
The reason why I find it more intuitive is that, unlike all other solutions, its syntax isn't strictly limited to multiplication. For example, np.ravel(x[:,None] / y[None]) will give you a / b for every a in x and b in y.