I would like to calculate the log-ratios for my 2D array, e.g.
a = np.array([[3,2,1,4], [2,1,1,6], [1,5,9,1], [7,8,2,2], [5,3,7,8]])
The formula is ln(x/g(x)), where g(x) is the geometric mean of each row. I execute it like this:
logvalues = np.array(a) # the values will be overwritten through the code below.
for i in range(len(a)):
row = np.array(a[i])
geo_mean = row.prod()**(1.0/len(row))
flr = lambda x: math.log(x/geo_mean)
logvalues = np.array([flr(x) for x in row])
I was wondering if there is any way to vectorise the above lines (preferably without introducing other modules) to make it more efficient?
This should do the trick:
geo_means = a.prod(1)**(1/a.shape[1])
logvalues = np.log(a/geo_means[:, None])
Another way you could do this is just write the function as though for a single 1-D array, ignoring the 2-D aspect:
def f(x):
return np.log(x / x.prod()**(1.0 / len(x)))
Then if you want to apply it to all rows in a 2-D array (or N-D array):
>>> np.apply_along_axis(f, 1, a)
array([[ 0.30409883, -0.10136628, -0.79451346, 0.5917809 ],
[ 0.07192052, -0.62122666, -0.62122666, 1.17053281],
[-0.95166562, 0.65777229, 1.24555895, -0.95166562],
[ 0.59299864, 0.72653003, -0.65976433, -0.65976433],
[-0.07391256, -0.58473818, 0.26255968, 0.39609107]])
Some other general notes on your attempt:
for i in range(len(a)): If you want to loop over all rows in an array it's generally faster to do simply for row in a. NumPy can optimize this case somewhat, whereas if you do for idx in range(len(a)) then for each index you have to again index the array with a[idx] which is slower. But even then it's better not to use a for loop at all where possible, which you already know.
row = np.array(a[i]): The np.array() isn't necessary. If you index an multi-dimensional array the returned value is already an array.
lambda x: math.log(x/geo_mean): Don't use math functions with NumPy arrays. Use the equivalents in the numpy module. Wrapping this in a function adds unnecessary overhead as well. Since you use this like [flr(x) for x in row] that's just equivalent to the already vectorized NumPy operations: np.log(row / geo_mean).
Related
I have two arrays I and X. I want to perform an operation which basically takes the indices from I and uses values from X. For example, I[0]=[0,1], I want to calculate X[0] and X[1] followed by X[0]-X[1] and append to a new array T. Similarly, for I[1]=[1,2], I want to calculate X[1] and X[2] followed by X[1]-X[2] and append to T. The expected output is presented.
import numpy as np
I=np.array([[0,1],[1,2]])
X=np.array([10,5,3])
The expected output is
T=array([[X[0]-X[1]],[X[1]-X[2]]])
The most basic approach is using nested indices together with the np.append() function.
It works like below:
T = np.append(X[I[0][0]] - X[I[0][1]], X[I[1][0]] - X[I[1][1]])
Where, X[I[0][0]] means to extract the value of I[0][0] and use that as the index we want for the array X.
You can also implement a loop to do that:
T = np.array([], dtype="int64")
for i in range(I.shape[0]):
for j in range(I.shape[1]-1):
T = np.append(T, X[I[i][j]] - X[I[i][j+1]])
If you find this answer helpful, please accept my answer. Thanks.
You can do this using integer array indexing. For large arrays, using for loops like in the currently accepted answer is going to be much slower than using vectorized operations.
import numpy as np
I = np.array([[0, 1], [1, 2]])
X = np.array([10, 5, 3])
T = X[I[:, 0:1]] - X[I[:, 1:2]]
I have function predicton like
def predictions(degree):
some magic,
return an np.ndarray([0..100])
I want to call this function for a few values of degree and use it to populate a larger np.ndarray (n=2), filling each row with the outcome of the function predictions. It seems like a simple task but somehow I cant get it working. I tried with
for deg in [1,2,4,8,10]:
np.append(result, predictions(deg),axis=1)
with result being an np.empty(100). But that failed with Singleton array array(1) cannot be considered a valid collection.
I could not get fromfunction it only works on a coordinate tuple, and the irregular list of degrees is not covered in the docs.
Don't use np.ndarray until you are older and wiser! I couldn't even use it without rereading the docs.
arr1d = np.array([1,2,3,4,5])
is the correct way to construct a 1d array from a list of numbers.
Also don't use np.append. I won't even add the 'older and wiser' qualification. It doesn't work in-place; and is slow when used in a loop.
A good way of building a 2 array from 1d arrays is:
alist = []
for i in ....:
alist.append(<alist or 1d array>)
arr = np.array(alist)
provided all the sublists have the same size, arr should be a 2d array.
This is equivalent to building a 2d array from
np.array([[1,2,3], [4,5,6]])
that is a list of lists.
Or a list comprehension:
np.array([predictions(i) for i in range(10)])
Again, predictions must all return the same length arrays or lists.
append is in the boring section of numpy. here you know the shape in advance
len_predictions = 100
def predictions(degree):
return np.ones((len_predictions,))
degrees = [1,2,4,8,10]
result = np.empty((len(degrees), len_predictions))
for i, deg in enumerate(degrees):
result[i] = predictions(deg)
if you want to store the degree somehow, you can use custom dtypes
I have 2 arrays of unequal size:
>>> np.size(array1)
4004001
>>> np.size(array2)
1000
Now, each element in array2 needs to be compared to all the elements in array1, to find the element which has the nearest value to that of this element in array2.
Upon finding this value, I need to store it in a different array of size 1000 - one of a size corresponding to array2.
The tedious and crude way of doing it could be using a for loop and taking each element from Array 2, subtracting its absolute value from array 1 elements and then taking the minimum value- this is going to make my code really slow.
I'd like to use numpy vectorized operations to do this but i've kind of hit a wall.
To make full use of the numpy parallelism we need vectorized functions. Further all values are found in the same array (array1) using the same criterium (nearest). Therefore, it is possible to make a special function for searching in array1 specifically.
However, to make the solution more reusable it is better to make a more general solution and then transform it into a more specific one. Thus, as a general approach to find the closest value, we start with this find nearest solution. Then we turn that into a more specific and vectorize it, to allow it to work on multiple element at once:
import math
import numpy as np
from functools import partial
def find_nearest_sorted(array,value):
idx = np.searchsorted(array, value, side="left")
if idx > 0 and (idx == len(array) or math.fabs(value - array[idx-1]) < math.fabs(value - array[idx])):
return array[idx-1]
else:
return array[idx]
array1 = np.random.rand(4004001)
array2 = np.random.rand(1000)
array1_sorted = np.sort(array1)
# Partially apply array1 to find function, to turn the general function
# into a specific, working with array1 only.
find_nearest_in_array1 = partial(find_nearest_sorted, array1_sorted)
# Vectorize specific function to allow us to apply it to all elements of
# array2, the numpy way.
vectorized_find = np.vectorize(find_nearest_in_array1)
output = vectorized_find(array2)
Hopefully this is what you wanted, a new vector, mapping the data in array2 to the nearest values in array1.
The most "numpythonic" way is is to use broadcasting. This is a quick and easy way to calculate a distance matrix, for which you can then take the argmin of the absolute value.
array1 = np.random.rand(4004001)
array2 = np.random.rand(1000)
# Calculate distance matrix (on truncated array1 for memory reasons)
dmat = array1[:400400] - array2[:,None]
# Take the abs of the distance matrix and work out the argmin along the last axis
ix = np.abs(dmat).argmin(axis=1)
shape of dmat:
(1000, 400400)
shape of ix and contents:
(1000,)
array([237473, 166831, 72369, 11663, 22998, 85179, 231702, 322752, ...])
However, it's memory hungry if you do this operation in one go, and actually doesn't work on my 8GB machine for the size of arrays that you specify, which is why I reduced the size of array1.
To make it work within memory constraints, simply slice one of the arrays into chunks and apply broadcasting on each chunk in turn (or parallelise). In this case, I've sliced array2 into 10 chunks:
# Define number of chunks and calculate chunk size
n_chunks = 10
chunk_len = array2.size // n_chunks
# Preallocate output array
out = np.zeros(1000)
for i in range(n_chunks):
s = slice(i*chunk_len, (i+1)*chunk_len)
out[s] = np.abs(array1 - array2[s, None]).argmin(axis=1)
import numpy as np
a = np.random.random(size=4004001).astype(np.float16)
b = np.random.random(size=1000).astype(np.float16)
#use numpy broadcasting to compare pairwise difference and then find the min arg in a for each element in b. Finally extract elements from a using the argmin array as indexes.
output = a[np.argmin(np.abs(b[:,None] -a),axis=1)]
This solution while simple can be very memory intensive. It may need a bit further optimisation if using it on large arrays.
Is there an easier way to get the sum of all values (assuming they are all numbers) in an ndarray :
import numpy as np
m = np.array([[1,2],[3,4]])
result = 0
(dim0,dim1) = m.shape
for i in range(dim0):
for j in range(dim1):
result += m[i,j]
print result
The above code seems somewhat verbose for a straightforward mathematical operation.
Thanks!
Just use numpy.sum():
result = np.sum(matrix)
or equivalently, the .sum() method of the array:
result = matrix.sum()
By default this sums over all elements in the array - if you want to sum over a particular axis, you should pass the axis argument as well, e.g. matrix.sum(0) to sum over the first axis.
As a side note your "matrix" is actually a numpy.ndarray, not a numpy.matrix - they are different classes that behave slightly differently, so it's best to avoid confusing the two.
Yes, just use the sum method:
result = m.sum()
For example,
In [17]: m = np.array([[1,2],[3,4]])
In [18]: m.sum()
Out[18]: 10
By the way, NumPy has a matrix class which is different than "regular" numpy arrays. So calling a regular ndarray matrix causes some cognitive dissonance. To help others understand your code, you may want to change the name matrix to something else.
How can I represent matrices in python?
Take a look at this answer:
from numpy import matrix
from numpy import linalg
A = matrix( [[1,2,3],[11,12,13],[21,22,23]]) # Creates a matrix.
x = matrix( [[1],[2],[3]] ) # Creates a matrix (like a column vector).
y = matrix( [[1,2,3]] ) # Creates a matrix (like a row vector).
print A.T # Transpose of A.
print A*x # Matrix multiplication of A and x.
print A.I # Inverse of A.
print linalg.solve(A, x) # Solve the linear equation system.
Python doesn't have matrices. You can use a list of lists or NumPy
If you are not going to use the NumPy library, you can use the nested list. This is code to implement the dynamic nested list (2-dimensional lists).
Let r is the number of rows
let r=3
m=[]
for i in range(r):
m.append([int(x) for x in raw_input().split()])
Any time you can append a row using
m.append([int(x) for x in raw_input().split()])
Above, you have to enter the matrix row-wise. To insert a column:
for i in m:
i.append(x) # x is the value to be added in column
To print the matrix:
print m # all in single row
for i in m:
print i # each row in a different line
((1,2,3,4),
(5,6,7,8),
(9,0,1,2))
Using tuples instead of lists makes it marginally harder to change the data structure in unwanted ways.
If you are going to do extensive use of those, you are best off wrapping a true number array in a class, so you can define methods and properties on them. (Or, you could NumPy, SciPy, ... if you are going to do your processing with those libraries.)