Is this best way or most efficient way to generate random numbers from a geometric distribution with an array of parameters that may contain 0?
allids["c"]=[2,0,1,1,3,0,0,2,0]
[ 0 if x == 0 else numpy.random.geometric(1./x) for x in allids["c"]]
Note I am somewhat concerned about optimization.
EDIT:
A bit of context: I have an sequence of characters (i.e. ATCGGGA) and I would like to expand/contract runs of a single character (i.e. if original sequence had a run of 2 'A's I want to simulate a sequence that will have an expected value of 2 'A's, but vary according to a geometric distribution). All the characters that are runs of length 1 I do NOT want to be of variable length.
So if
seq = 'AATCGGGAA'
allids["c"]=[2,0,1,1,3,0,0,2,0]
rep=[ 0 if x == 0 else numpy.random.geometric(1./x) for x in allids["c"]]
"".join([s*r for r, s in zip(rep, seq)])
will output (when rep is [1, 0, 1, 1, 3, 0, 0, 1, 0])
"ATCGGGA"
You can use a masked array to avoid the division by zero.
import numpy as np
a = np.ma.masked_equal([2, 0, 1, 1, 3, 0, 0, 2, 0], 0)
rep = np.random.geometric(1. / a)
rep[a.mask] = 0
This generates a random sample for each element of a, and then deletes some of them later. If you're concerned about this waste of random numbers, you could generate just enough, like so:
import numpy as np
a = np.ma.masked_equal([2, 0, 1, 1, 3, 0, 0, 2, 0], 0)
rep = np.zeros(a.shape, dtype=int)
rep[~a.mask] = np.random.geometric(1. / a[~a.mask])
What about this:
counts = array([2, 0, 1, 1, 3, 0, 0, 2, 0], dtype=float)
counts_ma = numpy.ma.array(counts, mask=(counts == 0))
counts[logical_not(counts.mask)] = \
array([numpy.random.geometric(v) for v in 1.0 / counts[logical_not(counts.mask)]])
You could potentially precompute the distribution of homopolymer runs and limit the number of calls to geometric as fetching large numbers of values from RNGs is more efficient than individual calls
Related
For a given array (1 or 2-dimensional) I would like to know, how many "patches" there are of nonzero elements. For example, in the array [0, 0, 1, 1, 0, 1, 0, 0] there are two patches.
I came up with a function for the 1-dimensional case, where I first assume the maximal number of patches and then decrease that number if a neighbor of a nonzero element is nonzero, too.
def count_patches_1D(array):
patches = np.count_nonzero(array)
for i in np.nonzero(array)[0][:-1]:
if (array[i+1] != 0):
patches -= 1
return patches
I'm not sure if that method works for two dimensions as well. I haven't come up with a function for that case and I need some help for that.
Edit for clarification:
I would like to count connected patches in the 2-dimensional case, including diagonals. So an array [[1, 0], [1, 1]] would have one patch as well as [[1, 0], [0, 1]].
Also, I am wondering if there is a build-in python function for this.
The following should work:
import numpy as np
import copy
# create an array
A = np.array(
[
[0, 1, 1, 1, 0, 1],
[0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 1]
]
)
def isadjacent(pos, newpos):
"""
Check whether two coordinates are adjacent
"""
# check for adjacent columns and rows
return np.all(np.abs(np.array(newpos) - np.array(pos)) < 2):
def count_patches(A):
"""
Count the number of non-zero patches in an array.
"""
# get non-zero coordinates
coords = np.nonzero(A)
# add them to a list
inipatches = list(zip(*coords))
# list to contain all patches
allpatches = []
while len(inipatches) > 0:
patch = [inipatches.pop(0)]
i = 0
# check for all points adjacent to the points within the current patch
while True:
plen = len(patch)
curpatch = patch[i]
remaining = copy.deepcopy(inipatches)
for j in range(len(remaining)):
if isadjacent(curpatch, remaining[j]):
patch.append(remaining[j])
inipatches.remove(remaining[j])
if len(inipatches) == 0:
break
if len(inipatches) == 0 or plen == len(patch):
# nothing added to patch or no points remaining
break
i += 1
allpatches.append(patch)
return len(allpatches)
print(f"Number of patches is {count_patches(A)}")
Number of patches is 5
This should work for arrays with any number of dimensions.
def create_matrix(xy):
matrix = []
matrix_y = []
x = xy[0]
y = xy[1]
for z in range(y):
matrix_y.append(0)
for n in range(x):
matrix.append(matrix_y)
return matrix
def set_matrix(matrix,xy,set):
x = xy[0]
y = xy[1]
matrix[x][y] = set
return matrix
index = [4,5]
index_2 = [3,4]
z = create_matrix(index)
z = set_matrix(z,index_2, 12)
print(z)
output:
[[0, 0, 0, 0, 12], [0, 0, 0, 0, 12], [0, 0, 0, 0, 12], [0, 0, 0, 0, 12]]
This code should change only the last array
In your for n in range(x): loop you are appending the same y matrix multiple times. Python under the hood does not copy that array, but uses a pointer. So you have a row of pointers to the same one column.
Move the matrix_y = [] stuff inside the n loop and you get unique y arrays.
Comment: python does not actually have a pointer concept but it does use them. It hides from you when it does a copy data and when it only copies a pointer to that data. That's kind of bad language design, and it tripped you up here. So now you now that pointers exist, and that most of the time when you "assign arrays" you will actually only set a pointer.
Another comment: if you are going to be doing anything serious with matrices, you should really look into numpy. That will be many factors faster if you do numerical computations.
you don't need first loop in create_matrix, hide them with comment:
#for z in range(y):
# matrix_y.append(0)
change second one like this, it means an array filled with and length = y:
for n in range(x):
matrix.append([0] * y)
result (only last cell was changed in matrix):
z = set_matrix(z,index_2, 12)
print(z)
# [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 12]]
I have a square 2D numpy array, A, and an array of zeros, B, with the same shape.
For every index (i, j) in A, other than the first and last rows and columns, I want to assign to B[i, j] the value of np.sum(A[i - 1:i + 2, j - 1:j + 2].
Example:
A =
array([[0, 0, 0, 0, 0],
[0, 1, 0, 1, 0],
[0, 1, 1, 0, 0],
[0, 1, 0, 1, 0],
[0, 0, 0, 0, 0])
B =
array([[0, 0, 0, 0, 0],
[0, 3, 4, 2, 0],
[0, 4, 6, 3, 0],
[0, 3, 4, 2, 0],
[0, 0, 0, 0, 0])
Is there an efficient way to do this? Or should I simply use a for loop?
There is a clever (read "borderline smartass") way to do this with np.lib.stride_tricks.as_strided. as_strided allows you to create views into your buffer that simulate windows by adding another dimension to the view. For example, if you had a 1D array like
>>> x = np.arange(10)
>>> np.lib.stride_tricks.as_strided(x, shape=(3, x.shape[0] - 2), strides=x.strides * 2)
array([[0, 1, 2, 3, 4, 5, 6, 7],
[1, 2, 3, 4, 5, 6, 7, 8],
[2, 3, 4, 5, 6, 7, 8, 9]])
Hopefully it is clear that you can just sum along axis=0 to get the sum of each size 3 window. There is no reason you couldn't extrend that to two or more dimensions. I've written the shape and index of the previous example in a way that suggests a solution:
A = np.array([[0, 0, 0, 0, 0],
[0, 1, 0, 1, 0],
[0, 1, 1, 0, 0],
[0, 1, 0, 1, 0],
[0, 0, 0, 0, 0]])
view = np.lib.stride_tricks.as_strided(A,
shape=(3, 3, A.shape[0] - 2, A.shape[1] - 2),
strides=A.strides * 2
)
B[1:-1, 1:-1] = view.sum(axis=(0, 1))
Summing along multiple axes simultaneously has been supported in np.sum since v1.7.0. For older versions of numpy, just sum repeatedly (twice) along axis=0.
Filling in the edges of B is left as an exercise for the reader (since it's not really part of the question).
As an aside, the solution here is a one-liner if you want it to be. Personally, I think anything with as_strided is already illegible enough, and doesn't need any further obfuscation. I'm not sure if a for loop is going to be bad enough performance-wise to justify this method in fact.
For future reference, here is a generic window-making function that can be used to solve this sort of problem:
def window_view(a, window=3):
"""
Create a (read-only) view into `a` that defines window dimensions.
The first ``a.ndim`` dimensions of the returned view will be sized according to `window`.
The remaining ``a.ndim`` dimensions will be the original dimensions of `a`, truncated by `window - 1`.
The result can be post-precessed by reducing the leading dimensions. For example, a multi-dimensional moving average could look something like ::
window_view(a, window).sum(axis=tuple(range(a.ndim))) / window**a.ndim
If the window size were different for each dimension (`window` were a sequence rather than a scalar), the normalization would be ``np.prod(window)`` instead of ``window**a.ndim``.
Parameters
-----------
a : array-like
The array to window into. Due to numpy dimension constraints, can not have > 16 dims.
window :
Either a scalar indicating the window size for all dimensions, or a sequence of length `a.ndim` providing one size for each dimension.
Return
------
view : numpy.ndarray
A read-only view into `a` whose leading dimensions represent the requested windows into `a`.
``view.ndim == 2 * a.ndim``.
"""
a = np.array(a, copy=False, subok=True)
window = np.array(window, copy=False, subok=False, dtype=np.int)
if window.size == 1:
window = np.full(a.ndim, window)
elif window.size == a.ndim:
window = window.ravel()
else:
raise ValueError('Number of window sizes must match number of array dimensions')
shape = np.concatenate((window, a.shape))
shape[a.ndim:] -= window - 1
strides = a.strides * 2
return np.lib.stride_tricks.as_strided(a, shake=shape, strides=strides)
I have found no 'simple' ways of doing this. But here are two ways:
Still involves a for loop
# Basically, get the sum for each location and then pad the result with 0's
B = [[np.sum(A[j-1:j+2,i-1:i+2]) for i in range(1,len(A)-1)] for j in range(1,len(A[0])-1)]
B = np.pad(B, ((1,1)), "constant", constant_values=(0))
Is longer but no for loops (this will be a lot more efficient on big arrays):
# Roll basically slides the array in the desired direction
A_right = np.roll(A, -1, 1)
A_left = np.roll(A, 1, 1)
A_top = np.roll(A, 1, 0)
A_bottom = np.roll(A, -1, 0)
A_bot_right = np.roll(A_bottom, -1, 1)
A_bot_left = np.roll(A_bottom, 1, 1)
A_top_right = np.roll(A_top, -1, 1)
A_top_left = np.roll(A_top, 1, 1)
# After doing that, you can just add all those arrays and these operations
# are handled better directly by numpy compared to when you use for loops
B = A_right + A_left + A_top + A_bottom + A_top_left + A_top_right + A_bot_left + A_bot_right + A
# You can then return the edges to 0 or whatever you like
B[0:len(B),0] = 0
B[0:len(B),len(B[0])-1] = 0
B[0,0:len(B)] = 0
B[len(B[0])-1,0:len(B)] = 0
You can just sum the 9 arrays that make up a block, each one being shifted by 1 w.r.t. the previous in either dimension. Using slice notation this can be done for the whole array A at once:
B = np.zeros_like(A)
B[1:-1, 1:-1] = sum(A[i:A.shape[0]-2+i, j:A.shape[1]-2+j]
for i in range(0, 3) for j in range(0, 3))
General version for arbitrary rectangular windows
def sliding_window_sum(a, size):
"""Compute the sum of elements of a rectangular sliding window over the input array.
Parameters
----------
a : array_like
Two-dimensional input array.
size : int or tuple of int
The size of the window in row and column dimension; if int then a quadratic window is used.
Returns
-------
array
Shape is ``(a.shape[0] - size[0] + 1, a.shape[1] - size[1] + 1)``.
"""
if isinstance(size, int):
size = (size, size)
m = a.shape[0] - size[0] + 1
n = a.shape[1] - size[1] + 1
return sum(A[i:m+i, j:n+j] for i in range(0, size[0]) for j in range(0, size[1]))
Ive been breaking my head over trying to come up with a recursive way to build the following matrix in python. It is quite a challenge without pointers. Could anyone maybe help me out?
The recursion is the following:
T0 = 1,
Tn+1 = [[Tn, Tn],
[ 0, Tn]]
I have tried many iterations of some recursive function, but I cannot wrap my head around it.
def T(n, arr):
n=int(n)
if n == 0:
return 1
else:
c = 2**(n-1)
Tn = np.zeros((c,c))
Tn[np.triu_indices(n=c)] = self.T(n=n-1, arr=arr)
return Tn
arr = np.zeros((8,8))
T(arr=arr, n=3)
It's not hard to do this, but you need to be careful about the meaning of the zero in the recursion. This isn't really precise for larger values of n:
Tn+1 = [[Tn, Tn],
[ 0, Tn]]
Because that zero can represent a block of zeros for example on the second iteration you have this:
[1, 1, 1, 1],
[0, 1, 0, 1],
[0, 0, 1, 1],
[0, 0, 0, 1]
Those four zeros in the bottom-left are all represented by the one zero in the formula. The block of zeros needs to be the same shape as the blocks around it.
After that it's a matter of making Numpy put thing in the right order and shape for you. numpy.block is really handy for this and makes it pretty simple:
import numpy as np
def makegasket(n):
if n == 0:
return np.array([1], dtype=int)
else:
node = makegasket(n-1)
return np.block([[node, node], [np.zeros(node.shape, dtype=int), node]])
makegasket(3)
Result:
array([[1, 1, 1, 1, 1, 1, 1, 1],
[0, 1, 0, 1, 0, 1, 0, 1],
[0, 0, 1, 1, 0, 0, 1, 1],
[0, 0, 0, 1, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 1]])
If you use larger n you might enjoy matplotlib.pyplot.imshow for display:
from matplotlib.pyplot import imshow
# ....
imshow(makegasket(7))
You don't really need a recursive function to implement this recursion. The idea is to start with the UR corner and build outward. You can even start with the UL corner to avoid some of the book-keeping and flip the matrix along either axis, but this won't be as efficient in the long run.
def build_matrix(n):
size = 2**n
# Depending on the application, even dtype=np.bool might work
matrix = np.zeros((size, size), dtype=np.int)
# This is t[0]
matrix[0, -1] = 1
for i in range(n):
k = 2**i
matrix[:k, -2 * k:-k] = matrix[k:2 * k, -k:] = matrix[:k, -k:]
return matrix
Just for fun, here is a plot of timing results for this implementation vs #Mark Meyer's answer. It shows the slight timing advantage (also memory) of using a looping approach in this case:
Both algorithms run out of memory around n=15 on my machine, which is not too surprising.
I am using Python and I need to find the most efficient way to perform the following task.
Task: Given any 1-dimensional array v of zeros and ones, denote by k>=0 the number of subsequences of all ones of v.
I need to obtain from v a 2-dimensional array w such that:
1) shape(w)=(k,len(v)),
2) for every i=1,..,k, the i-th row of "w" is an array of all zeros except for the i-th subsequence of all ones of v.
Let me make an example: suppose $v$ is the array
v=[0,1,1,0,0,1,0,1,1,1]
Then k=3 and w should be the array
w=[[0,1,1,0,0,0,0,0,0,0],[0,0,0,0,0,1,0,0,0,0],[0,0,0,0,0,0,0,1,1,1]]
It is possible to write the code to perform this task in many ways, for example:
import numpy as np
start=[]
end=[]
for ii in range(len(v)-1):
if (v[ii:ii+2]==[0,1]).all():
start.append(ii)
if (v[ii:ii+2]==[1,0]).all():
end.append(ii)
if len(start)>len(end):
end.append(len(v)-1)
w=np.zeros((len(start),len(v)))
for jj in range(len(start)):
w[jj,start[jj]+1:end[jj]+1]=np.ones(end[jj]-start[jj])
But I need to perform this task on a very big array v and this task is part of a function which then undergoes minimization.. so I need it to be as efficient and fast as possible..
So in conclusion my question is: what is the most computationally efficient way to perform it in Python?
Here's one vectorized way -
def expand_islands2D(v):
# Get start, stop of 1s islands
v1 = np.r_[0,v,0]
idx = np.flatnonzero(v1[:-1] != v1[1:])
s0,s1 = idx[::2],idx[1::2]
# Initialize 1D id array of size same as expected o/p and has
# starts and stops assigned as 1s and -1s, so that a final cumsum
# gives us the desired o/p
N,M = len(s0),len(v)
out = np.zeros(N*M,dtype=int)
# Setup starts with 1s
r = np.arange(N)*M
out[s0+r] = 1
# Setup stops with -1s
if s1[-1] == M:
out[s1[:-1]+r[:-1]] = -1
else:
out[s1+r] = -1
# Final cumsum on ID array
out2D = out.cumsum().reshape(N,-1)
return N, out2D
Sample run -
In [105]: v
Out[105]: array([0, 1, 1, 0, 0, 1, 0, 1, 1, 1])
In [106]: k,out2D = expand_islands2D(v)
In [107]: k # number of islands
Out[107]: 3
In [108]: out2D # 2d output with 1s islands on different rows
Out[108]:
array([[0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1]])