Pythonic way to sparsely randomly populate array?

Pythonic way to sparsely randomly populate array? - python

Problem:
Populate a 10 x 10 array of zeros randomly with 10 1's, 20 2's, 30 3's.
I don't actually have to use an array, rather I just need coordinates for the positions where the values would be. It's just easier to think of in terms of an array.
I have written several solutions for this, but they all seem to be non-straight forward and non-pythonic. I am hoping someone can give me some insight. My method has been using a linear array of 0--99, choosing randomly (np.random.choice) 10 values, removing them from the array, then choosing 20 random values. After that, I convert the linear positions into (y,x) coordinates.
import numpy as np
dim = 10
grid = np.arange(dim**2)
n1 = 10
n2 = 20
n3 = 30
def populate(grid, n, dim):
pos = np.random.choice(grid, size=n, replace=False)
yx = np.zeros((n,2))
for i in xrange(n):
delPos = np.where(grid==pos[i])
grid = np.delete(grid, delPos)
yx[i,:] = [np.floor(pos[i]/dim), pos[i]%dim]
return(yx, grid)
pos1, grid = populate(grid, n1, dim)
pos2, grid = populate(grid, n2, dim)
pos3, grid = populate(grid, n3, dim)
Extra
Suppose when I populate the 1's, I want them all on one half of the "array." I can do it using my method (sampling from grid[dim**2/2:]), but I haven't figured out how to do the same with the other suggestions.

You can create a list of all coordinates, shuffle that list and take the first 60 of those (10 + 20 + 30):
>>> import random
>>> coordinates = [(i, j) for i in xrange(10) for j in xrange(10)]
>>> random.shuffle(coordinates)
>>> coordinates[:60]
[(9, 5), (6, 9), (1, 5), ..., (0, 2), (5, 9), (2, 6)]
You can then use the first 10 to insert the 10 values, the next 20 for the 20 values and the remaining for the 30 values.

To generate the array, you can use numpy.random.choice.
np.random.choice([0, 1, 2, 3], size=(10,10), p=[.4, .1, .2, .3])
Then you can convert to coordinates. Note that numpy.random.choice generates a random sample using probabilities p, and thus you are not guaranteed to get the exact proportions in p.
Extra
If you want to have all the 1s on a particular side of the array, you can generate two random arrays and then hstack them. The trick is to slightly modify the probabilities of each number on each side.
In [1]: import numpy as np
In [2]: rem = .1/3 # amount to de- / increase the probability for non-1s
In [3]: A = np.random.choice([0, 1, 2, 3], size=(5, 10),
p=[.4-rem, .2, .2-rem, .3-rem])
In [4]: B = np.random.choice([0, 2, 3], size=(5, 10), p=[.4+rem, .2+rem, .3+rem])
In [5]: M = np.hstack( (A, B) )
In [6]: M
Out[1]:
array([[1, 1, 3, 0, 3, 0, 0, 1, 1, 0, 2, 2, 0, 2, 0, 2, 3, 3, 2, 0],
[0, 3, 3, 3, 3, 0, 1, 3, 1, 3, 0, 2, 3, 0, 0, 0, 3, 3, 2, 3],
[1, 0, 0, 0, 1, 0, 3, 1, 2, 2, 0, 3, 0, 3, 3, 0, 0, 3, 0, 0],
[3, 2, 3, 0, 3, 0, 1, 2, 3, 2, 0, 0, 0, 0, 3, 2, 0, 0, 0, 3],
[3, 3, 0, 3, 3, 3, 1, 3, 0, 3, 0, 2, 0, 2, 0, 0, 0, 3, 3, 3]])
Here, because I'm putting all the 1s on the left, I double the probability of 1 and decrease the probability of each number equally. The same logic applies when creating the other side.

Not sure if this is anymore "pythonic", but here's something I came up with using part of Simeon's answer.
import random
dim = 10
n1 = 10
n2 = 20
n3 = 30
coords = [[i,j] for i in xrange(dim) for j in xrange(dim)]
def setCoords(coords, n):
pos = []
for i in xrange(n):
random.shuffle(coords)
pos.append(coords.pop())
return(coords, pos)
coordsTmp, pos1 = setCoords(coords[dim**2/2:], n1)
coords = coords[:dim**2/2] + coordsTmp
coords, pos2 = setCoords(coords, n2)
coords, pos3 = setCoords(coords, n3)

Related

find the number of pairs that belong to a column but not a higher order column

I have an n x k binary numpy array, I am trying to find an efficient way to find the number of pairs of ones that belong to some column[j] but not to any higher column, in this case higher means in increasing index value.
For example in the array:
array([[1, 1, 1, 0, 1, 0],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 0, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 1, 0]], dtype=int32)
the output should be array([ 0, 0, 11, 2, 14, 1], dtype=int32). This makes sense because we see column[2] has all ones, so any pair of ones will necessarily have a highest column in common of at least 2, because even though column[0] also has all ones, it's lower, so no pair of ones have it as their highest in common. In all cases I am considering, column[0] will always have all ones.
Here is some example code that works and I believe is something like O(n^2 k)
def hcc(i, j, k, bin_mat):
# hcc means highest common columns
# i: index i
# j: index j
# k: number of columns - 1
# bin_mat: binary matrix
for q in range(k, 0, -1):
if (bin_mat[i, q] and bin_mat[j, q]):
return q
return 0
def get_num_pairs_columns(bin_mat):
k = bin_mat.shape[1]-1
num_pairs_hcc = np.zeros(k+1, dtype=np.int32) # number of one-pairs in columns
for i in range(bin_mat.shape[0]):
for j in range(bin_mat.shape[0]):
if(i < j):
num_pairs_hcc[hcc(i, j, k, bin_mat)] += 1
return num_pairs_highest_column
Another way I've though of approaching the problem is through sets. So every column gets its own set, and the index of every row with a one gets added to such a set. So for the example above, this would look like:
set = [{0, 1, 2, 3, 4, 5, 6, 7},
{0, 3, 6, 7},
{0, 1, 2, 3, 4, 5, 6, 7},
{1, 3, 6},
{0, 1, 3, 4, 5, 7},
{4, 5}]
The idea is to find the number of pairs in set[j] that are in no higher set (it can be in a lower set, just not higher). Since, I mentioned before, all cases will have column zero with all ones, every set is a subset of set[0]. So a much worse performing code using this approach looks like this:
def generate_sets(bin_mat):
sets = []
for j in range(bin_mat.shape[1]):
column = set()
for i in range(bin_mat.shape[0]):
if bin_mat[i, j] == 1:
column.add(i)
sets.append(column)
return sets
def get_hcc_sets(bin_mat):
sets = generate_sets(bin_mat)
pairs_sets = []
num_pairs_hcc = np.zeros(len(sets), dtype=np.int32)
for subset in sets:
pairs_sets.append({p for p in itertools.combinations(sorted(subset), r = 2)})
for j in range(len(sets)-1):
intersections = [pairs_sets[j].intersection(pairs_sets[q]) for q in range(j+1, len(sets))]
num_pairs_hcc[j] = len(pairs_sets[j] - set.union(*intersections))
num_pairs_hcc[len(sets)-1]=len(pairs_sets[len(sets)-1])
return num_pairs_hcc
I haven't checked that this sets implementation always produces the same results as the previous one, but in the finitely many cases I tried, it works. However, I am 100% certain that my first implementation gives exactly the result I need.
another reference example:
input:
array([[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[1, 0, 1],
[1, 0, 1],
[1, 0, 1],
[1, 0, 1]], dtype=int32)
output:
array([16, 6, 6], dtype=int32)
Is there a way to beat my O(n^2 k) implementation. It seems rather brute force and like there should be something I can exploit to make this calculation faster. I always expect n to be greater than k, by a orders of magnitude in many cases. So I'd rather the k have a higher exponent than the n.

If you are going for the O(n² k) approach in python, you can do it with much shorter code using itertools and set; the code might be more efficient too.
import itertools
t = [[1, 1, 1, 0, 1, 0],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 0, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 1, 0]]
n,k = len(t),len(t[0])
# build set of pairs of 1 in column j
def candidates(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if 1 == t[i1][j] == t[i2][j]}
# build set of pairs of 1 in higher columns
def badpairs(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if any(1 == t[i1][j0] == t[i2][j0] for j0 in range(j+1, k))}
# set difference
def finalpairs(j):
return candidates(j) - badpairs(j)
# print pairs
for j in range(k):
print(j, finalpairs(j))
# 0 set()
# 1 set()
# 2 {(2, 4), (1, 2), (2, 7), (4, 6), (0, 6), (2, 3), (6, 7), (0, 2), (2, 6), (5, 6), (2, 5)}
# 3 {(1, 6), (3, 6)}
# 4 {(0, 1), (0, 7), (0, 4), (3, 4), (1, 5), (3, 7), (0, 3), (1, 4), (5, 7), (1, 7), (0, 5), (1, 3), (4, 7), (3, 5)}
# 5 {(4, 5)}
# print number of pairs
for j in range(k):
print(j, len(finalpairs(j)))
# 0 0
# 1 0
# 2 11
# 3 2
# 4 14
# 5 1
Alternate definition for badpairs:
def badpairs(j):
return set().union(*(candidates(j0) for j0 in range(j+1, k)))
Slightly different approach: avoid building badpairs
def finalpairs(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if 1 == t[i1][j] == t[i2][j] and not any(1 == t[i1][j0] == t[i2][j0] for j0 in range(j+1, k))}

Calculate pairwise distance of multiple trajectories using numpy

Given an arbitrary number of 3D trajectories with N points (timesteps) each, I would like to compute the distance between each point for a given timestep.
Let's say we'll look at timestep 3 and have four trajectories t_0 ... t_3. The point of the third timestep of trajectory 0 is given as t_0(3). I want to calculate the distances as follows:
d_0 = norm(t_0(3) - t_1(3))
d_1 = norm(t_1(3) - t_2(3))
d_2 = norm(t_2(3) - t_3(3))
d_3 = norm(t_3(3) - t_0(3))
As you can see there is kind of circular behavior in it (the last one calculates the distance to the first one), but that is not strictly necessary.
I know how to write some for-loops and calculate what I want to. What i am looking for is a concept or maybe an implementation in numpy (or combinations of np-functions) which can perform this logic just using the right axis and other numpy magic.
Here some example trajectories
import numpy as np
TIMESTEP_COUNT = 70
origin = np.array([0, 0, 0])
run1_direction = np.array([1, 0, 0]) / np.linalg.norm([1, 0 ,0])
run2_direction = np.array([0, 1, 0]) / np.linalg.norm([0, 1, 0])
run3_direction = np.array([0, 0, 1]) / np.linalg.norm([0, 0, 1])
run4_direction = np.array([1, 1, 0]) / np.linalg.norm([1, 1, 0])
run1_trajectory = [origin]
run2_trajectory = [origin]
run3_trajectory = [origin]
run4_trajectory = [origin]
for t in range(TIMESTEP_COUNT - 1):
run1_trajectory.append(run1_trajectory[-1] + run1_direction)
run2_trajectory.append(run2_trajectory[-1] + run2_direction)
run3_trajectory.append(run3_trajectory[-1] + run3_direction)
run4_trajectory.append(run4_trajectory[-1] + run4_direction)
run1_trajectory = np.array(run1_trajectory)
run2_trajectory = np.array(run2_trajectory)
run3_trajectory = np.array(run3_trajectory)
run4_trajectory = np.array(run4_trajectory)
... results in the following image:
Thank you in advance!!
EDIT:
My question is different to the suggested answer below because i don't want to calculate a full distance matrix. My algo should work with the distances among consecutive runs only.

I think you can stack them vertically to get an array of shape 4 x n_timesteps, and then use np.roll to do the difference in each timestep, namely:
r = np.vstack([t0,t1,t2,t3])
r - np.roll(r,shift=-1,axis=0)
Numeric example:
t0,t1,t2,t3 = np.random.randint(1,10, 5), np.random.randint(1,10, 5), np.random.randint(1,10, 5), np.random.randint(1,10, 5)
r = np.vstack([t0,t1,t2,t3])
r
array([[1, 7, 7, 6, 2],
[9, 1, 2, 3, 6],
[1, 1, 6, 8, 1],
[2, 9, 5, 9, 3]])
r - np.roll(r,shift=-1,axis=0)
array([[-8, 6, 5, 3, -4],
[ 8, 0, -4, -5, 5],
[-1, -8, 1, -1, -2],
[ 1, 2, -2, 3, 1]])

Finding unique sets without subsets in python array

I have a dataset that needs to output boolean style data, just 1 and 0, for true or not true. I am trying to parse simple data sets I've processed to look for a subset of information in a numpy array, the array is about 100,000 elements in one direction and 20 in the other. I only need to search along the 20 axis, but I need to do that for each of the 100,000 entries and get output that I can map.
I've produced an array of this size made up of zeros, with the intention to simply mark the matching index indicator to a 1. A main hitch is that if I find a long set (I'm working with long sets to small sets), I need to NOT include any smaller set that's within it.
Sample:
[0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,0,1]
I need to find here that there are 1 group of 5, starting at index 2, and 1 group of 3, starting at index 9, and not return any subset of the group of 5 as though it were a group of 4 or a group of 3, thus leaving the results for all those already covered values. i.e. for groups of 3, the indices 2, 3, 4, 5, and 6 would all remain zero. It doesn't need to be overly efficient, I don't care if it searches anyways, I just need to not keep the result.
Currently I'm using a codeblock basically like this for a simple search:
values = numpy.array([0,1,1,1,1,1,0,0,1,1,1])
searchval = [1,2]
N = len(searchval)
possibles = numpy.where(values == searchval[0])[0]
print(possibles)
solns = []
for p in possibles:
check = values[p:p+N]
if numpy.all(check == searchval):
solns.append(p)
print(solns)
I've been wracking my brain trying to come up with a way to restructure this or similar code to produce the desires results. The end goal is to be searching for groups of 9 down to groups of 3, and having effectively a matrix of 1s and 0s indicating if an index has a group starting on it that is as long as we want.
Hopefully someone can point me to what I'm missing to make this work. Thanks!

Using more_itertools, a third-party library (pip install more_itertools):
import more_itertools as mit
sample = [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1]
groups = [list(c) for c in mit.consecutive_groups((mit.locate(sample)))]
d = {group[0]: len(group) for group in groups}
d
# {2: 5, 9: 3, 15: 1, 17: 1}
This result reads "At index 2 is a group of 5 ones. At group 9 is a group of 3 ones," etc.
Details
more_itertools.locate finds indices for truthy items by default.
more_itertools.consecutive_groups chunks consecutive numbers together.
The result is a dictionary of (starting-index, length) pairs.
As a dictionary, you can extract different kinds of information:
>>> # List of starting indices
>>> list(d)
[2, 9, 15, 17]
>>> # List indices for all lonely groups
>>> [k for k, v in d.items() if v == 1]
[15, 17]
>>> # List indices of groups greater the 2 items
>>> [k for k, v in d.items() if v > 1]
[2, 9]

Here is a numpy solution. I'm using a small example for demonstration but it easily scales (20 x 100,000 takes 25 ms on my rather modest laptop, see timings at the end of this post):
>>> import numpy as np
>>>
>>>
>>> a = np.random.randint(0, 2, (5, 10), dtype=np.int8)
>>> a
array([[0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 1, 1, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 1, 1, 1, 1, 0, 0]], dtype=int8)
>>>
>>> padded = np.pad(a,((1,1),(0,0)), 'constant')
# compare array to itself with offset to mark all switches from
# 0 to 1 or from 1 to 0
# then use 'where' to extract the coordinates
>>> colinds, rowinds = np.where((padded[:-1] != padded[1:]).T)
>>>
# the lengths of sets are the differences between switch points
>>> lengths = rowinds[1::2] - rowinds[::2]
# now we have the lengths we are free to throw the off-switches away
>>> colinds, rowinds = colinds[::2], rowinds[::2]
>>>
# admire
>>> from pprint import pprint
>>> pprint(list(zip(colinds, rowinds, lengths)))
[(0, 2, 1),
(1, 0, 2),
(2, 1, 2),
(2, 4, 1),
(3, 2, 1),
(4, 0, 5),
(5, 0, 1),
(5, 2, 1),
(5, 4, 1),
(6, 1, 1),
(6, 3, 2),
(7, 4, 1)]
Timings:
>>> def find_stretches(a):
... padded = np.pad(a,((1,1),(0,0)), 'constant')
... colinds, rowinds = np.where((padded[:-1] != padded[1:]).T)
... lengths = rowinds[1::2] - rowinds[::2]
... colinds, rowinds = colinds[::2], rowinds[::2]
... return colinds, rowinds, lengths
...
>>> a = np.random.randint(0, 2, (20, 100000), dtype=np.int8)
>>> from timeit import repeat
>>> kwds = dict(globals=globals(), number=100)
>>> repeat('find_stretches(a)', **kwds)
[2.475784719004878, 2.4715258619980887, 2.4705517270049313]

Something like this?
from collections import defaultdict
sample = [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1]
# Keys are number of consecutive 1's, values are indicies
results = defaultdict(list)
found = 0
for i, x in enumerate(samples):
if x == 1:
found += 1
elif i == 0 or found == 0:
continue
else:
results[found].append(i - found)
found = 0
if found:
results[found].append(i - found + 1)
assert results == {1: [15, 17], 3: [9], 5: [2]}

Insert sections of zeros into numpy array using zip and np.insert

I cut out the zeros of a numpy array, do some stuff and want to insert them back in visual purposes. I do have the indices of the sections and tried to insert the zeros back in with numpy.insert and zip but the index runs out of bounds, even though I start at the lower end. Example:
import numpy as np
a = np.array([1, 2, 4, 0, 0, 0, 3, 6, 2, 0, 0, 1, 3, 0, 0, 0, 5])
a = a[a != 0] # cut zeros out
zero_start = [3, 9, 13]
zero_end = [5, 10, 15]
# Now insert the zeros back in using the former indices
for ev in zip(zero_start, zero_end):
a = np.insert(a, ev[0], np.zeros(ev[1]-ev[0]))
>>> IndexError: index 13 is out of bounds for axis 0 with size 12
Seems like he is not refreshing the array size inside the loop. Any suggestions or other (more pythonic) approaches to solve this problem?

Approach #1: Using indexing -
# Get all zero indices
idx = np.concatenate([range(i,j+1) for i,j in zip(zero_start,zero_end)])
# Setup output array of zeros
N = len(idx) + len(a)
out = np.zeros(N,dtype=a.dtype)
# Get mask of non-zero places and assign values from a into those
out[~np.in1d(np.arange(N),idx)] = a
We can also generate the actual indices where a had non-zeros originally and then assign. Thus, the last step of masking could be replaced with something like this -
out[np.setdiff1d(np.arange(N),idx)] = a
Approach #2: Using np.insert given zero_start and zero_end as arrays -
insert_start = np.r_[zero_start[0], zero_start[1:] - zero_end[:-1]-1].cumsum()
out = np.insert(a, np.repeat(insert_start, zero_end - zero_start + 1), 0)
Sample run -
In [755]: a = np.array([1, 2, 4, 0, 0, 0, 3, 6, 2, 0, 0, 1, 3, 0, 0, 0, 5])
...: a = a[a != 0] # cut zeros out
...: zero_start = np.array([3, 9, 13])
...: zero_end = np.array([5, 10, 15])
...:
In [756]: s0 = np.r_[zero_start[0], zero_start[1:] - zero_end[:-1]-1].cumsum()
In [757]: np.insert(a, np.repeat(s0, zero_end - zero_start + 1), 0)
Out[757]: array([1, 2, 4, 0, 0, 0, 3, 6, 2, 0, 0, 1, 3, 0, 0, 0, 5])

Python: Sampling from a discrete distribution defined in an n-dimensional array

Is there a function in Python that samples from an n-dimensional numpy array and returns the indices of each draw. If not how would one go about defining such a function?
E.g.:
>>> probabilities = np.array([[.1, .2, .1], [.05, .5, .05]])
>>> print function(probabilities, draws = 10)
([1,1],[0,2],[1,1],[1,0],[0,1],[0,1],[1,1],[0,0],[1,1],[0,1])
I know this problem can be solved in many ways with 1-D arrays. However, I will be dealing with large n-dimensional arrays and can not afford to reshape them just to do a single draw.

You can use np.unravel_index:
a = np.random.rand(3, 4, 5)
a /= a.sum()
def sample(a, n=1):
a = np.asarray(a)
choices = np.prod(a.shape)
index = np.random.choice(choices, size=n, p=a.ravel())
return np.unravel_index(index, dims=a.shape)
>>> sample(a, 4)
(array([2, 2, 0, 2]), array([0, 1, 3, 2]), array([2, 4, 2, 1]))
This returns a tuple of arrays, one per dimension of a, each of length the number of samples requested. If you would rather have an array of shape (samples, dimensions), change the return statement to:
return np.column_stack(np.unravel_index(index, dims=a.shape))
And now:
>>> sample(a, 4)
array([[2, 0, 0],
[2, 2, 4],
[2, 0, 0],
[1, 0, 4]])

If your array is contiguous in memory, you can change the shape of your array in place:
probabilities = np.array([[.1, .2, .1], [.05, .5, .05]])
nrow, ncol = probabilities.shape
idx = np.arange( nrow * ncol ) # create 1D index
probabilities.shape = ( 6, ) # this is OK because your array is contiguous in memory
samples = np.random.choice( idx, 10, p=probabilities ) # sample in 1D
rowIndex = samples / nrow # convert to 2D
colIndex = samples % ncol
array([2, 0, 1, 0, 2, 2, 2, 2, 2, 0])
array([1, 1, 2, 0, 1, 1, 1, 1, 1, 1])
Note that since your array is contiguous in memory, reshape returns a view as well:
In [53]:
view = probabilities.reshape( 6, -1 )
view[ 0 ] = 9
probabilities[ 0, 0 ]
Out[53]:
9.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pythonic way to sparsely randomly populate array? - python

Related

find the number of pairs that belong to a column but not a higher order column

Calculate pairwise distance of multiple trajectories using numpy

Finding unique sets without subsets in python array

Insert sections of zeros into numpy array using zip and np.insert

Python: Sampling from a discrete distribution defined in an n-dimensional array

Categories

Resources