Finding unique sets without subsets in python array - python

I have a dataset that needs to output boolean style data, just 1 and 0, for true or not true. I am trying to parse simple data sets I've processed to look for a subset of information in a numpy array, the array is about 100,000 elements in one direction and 20 in the other. I only need to search along the 20 axis, but I need to do that for each of the 100,000 entries and get output that I can map.
I've produced an array of this size made up of zeros, with the intention to simply mark the matching index indicator to a 1. A main hitch is that if I find a long set (I'm working with long sets to small sets), I need to NOT include any smaller set that's within it.
Sample:
[0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,0,1]
I need to find here that there are 1 group of 5, starting at index 2, and 1 group of 3, starting at index 9, and not return any subset of the group of 5 as though it were a group of 4 or a group of 3, thus leaving the results for all those already covered values. i.e. for groups of 3, the indices 2, 3, 4, 5, and 6 would all remain zero. It doesn't need to be overly efficient, I don't care if it searches anyways, I just need to not keep the result.
Currently I'm using a codeblock basically like this for a simple search:
values = numpy.array([0,1,1,1,1,1,0,0,1,1,1])
searchval = [1,2]
N = len(searchval)
possibles = numpy.where(values == searchval[0])[0]
print(possibles)
solns = []
for p in possibles:
check = values[p:p+N]
if numpy.all(check == searchval):
solns.append(p)
print(solns)
I've been wracking my brain trying to come up with a way to restructure this or similar code to produce the desires results. The end goal is to be searching for groups of 9 down to groups of 3, and having effectively a matrix of 1s and 0s indicating if an index has a group starting on it that is as long as we want.
Hopefully someone can point me to what I'm missing to make this work. Thanks!

Using more_itertools, a third-party library (pip install more_itertools):
import more_itertools as mit
sample = [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1]
groups = [list(c) for c in mit.consecutive_groups((mit.locate(sample)))]
d = {group[0]: len(group) for group in groups}
d
# {2: 5, 9: 3, 15: 1, 17: 1}
This result reads "At index 2 is a group of 5 ones. At group 9 is a group of 3 ones," etc.
Details
more_itertools.locate finds indices for truthy items by default.
more_itertools.consecutive_groups chunks consecutive numbers together.
The result is a dictionary of (starting-index, length) pairs.
As a dictionary, you can extract different kinds of information:
>>> # List of starting indices
>>> list(d)
[2, 9, 15, 17]
>>> # List indices for all lonely groups
>>> [k for k, v in d.items() if v == 1]
[15, 17]
>>> # List indices of groups greater the 2 items
>>> [k for k, v in d.items() if v > 1]
[2, 9]

Here is a numpy solution. I'm using a small example for demonstration but it easily scales (20 x 100,000 takes 25 ms on my rather modest laptop, see timings at the end of this post):
>>> import numpy as np
>>>
>>>
>>> a = np.random.randint(0, 2, (5, 10), dtype=np.int8)
>>> a
array([[0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 1, 1, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 1, 1, 1, 1, 0, 0]], dtype=int8)
>>>
>>> padded = np.pad(a,((1,1),(0,0)), 'constant')
# compare array to itself with offset to mark all switches from
# 0 to 1 or from 1 to 0
# then use 'where' to extract the coordinates
>>> colinds, rowinds = np.where((padded[:-1] != padded[1:]).T)
>>>
# the lengths of sets are the differences between switch points
>>> lengths = rowinds[1::2] - rowinds[::2]
# now we have the lengths we are free to throw the off-switches away
>>> colinds, rowinds = colinds[::2], rowinds[::2]
>>>
# admire
>>> from pprint import pprint
>>> pprint(list(zip(colinds, rowinds, lengths)))
[(0, 2, 1),
(1, 0, 2),
(2, 1, 2),
(2, 4, 1),
(3, 2, 1),
(4, 0, 5),
(5, 0, 1),
(5, 2, 1),
(5, 4, 1),
(6, 1, 1),
(6, 3, 2),
(7, 4, 1)]
Timings:
>>> def find_stretches(a):
... padded = np.pad(a,((1,1),(0,0)), 'constant')
... colinds, rowinds = np.where((padded[:-1] != padded[1:]).T)
... lengths = rowinds[1::2] - rowinds[::2]
... colinds, rowinds = colinds[::2], rowinds[::2]
... return colinds, rowinds, lengths
...
>>> a = np.random.randint(0, 2, (20, 100000), dtype=np.int8)
>>> from timeit import repeat
>>> kwds = dict(globals=globals(), number=100)
>>> repeat('find_stretches(a)', **kwds)
[2.475784719004878, 2.4715258619980887, 2.4705517270049313]

Something like this?
from collections import defaultdict
sample = [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1]
# Keys are number of consecutive 1's, values are indicies
results = defaultdict(list)
found = 0
for i, x in enumerate(samples):
if x == 1:
found += 1
elif i == 0 or found == 0:
continue
else:
results[found].append(i - found)
found = 0
if found:
results[found].append(i - found + 1)
assert results == {1: [15, 17], 3: [9], 5: [2]}

Related

Find runs and lengths of consecutive values in an array

I'd like to find equal values in an array and their indices if they occur consecutively more then 2 times.
[0, 3, 0, 1, 0, 1, 2, 1, 2, 2, 2, 2, 1, 3, 4]
so in this example I would find value "2" occured "4" times, starting from position "8". Is there any build in function to do that?
I found a way with collections.Counter
collections.Counter(a)
# Counter({0: 3, 1: 4, 3: 2, 5: 1, 4: 1})
but this is not what I am looking for.
Of course I can write a loop and compare two values and then count them, but may be there is a more elegant solution?
Find consecutive runs and length of runs with condition
import numpy as np
arr = np.array([0, 3, 0, 1, 0, 1, 2, 1, 2, 2, 2, 2, 1, 3, 4])
res = np.ones_like(arr)
np.bitwise_xor(arr[:-1], arr[1:], out=res[1:]) # set equal, consecutive elements to 0
# use this for np.floats instead
# arr = np.array([0, 3, 0, 1, 0, 1, 2, 1, 2.4, 2.4, 2.4, 2, 1, 3, 4, 4, 4, 5])
# res = np.hstack([True, ~np.isclose(arr[:-1], arr[1:])])
idxs = np.flatnonzero(res) # get indices of non zero elements
values = arr[idxs]
counts = np.diff(idxs, append=len(arr)) # difference between consecutive indices are the length
cond = counts > 2
values[cond], counts[cond], idxs[cond]
Output
(array([2]), array([4]), array([8]))
# (array([2.4, 4. ]), array([3, 3]), array([ 8, 14]))
_, i, c = np.unique(np.r_[[0], ~np.isclose(arr[:-1], arr[1:])].cumsum(),
return_index = 1,
return_counts = 1)
for index, count in zip(i, c):
if count > 1:
print([arr[index], count, index])
Out[]: [2, 4, 8]
A little more compact way of doing it that works for all input types.

find the number of pairs that belong to a column but not a higher order column

I have an n x k binary numpy array, I am trying to find an efficient way to find the number of pairs of ones that belong to some column[j] but not to any higher column, in this case higher means in increasing index value.
For example in the array:
array([[1, 1, 1, 0, 1, 0],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 0, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 1, 0]], dtype=int32)
the output should be array([ 0, 0, 11, 2, 14, 1], dtype=int32). This makes sense because we see column[2] has all ones, so any pair of ones will necessarily have a highest column in common of at least 2, because even though column[0] also has all ones, it's lower, so no pair of ones have it as their highest in common. In all cases I am considering, column[0] will always have all ones.
Here is some example code that works and I believe is something like O(n^2 k)
def hcc(i, j, k, bin_mat):
# hcc means highest common columns
# i: index i
# j: index j
# k: number of columns - 1
# bin_mat: binary matrix
for q in range(k, 0, -1):
if (bin_mat[i, q] and bin_mat[j, q]):
return q
return 0
def get_num_pairs_columns(bin_mat):
k = bin_mat.shape[1]-1
num_pairs_hcc = np.zeros(k+1, dtype=np.int32) # number of one-pairs in columns
for i in range(bin_mat.shape[0]):
for j in range(bin_mat.shape[0]):
if(i < j):
num_pairs_hcc[hcc(i, j, k, bin_mat)] += 1
return num_pairs_highest_column
Another way I've though of approaching the problem is through sets. So every column gets its own set, and the index of every row with a one gets added to such a set. So for the example above, this would look like:
set = [{0, 1, 2, 3, 4, 5, 6, 7},
{0, 3, 6, 7},
{0, 1, 2, 3, 4, 5, 6, 7},
{1, 3, 6},
{0, 1, 3, 4, 5, 7},
{4, 5}]
The idea is to find the number of pairs in set[j] that are in no higher set (it can be in a lower set, just not higher). Since, I mentioned before, all cases will have column zero with all ones, every set is a subset of set[0]. So a much worse performing code using this approach looks like this:
def generate_sets(bin_mat):
sets = []
for j in range(bin_mat.shape[1]):
column = set()
for i in range(bin_mat.shape[0]):
if bin_mat[i, j] == 1:
column.add(i)
sets.append(column)
return sets
def get_hcc_sets(bin_mat):
sets = generate_sets(bin_mat)
pairs_sets = []
num_pairs_hcc = np.zeros(len(sets), dtype=np.int32)
for subset in sets:
pairs_sets.append({p for p in itertools.combinations(sorted(subset), r = 2)})
for j in range(len(sets)-1):
intersections = [pairs_sets[j].intersection(pairs_sets[q]) for q in range(j+1, len(sets))]
num_pairs_hcc[j] = len(pairs_sets[j] - set.union(*intersections))
num_pairs_hcc[len(sets)-1]=len(pairs_sets[len(sets)-1])
return num_pairs_hcc
I haven't checked that this sets implementation always produces the same results as the previous one, but in the finitely many cases I tried, it works. However, I am 100% certain that my first implementation gives exactly the result I need.
another reference example:
input:
array([[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[1, 0, 1],
[1, 0, 1],
[1, 0, 1],
[1, 0, 1]], dtype=int32)
output:
array([16, 6, 6], dtype=int32)
Is there a way to beat my O(n^2 k) implementation. It seems rather brute force and like there should be something I can exploit to make this calculation faster. I always expect n to be greater than k, by a orders of magnitude in many cases. So I'd rather the k have a higher exponent than the n.
If you are going for the O(n² k) approach in python, you can do it with much shorter code using itertools and set; the code might be more efficient too.
import itertools
t = [[1, 1, 1, 0, 1, 0],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 0, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 1, 0]]
n,k = len(t),len(t[0])
# build set of pairs of 1 in column j
def candidates(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if 1 == t[i1][j] == t[i2][j]}
# build set of pairs of 1 in higher columns
def badpairs(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if any(1 == t[i1][j0] == t[i2][j0] for j0 in range(j+1, k))}
# set difference
def finalpairs(j):
return candidates(j) - badpairs(j)
# print pairs
for j in range(k):
print(j, finalpairs(j))
# 0 set()
# 1 set()
# 2 {(2, 4), (1, 2), (2, 7), (4, 6), (0, 6), (2, 3), (6, 7), (0, 2), (2, 6), (5, 6), (2, 5)}
# 3 {(1, 6), (3, 6)}
# 4 {(0, 1), (0, 7), (0, 4), (3, 4), (1, 5), (3, 7), (0, 3), (1, 4), (5, 7), (1, 7), (0, 5), (1, 3), (4, 7), (3, 5)}
# 5 {(4, 5)}
# print number of pairs
for j in range(k):
print(j, len(finalpairs(j)))
# 0 0
# 1 0
# 2 11
# 3 2
# 4 14
# 5 1
Alternate definition for badpairs:
def badpairs(j):
return set().union(*(candidates(j0) for j0 in range(j+1, k)))
Slightly different approach: avoid building badpairs
def finalpairs(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if 1 == t[i1][j] == t[i2][j] and not any(1 == t[i1][j0] == t[i2][j0] for j0 in range(j+1, k))}

How to get the max index of distinct groups of integers in a python list

Example:
[0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
In this case I need:
1st '0' group = index: 0-4 , length : 5
1st '1' group = index: 5-6 , length : 2
2nd '0' group = index: 7 , length : 1
2nd '1' group = index: 8-17 , length : 10 <---- NEED THIS the index of max length of '1's
3rd '0' group = index: 18 - 22 , length : 5
I think you are looking for itertools.groupby. With this you can get a list of lists by each grouping of integers in the original dataset.
>>> data = [0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
>>> [list(group) for _, group in itertools.groupby(data)]
[[0, 0, 0, 0, 0], [1, 1], [0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0,0, 0]]
Or to get indexes, you can also do this in one line using itertools.groupby and .islice and operator.itemgetter
>>> [sorted(set(itemgetter(0, -1)([i[0] for i in g))) for _, g in groupby(enumerate(data), key=itemgetter(1))]
[[0, 4], [5, 6], [7], [8, 17], [18, 22]]
Or to get the starting or ending indexes, use this: (notice min and max determine the start or end index)
>>> [min(i[0] for i in group) for _, group in groupby(data)]
[0, 5, 7, 8, 18]
>>> [max(i[0] for i in group) for _, group in groupby(data)]
[4, 6, 7, 17, 22]
And to get the starting index of the largest group use:
>>> max(([next(group)[0], sum(1 for _ in group)] for _, group in groupby(enumerate(data), key=itemgetter(1))), key=itemgetter(1))[0]
8
The standard library provides itertools.groupby for this purpose. It's a bit tricky to use, because it does a lot of work:
>>> from itertools import groupby
>>> data = [0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
>>> groupby(data)
<itertools.groupby object at 0x0000015AB6EB3C78>
Hmm. It doesn't seem very useful yet. But we look at the documentation and see that it's a generator, so let's try expanding it into a list:
>>> list(groupby(data))
[(0, <itertools._grouper object at 0x0000015AB6EC2BA8>), (1, <itertools._grouper
object at 0x0000015AB6ED82B0>), (0, <itertools._grouper object at 0x0000015AB6E
D8518>), (1, <itertools._grouper object at 0x0000015AB6EFE780>), (0, <itertools.
_grouper object at 0x0000015AB6F028D0>)]
The 0 and 1 values in here correspond to the 0s and 1s in the original data, but we still have these other objects. Those are also generators:
>>> [(value, list(grouper)) for value, grouper in groupby(data)]
[(0, [0, 0, 0, 0, 0]), (1, [1, 1]), (0, [0]), (1, [1, 1, 1, 1, 1, 1, 1, 1, 1,
1]), (0, [0, 0, 0, 0, 0])]
Now we can see what's going on: the grouper objects generate chunks from the list.
So all we need to do is check the len of those lists and get the maximum value. We fix the comprehension so that we ignore the value and get the len of each grouper, and feed the results to the built-in max instead of making a list:
>>> max(len(list(grouper)) for value, grouper in groupby(data))
10
you can do it another way without itertools:
j=0
for i,val in enumerate(data):
if i == 0:
out=[[val]]
if val == data[i-1]:
out[j] += [val]
else:
j+=1
out += [[val]]
output:
[[0, 0, 0, 0, 0, 0], [1, 1], [0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0]]
now, make a dict with the unique values and the lengths of the sublists for each value:
counts = {}
for o in out:
if o[0] not in counts.keys():
counts[o[0]] = [len(o)]
else:
counts[o[0]] += [len(o)]
output:
{0: [6, 1, 5], 1: [2, 10]}
now get the max length of the sequences with the value you are after, in your case it's 1:
max(counts[1])
output:
10
EDIT : to also get the indices of this specific sequence you can do this:
id0 = 0
for o in out:
if o[0] != 1 or len(o) != max(counts[1]):
id0 += len(o)
if o[0] == 1 and len(o) == max(counts[1]):
id0 -= 1
break
id1 = id0 + max(counts[1]) - 1
print(max(counts[1]), id0, id1)
output:
10 8 17
it isnt the prettiest...but it works :)
You could iterate using the following function:
def count_through_a_list(x):
"""
returns all distinct continuous groups of values in a list
output is in the form of records
"""
# Initialize these values
group_start = 0
group_count = 1
prev = x[0]
groups = []
for i,n in enumerate(x):
# if n is not the same as the previous value OR i is the last index
if n!=prev or i == len(x)-1:
groups.append({'start':group_start, 'end':i-1, 'value':prev, 'length':i-group_start, 'group_counter':group_count})
# Reset the appropriate values
group_count+=1
group_start = i
prev = n
return groups
groups = count_through_a_list(x)
pd.DataFrame(groups, columns=['start','end','value', 'length', 'group_counter'])
start end value length group_counter
0 0 4 0 5 1
1 5 6 1 2 2
2 7 7 0 1 3
3 8 17 1 10 4
4 18 21 0 4 5

python consecutive counts of an occurence with length

this is probably really easy to do but I am looking to calculate the length of consecutive positive occurrences in a list in python. For example, I have a and I am looking to return b:
a=[0,0,1,1,1,1,0,0,1,0,1,1,1,0]
b=[0,0,4,4,4,4,0,0,1,0,3,3,3,0]
I note a similar question on Counting consecutive positive value in Python array but this only returns consecutive counts but not the length of the belonging group.
Thanks
This is similar to a run length encoding problem, so I've borrowed some ideas from that Rosetta code page:
import itertools
a=[0,0,1,1,1,1,0,0,1,0,1,1,1,0]
b = []
for item, group in itertools.groupby(a):
size = len(list(group))
for i in range(size):
if item == 0:
b.append(0)
else:
b.append(size)
b
Out[8]: [0, 0, 4, 4, 4, 4, 0, 0, 1, 0, 3, 3, 3, 0]
At last after so many tries came up with these two lines.
In [9]: from itertools import groupby
In [10]: lst=[list(g) for k,g in groupby(a)]
In [21]: [x*len(_lst) if x>=0 else x for _lst in lst for x in _lst]
Out[21]: [0, 0, 4, 4, 4, 4, 0, 0, 1, 0, 3, 3, 3, 0]
Here's one approach.
The basic premise is that when in a consecutive run of positive values, it will remember all the indices of these positive values. As soon as it hits a zero, it will backtrack and replace all the positive values with the length of their run.
a=[0,0,1,1,1,1,0,0,1,0,1,1,1,0]
glob = []
last = None
for idx, i in enumerate(a):
if i>0:
glob.append(idx)
if i==0 and last != i:
for j in glob:
a[j] = len(glob)
glob = []
# > [0, 0, 4, 4, 4, 4, 0, 0, 1, 0, 3, 3, 3, 0]

Pythonic way to sparsely randomly populate array?

Problem:
Populate a 10 x 10 array of zeros randomly with 10 1's, 20 2's, 30 3's.
I don't actually have to use an array, rather I just need coordinates for the positions where the values would be. It's just easier to think of in terms of an array.
I have written several solutions for this, but they all seem to be non-straight forward and non-pythonic. I am hoping someone can give me some insight. My method has been using a linear array of 0--99, choosing randomly (np.random.choice) 10 values, removing them from the array, then choosing 20 random values. After that, I convert the linear positions into (y,x) coordinates.
import numpy as np
dim = 10
grid = np.arange(dim**2)
n1 = 10
n2 = 20
n3 = 30
def populate(grid, n, dim):
pos = np.random.choice(grid, size=n, replace=False)
yx = np.zeros((n,2))
for i in xrange(n):
delPos = np.where(grid==pos[i])
grid = np.delete(grid, delPos)
yx[i,:] = [np.floor(pos[i]/dim), pos[i]%dim]
return(yx, grid)
pos1, grid = populate(grid, n1, dim)
pos2, grid = populate(grid, n2, dim)
pos3, grid = populate(grid, n3, dim)
Extra
Suppose when I populate the 1's, I want them all on one half of the "array." I can do it using my method (sampling from grid[dim**2/2:]), but I haven't figured out how to do the same with the other suggestions.
You can create a list of all coordinates, shuffle that list and take the first 60 of those (10 + 20 + 30):
>>> import random
>>> coordinates = [(i, j) for i in xrange(10) for j in xrange(10)]
>>> random.shuffle(coordinates)
>>> coordinates[:60]
[(9, 5), (6, 9), (1, 5), ..., (0, 2), (5, 9), (2, 6)]
You can then use the first 10 to insert the 10 values, the next 20 for the 20 values and the remaining for the 30 values.
To generate the array, you can use numpy.random.choice.
np.random.choice([0, 1, 2, 3], size=(10,10), p=[.4, .1, .2, .3])
Then you can convert to coordinates. Note that numpy.random.choice generates a random sample using probabilities p, and thus you are not guaranteed to get the exact proportions in p.
Extra
If you want to have all the 1s on a particular side of the array, you can generate two random arrays and then hstack them. The trick is to slightly modify the probabilities of each number on each side.
In [1]: import numpy as np
In [2]: rem = .1/3 # amount to de- / increase the probability for non-1s
In [3]: A = np.random.choice([0, 1, 2, 3], size=(5, 10),
p=[.4-rem, .2, .2-rem, .3-rem])
In [4]: B = np.random.choice([0, 2, 3], size=(5, 10), p=[.4+rem, .2+rem, .3+rem])
In [5]: M = np.hstack( (A, B) )
In [6]: M
Out[1]:
array([[1, 1, 3, 0, 3, 0, 0, 1, 1, 0, 2, 2, 0, 2, 0, 2, 3, 3, 2, 0],
[0, 3, 3, 3, 3, 0, 1, 3, 1, 3, 0, 2, 3, 0, 0, 0, 3, 3, 2, 3],
[1, 0, 0, 0, 1, 0, 3, 1, 2, 2, 0, 3, 0, 3, 3, 0, 0, 3, 0, 0],
[3, 2, 3, 0, 3, 0, 1, 2, 3, 2, 0, 0, 0, 0, 3, 2, 0, 0, 0, 3],
[3, 3, 0, 3, 3, 3, 1, 3, 0, 3, 0, 2, 0, 2, 0, 0, 0, 3, 3, 3]])
Here, because I'm putting all the 1s on the left, I double the probability of 1 and decrease the probability of each number equally. The same logic applies when creating the other side.
Not sure if this is anymore "pythonic", but here's something I came up with using part of Simeon's answer.
import random
dim = 10
n1 = 10
n2 = 20
n3 = 30
coords = [[i,j] for i in xrange(dim) for j in xrange(dim)]
def setCoords(coords, n):
pos = []
for i in xrange(n):
random.shuffle(coords)
pos.append(coords.pop())
return(coords, pos)
coordsTmp, pos1 = setCoords(coords[dim**2/2:], n1)
coords = coords[:dim**2/2] + coordsTmp
coords, pos2 = setCoords(coords, n2)
coords, pos3 = setCoords(coords, n3)

Categories

Resources