Related
I have an n x k binary numpy array, I am trying to find an efficient way to find the number of pairs of ones that belong to some column[j] but not to any higher column, in this case higher means in increasing index value.
For example in the array:
array([[1, 1, 1, 0, 1, 0],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 0, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 1, 0]], dtype=int32)
the output should be array([ 0, 0, 11, 2, 14, 1], dtype=int32). This makes sense because we see column[2] has all ones, so any pair of ones will necessarily have a highest column in common of at least 2, because even though column[0] also has all ones, it's lower, so no pair of ones have it as their highest in common. In all cases I am considering, column[0] will always have all ones.
Here is some example code that works and I believe is something like O(n^2 k)
def hcc(i, j, k, bin_mat):
# hcc means highest common columns
# i: index i
# j: index j
# k: number of columns - 1
# bin_mat: binary matrix
for q in range(k, 0, -1):
if (bin_mat[i, q] and bin_mat[j, q]):
return q
return 0
def get_num_pairs_columns(bin_mat):
k = bin_mat.shape[1]-1
num_pairs_hcc = np.zeros(k+1, dtype=np.int32) # number of one-pairs in columns
for i in range(bin_mat.shape[0]):
for j in range(bin_mat.shape[0]):
if(i < j):
num_pairs_hcc[hcc(i, j, k, bin_mat)] += 1
return num_pairs_highest_column
Another way I've though of approaching the problem is through sets. So every column gets its own set, and the index of every row with a one gets added to such a set. So for the example above, this would look like:
set = [{0, 1, 2, 3, 4, 5, 6, 7},
{0, 3, 6, 7},
{0, 1, 2, 3, 4, 5, 6, 7},
{1, 3, 6},
{0, 1, 3, 4, 5, 7},
{4, 5}]
The idea is to find the number of pairs in set[j] that are in no higher set (it can be in a lower set, just not higher). Since, I mentioned before, all cases will have column zero with all ones, every set is a subset of set[0]. So a much worse performing code using this approach looks like this:
def generate_sets(bin_mat):
sets = []
for j in range(bin_mat.shape[1]):
column = set()
for i in range(bin_mat.shape[0]):
if bin_mat[i, j] == 1:
column.add(i)
sets.append(column)
return sets
def get_hcc_sets(bin_mat):
sets = generate_sets(bin_mat)
pairs_sets = []
num_pairs_hcc = np.zeros(len(sets), dtype=np.int32)
for subset in sets:
pairs_sets.append({p for p in itertools.combinations(sorted(subset), r = 2)})
for j in range(len(sets)-1):
intersections = [pairs_sets[j].intersection(pairs_sets[q]) for q in range(j+1, len(sets))]
num_pairs_hcc[j] = len(pairs_sets[j] - set.union(*intersections))
num_pairs_hcc[len(sets)-1]=len(pairs_sets[len(sets)-1])
return num_pairs_hcc
I haven't checked that this sets implementation always produces the same results as the previous one, but in the finitely many cases I tried, it works. However, I am 100% certain that my first implementation gives exactly the result I need.
another reference example:
input:
array([[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[1, 0, 1],
[1, 0, 1],
[1, 0, 1],
[1, 0, 1]], dtype=int32)
output:
array([16, 6, 6], dtype=int32)
Is there a way to beat my O(n^2 k) implementation. It seems rather brute force and like there should be something I can exploit to make this calculation faster. I always expect n to be greater than k, by a orders of magnitude in many cases. So I'd rather the k have a higher exponent than the n.
If you are going for the O(n² k) approach in python, you can do it with much shorter code using itertools and set; the code might be more efficient too.
import itertools
t = [[1, 1, 1, 0, 1, 0],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 0, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 1, 0]]
n,k = len(t),len(t[0])
# build set of pairs of 1 in column j
def candidates(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if 1 == t[i1][j] == t[i2][j]}
# build set of pairs of 1 in higher columns
def badpairs(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if any(1 == t[i1][j0] == t[i2][j0] for j0 in range(j+1, k))}
# set difference
def finalpairs(j):
return candidates(j) - badpairs(j)
# print pairs
for j in range(k):
print(j, finalpairs(j))
# 0 set()
# 1 set()
# 2 {(2, 4), (1, 2), (2, 7), (4, 6), (0, 6), (2, 3), (6, 7), (0, 2), (2, 6), (5, 6), (2, 5)}
# 3 {(1, 6), (3, 6)}
# 4 {(0, 1), (0, 7), (0, 4), (3, 4), (1, 5), (3, 7), (0, 3), (1, 4), (5, 7), (1, 7), (0, 5), (1, 3), (4, 7), (3, 5)}
# 5 {(4, 5)}
# print number of pairs
for j in range(k):
print(j, len(finalpairs(j)))
# 0 0
# 1 0
# 2 11
# 3 2
# 4 14
# 5 1
Alternate definition for badpairs:
def badpairs(j):
return set().union(*(candidates(j0) for j0 in range(j+1, k)))
Slightly different approach: avoid building badpairs
def finalpairs(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if 1 == t[i1][j] == t[i2][j] and not any(1 == t[i1][j0] == t[i2][j0] for j0 in range(j+1, k))}
I want to build a numpy array from the result of itertools.product. My first approach was a simple:
from itertools import product
import numpy as np
max_init = 6
init_values = range(1, max_init + 1)
repetitions = 12
result = np.array(list(product(init_values, repeat=repetitions)))
This code works well for "small" repetitions (like <=4), but with "large" values (>= 12) it completely hogs the memory and crashes. I assumed that building the list was the thing eating all the RAM, so I searched how to make it directly with an array. I found Numpy equivalent of itertools.product and Using numpy to build an array of all combinations of two arrays.
So, I tested the following alternatives:
Alternative #1:
results = np.empty((max_init**repetitions, repetitions))
for i, row in enumerate(product(init_values, repeat=repetitions)):
result[:][i] = row
Alternative #2:
init_values_args = [init_values] * repetitions
results = np.array(np.meshgrid(*init_values_args)).T.reshape(-1, repetitions)
Alternative #3:
results = np.indices([sides] * num_dice).reshape(num_dice, -1).T + 1
#1 is extremely slow. I didn't have enough patience to let it finish (after a few minutes of processing on a 2017 MacBook Pro). #2 and #3 eat all the memory until the python interpreter crashes, as with the initial approach.
After that, I thought that I could express the same information in a different way that was still useful for me: a dict where the keys would be all the possible (sorted) combinations, and the values would be the counting of these combinations. So I tried:
Alternative #4:
from collections import Counter
def sorted_product(iterable, repeat=1):
for el in product(iterable, repeat=repeat):
yield tuple(sorted(el))
def count_product(repetitions=1, max_init=6):
init_values = range(1, max_init + 1)
sp = sorted_product(init_values, repeat=repetitions)
counted_sp = Counter(sp)
return np.array(list(counted_sp.values())), \
np.array(list(counted_sp.keys()))
cnt, values = count(repetitions=repetitions, max_init=max_init)
But the line counted_sp = Counter(sp), which triggers getting all the values of the generators, is also too slow for my needs (it also took several minutes before I canceled it).
Is there another way to generate the same data (or a different data structure containing the same information) that does not have the mentioned shortcomings of being too slow or using too much memory?
PS: I tested all the implementations above against my tests with a small repetitions, and all the tests passed, so they give consistent results.
I hope that editing the question is the best way to expand it. Otherwise, let me know, and I'll edit post where I should.
After reading the first two answers below and thinking about it, I agree that I am approaching the issue from the wrong angle. Instead of going with a "brute force" approach I should have used probabilities and work with that.
My intention is, later on, for each combination:
- Count how many values are under a threshold X.
- Count how many values are equal or over threshold X and below a threshold Y.
- Count how many values are over threshold Y.
And group the combinations that have the same counts.
As an illustrative example:
If I roll 12 dice of 6 sides, what's the probability of having M dice with a value <3, N dice with a value >=3 and <4, and P dice with a value >5, for all possible combinations of M, N, and P?
So, I think that I'll close this question in a few days while I go with this new approach. Thank you for all the feedback and your time!
The number tuples that list(product(range(1,7), repeats=12)) makes is 6**12, 2,176,782,336. Whether a list or array that's probably too large for most computers.
In [119]: len(list(product(range(1,7),repeat=12)))
....
MemoryError:
Trying to make an array of that size directly:
In [129]: A = np.ones((6**12,12),int)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-129-e833a9e859e0> in <module>()
----> 1 A = np.ones((6**12,12),int)
/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py in ones(shape, dtype, order)
190
191 """
--> 192 a = empty(shape, dtype, order)
193 multiarray.copyto(a, 1, casting='unsafe')
194 return a
ValueError: Maximum allowed dimension exceeded
Array memory size, 4 bytes per item
In [130]: 4*12*6**12
Out[130]: 104,485,552,128
100GB?
Why do you need to generate 2B combinations of 7 numbers?
So with your Counter you reduce the number of items
In [138]: sp = sorted_product(range(1,7), 2)
In [139]: counter=Counter(sp)
In [140]: counter
Out[140]:
Counter({(1, 1): 1,
(1, 2): 2,
(1, 3): 2,
(1, 4): 2,
(1, 5): 2,
(1, 6): 2,
(2, 2): 1,
(2, 3): 2,
(2, 4): 2,
(2, 5): 2,
(2, 6): 2,
(3, 3): 1,
(3, 4): 2,
(3, 5): 2,
(3, 6): 2,
(4, 4): 1,
(4, 5): 2,
(4, 6): 2,
(5, 5): 1,
(5, 6): 2,
(6, 6): 1})
from 36 to 21 (for 2 repetitions). It shouldn't be hard to generalize this to more repetitions (combinations? permutations?) It still will push time and/or memory boundaries.
A variant on meshgrid using mgrid:
In [175]: n=7; A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
In [176]: A.shape
Out[176]: (279936, 7)
In [177]: B=np.array(list(product(range(1,7),repeat=7)))
In [178]: B.shape
Out[178]: (279936, 7)
In [179]: A[:10]
Out[179]:
array([[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 1, 1, 3],
[1, 1, 1, 1, 1, 1, 4],
[1, 1, 1, 1, 1, 1, 5],
[1, 1, 1, 1, 1, 1, 6],
[1, 1, 1, 1, 1, 2, 1],
[1, 1, 1, 1, 1, 2, 2],
[1, 1, 1, 1, 1, 2, 3],
[1, 1, 1, 1, 1, 2, 4]])
In [180]: B[:10]
Out[180]:
array([[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 1, 1, 3],
[1, 1, 1, 1, 1, 1, 4],
[1, 1, 1, 1, 1, 1, 5],
[1, 1, 1, 1, 1, 1, 6],
[1, 1, 1, 1, 1, 2, 1],
[1, 1, 1, 1, 1, 2, 2],
[1, 1, 1, 1, 1, 2, 3],
[1, 1, 1, 1, 1, 2, 4]])
In [181]: np.allclose(A,B)
mgrid is quite a bit faster:
In [182]: timeit B=np.array(list(product(range(1,7),repeat=7)))
317 ms ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [183]: timeit A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
13.9 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but, yes, it will have the same overall memory usage and limit.
With n=10,
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
The right answer is: Don't. Whatever you want to do with all these combinations, adjust your approach so that you either generate them one at a time and use them immediately without storing them, or better yet, find a way to get the job done without inspecting every combination. Your current solution works for toy problems, but is not suitable for larger parameters. Explain what you are up to and maybe someone here can help.
I am creating an conditioning experiment with three conditions (0,1,2) and need to pseudo-randomize condition order. I need a randomize list with each condition occurring only 2 times in a row. Here how I tried to achieve it. The code is running but it takes an eternity...
Any ideas why this code is not working well and any different approaches to solve the problem?
#create a list with 36 values
types = [0] * 4 + [1] * 18 + [2]*14 #0=CS+ ohne Verstärkung; 1 = CS-, 2=CS+ mit shock
#random.shuffle(types)
while '1,1,1' or '2,2,2' or '0,0,0' in types:
random.shuffle(types)
else: print(types)
Thank you in advance!
Martina
Your loop has several problems. First while '1,1,1' or '2,2,2' or '0,0,0' in types: is the same as while ('1,1,1') or ('2,2,2') or ('0,0,0' in types):. Non-zero strings are always True so your condition is always true and the while never stops. Even if it did, types is a list of integers. '0,0,0' is a string and is not an element of the list.
itertools.groupby is a good tool to solve this problem. Its an iterator that is designed to group a sequence into subiterators. You can use it to see if any clusters of numbers are too long.
import random
import itertools
#create a list with 36 values
types = [0] * 4 + [1] * 18 + [2]*14 #
print(types)
while True:
random.shuffle(types)
# try to find spans that are too long
for key, subiter in itertools.groupby(types):
if len(list(subiter)) >= 3:
break # found one, staty in while
else:
break # found none, leave while
print(types)
while '1,1,1' or '2,2,2' or '0,0,0' in types:
random.shuffle(types)
evaluates as:
while True or True or '0,0,0' in types:
random.shuffle(types)
and short-circuits at while True
Instead, use: any() which returns True if any of the inner terms are True
Additionally, your types is numbers and you're comparing it to strings:
>>> types
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
so you need to map those numbers to a string which can be compared:
>>> ','.join(map(str, types))
'0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2'
Try:
while any(run in ','.join(map(str, types)) for run in ['0,0,0', '1,1,1', '2,2,2']):
random.shuffle(types)
>>> types
[1, 2, 1, 2, 1, 2, 1, 1, 0, 2, 0, 1, 2, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 0, 2, 1, 1, 0, 2, 1, 1, 2, 2, 1, 1]
I have simple dataset on HDFS that I'm loading into Spark. It looks like this:
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
...
basically, a matrix. I'm trying to implement something that requires grouping matrix rows, and so I'm trying to add a unique key for every row like so:
(1, [1 1 1 1 1 ... ])
(2, [1 1 1 1 1 ... ])
(3, [1 1 1 1 1 ... ])
...
I tried something somewhat naive: set a global variable and write a lambda function to iterate over the global variable:
# initialize global index
global global_index
global_index = 0
# function to generate keys
def generateKeys(x):
global_index+=1
return (global_index,x)
# read in data and operate on it
data = sc.textFile("/data.txt")
...some preprocessing...
data.map(generateKeys)
And it seemed to not recognize the existence of the global variable.
Is there an easy way that comes to mind to do this?
Thanks,
Jack
>>> lsts = [
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 2],
... [1, 1, 1, 2, 1, 2]
... ]
...
>>> list(enumerate(lsts))
[(0, [1, 1, 1, 1, 1, 1]),
(1, [1, 1, 1, 1, 1, 1]),
(2, [1, 1, 1, 1, 1, 1]),
(3, [1, 1, 1, 1, 1, 1]),
(4, [1, 1, 1, 1, 1, 1]),
(5, [1, 1, 1, 1, 1, 1]),
(6, [1, 1, 1, 1, 1, 2]),
(7, [1, 1, 1, 2, 1, 2])]
enumerate generates unique index for each item in the iterable and yields tuples with values (index, original_item)
If you want to start numbering with other than 0, pass the starting value to enumerate as second parameter.
>>> list(enumerate(lsts, 1))
[(1, [1, 1, 1, 1, 1, 1]),
(2, [1, 1, 1, 1, 1, 1]),
(3, [1, 1, 1, 1, 1, 1]),
(4, [1, 1, 1, 1, 1, 1]),
(5, [1, 1, 1, 1, 1, 1]),
(6, [1, 1, 1, 1, 1, 1]),
(7, [1, 1, 1, 1, 1, 2]),
(8, [1, 1, 1, 2, 1, 2])]
Note, that the list is used to get real values out from enumerate which is iterator and not a function, returning lists.
Alternative: globally available id assigner
enumerate is easy to use, but if you would need to assing id in diferrent pieces of your code, it
would become difficult or impossible. For such a case, globally available generator (as drafter in
OP) would be the way to go.
itertools provide count which can serve our need:
>>> from itertools import count
>>> idgen = count()
Now we have (globally available) idgen generator ready to yield unique ids.
We can test it by a function prid (print id):
>>> def prid():
... id = idgen.next()
... print id
...
>>> prid()
0
>>> prid()
1
>>> prid()
2
>>> prid()
3
As it works we can test it on list of values:
>>> lst = ['100', '101', '102', '103', '104', '105', '106', '107', '108', '109']
and define actual function, which when called with a value would return tuple (id, value)
>>> def assignId(val):
... return (idgen.next(), val)
...
note, that there is no need to declare idgen as global as we are not going to change it's value (the idgen will only change it's internal status when called, but will still remain the same generator).
Test, if it works:
>>> assignId("ahahah")
(4, 'ahahah')
and try it on the list:
>>> map(assignId, lst)
[(5, '100'),
(6, '101'),
(7, '102'),
(8, '103'),
(9, '104'),
(10, '105'),
(11, '106'),
(12, '107'),
(13, '108'),
(14, '109')]
The main diferrence to enumerate solution is, we can assign ids one by one anywhere in the code
without doing it all from within one all processing enumerate.
>>> assignId("lonely line")
(15, 'lonely line')
try dataRdd.zipWithIndex and eventually swap the resulting tuple if having the index first is a must.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have to create a list in Python of all the numbers of length X where each digit is lower than 3. for example, for length 4:
[[0000],[0001],[0002],[0010],[0011],...] and so on..
I have some ideas. but I can't think of any good, performant solution.
I thought about doing the following:
Create a function "Is each number's digits < 2"
Loop over 9999 numbers and run the functions on them. Then add to the list.
To summerize, I want to list all numbers that < x in base of 3
edit:
this can help: [(x,y,z) for x in xrange(3) for y in xrange(3) for z in xrange(3)].
It is even better for me that the outpot is in generator. but this answer isn't dynamic. i can't change its length.
Using itertools.product:
[''.join(map(str,tup)) for tup in product(range(3),repeat=4)]
I joined them into strings since [0000] will just display as [0]. You could just leave them as tuples and get rid of all the join(map(str... mumbo jumbo. In that case you don't even need the list comp, it's just
list(product(range(4),repeat=3))
The following in not exactly what you asked for, but it comes close:
from itertools import product
def create_list(x):
return list(product(range(3), repeat=x))
print create_list(3)
This will print:
[(0, 0, 0), (0, 0, 1), (0, 0, 2), (0, 1, 0), (0, 1, 1), (0, 1, 2),
(0, 2, 0), (0, 2, 1), (0, 2, 2), (1, 0, 0), (1, 0, 1), (1, 0, 2),
(1, 1, 0), (1, 1, 1), (1, 1, 2), (1, 2, 0), (1, 2, 1), (1, 2, 2),
(2, 0, 0), (2, 0, 1), (2, 0, 2), (2, 1, 0), (2, 1, 1), (2, 1, 2),
(2, 2, 0), (2, 2, 1), (2, 2, 2)]
Come up with a rule that orders all the responses so that each one comes either before or after each other one.
Write code to find the first response.
Write code to determine if a response is the last response.
Write code to convert a response into the next response.
Now the algorithm is trivial:
Set an indicator to the first response from 2 above.
Output the current value of the indicator.
If the indicator is the last response from 3 above, stop.
Increment the indicator using 4 above.
Go to step 2.
I would suggest you order them numerically, so four four digits, it would be 0000, 0001, 0002, 0010, 0011, and so on. The first response is then all zeroes. The last response is all twos. This just leaves the issue of writing the code to increment to the next response.
inputBase, expectedSize = 2, 3
def convertToBase(num, base):
result, current = [], 0
if not num: result.append(0)
while num:
result.append((num % base))
current += 1
num /= base
result.reverse()
return result, current
currentNum, result = 0, []
while True:
based, size = convertToBase(currentNum, inputBase)
if size > expectedSize: break
while len(based) < expectedSize:
based.insert(0, 0)
result.append(based)
currentNum += 1
print result
Output:
[[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]]
Just change
inputBase, expectedSize = 2, 3
to any base and the number of digits you want.