repeat array in arbitary length - python

Is it possible to create an array that looks like
0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1
having the following array in the beginning
4, 3, 5, 2
without using loops in Python/Numpy?
EDIT:
This is just an example and the information (4,3,5,2) may have any length or numbers.

>>> lengths = np.array([4, 3, 5, 2])
>>> np.concatenate(map(np.arange, lengths))
array([0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1])
Of course, this is cheating, because map is a loop in disguise. There's no NumPy idiom to do this any more directly, AFAIK.
The above creates len(lengths) temporaries. An alternative that does not construct these temporaries is to use fromiter and an adapted version of #jonrsharpe's answer:
>>> from itertools import chain, imap
>>> np.fromiter(chain.from_iterable(imap(xrange, lengths)), dtype=int,
... count=np.sum(lengths))
array([0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1])
Somewhat surprisingly, the fromiter idiom is faster, and it gets faster if you don't compute the count first:
>>> lengths = np.arange(30)
>>> %timeit np.concatenate(map(np.arange, lengths))
10000 loops, best of 3: 64.8 µs per loop
>>> %timeit np.fromiter(chain.from_iterable(imap(xrange, lengths)), dtype=int, count=np.sum(lengths))
10000 loops, best of 3: 28.3 µs per loop
>>> %timeit np.fromiter(chain.from_iterable(imap(xrange, lengths)), dtype=int)
10000 loops, best of 3: 25.8 µs per loop
(Timings of NumPy 1.8.1 and Python 2.7.6 on an x86-64 running Linux.)

You can do it without writing for or while, but I assure you there's a loop under there somewhere!
>>> from itertools import chain, imap
>>> list(chain.from_iterable(imap(xrange, (4, 3, 5, 2))))
[0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1]
Powered by itertools.

Related

Can NumPy take care that an array is (nonstrictly) increasing along one axis?

Is there a function in numpy to guarantee or rather fix an array such that it is (nonstrictly) increasing along one particular axis?
For example, I have the following 2D array:
X = array([[1, 2, 1, 4, 5],
[0, 3, 1, 5, 4]])
the output of np.foobar(X) should return
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
Does foobar exist or do I need to do that manually by using something like np.diff and some smart indexing?
Use np.maximum.accumulate for a running (accumulated) max value along that axis to ensure the strictly increasing criteria -
np.maximum.accumulate(X,axis=1)
Sample run -
In [233]: X
Out[233]:
array([[1, 2, 1, 4, 5],
[0, 3, 1, 5, 4]])
In [234]: np.maximum.accumulate(X,axis=1)
Out[234]:
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
For memory efficiency, we can assign it back to the input for in-situ changes with its out argument.
Runtime tests
Case #1 : Array as input
In [254]: X = np.random.rand(1000,1000)
In [255]: %timeit np.maximum.accumulate(X,axis=1)
1000 loops, best of 3: 1.69 ms per loop
# #cᴏʟᴅsᴘᴇᴇᴅ's pandas soln using df.cummax
In [256]: %timeit pd.DataFrame(X).cummax(axis=1).values
100 loops, best of 3: 4.81 ms per loop
Case #2 : Dataframe as input
In [257]: df = pd.DataFrame(np.random.rand(1000,1000))
In [258]: %timeit np.maximum.accumulate(df.values,axis=1)
1000 loops, best of 3: 1.68 ms per loop
# #cᴏʟᴅsᴘᴇᴇᴅ's pandas soln using df.cummax
In [259]: %timeit df.cummax(axis=1)
100 loops, best of 3: 4.68 ms per loop
pandas offers you the df.cummax function:
import pandas as pd
pd.DataFrame(X).cummax(axis=1).values
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
It's useful to know that there's a first class function on hand in case your data is already loaded into a dataframe.

Build a large numpy array from itertools.product

I want to build a numpy array from the result of itertools.product. My first approach was a simple:
from itertools import product
import numpy as np
max_init = 6
init_values = range(1, max_init + 1)
repetitions = 12
result = np.array(list(product(init_values, repeat=repetitions)))
This code works well for "small" repetitions (like <=4), but with "large" values (>= 12) it completely hogs the memory and crashes. I assumed that building the list was the thing eating all the RAM, so I searched how to make it directly with an array. I found Numpy equivalent of itertools.product and Using numpy to build an array of all combinations of two arrays.
So, I tested the following alternatives:
Alternative #1:
results = np.empty((max_init**repetitions, repetitions))
for i, row in enumerate(product(init_values, repeat=repetitions)):
result[:][i] = row
Alternative #2:
init_values_args = [init_values] * repetitions
results = np.array(np.meshgrid(*init_values_args)).T.reshape(-1, repetitions)
Alternative #3:
results = np.indices([sides] * num_dice).reshape(num_dice, -1).T + 1
#1 is extremely slow. I didn't have enough patience to let it finish (after a few minutes of processing on a 2017 MacBook Pro). #2 and #3 eat all the memory until the python interpreter crashes, as with the initial approach.
After that, I thought that I could express the same information in a different way that was still useful for me: a dict where the keys would be all the possible (sorted) combinations, and the values would be the counting of these combinations. So I tried:
Alternative #4:
from collections import Counter
def sorted_product(iterable, repeat=1):
for el in product(iterable, repeat=repeat):
yield tuple(sorted(el))
def count_product(repetitions=1, max_init=6):
init_values = range(1, max_init + 1)
sp = sorted_product(init_values, repeat=repetitions)
counted_sp = Counter(sp)
return np.array(list(counted_sp.values())), \
np.array(list(counted_sp.keys()))
cnt, values = count(repetitions=repetitions, max_init=max_init)
But the line counted_sp = Counter(sp), which triggers getting all the values of the generators, is also too slow for my needs (it also took several minutes before I canceled it).
Is there another way to generate the same data (or a different data structure containing the same information) that does not have the mentioned shortcomings of being too slow or using too much memory?
PS: I tested all the implementations above against my tests with a small repetitions, and all the tests passed, so they give consistent results.
I hope that editing the question is the best way to expand it. Otherwise, let me know, and I'll edit post where I should.
After reading the first two answers below and thinking about it, I agree that I am approaching the issue from the wrong angle. Instead of going with a "brute force" approach I should have used probabilities and work with that.
My intention is, later on, for each combination:
- Count how many values are under a threshold X.
- Count how many values are equal or over threshold X and below a threshold Y.
- Count how many values are over threshold Y.
And group the combinations that have the same counts.
As an illustrative example:
If I roll 12 dice of 6 sides, what's the probability of having M dice with a value <3, N dice with a value >=3 and <4, and P dice with a value >5, for all possible combinations of M, N, and P?
So, I think that I'll close this question in a few days while I go with this new approach. Thank you for all the feedback and your time!
The number tuples that list(product(range(1,7), repeats=12)) makes is 6**12, 2,176,782,336. Whether a list or array that's probably too large for most computers.
In [119]: len(list(product(range(1,7),repeat=12)))
....
MemoryError:
Trying to make an array of that size directly:
In [129]: A = np.ones((6**12,12),int)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-129-e833a9e859e0> in <module>()
----> 1 A = np.ones((6**12,12),int)
/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py in ones(shape, dtype, order)
190
191 """
--> 192 a = empty(shape, dtype, order)
193 multiarray.copyto(a, 1, casting='unsafe')
194 return a
ValueError: Maximum allowed dimension exceeded
Array memory size, 4 bytes per item
In [130]: 4*12*6**12
Out[130]: 104,485,552,128
100GB?
Why do you need to generate 2B combinations of 7 numbers?
So with your Counter you reduce the number of items
In [138]: sp = sorted_product(range(1,7), 2)
In [139]: counter=Counter(sp)
In [140]: counter
Out[140]:
Counter({(1, 1): 1,
(1, 2): 2,
(1, 3): 2,
(1, 4): 2,
(1, 5): 2,
(1, 6): 2,
(2, 2): 1,
(2, 3): 2,
(2, 4): 2,
(2, 5): 2,
(2, 6): 2,
(3, 3): 1,
(3, 4): 2,
(3, 5): 2,
(3, 6): 2,
(4, 4): 1,
(4, 5): 2,
(4, 6): 2,
(5, 5): 1,
(5, 6): 2,
(6, 6): 1})
from 36 to 21 (for 2 repetitions). It shouldn't be hard to generalize this to more repetitions (combinations? permutations?) It still will push time and/or memory boundaries.
A variant on meshgrid using mgrid:
In [175]: n=7; A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
In [176]: A.shape
Out[176]: (279936, 7)
In [177]: B=np.array(list(product(range(1,7),repeat=7)))
In [178]: B.shape
Out[178]: (279936, 7)
In [179]: A[:10]
Out[179]:
array([[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 1, 1, 3],
[1, 1, 1, 1, 1, 1, 4],
[1, 1, 1, 1, 1, 1, 5],
[1, 1, 1, 1, 1, 1, 6],
[1, 1, 1, 1, 1, 2, 1],
[1, 1, 1, 1, 1, 2, 2],
[1, 1, 1, 1, 1, 2, 3],
[1, 1, 1, 1, 1, 2, 4]])
In [180]: B[:10]
Out[180]:
array([[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 1, 1, 3],
[1, 1, 1, 1, 1, 1, 4],
[1, 1, 1, 1, 1, 1, 5],
[1, 1, 1, 1, 1, 1, 6],
[1, 1, 1, 1, 1, 2, 1],
[1, 1, 1, 1, 1, 2, 2],
[1, 1, 1, 1, 1, 2, 3],
[1, 1, 1, 1, 1, 2, 4]])
In [181]: np.allclose(A,B)
mgrid is quite a bit faster:
In [182]: timeit B=np.array(list(product(range(1,7),repeat=7)))
317 ms ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [183]: timeit A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
13.9 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but, yes, it will have the same overall memory usage and limit.
With n=10,
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
The right answer is: Don't. Whatever you want to do with all these combinations, adjust your approach so that you either generate them one at a time and use them immediately without storing them, or better yet, find a way to get the job done without inspecting every combination. Your current solution works for toy problems, but is not suitable for larger parameters. Explain what you are up to and maybe someone here can help.

Generate 1D NumPy array of concatenated ranges

I want to generate a following array a:
nv = np.random.randint(3, 10+1, size=(1000000,))
a = np.concatenate([np.arange(1,i+1) for i in nv])
Thus, the output would be something like -
[0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2, 3, 4, 5, 0, ...]
Does there exist any better way to do it?
Here's a vectorized approach using cumulative summation -
def ranges(nv, start = 1):
shifts = nv.cumsum()
id_arr = np.ones(shifts[-1], dtype=int)
id_arr[shifts[:-1]] = -nv[:-1]+1
id_arr[0] = start # Skip if we know the start of ranges is 1 already
return id_arr.cumsum()
Sample runs -
In [23]: nv
Out[23]: array([3, 2, 5, 7])
In [24]: ranges(nv, start=0)
Out[24]: array([0, 1, 2, 0, 1, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 6])
In [25]: ranges(nv, start=1)
Out[25]: array([1, 2, 3, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7])
Runtime test -
In [62]: nv = np.random.randint(3, 10+1, size=(100000,))
In [63]: %timeit your_func(nv) # #MSeifert's solution
10 loops, best of 3: 129 ms per loop
In [64]: %timeit ranges(nv)
100 loops, best of 3: 5.54 ms per loop
Instead of doing this with numpy methods you could use normal python ranges and just convert the result to an array:
from itertools import chain
import numpy as np
def your_func(nv):
ranges = (range(1, i+1) for i in nv)
flattened = list(chain.from_iterable(ranges))
return np.array(flattened)
This doesn't need to utilize hard to understand numpy slicings and constructs. To show a sample case:
import random
>>> nv = [random.randint(1, 10) for _ in range(5)]
>>> print(nv)
[4, 2, 10, 5, 3]
>>> print(your_func(nv))
[ 1 2 3 4 1 2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 1 2 3]
Why two steps?
a = np.concatenate([np.arange(0,np.random.randint(3,11)) for i in range(1000000)])

Count how many times each row is present in numpy.array

I am trying to count a number each row shows in a np.array, for example:
import numpy as np
my_array = np.array([[1, 2, 0, 1, 1, 1],
[1, 2, 0, 1, 1, 1], # duplicate of row 0
[9, 7, 5, 3, 2, 1],
[1, 1, 1, 0, 0, 0],
[1, 2, 0, 1, 1, 1], # duplicate of row 0
[1, 1, 1, 1, 1, 0]])
Row [1, 2, 0, 1, 1, 1] shows up 3 times.
A simple naive solution would involve converting all my rows to tuples, and applying collections.Counter, like this:
from collections import Counter
def row_counter(my_array):
list_of_tups = [tuple(ele) for ele in my_array]
return Counter(list_of_tups)
Which yields:
In [2]: row_counter(my_array)
Out[2]: Counter({(1, 2, 0, 1, 1, 1): 3, (1, 1, 1, 1, 1, 0): 1, (9, 7, 5, 3, 2, 1): 1, (1, 1, 1, 0, 0, 0): 1})
However, I am concerned about the efficiency of my approach. And maybe there is a library that provides a built-in way of doing this. I tagged the question as pandas because I think that pandas might have the tool I am looking for.
You can use the answer to this other question of yours to get the counts of the unique items.
In numpy 1.9 there is a return_counts optional keyword argument, so you can simply do:
>>> my_array
array([[1, 2, 0, 1, 1, 1],
[1, 2, 0, 1, 1, 1],
[9, 7, 5, 3, 2, 1],
[1, 1, 1, 0, 0, 0],
[1, 2, 0, 1, 1, 1],
[1, 1, 1, 1, 1, 0]])
>>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1]))
>>> b = np.ascontiguousarray(my_array).view(dt)
>>> unq, cnt = np.unique(b, return_counts=True)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 2, 0, 1, 1, 1],
[9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])
In earlier versions, you can do it as:
>>> unq, _ = np.unique(b, return_inverse=True)
>>> cnt = np.bincount(_)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 2, 0, 1, 1, 1],
[9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])
I think just specifying axis in np.unique gives what you need.
import numpy as np
unq, cnt = np.unique(my_array, axis=0, return_counts=True)
Note: this feature is available only in numpy>=1.13.0.
(This assumes that the array is fairly small, e.g. fewer than 1000 rows.)
Here's a short NumPy way to count how many times each row appears in an array:
>>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1)
array([3, 3, 1, 1, 3, 1])
This counts how many times each row appears in my_array, returning an array where the first value shows how many times the first row appears, the second value shows how many times the second row appears, and so on.
A pandas approach might look like this
import pandas as pd
df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6'])
df.groupby(['c1','c2','c3','c4','c5','c6']).size()
Note: supplying column names is not necessary
You solution is not bad, but if your matrix is large you will probably want to use a more efficient hash (compared to the default one Counter uses) for the rows before counting. You can do that with joblib:
A = np.random.rand(5, 10000)
%timeit (A[:,np.newaxis,:] == A).all(axis=2).sum(axis=1)
10000 loops, best of 3: 132 µs per loop
%timeit Counter(joblib.hash(row) for row in A).values()
1000 loops, best of 3: 1.37 ms per loop
%timeit Counter(tuple(ele) for ele in A).values()
100 loops, best of 3: 3.75 ms per loop
%timeit pd.DataFrame(A).groupby(range(A.shape[1])).size()
1 loops, best of 3: 2.24 s per loop
The pandas solution is extremely slow (about 2s per loop) with this many columns. For a small matrix like the one you showed your method is faster than joblib hashing but slower than numpy:
numpy: 100000 loops, best of 3: 15.1 µs per loop
joblib:1000 loops, best of 3: 885 µs per loop
tuple: 10000 loops, best of 3: 27 µs per loop
pandas: 100 loops, best of 3: 2.2 ms per loop
If you have a large number of rows then you can probably find a better substitute for Counter to find hash frequencies.
Edit: Added numpy benchmarks from #acjr's solution in my system so that it is easier to compare. The numpy solution is the fastest one in both cases.
A solution identical to Jaime's can be found in the numpy_indexed package (disclaimer: I am its author)
import numpy_indexed as npi
npi.count(my_array)

How to efficiently concatenate many arange calls in numpy?

I'd like to vectorize calls like numpy.arange(0, cnt_i) over a vector of cnt values and concatenate the results like this snippet:
import numpy
cnts = [1,2,3]
numpy.concatenate([numpy.arange(cnt) for cnt in cnts])
array([0, 0, 1, 0, 1, 2])
Unfortunately the code above is very memory inefficient due to the temporary arrays and list comprehension looping.
Is there a way to do this more efficiently in numpy?
Here's a completely vectorized function:
def multirange(counts):
counts = np.asarray(counts)
# Remove the following line if counts is always strictly positive.
counts = counts[counts != 0]
counts1 = counts[:-1]
reset_index = np.cumsum(counts1)
incr = np.ones(counts.sum(), dtype=int)
incr[0] = 0
incr[reset_index] = 1 - counts1
# Reuse the incr array for the final result.
incr.cumsum(out=incr)
return incr
Here's a variation of #Developer's answer that only calls arange once:
def multirange_loop(counts):
counts = np.asarray(counts)
ranges = np.empty(counts.sum(), dtype=int)
seq = np.arange(counts.max())
starts = np.zeros(len(counts), dtype=int)
starts[1:] = np.cumsum(counts[:-1])
for start, count in zip(starts, counts):
ranges[start:start + count] = seq[:count]
return ranges
And here's the original version, written as a function:
def multirange_original(counts):
ranges = np.concatenate([np.arange(count) for count in counts])
return ranges
Demo:
In [296]: multirange_original([1,2,3])
Out[296]: array([0, 0, 1, 0, 1, 2])
In [297]: multirange_loop([1,2,3])
Out[297]: array([0, 0, 1, 0, 1, 2])
In [298]: multirange([1,2,3])
Out[298]: array([0, 0, 1, 0, 1, 2])
Compare timing using a larger array of counts:
In [299]: counts = np.random.randint(1, 50, size=50)
In [300]: %timeit multirange_original(counts)
10000 loops, best of 3: 114 µs per loop
In [301]: %timeit multirange_loop(counts)
10000 loops, best of 3: 76.2 µs per loop
In [302]: %timeit multirange(counts)
10000 loops, best of 3: 26.4 µs per loop
Try the following for solving memory problem, efficiency is almost the same.
out = np.empty((sum(cnts)))
k = 0
for cnt in cnts:
out[k:k+cnt] = np.arange(cnt)
k += cnt
so no concatenation is used.
np.tril_indices pretty much does this for you:
In [28]: def f(c):
....: return np.tril_indices(c, -1)[1]
In [29]: f(10)
Out[29]:
array([0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 0, 1,
2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 8])
In [33]: %timeit multirange(range(10))
10000 loops, best of 3: 93.2 us per loop
In [34]: %timeit f(10)
10000 loops, best of 3: 68.5 us per loop
much faster than #Warren Weckesser multirange when the dimension is small.
But becomes much slower when the dimension is larger (#hpaulj, you have a very good point):
In [36]: %timeit multirange(range(1000))
100 loops, best of 3: 5.62 ms per loop
In [37]: %timeit f(1000)
10 loops, best of 3: 68.6 ms per loop

Categories

Resources