Related
Inspired by the post How to create a sequence of sequences of numbers in R?.
Question:
I would like to make the following sequence in NumPy.
[1, 2, 3, 4, 5, 2, 3, 4, 5, 3, 4, 5, 4, 5, 5]
I have tried the following:
Non-generic and hard coding using np.r_
np.r_[1:6, 2:6, 3:6, 4:6, 5:6]
# array([1, 2, 3, 4, 5, 2, 3, 4, 5, 3, 4, 5, 4, 5, 5])
Pure Python to generate the desired array.
n = 5
a = np.r_[1:n+1]
[i for idx in range(a.shape[0]) for i in a[idx:]]
# [1, 2, 3, 4, 5, 2, 3, 4, 5, 3, 4, 5, 4, 5, 5]
Create a 2D array and take the upper triangle from it.
n = 5
a = np.r_[1:n+1]
arr = np.tile(a, (n, 1))
print(arr)
# [[1 2 3 4 5]
# [1 2 3 4 5]
# [1 2 3 4 5]
# [1 2 3 4 5]
# [1 2 3 4 5]]
o = np.triu(arr).flatten()
# array([1, 2, 3, 4, 5,
# 0, 2, 3, 4, 5,
# 0, 0, 3, 4, 5, # This is 1D array
# 0, 0, 0, 4, 5,
# 0, 0, 0, 0, 5])
out = o[o > 0]
# array([1, 2, 3, 4, 5, 2, 3, 4, 5, 3, 4, 5, 4, 5, 5])
The above solution is generic but I want to know if there's a more efficient way to do it in NumPy.
I'm not sure if this is a good idea but I tried running it against your python method and it seems to be faster.
np.concatenate([np.arange(i, n+1) for i in range(1, n+1)])
Here is the full code:
import numpy as np
from time import time
n = 5000
t = time()
c = np.concatenate([np.arange(i, n+1) for i in range(1, n+1)])
print(time() - t)
# 0.039876699447631836
t = time()
a = np.r_[1:n+1]
b = np.array([i for idx in range(a.shape[0]) for i in a[idx:]])
print(time() - t)
# 2.0875167846679688
print(all(b == c))
# True
A really plain Python (no numpy) way is:
n = 5
a = [r for start in range(1, n+1) for r in range(start, n+1)]
This will be faster for small n (~150) but slower than #tangolin's solution for larger n. It is still faster than the OP's "pure python" way.
A faster implementation prepares the data in advance, avoiding creating a new range each time :
source = np.arange(1, n+1)
d = np.concatenate([source[i: n+1] for i in range(0, n)])
NOTE
My original implementation both allocated space for the return value and prepared the data in advance, but it was not pythonic. I changed it to use concatenate after reading #tangolin's answer and noticed that concatenate does the same.
Original implementation:
e = np.empty((n*(n+1)//2, ), dtype='int64')
source = np.arange(1, n+1)
for i in range(n):
init = n * i - i*(i-1)//2
end = n - i + init
e[init:end] = source[i:n]
I have this function in matlab
cn = reshape (repmat (sn, n_rep, 1), 1, []);
No python with the key code:
import numpy like np
from numpy.random import randint
M = 2
N = 2 * 10 ** 8 ### data value
n_rep = 3 ## number of repetitions
sn = randint (0, M, size = N) ### integers 0 and 1
print ("sn =", sn)
cn_repmat = np.tile (sn, n_rep)
print ("cn_repmat =", cn_repmat)
cn = np.reshape (cn_repmat, 1, [])
print (cn)
I'm not sure if retro kinship is not known
File "C: / Users / Sergio Malhao / .spyder-py3 / Desktop / untitled6.py", line 17, under <module>
cn = np.reshape (cn_repmat, 1, [])
File "E: \ Anaconda3 \ lib \ site-packages \ numpy \ core \ fromnumeric.py", line 232, in reshape
return _wrapfunc (a, 'reshape', newshape, order = order)
File "E: \ Anaconda3 \ lib \ site-packages \ numpy \ core \ fromnumeric.py", line 57, in _wrapfunc
return getattr (obj, method) (* args, ** kwds)
ValueError: Can not reshape the array of size 600000000 in shape (1,)
Numpy is not supposed to be a 1:1 matlab. It works similar, but not in the same way.
I assume you want to convert a matrix into one dimensional array.
Try to:
np.reshape (cn_repmat, (1, -1))
where the (1, -1) is a tuple defining size of the new array.
One shape dimension can be -1. In this case, the value is inferred
from the length of the array and remaining dimensions.
In Octave:
>> sn = [0,1,2,3,4]
sn =
0 1 2 3 4
>> repmat(sn,4,1)
ans =
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
>> reshape(repmat(sn,4,1),1,[])
ans =
0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
In numpy:
In [595]: sn=np.array([0,1,2,3,4])
In [596]: np.repeat(sn,4)
Out[596]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4])
In [597]: np.tile(sn,4)
Out[597]: array([0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4])
In MATLAB matrices are at least 2d; in numpy they may be 1d. Out[596] is 1d.
We could get closer to the MATLAB by making the sn 2d:
In [608]: sn2 = sn[None,:] # = sn.reshape((1,-1))
In [609]: sn2
Out[609]: array([[0, 1, 2, 3, 4]])
In [610]: np.repeat(sn2,4,1)
Out[610]: array([[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4]])
With tile we have to transpose or play order games (MATLAB is order F):
In [613]: np.tile(sn,[4,1])
Out[613]:
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
In [614]: np.tile(sn,[4,1]).T.ravel()
Out[614]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4])
In [615]: np.tile(sn,[4,1]).ravel(order='F')
Out[615]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4])
ravel is the equivalent of reshape(...., -1). -1 functions like [] in MATLAB when reshaping.
In numpy repeat is the basic function; tile uses repeat with a different user interface (more like repmat).
I have an array containing chunks of negative and chunks of positive elements. A much simplified example of it would be an array a looking like: array([-3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
(a<0).sum() and (a>0).sum() give me the total number of negative and positive elements but how do I count these in order? By this I mean I want to know that my array contains first 3 negative elements, 6 positive and 2 negative.
This sounds like a topic that have been addressed somewhere, and there may be a duplicate out there, but I can't find one.
A method is to use numpy.roll(a,1) in a loop over the whole array and count the number of elements of a given sign appearing in e.g. the first element of the array as it rolls, but it doesn't look much numpyic (or pythonic) nor very efficient to me.
Here's one vectorized approach -
def pos_neg_counts(a):
mask = a>0
idx = np.flatnonzero(mask[1:] != mask[:-1])
count = np.concatenate(( [idx[0]+1], idx[1:] - idx[:-1], [a.size-1-idx[-1]] ))
if a[0]<0:
return count[1::2], count[::2] # pos, neg counts
else:
return count[::2], count[1::2] # pos, neg counts
Sample runs -
In [155]: a
Out[155]: array([-3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
In [156]: pos_neg_counts(a)
Out[156]: (array([6]), array([3, 2]))
In [157]: a[0] = 3
In [158]: a
Out[158]: array([ 3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
In [159]: pos_neg_counts(a)
Out[159]: (array([1, 6]), array([2, 2]))
In [160]: a[-1] = 7
In [161]: a
Out[161]: array([ 3, -2, -1, 1, 2, 3, 4, 5, 6, -5, 7])
In [162]: pos_neg_counts(a)
Out[162]: (array([1, 6, 1]), array([2, 1]))
Runtime test
Other approach(es) -
# #Franz's soln
def split_app(my_array):
negative_index = my_array<0
splits = np.split(negative_index, np.where(np.diff(negative_index))[0]+1)
len_list = [len(i) for i in splits]
return len_list
Timings on bigger dataset -
In [20]: # Setup input array
...: reps = np.random.randint(3,10,(100000))
...: signs = np.ones(len(reps),dtype=int)
...: signs[::2] = -1
...: a = np.repeat(signs, reps)*np.random.randint(1,9,reps.sum())
...:
In [21]: %timeit split_app(a)
10 loops, best of 3: 90.4 ms per loop
In [22]: %timeit pos_neg_counts(a)
100 loops, best of 3: 2.21 ms per loop
Just use
my_array = np.array([-3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
negative_index = my_array<0
and you'll get the indizes of the negative values. After that you can split this array:
splits = np.split(negative_index, np.where(np.diff(negative_index))[0]+1)
and moreover calc the size of the inner arrays:
len_list = [len(i) for i in splits]
print(len_list)
And you'll get what you are looking for:
Out[1]: [3, 6, 2]
You just have to mention what your first element is. Per definition in my code, a negative one.
So just execute:
my_array = np.array([-3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
negative_index = my_array<0
splits = np.split(negative_index, np.where(np.diff(negative_index))[0]+1)
len_list = [len(i) for i in splits]
print(len_list)
My (rather simple and probably inefficient) solution would be:
import numpy as np
arr = np.array([-3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
sgn = np.sign(arr[0])
res = []
cntr = 1 # counting the first one
for i in range(1, len(arr)):
if np.sign(arr[i]) != sgn:
res.append(cntr)
cntr = 0
sgn *= -1
cntr += 1
res.append(cntr)
print res
Say you have two matrices, A is 2x2 and B is 2x7 (2 rows, 7 columns). I want to create a matrix C of shape 2x7, out of copies of A. The problem is np.hstack only understands situations where the column numbers divide (say 2 and 8, thus you can easily stack 4 copies of A to get C) ,but what about when they do not? Any ideas?
A = [[0,1] B = [[1,2,3,4,5,6,7], C = [[0,1,0,1,0,1,0],
[2,3]] [1,2,3,4,5,6,7]] [2,3,2,3,2,3,2]]
Here's an approach with modulus -
In [23]: ncols = 7 # No. of cols in output array
In [24]: A[:,np.mod(np.arange(ncols),A.shape[1])]
Out[24]:
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])
Or with % operator -
In [27]: A[:,np.arange(ncols)%A.shape[1]]
Out[27]:
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])
For such repeated indices, using np.take would be more performant -
In [29]: np.take(A, np.arange(ncols)%A.shape[1], axis=1)
Out[29]:
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])
A solution without numpy (although the np solution posted above is a lot nicer):
A = [[0,1],
[2,3]]
B = [[1,2,3,4,5,6,7],
[1,2,3,4,5,6,7]]
i_max, j_max = len(A), len(A[0])
C = []
for i, line_b in enumerate(B):
line_c = [A[i % i_max][j % j_max] for j, _ in enumerate(line_b)]
C.append(line_c)
print(C)
First solution is very nice. Another possible way would be to still use hstack, but if you don't want the pattern repeated fully you can use array slicing to get the values you need:
a.shape > (2,2)
b.shape > (2,7)
repeats = np.int(np.ceil(b.shape[1]/a.shape[0]))
trim = b.shape[1] % a.shape[0]
c = np.hstack([a] * repeats)[:,:-trim]
>
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])
I am trying to count a number each row shows in a np.array, for example:
import numpy as np
my_array = np.array([[1, 2, 0, 1, 1, 1],
[1, 2, 0, 1, 1, 1], # duplicate of row 0
[9, 7, 5, 3, 2, 1],
[1, 1, 1, 0, 0, 0],
[1, 2, 0, 1, 1, 1], # duplicate of row 0
[1, 1, 1, 1, 1, 0]])
Row [1, 2, 0, 1, 1, 1] shows up 3 times.
A simple naive solution would involve converting all my rows to tuples, and applying collections.Counter, like this:
from collections import Counter
def row_counter(my_array):
list_of_tups = [tuple(ele) for ele in my_array]
return Counter(list_of_tups)
Which yields:
In [2]: row_counter(my_array)
Out[2]: Counter({(1, 2, 0, 1, 1, 1): 3, (1, 1, 1, 1, 1, 0): 1, (9, 7, 5, 3, 2, 1): 1, (1, 1, 1, 0, 0, 0): 1})
However, I am concerned about the efficiency of my approach. And maybe there is a library that provides a built-in way of doing this. I tagged the question as pandas because I think that pandas might have the tool I am looking for.
You can use the answer to this other question of yours to get the counts of the unique items.
In numpy 1.9 there is a return_counts optional keyword argument, so you can simply do:
>>> my_array
array([[1, 2, 0, 1, 1, 1],
[1, 2, 0, 1, 1, 1],
[9, 7, 5, 3, 2, 1],
[1, 1, 1, 0, 0, 0],
[1, 2, 0, 1, 1, 1],
[1, 1, 1, 1, 1, 0]])
>>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1]))
>>> b = np.ascontiguousarray(my_array).view(dt)
>>> unq, cnt = np.unique(b, return_counts=True)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 2, 0, 1, 1, 1],
[9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])
In earlier versions, you can do it as:
>>> unq, _ = np.unique(b, return_inverse=True)
>>> cnt = np.bincount(_)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 2, 0, 1, 1, 1],
[9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])
I think just specifying axis in np.unique gives what you need.
import numpy as np
unq, cnt = np.unique(my_array, axis=0, return_counts=True)
Note: this feature is available only in numpy>=1.13.0.
(This assumes that the array is fairly small, e.g. fewer than 1000 rows.)
Here's a short NumPy way to count how many times each row appears in an array:
>>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1)
array([3, 3, 1, 1, 3, 1])
This counts how many times each row appears in my_array, returning an array where the first value shows how many times the first row appears, the second value shows how many times the second row appears, and so on.
A pandas approach might look like this
import pandas as pd
df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6'])
df.groupby(['c1','c2','c3','c4','c5','c6']).size()
Note: supplying column names is not necessary
You solution is not bad, but if your matrix is large you will probably want to use a more efficient hash (compared to the default one Counter uses) for the rows before counting. You can do that with joblib:
A = np.random.rand(5, 10000)
%timeit (A[:,np.newaxis,:] == A).all(axis=2).sum(axis=1)
10000 loops, best of 3: 132 µs per loop
%timeit Counter(joblib.hash(row) for row in A).values()
1000 loops, best of 3: 1.37 ms per loop
%timeit Counter(tuple(ele) for ele in A).values()
100 loops, best of 3: 3.75 ms per loop
%timeit pd.DataFrame(A).groupby(range(A.shape[1])).size()
1 loops, best of 3: 2.24 s per loop
The pandas solution is extremely slow (about 2s per loop) with this many columns. For a small matrix like the one you showed your method is faster than joblib hashing but slower than numpy:
numpy: 100000 loops, best of 3: 15.1 µs per loop
joblib:1000 loops, best of 3: 885 µs per loop
tuple: 10000 loops, best of 3: 27 µs per loop
pandas: 100 loops, best of 3: 2.2 ms per loop
If you have a large number of rows then you can probably find a better substitute for Counter to find hash frequencies.
Edit: Added numpy benchmarks from #acjr's solution in my system so that it is easier to compare. The numpy solution is the fastest one in both cases.
A solution identical to Jaime's can be found in the numpy_indexed package (disclaimer: I am its author)
import numpy_indexed as npi
npi.count(my_array)